Searching OCR'ed text: An LDA based approach

E. Hassan; V. Garg; S.K.M. Haque; S. Chaudhury; Madan Gopal

doi:10.1109/ICDAR.2011.244

Profiles Research Units Publications

Conferences

Searching OCR'ed text: An LDA based approach

E. Hassan, V. Garg, S.K.M. Haque, S. Chaudhury,

Published in

2011

DOI: 10.1109/ICDAR.2011.244

Pages: 1210 - 1214

Abstract

Indexing and retrieval performance over digitized document collection significantly depends on the performance of available Optical Character Recognition (OCR). The paper presents a novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy. The proposed indexing framework is based on topic modeling using Latent Dirichlet Allocation (LDA). The OCR's confidence in correctly recognizing a symbol is propagated in topic learning process such that semantic grouping of word examples carefully distinguishes between commonly confusing words. We present a novel application of Lucene with topic modeling for document indexing application. The experimental evaluation of the proposed framework is presented on document collection belonging to Devanagari script. © 2011 IEEE.