The BBN Byblos hindi OCR system

Abstract

The BBN Byblos OCR system implements a script-independent methodology for OCR using hidden Markov models (HMMs). We have successfully ported the system to Arabic, English, Chinese, Pashto, and Japanese. In this chapter, we report on our recent effort in training the system to perform recognition of Hindi (Devanagari) documents. The initial experiments reported in this chapter were performed using a corpus of synthetic (computer-generated) document images along with slightly degraded versions of the same that were generated by scanning printed versions of the document images and by scanning faxes of the printed versions. On a fair test set consisting of synthetic images alone we measured a character error rate of 1.0%. The character error rate on a fair test set consisting of scanned images (scans of printed versions of the synthetic images) was 1.40% while the character error rate on a fair …

Date: December 19, 2025
Authors: Prem Natarajan, Ehry MacRostie, Michael Decerbo
Journal: Guide to OCR for Indic Scripts: Document Recognition and Retrieval
Pages: 173-180
Publisher: Springer London

View Paper