bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
arxiv 2023

Your Image

Figure: bbOCR reconstruction of sample document. The document has been analyzed and the OCR output has been converted to HTML format.

Abstract

Although many Optical Character Recognition(OCR) meth- ods exit, the lack of comprehensive open-source systems hampers the progress of document digitization in many low- resource languages such as Bengali. The existing methods fo- cus primarily on individual tasks such as word-level OCR, document layout extraction, and distortion correction mostly in high-resource languages. Unfortunately, for low-resource languages, none is a practical system due to a lack of large- scale datasets for different document OCR components, and problems caused by an alphasyllabary writing system . More- over, a system-level evaluation metric that takes into ac- count the document layout and the text recognition simulta- neously is an under-explored area. In this paper, we intro- duce an open-source scalable document OCR system named Bengali-BRACU-OCR (bbOCR) for reconstructing the Ben- gali documents into a structured searchable digitized format considering the document layout, geometric distortions and illumination variations . For building this pipeline, we pro- vide two synthetic datasets and propose a customized model for Bengali text recognition. Besides evaluating the system at the component level, for a system-level extensive evalua- tion, we introduce a diversified evaluation dataset and com- prehensive evaluation metrics. The extensive evaluation sug- gests the practicality of our system over the state-of-the-art open-source Bengali Document OCR system in terms of met- rics and runtime. The source codes and datasets are available here .


-->

Citation

Annotators

Marshia Haque Meghla, Junayed Bhuiyan.

.

Acknowledgements

We are thankful to Center for Bangladesh Genocide Research - CBGR for sharing some invaluable historical documents for this dataset. We also thank APSIS Solutions for sharing and opensourcing their word-recognition models and data generation strategies.

.

Contact

research.bengaliai@gmail.com, farig.sadeque@bracu.ac.bd, sushmit@ieee.org

.