bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents
arxiv 2023

Figure: bbOCR reconstruction of sample document. The document has been analyzed and the OCR output has been converted to HTML format.

Abstract

Although many Optical Character Recognition(OCR) meth- ods exit, the lack of comprehensive open-source systems hampers the progress of document digitization in many low- resource languages such as Bengali. The existing methods fo- cus primarily on individual tasks such as word-level OCR, document layout extraction, and distortion correction mostly in high-resource languages. Unfortunately, for low-resource languages, none is a practical system due to a lack of large- scale datasets for different document OCR components, and problems caused by an alphasyllabary writing system . More- over, a system-level evaluation metric that takes into ac- count the document layout and the text recognition simulta- neously is an under-explored area. In this paper, we intro- duce an open-source scalable document OCR system named Bengali-BRACU-OCR (bbOCR) for reconstructing the Ben- gali documents into a structured searchable digitized format considering the document layout, geometric distortions and illumination variations . For building this pipeline, we pro- vide two synthetic datasets and propose a customized model for Bengali text recognition. Besides evaluating the system at the component level, for a system-level extensive evalua- tion, we introduce a diversified evaluation dataset and com- prehensive evaluation metrics. The extensive evaluation sug- gests the practicality of our system over the state-of-the-art open-source Bengali Document OCR system in terms of met- rics and runtime. The source codes and datasets are available here .

Citation

			@misc{2308.10647,
			Author = {Imam Mohammad Zulkarnain and Shayekh Bin Islam and Md. Zami Al Zunaed Farabe and Md. Mehedi Hasan Shawon and Jawaril Munshad Abedin and Beig Rajibul Hasan and Marsia Haque and Istiak Shihab and Syed Mobassir and MD. Nazmuddoha Ansary and Asif Sushmit and Farig Sadeque},
			Title = {bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents},
			Year = {2023},
			Eprint = {arXiv:2308.10647},
			}

@article{shihab2023badlad,
  title={BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset},
  author={Shihab, Md and Hossain, Istiak and Hasan, Md and Emon, Mahfuzur Rahman and Hossen, Syed Mobassir and Ansary, Md and Ahmed, Intesur and Rakib, Fazle Rabbi and Dhruvo, Shahriar Elahi and Dip, Souhardya Saha and others},
  journal={arXiv preprint arXiv:2303.05325},
  year={2023}
}

Annotators

Marshia Haque Meghla, Junayed Bhuiyan.

.

Acknowledgements

We are thankful to Center for Bangladesh Genocide Research - CBGR for sharing some invaluable historical documents for this dataset. We also thank APSIS Solutions for sharing and opensourcing their word-recognition models and data generation strategies.

.

Contact

research.bengaliai@gmail.com, farig.sadeque@bracu.ac.bd, sushmit@ieee.org

.