Figure: bbOCR reconstruction of sample document. The document has been analyzed and the OCR output has been converted to HTML format.
bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents arxiv 2023
Abstract
Although many Optical Character Recognition(OCR) meth- ods exit, the lack of comprehensive open-source systems hampers the progress of document digitization in many low- resource languages such as Bengali. The existing methods fo- cus primarily on individual tasks such as word-level OCR, document layout extraction, and distortion correction mostly in high-resource languages. Unfortunately, for low-resource languages, none is a practical system due to a lack of large- scale datasets for different document OCR components, and problems caused by an alphasyllabary writing system . More- over, a system-level evaluation metric that takes into ac- count the document layout and the text recognition simulta- neously is an under-explored area. In this paper, we intro- duce an open-source scalable document OCR system named Bengali-BRACU-OCR (bbOCR) for reconstructing the Ben- gali documents into a structured searchable digitized format considering the document layout, geometric distortions and illumination variations . For building this pipeline, we pro- vide two synthetic datasets and propose a customized model for Bengali text recognition. Besides evaluating the system at the component level, for a system-level extensive evalua- tion, we introduce a diversified evaluation dataset and com- prehensive evaluation metrics. The extensive evaluation sug- gests the practicality of our system over the state-of-the-art open-source Bengali Document OCR system in terms of met- rics and runtime. The source codes and datasets are available here .