Word-Level Multi-Script Indic Document Image Dataset and Baseline Results on Script Identification

E-ISSN： 2155-6989|7|2|81-94

ISSN： 2155-6997

Source： International Journal of Computer Vision and Image Processing (IJCVIP), Vol.7, Iss.2, 2017-04, pp. : 81-94

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Previous Menu Next

Abstract

Document analysis research starves from the availability of public datasets. Without publicly available dataset, one cannot make fair comparison with the state-of-the-art methods. To bridge this gap, in this paper, the authors propose a word-level document image dataset of 13 different Indic languages from 11 official scripts. It is composed of 39K words that are equally distributed i.e., 3K words per language. For a baseline results, five different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA), simple logistic (SL), library for linear classifier (LibLINEAR) and bayesian network (BayesNet) classifiers are used with three state-of-the-art features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations. The authors observed that MLP provides better results when all features are used, and achieved the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).