ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 15 Issue 4, June 2016

Printed Text Image Database for Sindhi OCR
Dil Nawaz Hakro, Abdullah Zawawi Talib
Article No.: 21
DOI: 10.1145/2846093

Document Image Understanding (DIU) and Electronic Document Management are active fields of research involving image understanding, interpretation, efficient handling, and routing of documents as well as their retrieval. Research on most of the...

Word Segmentation for Burmese (Myanmar)
Chenchen Ding, Ye Kyaw Thu, Masao Utiyama, Eiichiro Sumita
Article No.: 22
DOI: 10.1145/2846095

Experiments on various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that...

From Image to Translation: Processing the Endangered Nyushu Script
Tongtao Zhang, Aritra Chowdhury, Nimit Dhulekar, Jinjing Xia, Kevin Knight, Heng Ji, Bülent Yener, Liming Zhao
Article No.: 23
DOI: 10.1145/2857052

The lack of computational support has significantly slowed down automatic understanding of endangered languages. In this paper, we take Nyushu (simplified Chinese: 女书; literally: “women’s writing”) as a case study...

A “Suggested” Picture of Web Search in Turkish
Erdem Sarigil, Oguz Yilmaz, Ismail Sengor Altingovde, Rifat Ozcan, ÖzgÜr Ulusoy
Article No.: 24
DOI: 10.1145/2891105

Although query log analysis provides crucial insights about Web users’ search interests, conducting such analyses is almost impossible for some languages, as large-scale and public query logs are quite scarce. In this study, we first survey...

Classification of Printed Gujarati Characters Using Low-Level Stroke Features
Mukesh M. Goswami, Suman K. Mitra
Article No.: 25
DOI: 10.1145/2856105

This article presents an elegant technique for extracting the low-level stroke features, such as endpoints, junction points, line elements, and curve elements, from offline printed text using a template matching approach. The proposed features are...

A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script
Prakash Choudhary, Neeta Nain
Article No.: 26
DOI: 10.1145/2857053

This article introduces a large handwritten text document image corpus dataset for Urdu script named CALAM (Cursive And Language Adaptive Methodologies). The database contains unconstrained handwritten sentences along with their structural...

A Fast and Compact Language Model Implementation Using Double-Array Structures
Jun-Ya Norimatsu, Makoto Yasuhara, Toru Tanaka, Mikio Yamamoto
Article No.: 27
DOI: 10.1145/2873068

The language model is a widely used component in fields such as natural language processing, automatic speech recognition, and optical character recognition. In particular, statistical machine translation uses language models, and the translation...

Learning Generalized Features for Semantic Role Labeling
Haitong Yang, Chengqing Zong
Article No.: 28
DOI: 10.1145/2890496

This article makes an effort to improve Semantic Role Labeling (SRL) through learning generalized features. The SRL task is usually treated as a supervised problem. Therefore, a huge set of features are crucial to the performance of SRL systems....

Bangla Handwritten Character Segmentation Using Structural Features: A Supervised and Bootstrapping Approach
Tapan Kumar Bhowmik, Swapan Kumar Parui, Utpal Roy, Lambert Schomaker
Article No.: 29
DOI: 10.1145/2890497

In this article, we propose a new framework for segmentation of Bangla handwritten word images into meaningful individual symbols or pseudo-characters. Existing segmentation algorithms are not usually treated as a classification problem. However,...