ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Latest Articles

Constructing Complex Search Tasks with Coherent Subtask Search Goals

Nowadays, due to the explosive growth of web content and usage, users deal with their complex search tasks by web search engines. However,... (more)

Collective Web-Based Parenthetical Translation Extraction Using Markov Logic Networks

Parenthetical translations are translations of terms in otherwise monolingual text that appear inside parentheses. Parenthetical translations... (more)

Fuzzy Hindi WordNet and Word Sense Disambiguation Using Fuzzy Graph Connectivity Measures

In this article, we propose Fuzzy Hindi WordNet, which is an extended version of Hindi WordNet. The proposed idea of fuzzy relations and their role in... (more)

Acoustic Features for Hidden Conditional Random Fields--Based Thai Tone Classification

In the Thai language, tone information is necessary for Thai speech recognition systems. Previous studies show that many acoustic cues are attributed... (more)

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora

Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As... (more)


Science Citation Index Listing

TALLIP will be listed in the Science Citation Index Expanded starting with the first 2015 issue, 14(1). TALLIP will be included in the 2017 Journal Citation Report, and the first Impact Factor will be published mid-2018.

Call for Nominations, Editor-in-Chief

TALLIP is seeking nominations for a new EiC for a three-year term starting in June 2016. 

New Name, Expanded Scope

This page provides information about the journal Transactions on Asian and Low-Resource Language Information Processing (TALLIP), a publication of the Association for Computing Machinery (ACM).

The journal was formerly known as the Transactions on Asian Language Information Processing (TALIP): see the editorial charter for information on the expanded scope of the journal.  

ACM Author Options
New options for ACM authors to manage rights and permissions for their work: ACM introduces a new publishing license agreement, an updated copyright transfer agreement, and a new author-pays option which allows for perpetual open access through the ACM Digital Library. For more information, visit the ACM Author Rights.    


Forthcoming Articles
From Image to Translation: Processing the Endangered Nyushu Script

The lack of computational support has significantly slowed down automatic understanding of endangered languages. In this paper, we take Nyushu (simplified Chinese: sf; literally: womens writing) as a case study to present the first computational approach that combines Computer Vision and Natural Language Processing techniques to deeply understand an endangered language. We developed an end-to-end system to read a scanned hand-written Nyushu article, segment it into characters, link them to standard characters, and then translate the article into Mandarin Chinese. We propose several novel methods to address the new challenges introduced by noisy input and low resources, including Nyushu-specific feature selection for character segmentation and linking, and character linking lattice based Machine Translation. The end-to-end system performance is promising to serve as a benchmark.

Learning Generalized Features for Semantic Role Labeling

This paper makes efforts to improve Semantic Role Labeling (SRL) through learning generalized features. The SRL task is usually treated as a supervised problem. Therefore, a huge set of features are crucial to the performance of SRL systems. But these features lack generalization powers when predicting an unseen argument. This paper proposes a simple approach to relieve the issue. A strong intuition is that arguments occurring in similar syntactic positions are likely to bear the same semantic role, and analogously arguments that are lexically similar are likely to represent the same semantic role. Therefore, it will be informative to SRL if syntactic or lexical similar arguments can activate the same feature. Inspired by this, we embed the information of lexicalization and syntax into a feature vector for each argument, and then use Kmeans to make clustering for all feature vectors of training set. For an unseen argument to be predicted, it will belong to the same cluster as its similar arguments of training set. Therefore, the clusters can be thought of as a kind of generalized feature. We evaluate our method on several benchmarks. The experimental results show that our approach can significantly improve the SRL performance.

A "Suggested" Picture of Web Search in Turkish

While query log analysis provides crucial information about the search interests and behaviors of web users; conducting such analyses is almost impossible for several languages, as large-scale and public query logs are quite scarce. In this study, we adopt a novel strategy to obtain a set of Turkish queries using the instant query auto-completion services from the four major search engines. Our work provides the first large-scale analysis of web queries and their results in Turkish, via automatic methods and extensive user studies.

A Fast and Compact Language Model Implementation Using Double-array Structures

The language model is one of the most important components of statistical machine translation. The translation speed and quality are greatly affected by the performance of the language model system. We propose a fast and compact language model querying system that increases translation speed and reduces memory usage. We use a double-array structure known to be a fast and compact trie representation as a core data structure in the language model. We modify the original double-array structures to be able to store model parameters such as log probabilities and log backoff weights. We also propose two optimization methods. One enables us to drop some trie nodes losslessly. Since the naive implementation has many unused nodes, eliminating those nodes makes the trie compact. Another is that we tune the word IDs in the language model. This reduces the variance of the word ID distribution for each node and results in reducing the model size. We conduct experiments to evaluate the efficiency of our methods on both the perplexity calculation task and translation task. The results show that our method brings out the nature of double-arrays and achieves smaller than one of state-of-the-art systems and faster than another types of the systems.

Printed Text Image Database for Sindhi OCR

Document Image Understanding (DIU) and Electronic Document Management are active fields of research involving image understanding, interpretation, efficient handling, and routing of documents as well their retrieval. Research on most of the non-cursive scripts (Latin), have matured whereas research on the cursive (connected) scripts is still moving towards perfection. Many researchers are currently working on the cursive scripts (Arabic and other scripts adopting it) around the world so that the difficulties and challenges in document understanding and handling of these scripts can be overcome. Sindhi script has the largest extension of the original Arabic alphabet among languages adopting Arabic script; it contains 52 characters compared to 28 characters in Arabic alphabet in order to accommodate more sounds for the language. There are 24 differentiating characters with some possessing four dots. For Sindhi OCR research and development a database is needed for training and testing of Sindhi text images. We have developed a large database containing 4 billion 57 million words and 15 billion 275 million characters in 150 various fonts in 4 font weight and 4 styles. The database contents were collected from various sources including websites, books, theses and others. A custom built application was also developed to create a text image from a text document that supports various fonts and sizes. The database considers the words, characters, characters with spaces and lines. The database is freely available as a partial or full database by sending email to one of the authors.


This paper presents an elegant technique for extracting the low-level stroke features, such as endpoints, junction points, line elements, and curve elements from the offline printed text using template matching approach. The proposed features are used to classify the subset of characters from Gujarati script. The database consists of approximately 16782 samples of 42 middle zone symbols from Gujarati character set, collected from three different sources, namely, machine printed books, newspapers, and laser printed documents. The purpose of this division is to add varieties in terms of size, font type, style, ink variation, and boundary deformation. The experiments are performed on database using k-nearest neighbor (kNN) classifier and results are compared with other widely used structural features, namely Chain Codes (CC), Directional Element Features (DEF), and Histogram of Oriented Gradients (HoG). The results show that the features are quite robust against the variations and gives comparable performance with other existing works.


This paper presents an elegant technique for extracting low level stroke features, like line segments, curve segments, end points and junction points from the off line printed text using template matching approach. The proposed feature are used to classify the subset of characters from Gujarati character set. The dataset consist of approximately 16000 middle zone symbols of 42 different character classes. The symbols are collected from three different sources, namely machine printed book, laser printed document and news papers, in order to add varieties in terms of size, font type, style, ink variation and boundary deformation. The experiment shows that the features are quite robust against the variations and the results obtained are comparable with other existing work.

A "Suggested" Picture of Web Search in Turkish

While query log analysis provides crucial insights about web users' search interests; conducting such analyses is almost impossible for some languages, as large-scale and public query logs are quite scarce. In this study, we first survey the existing query collections in Turkish and discuss their limitations. Next, we adopt a novel strategy to obtain a set of Turkish queries using the query auto-completion services from the four major search engines; and provide the first large-scale analysis of web queries and their results in Turkish.

Word Segmentation for Burmese (Myanmar)

Experiments of various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that statistical and machine learning approaches have significantly better performance than dictionary-based approaches. We believe this note is the first systematic comparison of word segmentation approaches for Burmese, based on an annotated corpus with relatively considerable size (containing approximately half million words). This work intends to discover the properties and proper approaches of Burmese textual processing and to promote further researches of the understudied language.

A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script

In this paper we are presenting a large Urdu handwritten text corpus database having full length Urdu sentences and annotation structure for annotating offline handwritten image corpus with XML representation. The annotation of corpus is essential to make it available and applicable in a vast area of computational linguistic. Here, a unified approach is used to develop an Urdu corpus along with the demographic information of writer on a single form. Urdu is the fourth most frequently used language in the world but due to its complex writing script and poor resources it is still a thrust area for NLP. We have developed CALAM (cursive and language adaptive methodology) an Urdu corpus consisting of 1200 handwritten images. For capturing maximum Urdu words and the variations in handwritten styles data collection is distributed within 6 categories and further divided into 14 subcategories and forms were filled by different writers from various geographical regions with different educational qualifications. A structure has been designed to annotate handwritten Urdu script image at lines, words, components level with a XML standard to provide a ground-truth of each image at different levels of annotation. This corpus would be very useful for linguistic research in benchmarking and evaluation of handwritten text recognition techniques for Urdu script, signature verification, writer identification, digital forensics, classification of printed and handwritten text, categorization of texts as per use and so on

BenLem (a Bengali Lemmatizer) and its Role in WSD

A lemmatization algorithm for Bengali has been developed and its effectiveness for word sense disambiguation (WSD) is investigated. One of the key challenges for computer processing of agglutinative languages is to deal with the frequent morphological variations of the root words as they appear in the text. Therefore, designing of a lemmatizer is essential for developing many natural language processing (NLP) tools for such languages. In this experiment, Bengali which is the national language of Bangladesh and the second most popular language in the Indian subcontinent has been taken as a reference. In order to design the lemmatizer (BenLem), possible transformations through which surface words are formed from lemmas are studied so that suitable reverse transformations (along with contextual knowledge) can be applied on a surface word to get the corresponding lemma back. BenLem is found to be capable of handling both inflectional and derivational morphology in Bengali. It is evaluated on a set of $18$ news articles taken from FIRE Bengali News Corpus consisting of $3,338$ surface words (excluding proper nouns) and found to be about 82.68\% accurate. The role of the lemmatizer is then investigated for Bengali WSD. Fifty ($50$) news articles are randomly selected from the FIRE corpus and five most frequent polysemous Bengali words are considered for sense disambiguation. Different WSD systems are considered for this experiment and it is noticed that BenLem improves the performance of all the WSD systems and the improvement is statistically significant.

Bangla Handwritten Character Segmentation using Structural Features: A Supervised and Bootstrapping Approach

In this paper, we describe a new framework for segmentation of Bangla handwritten word images into meaningful individual symbols or pseudo-characters. Existing segmentation algorithms are not usually treated as a classification problem. However, in the present study the segmentation algorithm is looked upon as a two-class supervised classification problem. The method employs an SVM classifier to select the segmentation points on the word image on the basis of various structural features. For training of the SVM classifier, an unannotated training set is prepared first using candidate segmenting points. The training set is then clustered and each class is labeled manually with minimal manual intervention. A semiautomatic bootstrapping technique is also employed to enlarge the training set from new samples. The overall architecture describes a basic step towards building an annotation system for segmentation problem which is not so far investigated. The experimental results show that our segmentation method is quite efficient for segmenting not only words images but also applicable to handwritten text. As a part of this work, a database of Bangla handwritten word images is also developed. Considering our data collection method and a statistical analysis of our lexicon set, we claim that the relevant characteristics of an ideal lexicon set have been incorporated in our handwritten word image database.


Publication Years 2015-2016
Publication Count 36
Citation Count 0
Available for Download 36
Downloads (6 weeks) 434
Downloads (12 Months) 2160
Downloads (cumulative) 2163
Average downloads per article 60
Average citations per article 0
First Name Last Name Award
Baoli Li ACM Senior Member (2012)
Dong Zhou ACM Senior Member (2012)

First Name Last Name Paper Counts
Chengqing Zong 3
Eiichiro Sumita 2
Xiaodong Liu 2
Isao Goto 2
Juifeng Yeh 2
Kevin Duh 2
Masao Utiyama 2
Sadao Kurohashi 2
Yūji Matsumoto 1
Hanping Shen 1
Peishan Tsai 1
Seunghoon Na 1
Xinyu Dai 1
Daya Lobiyal 1
Shujie Liu 1
Jordi Centelles 1
Hideki Mima 1
Amita Jain 1
Chenhui Chu 1
Richard Tsai 1
Yuming Hsieh 1
Kehjiann Chen 1
Hsinmin Wang 1
Hsinhsi Chen 1
Chunghsien Wu 1
Jiajun Zhang 1
Sherief Abdallah 1
Ramisettyrajeshwara Rao 1
Chenchen Ding 1
Keisuke Sakanushi 1
Mu Li 1
Maad Shatnawi 1
Toru Ishida 1
Hiroki Hanaoka 1
Chutamanee Onsuwan 1
Shujian Huang 1
Yu Zhou 1
Rui Wang 1
Baoliang Lu 1
Mikio Yamamoto 1
Wenhsiang Lu 1
Natthawut Kertkeidkachorn 1
Atiwong Suchato 1
Toshiaki Nakazawa 1
Shuling Huang 1
Gina Levow 1
Maochuan Su 1
Neeta Nain 1
Subhash Panwar 1
Xiaoqing Li 1
Yinggong Zhao 1
Bilel Elayeb 1
Tingxuan Wang 1
Proadpran Punyabukkana 1
Wenyi Chen 1
Kuanyu Chen 1
Suresh Sundaram 1
Angarai Ramakrishnan 1
Deepti Khanduja 1
Marta Costa-jussà 1
Mairidan Wushouer 1
Donghui Lin 1
Prasenjit Majumder 1
Thanaruk Theeramunkong 1
Ibrahim Bounhas 1
Hirona Touji 1
Minghong Bai 1
Lunghao Lee 1
Shihhung Wu 1
Arafat Awajan 1
Yusuke Miyao 1
Nitin Ramrakhiyani 1
Nongnuch Ketui 1
B Kumari 1
Yuji Matsumoto 1
Haitong Yang 1
Hai Zhao 1
Chaolin Liu 1
Fei Cheng 1
Ming Zhou 1
Katsutoshi Hirayama 1
Sumire Uematsu 1
Takuya Matsuzaki 1
Kehyih Su 1
Jiajun Chen 1
Mingwen Wang 1

Affiliation Paper Counts
National Central University Taiwan 1
Chaoyang University of Technology 1
National Chengchi University 1
Indian Institute of Technology 1
Kobe University 1
Indian Institute of Science 1
National Tsing Hua University 1
United Arab Emirates University 1
British University in Dubai 1
Princess Sumaya University 1
University of Washington 1
Jawaharlal Nehru University 1
Japan Science and Technology Agency 2
Universitat Politecnica de Catalunya 2
Institute of Automation Chinese Academy of Sciences 3
Thammasat University 3
National Taiwan University 3
Microsoft Research Asia 3
Chulalongkorn University 3
University of Tokyo 3
Shanghai Jiaotong University 3
Chinese Academy of Sciences 4
Nanjing University 4
University of Tsukuba 4
National Chiayi University 4
National Cheng Kung University 5
Japan National Institute of Information and Communications Technology 5
Academia Sinica Taiwan 5
Kyoto University 6
Nara Institute of Science and Technology 7

ACM Transactions on Asian and Low-Resource Language Information Processing

Volume 15 Issue 2, February 2016

Volume 15 Issue 3, December 2015  Issue-in-Progress
Volume 15 Issue 1, January 2016
Volume 14 Issue 4, October 2015 Special Issue on Chinese Spell Checking
Volume 14 Issue 3, June 2015
Volume 14 Issue 2, March 2015
Volume 14 Issue 1, January 2015

Volume 13 Issue 4, December 2014
Volume 13 Issue 3, September 2014
Volume 13 Issue 2, June 2014
Volume 13 Issue 1, February 2014

Volume 12 Issue 4, October 2013
Volume 12 Issue 3, August 2013
Volume 12 Issue 2, June 2013
Volume 12 Issue 1, March 2013

Volume 11 Issue 4, December 2012 Special Issue on RITE
Volume 11 Issue 3, September 2012
Volume 11 Issue 2, June 2012
Volume 11 Issue 1, March 2012

Volume 10 Issue 4, December 2011
Volume 10 Issue 3, September 2011
Volume 10 Issue 2, June 2011
Volume 10 Issue 1, March 2011

Volume 9 Issue 4, December 2010
Volume 9 Issue 3, September 2010
Volume 9 Issue 2, June 2010
Volume 9 Issue 1, March 2010

Volume 8 Issue 4, December 2009
Volume 8 Issue 3, August 2009
Volume 8 Issue 2, May 2009
Volume 8 Issue 1, March 2009

Volume 7 Issue 4, November 2008
Volume 7 Issue 3, August 2008
Volume 7 Issue 2, June 2008
Volume 7 Issue 1, February 2008

Volume 6 Issue 4, December 2007
Volume 6 Issue 3, November 2007
Volume 6 Issue 2, September 2007
Volume 6 Issue 1, April 2007

Volume 5 Issue 4, December 2006
Volume 5 Issue 3, September 2006
Volume 5 Issue 2, June 2006
Volume 5 Issue 1, March 2006

Volume 4 Issue 4, December 2005
Volume 4 Issue 3, September 2005
Volume 4 Issue 2, June 2005
Volume 4 Issue 1, March 2005

Volume 3 Issue 4, December 2004
Volume 3 Issue 3, September 2004
Volume 3 Issue 2, June 2004
Volume 3 Issue 1, March 2004 Special Issue on Temporal Information Processing

Volume 2 Issue 4, December 2003
Volume 2 Issue 3, September 2003
Volume 2 Issue 2, June 2003
Volume 2 Issue 1, March 2003

Volume 1 Issue 4, December 2002
Volume 1 Issue 3, September 2002
Volume 1 Issue 2, June 2002
Volume 1 Issue 1, March 2002
All ACM Journals | See Full Journal Index