Classification of Printed Gujarati Characters Using Low-Level Stroke Features

This article presents an elegant technique for extracting the low-level stroke features, such as endpoints, junction points, line elements, and curve... (more)

A Fast and Compact Language Model Implementation Using Double-Array Structures

The language model is a widely used component in fields such as natural language processing, automatic speech recognition, and optical character... (more)

Learning Generalized Features for Semantic Role Labeling

This article makes an effort to improve Semantic Role Labeling (SRL) through learning generalized features. The SRL task is usually treated as a... (more)

Bangla Handwritten Character Segmentation Using Structural Features

In this article, we propose a new framework for segmentation of Bangla handwritten word images into meaningful individual symbols or... (more)


Science Citation Index Listing

TALLIP will be listed in the Science Citation Index Expanded starting with the first 2015 issue, 14(1). TALLIP will be included in the 2017 Journal Citation Report, and the first Impact Factor will be published mid-2018.

Call for Nominations, Editor-in-Chief

TALLIP is seeking nominations for a new EiC for a three-year term starting in June 2016. 

New Name, Expanded Scope

This page provides information about the journal Transactions on Asian and Low-Resource Language Information Processing (TALLIP), a publication of the Association for Computing Machinery (ACM).

The journal was formerly known as the Transactions on Asian Language Information Processing (TALIP): see the editorial charter for information on the expanded scope of the journal.  

ACM Author Options
New options for ACM authors to manage rights and permissions for their work: ACM introduces a new publishing license agreement, an updated copyright transfer agreement, and a new author-pays option which allows for perpetual open access through the ACM Digital Library. For more information, visit the ACM Author Rights.    


Forthcoming Articles
From Image to Translation: Processing the Endangered Nyushu Script

The lack of computational support has significantly slowed down automatic understanding of endangered languages. In this paper, we take Nyushu (simplified Chinese: sf; literally: womens writing) as a case study to present the first computational approach that combines Computer Vision and Natural Language Processing techniques to deeply understand an endangered language. We developed an end-to-end system to read a scanned hand-written Nyushu article, segment it into characters, link them to standard characters, and then translate the article into Mandarin Chinese. We propose several novel methods to address the new challenges introduced by noisy input and low resources, including Nyushu-specific feature selection for character segmentation and linking, and character linking lattice based Machine Translation. The end-to-end system performance is promising to serve as a benchmark.

Online Handwritten Gurmukhi Strokes Dataset based on Minimal set of Words

The online handwriting data is an integral part of data analysis and classification research as collected handwritten data offers many challenges to group handwritten strokes classes. The present work has been done for grouping handwritten strokes from Indic script Gurmukhi. The Gurmukhi is the script of popular and widely spoken language Punjabi. The present work includes development of data set of Gurmukhi words in context of online handwriting recognition for real life use applications such as maps navigation. We have collected data of hundred writers for the common places of Punjab region. The writers variations such as writing skill level (beginner, moderate and expert), gender, right or left handedness and their adaptability to digital handwriting have been considered in data set development. We have introduced a novel technique to form handwritten strokes classes based on limited set of words. The presence of all alphabets including vowels of Gurmukhi script has been considered before selection of word. The developed data set includes 39411 strokes from handwritten words and form 72 classes of strokes after using k-means clustering technique and manual verification through expert and moderate writers. We have achieved the recognition results using Hidden Markov Model as 87.10%, 85.43% and 84.33% for middle zone strokes when using training data as 66%, 50% and 80% of developed dataset. The present work is a step in direction to find groups for unknown handwriting strokes with reasonably higher level of accuracy.

Improving Unsupervised Dependency Parsing with Knowledge from Query Logs

Unsupervised dependency parsing becomes more and more popular in recent years because it does not need expensive annotations, such as treebanks, which are required for supervised and semi-supervised dependency parsing. However, its accuracy is still far below than that of supervised dependency parsers, partly due to the fact that their parsing model is not sufficient to capture linguistic phenomena underlying texts. The performance for unsupervised dependency parsing can be improved by mining knowledge out of the texts and by incorporating it in the model. In this paper, syntactic knowledge is acquired from query logs to help estimate better probabilities in dependency model with valence. The proposed method is language independent, and obtains an improvement of 4.1% unlabeled accuracy on Penn Chinese Treebank by utilizing additional dependency relations from the Sogou query logs and Baidu query logs. Morever, experiments show that the proposed model achieves improvements of 8.07% on CoNLL 2007 English using the AOL query logs. We believe query logs are useful sources of syntactic knowledge for many NLP tasks.

Printed Text Image Database for Sindhi OCR

Document Image Understanding (DIU) and Electronic Document Management are active fields of research involving image understanding, interpretation, efficient handling, and routing of documents as well their retrieval. Research on most of the non-cursive scripts (Latin), have matured whereas research on the cursive (connected) scripts is still moving towards perfection. Many researchers are currently working on the cursive scripts (Arabic and other scripts adopting it) around the world so that the difficulties and challenges in document understanding and handling of these scripts can be overcome. Sindhi script has the largest extension of the original Arabic alphabet among languages adopting Arabic script; it contains 52 characters compared to 28 characters in Arabic alphabet in order to accommodate more sounds for the language. There are 24 differentiating characters with some possessing four dots. For Sindhi OCR research and development a database is needed for training and testing of Sindhi text images. We have developed a large database containing 4 billion 57 million words and 15 billion 275 million characters in 150 various fonts in 4 font weight and 4 styles. The database contents were collected from various sources including websites, books, theses and others. A custom built application was also developed to create a text image from a text document that supports various fonts and sizes. The database considers the words, characters, characters with spaces and lines. The database is freely available as a partial or full database by sending email to one of the authors.

Online Handwritten Gurmukhi Strokes Dataset based on Minimal set of Words

The online handwriting data is an integral part of data analysis and classification research, as collected handwritten data offers many challenges to group handwritten stroke classes. The present work has been done for grouping handwritten strokes from Indic script Gurmukhi. The Gurmukhi is the script of popular and widely spoken language Punjabi. The present work includes development of dataset of Gurmukhi words in context of online handwriting recognition for real life use applications, such as maps navigation. We have collected data of hundred writers for the common places of Punjab region. The writers variations such as writing skill level (beginner, moderate and expert), gender, right or left handedness and their adaptability to digital handwriting have been considered in dataset development. We have introduced a novel technique to form handwritten stroke classes based on a limited set of words. The presence of all alphabets including vowels of Gurmukhi script has been considered before selection of a word. The developed dataset includes 39411 strokes from handwritten words and forms 72 classes of strokes after using k-means clustering technique and manual verification through expert and moderate writers. We have achieved the recognition results using Hidden Markov Model as 87. 10%, 85. 43% and 84. 33% for middle zone strokes when using training data as 66%, 50% and 80% of developed dataset. The present work is a step in direction to find groups for unknown handwriting strokes with reasonably higher level of accuracy.


This paper presents an elegant technique for extracting low level stroke features, like line segments, curve segments, end points and junction points from the off line printed text using template matching approach. The proposed feature are used to classify the subset of characters from Gujarati character set. The dataset consist of approximately 16000 middle zone symbols of 42 different character classes. The symbols are collected from three different sources, namely machine printed book, laser printed document and news papers, in order to add varieties in terms of size, font type, style, ink variation and boundary deformation. The experiment shows that the features are quite robust against the variations and the results obtained are comparable with other existing work.

A "Suggested" Picture of Web Search in Turkish

While query log analysis provides crucial insights about web users' search interests; conducting such analyses is almost impossible for some languages, as large-scale and public query logs are quite scarce. In this study, we first survey the existing query collections in Turkish and discuss their limitations. Next, we adopt a novel strategy to obtain a set of Turkish queries using the query auto-completion services from the four major search engines; and provide the first large-scale analysis of web queries and their results in Turkish.

Word Segmentation for Burmese (Myanmar)

Experiments of various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that statistical and machine learning approaches have significantly better performance than dictionary-based approaches. We believe this note is the first systematic comparison of word segmentation approaches for Burmese, based on an annotated corpus with relatively considerable size (containing approximately half million words). This work intends to discover the properties and proper approaches of Burmese textual processing and to promote further researches of the understudied language.

A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script

In this paper we are presenting a large Urdu handwritten text corpus database having full length Urdu sentences and annotation structure for annotating offline handwritten image corpus with XML representation. The annotation of corpus is essential to make it available and applicable in a vast area of computational linguistic. Here, a unified approach is used to develop an Urdu corpus along with the demographic information of writer on a single form. Urdu is the fourth most frequently used language in the world but due to its complex writing script and poor resources it is still a thrust area for NLP. We have developed CALAM (cursive and language adaptive methodology) an Urdu corpus consisting of 1200 handwritten images. For capturing maximum Urdu words and the variations in handwritten styles data collection is distributed within 6 categories and further divided into 14 subcategories and forms were filled by different writers from various geographical regions with different educational qualifications. A structure has been designed to annotate handwritten Urdu script image at lines, words, components level with a XML standard to provide a ground-truth of each image at different levels of annotation. This corpus would be very useful for linguistic research in benchmarking and evaluation of handwritten text recognition techniques for Urdu script, signature verification, writer identification, digital forensics, classification of printed and handwritten text, categorization of texts as per use and so on

Pairwise Comparative Classification for Translator Stylometric Analysis

When a text is translated by different translators, the features extracted from each translation form a block of records. Within this block, each translation belongs to a specific translator. When the translators are not known for a sample of the data, finding an algorithm to assign translators to records form a new type of classification problem which we call Comparative Classification Problem (CCP). The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption. In the Pair-Wise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this paper, we introduce a new algorithm PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation -- we call Gradient Based Transformation (GBT) -- can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWCCP C4.5 consistently provided the best results over C4.5 and GBT.


Publication Years 2002-2016
Publication Count 278
Citation Count 995
Available for Download 278
Downloads (6 weeks) 1006
Downloads (12 Months) 8889
Downloads (cumulative) 220876
Average downloads per article 795
Average citations per article 4
