ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Latest Articles

From Genesis to Creole Language: Transfer Learning for Singlish Universal Dependencies Parsing and POS Tagging

Singlish can be interesting to the computational linguistics community both linguistically, as a major low-resource creole based on English, and computationally, for information extraction and sentiment analysis of regional social media. In our conference paper, Wang et al. (2017), we investigated part-of-speech (POS) tagging and dependency parsing... (more)

Chinese Zero Pronoun Resolution: A Chain-to-chain Approach

Chinese zero pronoun (ZP) resolution plays a critical role in discourse analysis. Different from traditional mention-to-mention approaches, this article proposes a chain-to-chain approach to improve the performance of ZP resolution in three aspects. First, consecutive ZPs are clustered into coreferential chains, each working as one independent... (more)

Chinese Zero Pronoun Resolution: A Collaborative Filtering-based Approach

Semantic information that has been proven to be necessary to the resolution of common noun phrases is typically ignored by most existing Chinese zero pronoun resolvers. This is because that zero pronouns convey no descriptive information, which makes it almost impossible to calculate semantic similarities between the zero pronoun and its candidate... (more)

Transform, Combine, and Transfer: Delexicalized Transfer Parser for Low-resource Languages

Transfer parsing has been used for developing dependency parsers for languages with no treebank by using transfer from treebanks of other languages (source languages). In delexicalized transfer, parsed words are replaced by their part-of-speech tags. Transfer parsing may not work well if a language does not follow uniform syntactic structure with... (more)

Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging

This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-tagging information, as one component of the Asian Language Treebank Project. The annotated... (more)

Ancient–Modern Chinese Translation with a New Large Training Dataset

Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatic translation from ancient Chinese to modern Chinese helps to... (more)

Chinese Syntax Parsing Based on Sliding Match of Semantic String

Different from the current syntax parsing based on deep learning, we present a novel Chinese parsing method, which is based on Sliding Match of... (more)

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers... (more)

Deep Contextualized Word Embeddings for Universal Dependency Parsing

Deep contextualized word embeddings (Embeddings from Language Model, short for ELMo), as an emerging and effective replacement for the static word... (more)

Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora

Comparable corpora are valuable alternatives for the expensive parallel corpora. They comprise informative parallel fragments that are useful... (more)

μ-Forcing: Training Variational Recurrent Autoencoders for Text Generation

It has been previously observed that training Variational Recurrent Autoencoders (VRAE) for text generation suffers from serious uninformative latent variables problems. The model would collapse into a plain language model that totally ignores the latent variables and can only generate repeating and dull samples. In this article, we explore the... (more)

Explicitly Modeling Word Translations in Neural Machine Translation

In this article, we show that word translations can be explicitly incorporated into NMT effectively to avoid wrong translations. Specifically, we... (more)


Call for Nominations
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)


The term of the current Editor-in-Chief (EiC) of the ACM Trans. on Asian and Low-Resource Language Information Processing (TALLIP) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC.  TALLIP was established in 2002 and has been experiencing steady growth, with 178 submissions received in 2017.

Nominations, including self nominations, are invited for a three-year term as TALLIP EiC, beginning on June 1, 2019.  The EiC appointment may be renewed at most one time. This is an entirely voluntary position, but ACM will provide appropriate administrative support.

Appointed by the ACM Publications Board, Editors-in-Chief (EiCs) of ACM journals are delegated full responsibility for the editorial management of the journal consistent with the journal's charter and general ACM policies. The Board relies on EiCs to ensure that the content of the journal is of high quality and that the editorial review process is both timely and fair. He/she has final say on acceptance of papers, size of the Editorial Board, and appointment of Associate Editors. A complete list of responsibilities is found in the ACM Volunteer Editors Position Descriptions. Additional information can be found in the following documents:

Nominations should include a vita along with a brief statement of why the nominee should be considered. Self-nominations are encouraged, and should include a statement of the candidate's vision for the future development of TALLIP. The deadline for submitting nominations is April 15, 2019, although nominations will continue to be accepted until the position is filled.

Please send all nominations to the nominating committee chair, Monojit Choudhury (

The search committee members are:

  • Monojit Choudhury (Microsoft Research, India), Chair
  • Kareem M. Darwish (Qatar Computing Research Institute, HBKU)
  • Tei-wei Kuo (National Taiwan University & Academia Sinica) EiC of ACM Transactions on Cyber-Physical Systems; Vice Chair, ACM SIGAPP
  • Helen Meng, (Chinese University of Hong Kong)
  • Taro Watanabe (Google Inc., Tokyo)
  • Holly Rushmeier (Yale University), ACM Publications Board Liaison
Word Reordering for Translation into Korean Sign Language Using Syntactically-guided Classification

One of the goals of machine translation is to break the language barrier that prevents communication with others and accessing information. Furthermore, deaf people face big language barriers in their daily lives. There are very few digital resources for sign language processing. In this paper, we present a machine translation system for translating Korean to Korean Sign Language (KSL) glosses. The system uses dictionary-based lexical transfer and syntactically guided data-driven structural transfer. A basic description of the linguistic features of KSL with other sign languages is also presented. This work especially focuses on structural transfer as word reordering. The core part of our work is a neural classification model for reordering order-important constituent pairs with a reordering task that is newly designed for Korean-to-KSL translation. The experiment results evaluated on news transcript data show that the proposed system achieves a BLEU score of 0.512 and a RIBES score of 0.425, significantly improving the performances of the baseline system.

Wasf-Vec: Topology-Based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology

Word clustering is a crucial issue in low resource languages. Since words that share semantics are expected to be clustered together, it is common to use feature vector representation generated from distributional theory based words embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low resource Iraqi's dialect language vocabulary. We start with a dialect fast stemming algorithm that utilizes the MSA data with a 0.85 accuracy measured by the F1 score, followed by training using the distributional theory based word embedding method on the stemmed data. This is followed by an analysis of how dialect words were clustered within other Modern Standard Arabic words while using word semantic relations that are well supported by solid linguistic theories, and we shed the light on the strong and weak words' relation representations. The analysis is handled by visualizing the first two PCA components in 2D space, examining the words nearest neighbors, and analyzing distance-histogram of specific words' templates. New simple yet effective spatial feature vector named Wasf-Vec for word representation is proposed in this work that utilizes the orthographical, phonological, and morphological words' structures. Wasf technique captures relations that are not contextual based as in the distributional theory based word embedding method. The validation of the words classification used in this paper is done by employing the classes in a class-based language modeling CBLM. Wasf-Vec CBLM achieved 7% lower perplexity (pp) than distributional theory based word embedding method CBLM. This result is significant when working with low resource languages.

Filtered Pseudo-Parallel Corpus Improves Low-Resource Neural Machine Translation

Large-scale parallel corpora are essential for training high-quality machine translation systems but such corpora are not freely available for many language pairs. In previous studies, training data has been augmented by pseudo-parallel corpora obtained by using machine translation models to translate monolingual corpora into the source language. However, in low-resource language pairs in which only low-accurate machine translation systems can be used, translation quality degrades when a pseudo-parallel corpus is used naively. To improve machine translation performance with low-resource language pairs, we propose a method to expand the training data effectively via filtering the pseudo-parallel corpus using quality estimation based on sentence-level round-trip translation. We experimented on three language pairs using small, medium, and large size parallel corpora and observed that BLEU scores improved significantly for low-resource language pairs.

An Automatic and A Machine-Assisted Method to Clean Bilingual Corpus

Two different methods of corpus cleaning are presented in this paper. One is a machine-assisted technique which is good to clean small sized parallel corpus and the other is an automatic method which is suitable for cleaning large sized parallel corpus. A baseline SMT (MOSES) system is used to evaluate these methods. Machine assisted techniques use two features: word alignment and length of the source and target language sentence. These features are used to detect mistranslations in the corpus, which are then handled by a human translator. Experiments are conducted on the EILMT (English to Indian Language Machine Translation) corpus (English-Hindi). BLEU score is improved by 0.27% on clean corpus. Automatic method of corpus cleaning uses a combination of two features. One feature is length of source and target language sentence and the second feature is Viterbi alignment score generated by Hidden Markov Model (HMM) model for each sentence pair. Two different threshold values are used for these two features. These values are decided by using a small sized manually annotated parallel corpus of 206 sentence pairs. Experiments are conducted on the ACL (Association of Computational Linguist) 2014 corpus (English-Hindi). The BLEU score is improved by 0.6% on clean corpus.

Children Story Classification in Indian Languages using Linguistic and Keyword based Features

The primary objective of this work is to classify Hindi and Telugu stories into three genres: fable, folk-tale and legend. In this work, we are proposing a framework for story classification (SC) using keyword and part-of-speech (POS) features. For improving the performance of SC system, feature reduction techniques and combinations of various POS tags are explored. Further, we investigated the performance of SC by dividing the story into parts depending on its semantic structure. In this work, stories are (i) manually divided into parts based on their semantics as introduction, main and climax; and (ii) automatically divided into equal parts based on number of sentences in a story as initial, middle and end. We have also examined sentence increment model that aims at determining an optimum number of sentences required to identify story genre by incremental selection of sentences in a story. Experiments are conducted on Hindi and Telugu story corpora consisting of 300 and 150 short stories, respectively. The performance of SC system is evaluated using different combinations of keyword and POS based features, with three promising machine learning classifiers: (i) Naive Bayes (NB), (ii) k-Nearest Neighbour (KNN) and (iii) Support Vector Machine (SVM). Performance of the classifier is evaluated using 10-fold cross-validation and effectiveness of classifier is measured using precision, recall and F-measure. From the classification results, it is observed that adding linguistic information boosts the performance of story classification significantly. In view of the structure of the story, main and initial parts of the story have shown comparatively better performance. The results from the sentence incremental model have indicated that, the first nine and seven sentences in Hindi and Telugu stories respectively are sufficient for better classification of stories. In most of the studies, SVM models outperformed the other models in classification accuracy.

Neural Conversation Generation with Auxiliary Emotional Supervised Models

An important aspect of developing dialogue agents involves endowing a conversation system with emotion perception and interaction. Most existing emotion dialogue models lack the adaptability and extensibility of different scenes because of their limitation to require a user-specified emotion category or their reliance on a fixed emotional dictionary. To overcome these limitations, we propose a neural Chinese conversation generation with auxiliary emotional supervised Model (nCCG-ESM) comprising a sequence-to-sequence (Seq2Seq) generation model and an emotional classifier used as an auxiliary model. The emotional classifier was trained to predict the emotion distributions of the dialogues, which were then used as emotion supervised signals to guide the generation model to generate diverse emotional responses. The proposed nCCG-ESM is flexible enough to generate responses with emotional diversity, including user-specified or unspecified emotions, which can be adapted and extended to different scenarios. Experiments on large-scale Weibo post-response pairs showed that the proposed model was capable of producing more diverse, appropriate, and emotionally rich responses, yielding substantial gains in diversity scores and human evaluations.

Adversarial Training for Unknown Word Problems in Neural Machine Translation

Nearly most of the work in neural machine translation is limited to a quite restricted vocabulary, crudely treating all other words the same as an symbol. For the translation of agglutinative language, such as Mongolian, unknown (UNK) words also come from the misunderstanding of the translation model to the morphological changes. In this study, we introduce a new adversarial training model in generative adversarial net to alleviate the UNK problem in Mongolian?Chinese machine translation. We add a variety of Mongolian morphological noise samples into the training set in the form of pseudo-data, to increase the generalization ability for UNK. The training process can be described as three adversarial sub models (generator, filter and discriminator), playing a win?win game. In this game, the added filter plays the role of emphasizing the discriminator to pay attention to the negative generations that contain noise and improving training efficiency. Finally, the discriminator cannot easily discriminate the negative samples generated by the generator with filter and human translations. The experimental results show that the newly emerged Mongolian?Chinese task is state-of-the-art. Under this premise, the training time is greatly shortened.

Enhanced Double-Carrier Word Embedding Via Phonetics and Writing

Word embeddings, which map words into a unified vector space, capture rich semantic information. From a linguistic point of view, words have two carriers, speech and writing, yet the most recent word embedding models focus on only the writing carrier and ignore the role of the speech carrier in semantic expressions. However, in the development of language, speech appears before writing and plays an important role in the development of writing. For phonetic language systems, the written forms are secondary symbols of spoken ones. Based on this idea, we carried out our work and proposed double-carrier word embedding (DCWE). We used DCWE to conduct a simulation of the generation order of speech and writing. We trained written embedding based on phonetic embedding. The final word embedding fuses writing and phonetic embedding. To illustrate that our model can be applied to most languages, we selected Chinese, English and Spanish as examples and evaluated these models through word similarity and text classification experiments.

SentiFars: A Persian Polarity Lexicon for Sentiment Analysis

There is no doubt on the usefulness of public opinion towards different issues in social media and world wide web. Extracting the feeling of people about an issue from text is not straightforward. Polarity lexicons which assign polarity tags or scores to words and phrases play an important role in sentiment analysis systems. As English is the richest language in this area, getting benefit of existing English resources in order to build new ones have attracted many researchers in recent years. In this paper, we propose a new translation-based approach for building polarity resources in resource-lean languages such as Persian. The results of empirical evaluation of the proposed approach approve its effectiveness. The generated resource is the largest publicly available polarity lexicon for Persian.

Order-Sensitive Keywords based Response Generation in Open-domain Conversational Systems

External keywords are crucial for response generation models to address the generic response problems in open domain conversational systems. The occurrence of keywords in a response depends heavily on the order of the keywords as they are generated sequentially. Meanwhile, the order of keywords also affects the semantics of a response. Previous keywords based methods mainly focus on the composite of keywords, while the order of keywords has not been sufficiently discussed. In this work, we propose an order-sensitive keywords based model to explore the influence of the order of keywords. It automatically inferences the most suitable order that is optimized to generate a natural and relevant response given a message, and subsequently generates the response using the ordered keywords as building blocks. We conducted experiments on a public Twitter dataset and the results show that our approach outperforms the state-of-the-art baselines in both automatic and human evaluations.

Extracting Polarity Shifting Patterns from Any Corpus Based on Natural Annotation

In recent years, online sentiment texts are generated by users in various domains and in different languages. Binary polarity classification (positive or negative) on business sentiment texts can help both companies and customers to evaluate products or services. Sometimes, the polarity of sentiment texts can be modified, making the polarity classification difficult. In sentiment analysis, such modification of polarity is termed as \textbf{polarity shifting}, which shifts the polarity of a sentiment clue (emotion, evaluation etc.). It is well known that detection of polarity shifting can help improve sentiment analysis in texts. However, to detect polarity shifting in corpora is challenging: 1) polarity shifting is normally sparse in texts, making human annotation difficult; 2) corpora with dense polarity shifting are few, we may need polarity shifting patterns from various corpora. In this paper, an approach is presented to extract polarity shifting patterns from any text corpus. For the first time, we proposed to select texts rich in polarity shifting by idea of \textbf{natural annotation}, which is used to replace human annotation. With a sequence mining algorithm, the selected texts are used to generate polarity shifting pattern candidates, and then we rank them by C-value before human annotation. The approach is tested on different corpora and different languages. The results show that our approach can capture various types of polarity shifting patterns, and some patterns are unique to specific corpora. Therefore, for better performance, it is reasonable to construct polarity shifting patterns directly from the given corpus.

All ACM Journals | See Full Journal Index

enter search term and/or author name