ACM DL

ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Menu
Latest Articles

From Genesis to Creole Language: Transfer Learning for Singlish Universal Dependencies Parsing and POS Tagging

Singlish can be interesting to the computational linguistics community both linguistically, as a major low-resource creole based on English, and computationally, for information extraction and sentiment analysis of regional social media. In our conference paper, Wang et al. (2017), we investigated part-of-speech (POS) tagging and dependency parsing... (more)

Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging

This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-tagging information, as one component of the Asian Language Treebank Project. The annotated... (more)

Ancient–Modern Chinese Translation with a New Large Training Dataset

Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatic translation from ancient Chinese to modern Chinese helps to... (more)

NEWS

Call for Nominations
Editor-In-Chief
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)

 

The term of the current Editor-in-Chief (EiC) of the ACM Trans. on Asian and Low-Resource Language Information Processing (TALLIP) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC.  TALLIP was established in 2002 and has been experiencing steady growth, with 178 submissions received in 2017.

Nominations, including self nominations, are invited for a three-year term as TALLIP EiC, beginning on June 1, 2019.  The EiC appointment may be renewed at most one time. This is an entirely voluntary position, but ACM will provide appropriate administrative support.

Appointed by the ACM Publications Board, Editors-in-Chief (EiCs) of ACM journals are delegated full responsibility for the editorial management of the journal consistent with the journal's charter and general ACM policies. The Board relies on EiCs to ensure that the content of the journal is of high quality and that the editorial review process is both timely and fair. He/she has final say on acceptance of papers, size of the Editorial Board, and appointment of Associate Editors. A complete list of responsibilities is found in the ACM Volunteer Editors Position Descriptions. Additional information can be found in the following documents:

Nominations should include a vita along with a brief statement of why the nominee should be considered. Self-nominations are encouraged, and should include a statement of the candidate's vision for the future development of TALLIP. The deadline for submitting nominations is April 15, 2019, although nominations will continue to be accepted until the position is filled.

Please send all nominations to the nominating committee chair, Monojit Choudhury (monojitc@microsoft.com).

The search committee members are:

  • Monojit Choudhury (Microsoft Research, India), Chair
  • Kareem M. Darwish (Qatar Computing Research Institute, HBKU)
  • Tei-wei Kuo (National Taiwan University & Academia Sinica) EiC of ACM Transactions on Cyber-Physical Systems; Vice Chair, ACM SIGAPP
  • Helen Meng, (Chinese University of Hong Kong)
  • Taro Watanabe (Google Inc., Tokyo)
  • Holly Rushmeier (Yale University), ACM Publications Board Liaison

Chinese Syntax Parsing Based on Sliding Match of Semantic String

Identifying and Analyzing different Aspects of English-Hindi Code-Switching in Twitter

Code-switching or juxtaposition of linguistic units from two or more languages in a single utterance, in recent times, has become very common in text, thanks to social media and other computer mediated forms of communication. In this exploratory study of English-Hindi code-switching on Twitter, we automatically create a large corpus of code-switched tweets and devise techniques to identify the relationship between successive components in a code-switched tweet. More specifically, we identify pragmatic functions like narrative-evaluative, negative reinforcement, translation etc. characterizing relation between successive components. We analyze the difference / similarity between switching patterns in code-switched and monolingual multi-component tweets. We observe strong dominance of narrative-evaluative (non-opinion to opinion or vice-versa) switching in case of both code-switched and monolingual multi-component tweets in around 40% cases. Polarity switching appears to be a prevalent switching phenomenon (10%) specifically in code-switched tweets (three to four times higher than monolingual multi-component tweets) where preference of expressing negative sentiment in Hindi is approximately twice compared to English. Positive reinforcement appears to be an important pragmatic function for English multi-component tweets whereas negative reinforcement plays a key role for Devanagari multi-component tweets. Our results also indicate that the extent and nature of code-switching also strongly depend on the topic (sports, politics etc.) of discussion.

Sentiment Analysis for a Resource Poor Language - Roman Urdu

Deep Contextualized Word Embeddings for Universal Dependency Parsing

Deep contextualized word embeddings (short for ELMo), as an emerging and effective replacement for the static word embeddings, have achieved success on a bunch of syntactic and semantic NLP problems. However, little is known about what is responsible for the improvements. In this paper, we focus on the effect of ELMo for a typical syntax problem -- universal POS tagging and dependency parsing. We incorporate ELMo as additional word embeddings into the state-of-the-art POS tagger and dependency parser, and it leads to consistent performance improvements. Experimental results show the model using ELMo outperforms the state-of-the-art baseline by an average 0.91 for POS tagging and 1.11 for dependency parsing. Further analysis reveals that the improvements mainly result from the ELMo's better abstraction ability on the out-of-vocabulary (OOV) words, and this ability is achieved by the character-level word representation in ELMo. Based on ELMo's advantage on OOV, experiments that simulate low-resource settings are conducted and the results show that deep contextualized word embeddings are effective for data-insufficient tasks where the OOV problem is severe.

Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora

Comparable corpora are valuable alternatives for the expensive parallel corpora. They comprise informative parallel fragments which are useful resources for different natural language processing tasks. In this work, a generative model is proposed for efficient extraction of parallel fragments from a pair of comparable documents. The core of the proposed model is a graph called the Matching Graph. The ability of the Matching Graph to be trained on a small initial seed makes it a proper model for language pairs suffering from the scarce resource problem. Experiments show that the Matching Graph performs significantly better than other recently published models. According to the experiments on English-Persian and Arabic-Persian language pairs, the extracted parallel fragments can be used instead of parallel data for training statistical machine translation systems. Results reveal that the extracted fragments in the best case are able to retrieve about 90% of the information of a statistical machine translation system which is trained on a parallel corpus. Moreover, it is shown that using the extracted fragments as additional information for training statistical machine translation systems leads to an improvement of about 2% for English-Persian and about 1% for Arabic-Persian translation on BLEU score.

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation, and automatic question-answering. Recognizing the importance of NER, a plethora NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this paper makes three key contributions. Firstly, we have developed the largest Urdu NER corpus that contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has more than doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpus. Secondly, we have generated four word embeddings using two different techniques, fastText and Word2vec, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Finally, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques and impact of word embeddings on these techniques.

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Statistical machine translation (SMT) models require large bilingual corpora to produce high quality results. Nevertheless, such large bilingual corpora are unavailable for almost language pairs. In this work, we enhance SMT for low-resource languages using semantic similarity. Specifically, we focus on two strategies: sentence alignment and pivot translation. For sentence alignment, we use the representative method that based on sentence length and word alignment as a baseline method. We utilize word2vec to extract word similarity from monolingual data to improve the word alignment phase in the baseline method. The proposed sentence alignment algorithm is used to build bilingual corpora from Wikipedia. In pivot translation, the representative method called triangulation connects source to target phrases via common pivot phrases in source-pivot and pivot-target phrase tables. Nevertheless, it may lack information when some pivot phrases contain the same meaning, but they are not matched to each other. Therefore, we use similarity between pivot phrases to improve the triangulation method. Finally, we introduce a framework that combines the two proposed algorithms to improve SMT for low-resource languages. We conduct experiments on low-resource languages including Japanese-Vietnamese and Southeast Asian languages (Indonesian, Malay, Filipino, and Vietnamese). Experimental results show that our proposed methods of sentence alignment and pivot translation based on semantic similarity improve the baseline methods. The proposed framework significantly improves baseline SMT models trained on small bilingual corpora.

All ACM Journals | See Full Journal Index

Search TALLIP
enter search term and/or author name