ACM DL

ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Menu
Latest Articles

Handwritten Manipuri Meetei-Mayek Classification Using Convolutional Neural Network

A new technique for classifying all 56 different characters of the Manipuri Meetei-Mayek (MMM) is proposed herein. The characters are grouped under... (more)

A Neural Semantic Parser for Math Problems Incorporating Multi-Sentence Information

In this article, we study the problem of parsing a math problem into logical forms. It is an essential pre-processing step for automatically solving... (more)

A Supplementary Feature Set for Sentiment Analysis in Japanese Dialogues

Recently, real-time affect-awareness has been applied in several commercial systems, such as dialogue systems and computer games. Real-time... (more)

A Sense Annotated Corpus for All-Words Urdu Word Sense Disambiguation

Word Sense Disambiguation (WSD) aims to automatically predict the correct sense of a word used in a given context. All human languages exhibit word... (more)

Chinese-Catalan: A Neural Machine Translation Approach Based on Pivoting and Attention Mechanisms

This article innovatively addresses machine translation from Chinese to Catalan using neural pivot strategies trained without any direct parallel data. The Catalan language is very similar to Spanish from a linguistic point of view, which motivates the use of Spanish as pivot language. Regarding neural architecture, we are using the latest... (more)

POS Tag-enhanced Coarse-to-fine Attention for Neural Machine Translation

Although neural machine translation (NMT) has certain capability to implicitly learn semantic information of sentences, we explore and show that... (more)

Multi-Entity Aspect-Based Sentiment Analysis with Context, Entity, Aspect Memory and Dependency Information

Fine-grained sentiment analysis is a useful tool for producers to understand consumers’... (more)

NEWS

Call for Nominations
Editor-In-Chief
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)

 

The term of the current Editor-in-Chief (EiC) of the ACM Trans. on Asian and Low-Resource Language Information Processing (TALLIP) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC.  TALLIP was established in 2002 and has been experiencing steady growth, with 178 submissions received in 2017.

Nominations, including self nominations, are invited for a three-year term as TALLIP EiC, beginning on June 1, 2019.  The EiC appointment may be renewed at most one time. This is an entirely voluntary position, but ACM will provide appropriate administrative support.

Appointed by the ACM Publications Board, Editors-in-Chief (EiCs) of ACM journals are delegated full responsibility for the editorial management of the journal consistent with the journal's charter and general ACM policies. The Board relies on EiCs to ensure that the content of the journal is of high quality and that the editorial review process is both timely and fair. He/she has final say on acceptance of papers, size of the Editorial Board, and appointment of Associate Editors. A complete list of responsibilities is found in the ACM Volunteer Editors Position Descriptions. Additional information can be found in the following documents:

Nominations should include a vita along with a brief statement of why the nominee should be considered. Self-nominations are encouraged, and should include a statement of the candidate's vision for the future development of TALLIP. The deadline for submitting nominations is April 15, 2019, although nominations will continue to be accepted until the position is filled.

Please send all nominations to the nominating committee chair, Monojit Choudhury ([email protected]).

The search committee members are:

  • Monojit Choudhury (Microsoft Research, India), Chair
  • Kareem M. Darwish (Qatar Computing Research Institute, HBKU)
  • Tei-wei Kuo (National Taiwan University & Academia Sinica) EiC of ACM Transactions on Cyber-Physical Systems; Vice Chair, ACM SIGAPP
  • Helen Meng, (Chinese University of Hong Kong)
  • Taro Watanabe (Google Inc., Tokyo)
  • Holly Rushmeier (Yale University), ACM Publications Board Liaison

Chinese Syntax Parsing Based on Sliding Match of Semantic String

Syntax-Based Chinese-Vietnamese Tree-to-Tree Statistical Machine Translation with Bilingual Features

Poor Chinese-Vietnamese bilingual parallel corpus make the existing Chinese-Vietnamese machine translation unsatisfactory. Considering the differences between Chinese and Vietnamese, we proposed a method of Chinese-Vietnamese tree-to-tree Statistical Machine Translation with language features. Lingual difference feature plays a good supervised role on machine translation. Analyzing the syntactic differences between Chinese and Vietnamese, we define some rules of language difference, attributive postposition award, time adverbial postposition award and locative adverbial postposition award .On the basis of Chinese-Vietnamese bilingual word-aligned corpus, these awards are combined into extract tree-to-tree translation rules. These defined rules are used to constraint the decoding, to prune and optimize the candidate sentences, and as a result, we acquire the optimal translation sequence. The experiments of Chinese-Vietnamese bilingual sentence translation showed that the proposed method performs well and that syntax difference features can greatly improve the efficiency and accuracy of the translation.

Chinese Zero Pronoun Resolution: A Chain to Chain Approach

Chinese zero pronoun (ZP) resolution plays a critical role in discourse analysis. Different from traditional mention to mention approaches, this paper proposes a chain to chain approach to improve the performance of ZP resolution from three aspects. Firstly, consecutive ZPs are clustered into coreferential chains, each working as one independent anaphor as a whole. In this way, those ZPs far away from their overt antecedents can be bridged via other consecutive ZPs in the same coreferential chains and thus better resolved. Secondly, common noun phrases (NPs) are automatically grouped into coreferential chains using traditional approaches, each working as one independent antecedent candidate as a whole. That is, those NPs occurring in the same coreferential chain are viewed as one antecedent candidate as a whole, and ZP resolution is made between ZP coreferential chains and common NP coreferential chains. In this way, the performance can be much improved due to the effective reduction of search space by pruning singletons and negative instances. Thirdly and finally, additional features from ZP and common NP coreferential chains are employed to better represent anaphors and their antecedent candidates, respectively. Comprehensive experiments on the OntoNotes V5.0 corpus show that our chain to chain approach significantly outperforms the state-of-the-art mention to mention approaches. To our knowledge, this is the first work to resolve zero pronouns in a chain to chain way.

Identifying and Analyzing different Aspects of English-Hindi Code-Switching in Twitter

Code-switching or juxtaposition of linguistic units from two or more languages in a single utterance, in recent times, has become very common in text, thanks to social media and other computer mediated forms of communication. In this exploratory study of English-Hindi code-switching on Twitter, we automatically create a large corpus of code-switched tweets and devise techniques to identify the relationship between successive components in a code-switched tweet. More specifically, we identify pragmatic functions like narrative-evaluative, negative reinforcement, translation etc. characterizing relation between successive components. We analyze the difference / similarity between switching patterns in code-switched and monolingual multi-component tweets. We observe strong dominance of narrative-evaluative (non-opinion to opinion or vice-versa) switching in case of both code-switched and monolingual multi-component tweets in around 40% cases. Polarity switching appears to be a prevalent switching phenomenon (10%) specifically in code-switched tweets (three to four times higher than monolingual multi-component tweets) where preference of expressing negative sentiment in Hindi is approximately twice compared to English. Positive reinforcement appears to be an important pragmatic function for English multi-component tweets whereas negative reinforcement plays a key role for Devanagari multi-component tweets. Our results also indicate that the extent and nature of code-switching also strongly depend on the topic (sports, politics etc.) of discussion.

From Genesis to Creole language: Transfer Learning for Singlish Universal Dependencies Parsing and POS Tagging

Singlish can be interesting to the computational linguistics community both linguistically as a major low-resource creole based on English, and computationally for information extraction and sentiment analysis of regional social media. In our conference paper, Wang et al. [2017], we investigated part-of-speech (POS) tagging and dependency parsing for Singlish by constructing a treebank under the Universal Dependencies scheme, and successfully used neural stacking models to integrate English syntactic knowledge for boosting Singlish POS tagging and dependency parsing, achieving the state-of-the-art accuracies of 89.50% and 84.47% for Singlish POS tagging and dependency respectively. In this work, we substantially extend Wang et al. [2017] by enlarging the Singlish treebank to more than triple the size and with much more diversity in topics, as well as further exploring neural multi-task models for integrating English syntactic knowledge. Results show that the enlarged treebank has achieved significant relative error reduction of 45.8% and 15.5% on the base model, 27% and 10% on the neural multi-task model, and 21% and 15% on the neural stacking model for POS tagging and dependency parsing respectively. Moreover, the state-of-the-art Singlish POS tagging and dependency parsing accuracies have been improved to 91.45% and 85.57% respectively. We make our treebanks and models available for further research.

Transform, combine and transfer: Delexicalized transfer parser for low-resource languages

Transfer parsing has been used for developing dependency parsers for languages with no treebank using transfer from treebanks of other languages (source languages). In delexicalized transfer parsing the words are replaced by their part-of-speech tags. Transfer parsing may not work well if a language does not follow uniform syntactic structure with respect to its different constituent patterns. Earlier work has used information derived from linguistic databases to transform a source language treebank to reduce the syntactic differences between the source and the target languages. We propose a transformation method where a source language pattern is transformed stochastically to one of the multiple possible patterns followed in the target language. The transformed source language treebank can be used to train a delexicalized parser in the target language. We show that this method significantly improves average performance of single-source delexicalized transfer parsers. We also propose a multi-source transfer parsing approach by concatenating transformed source language treebanks and show that the multi-source parsers work better when using a subset of the source language treebanks rather than all of them or only one. The treebanks are selected greedily based on the labelled attachment scores of the corresponding single-source parser trained using the treebank after transformation.

Sentiment Analysis for a Resource Poor Language - Roman Urdu

Chinese Zero Pronoun Resolution: A Collaborative Filtering-based Approach

Semantic information that has been proven to be necessary to the resolution of common noun phrases is typically ignored by most existing Chinese zero pronoun resolvers. This is because that zero pronouns convey no descriptive information, which makes it almost impossible to calculate semantic similarities between the zero pronoun and its candidate antecedents. Moreover, most of traditional approaches are based on the single-candidate model, which considers the candidate antecedents of a zero pronoun in isolation and thus overlooks their reciprocities. To address these problems, we first propose a neural network-based zero pronoun resolver (NZR) that is capable of generating vector-space semantics of zero pronouns and candidate antecedents. On the basis of NZR, we develop the collaborative filtering-based framework for Chinese zero pronoun resolution task, exploring the reciprocities between the candidate antecedents of a zero pronoun to more rationally re-estimate their importance. Experiment results on the Chinese portion of the OntoNotes corpus are encouraging: our proposed model substantially surpasses the Chinese zero pronoun resolution baseline systems.

Deep Contextualized Word Embeddings for Universal Dependency Parsing

Deep contextualized word embeddings (short for ELMo), as an emerging and effective replacement for the static word embeddings, have achieved success on a bunch of syntactic and semantic NLP problems. However, little is known about what is responsible for the improvements. In this paper, we focus on the effect of ELMo for a typical syntax problem -- universal POS tagging and dependency parsing. We incorporate ELMo as additional word embeddings into the state-of-the-art POS tagger and dependency parser, and it leads to consistent performance improvements. Experimental results show the model using ELMo outperforms the state-of-the-art baseline by an average 0.91 for POS tagging and 1.11 for dependency parsing. Further analysis reveals that the improvements mainly result from the ELMo's better abstraction ability on the out-of-vocabulary (OOV) words, and this ability is achieved by the character-level word representation in ELMo. Based on ELMo's advantage on OOV, experiments that simulate low-resource settings are conducted and the results show that deep contextualized word embeddings are effective for data-insufficient tasks where the OOV problem is severe.

Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora

Comparable corpora are valuable alternatives for the expensive parallel corpora. They comprise informative parallel fragments which are useful resources for different natural language processing tasks. In this work, a generative model is proposed for efficient extraction of parallel fragments from a pair of comparable documents. The core of the proposed model is a graph called the Matching Graph. The ability of the Matching Graph to be trained on a small initial seed makes it a proper model for language pairs suffering from the scarce resource problem. Experiments show that the Matching Graph performs significantly better than other recently published models. According to the experiments on English-Persian and Arabic-Persian language pairs, the extracted parallel fragments can be used instead of parallel data for training statistical machine translation systems. Results reveal that the extracted fragments in the best case are able to retrieve about 90% of the information of a statistical machine translation system which is trained on a parallel corpus. Moreover, it is shown that using the extracted fragments as additional information for training statistical machine translation systems leads to an improvement of about 2% for English-Persian and about 1% for Arabic-Persian translation on BLEU score.

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation, and automatic question-answering. Recognizing the importance of NER, a plethora NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this paper makes three key contributions. Firstly, we have developed the largest Urdu NER corpus that contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has more than doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpus. Secondly, we have generated four word embeddings using two different techniques, fastText and Word2vec, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Finally, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques and impact of word embeddings on these techniques.

Machine Translation Evaluation Metric Based on Dependency Parsing Model

Most of the syntax-based metrics obtain the similarity by comparing the sub-structures extracted from the trees of hypothesis and reference. These sub-structures cannot represent all the information in the trees because their lengths are limited. To sufficiently use the reference syntax information, a new automatic evaluation metric is proposed based on dependency parsing model. First, a dependency parsing model is trained using the reference dependency tree for each sentence. Then, the hypothesis is parsed by this dependency parsing model and the corresponding hypothesis dependency tree is generated. The quality of hypothesis can be judged by the quality of the hypothesis dependency tree. Unigram F-score is included in the new metric so that lexicon similarity is obtained. According to experimental results, the proposed metric can perform better than METEOR and BLEU on system level, and get comparable results with METEOR on sentence level. To further improve the performance, we also propose a combined metric which gets the best performance on sentence level and on system level.

Ancient-Modern Chinese Translation with a New Large Training Dataset

Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatically translation from ancient Chinese to modern Chinese helps to inherit and carry forward the quintessence of the ancients. However, the lack of large-scale parallel corpus limits the study of machine translation in Ancient-Modern Chinese. In this paper, we propose an Ancient-Modern Chinese clause alignment approach based on the characteristics of these two languages. This method combines both lexical-based information and statistical-based information, which achieves 94.2 F1-score on our manual annotation test set. We use this method to create a new large-scale Ancient-Modern Chinese parallel corpus which contains over 1.24M bilingual pairs. To our best knowledge, this is the first large high-quality Ancient-Modern Chinese dataset. Furthermore, we analyzed and compared the performance of the SMT and various NMT based models on this dataset and provided a strong baseline for this task.

Towards Burmese (Myanmar) Morphological Analysis: Syllable-Based Tokenization and Part-of-Speech Tagging

This paper presents a comprehensive study on Burmese (Myanmar) morphological analysis, from annotated data preparation to experiment-based investigation. Twenty thousand Burmese sentences in news field are annotated with morphological information as one component of the Asian Language Treebank Project. The annotation includes two-layer tokenization and part-of-speech (POS) tagging, to provide rich information on the morphological level and on the syntactic constituent level. The annotated corpus has been released under a CC BY-NC-SA license, and it is the largest open-access database of annotated Burmese when this manuscript was prepared in 2017. Detailed descriptions of the preparation, refinement, and features of the annotated corpus are provided in the first half of the paper. Facilitated by the deliberately prepared corpus, experiment-based investigations of Burmese morphological analysis are presented in the second half of the paper, wherein the standard sequence-labeling approach for conditional random fields and a long short-term memory (LSTM) based recurrent neural network (RNN) are applied and discussed. We obtained several general conclusions on the Burmese morphological analysis task, covering the scheme design of output tags, effect of joint tokenization and POS-tagging, and importance of ensemble from the viewpoint of stabilizing the performance of LSTM-based RNN. This study provides a solid basis for further studies on Burmese processing. Owing to the present study, in terms of morphological analysis, Burmese should no longer be referred to as a low-resourced or under-studied language.

Automatic Diacritics Restoration for Tunisian Dialect

Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. Firstly, a word without diacratics could have many possible meanings depending on their diacritization. Secondly, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Thirdly, without diacritics a word could have many possible POS instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this paper, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on CRFs (Conditional Random Fields). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (WER 21.44% for CRF and WER 34.6% for SMT).

Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages

Statistical machine translation (SMT) models require large bilingual corpora to produce high quality results. Nevertheless, such large bilingual corpora are unavailable for almost language pairs. In this work, we enhance SMT for low-resource languages using semantic similarity. Specifically, we focus on two strategies: sentence alignment and pivot translation. For sentence alignment, we use the representative method that based on sentence length and word alignment as a baseline method. We utilize word2vec to extract word similarity from monolingual data to improve the word alignment phase in the baseline method. The proposed sentence alignment algorithm is used to build bilingual corpora from Wikipedia. In pivot translation, the representative method called triangulation connects source to target phrases via common pivot phrases in source-pivot and pivot-target phrase tables. Nevertheless, it may lack information when some pivot phrases contain the same meaning, but they are not matched to each other. Therefore, we use similarity between pivot phrases to improve the triangulation method. Finally, we introduce a framework that combines the two proposed algorithms to improve SMT for low-resource languages. We conduct experiments on low-resource languages including Japanese-Vietnamese and Southeast Asian languages (Indonesian, Malay, Filipino, and Vietnamese). Experimental results show that our proposed methods of sentence alignment and pivot translation based on semantic similarity improve the baseline methods. The proposed framework significantly improves baseline SMT models trained on small bilingual corpora.

All ACM Journals | See Full Journal Index

Search TALLIP
enter search term and/or author name