ACM DL

ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Menu
Latest Articles

From Genesis to Creole Language: Transfer Learning for Singlish Universal Dependencies Parsing and POS Tagging

Singlish can be interesting to the computational linguistics community both linguistically, as a major low-resource creole based on English, and computationally, for information extraction and sentiment analysis of regional social media. In our conference paper, Wang et al. (2017), we investigated part-of-speech (POS) tagging and dependency parsing... (more)

Chinese Zero Pronoun Resolution: A Chain-to-chain Approach

Chinese zero pronoun (ZP) resolution plays a critical role in discourse analysis. Different from traditional mention-to-mention approaches, this article proposes a chain-to-chain approach to improve the performance of ZP resolution in three aspects. First, consecutive ZPs are clustered into coreferential chains, each working as one independent... (more)

Chinese Zero Pronoun Resolution: A Collaborative Filtering-based Approach

Semantic information that has been proven to be necessary to the resolution of common noun phrases is typically ignored by most existing Chinese zero pronoun resolvers. This is because that zero pronouns convey no descriptive information, which makes it almost impossible to calculate semantic similarities between the zero pronoun and its candidate... (more)

Transform, Combine, and Transfer: Delexicalized Transfer Parser for Low-resource Languages

Transfer parsing has been used for developing dependency parsers for languages with no treebank by using transfer from treebanks of other languages (source languages). In delexicalized transfer, parsed words are replaced by their part-of-speech tags. Transfer parsing may not work well if a language does not follow uniform syntactic structure with... (more)

Towards Burmese (Myanmar) Morphological Analysis: Syllable-based Tokenization and Part-of-speech Tagging

This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-tagging information, as one component of the Asian Language Treebank Project. The annotated... (more)

Ancient–Modern Chinese Translation with a New Large Training Dataset

Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatic translation from ancient Chinese to modern Chinese helps to... (more)

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers... (more)

NEWS

Call for Nominations
Editor-In-Chief
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)

 

The term of the current Editor-in-Chief (EiC) of the ACM Trans. on Asian and Low-Resource Language Information Processing (TALLIP) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC.  TALLIP was established in 2002 and has been experiencing steady growth, with 178 submissions received in 2017.

Nominations, including self nominations, are invited for a three-year term as TALLIP EiC, beginning on June 1, 2019.  The EiC appointment may be renewed at most one time. This is an entirely voluntary position, but ACM will provide appropriate administrative support.

Appointed by the ACM Publications Board, Editors-in-Chief (EiCs) of ACM journals are delegated full responsibility for the editorial management of the journal consistent with the journal's charter and general ACM policies. The Board relies on EiCs to ensure that the content of the journal is of high quality and that the editorial review process is both timely and fair. He/she has final say on acceptance of papers, size of the Editorial Board, and appointment of Associate Editors. A complete list of responsibilities is found in the ACM Volunteer Editors Position Descriptions. Additional information can be found in the following documents:

Nominations should include a vita along with a brief statement of why the nominee should be considered. Self-nominations are encouraged, and should include a statement of the candidate's vision for the future development of TALLIP. The deadline for submitting nominations is April 15, 2019, although nominations will continue to be accepted until the position is filled.

Please send all nominations to the nominating committee chair, Monojit Choudhury ([email protected]).

The search committee members are:

  • Monojit Choudhury (Microsoft Research, India), Chair
  • Kareem M. Darwish (Qatar Computing Research Institute, HBKU)
  • Tei-wei Kuo (National Taiwan University & Academia Sinica) EiC of ACM Transactions on Cyber-Physical Systems; Vice Chair, ACM SIGAPP
  • Helen Meng, (Chinese University of Hong Kong)
  • Taro Watanabe (Google Inc., Tokyo)
  • Holly Rushmeier (Yale University), ACM Publications Board Liaison

Chinese Syntax Parsing Based on Sliding Match of Semantic String

Wasf-Vec: Topology-Based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology

Word clustering is a crucial issue in low resource languages. Since words that share semantics are expected to be clustered together, it is common to use feature vector representation generated from distributional theory based words embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low resource Iraqi's dialect language vocabulary. We start with a dialect fast stemming algorithm that utilizes the MSA data with a 0.85 accuracy measured by the F1 score, followed by training using the distributional theory based word embedding method on the stemmed data. This is followed by an analysis of how dialect words were clustered within other Modern Standard Arabic words while using word semantic relations that are well supported by solid linguistic theories, and we shed the light on the strong and weak words' relation representations. The analysis is handled by visualizing the first two PCA components in 2D space, examining the words nearest neighbors, and analyzing distance-histogram of specific words' templates. New simple yet effective spatial feature vector named Wasf-Vec for word representation is proposed in this work that utilizes the orthographical, phonological, and morphological words' structures. Wasf technique captures relations that are not contextual based as in the distributional theory based word embedding method. The validation of the words classification used in this paper is done by employing the classes in a class-based language modeling CBLM. Wasf-Vec CBLM achieved 7% lower perplexity (pp) than distributional theory based word embedding method CBLM. This result is significant when working with low resource languages.

Filtered Pseudo-Parallel Corpus Improves Low-Resource Neural Machine Translation

Large-scale parallel corpora are essential for training high-quality machine translation systems but such corpora are not freely available for many language pairs. In previous studies, training data has been augmented by pseudo-parallel corpora obtained by using machine translation models to translate monolingual corpora into the source language. However, in low-resource language pairs in which only low-accurate machine translation systems can be used, translation quality degrades when a pseudo-parallel corpus is used naively. To improve machine translation performance with low-resource language pairs, we propose a method to expand the training data effectively via filtering the pseudo-parallel corpus using quality estimation based on sentence-level round-trip translation. We experimented on three language pairs using small, medium, and large size parallel corpora and observed that BLEU scores improved significantly for low-resource language pairs.

An Automatic and A Machine-Assisted Method to Clean Bilingual Corpus

Two different methods of corpus cleaning are presented in this paper. One is a machine-assisted technique which is good to clean small sized parallel corpus and the other is an automatic method which is suitable for cleaning large sized parallel corpus. A baseline SMT (MOSES) system is used to evaluate these methods. Machine assisted techniques use two features: word alignment and length of the source and target language sentence. These features are used to detect mistranslations in the corpus, which are then handled by a human translator. Experiments are conducted on the EILMT (English to Indian Language Machine Translation) corpus (English-Hindi). BLEU score is improved by 0.27% on clean corpus. Automatic method of corpus cleaning uses a combination of two features. One feature is length of source and target language sentence and the second feature is Viterbi alignment score generated by Hidden Markov Model (HMM) model for each sentence pair. Two different threshold values are used for these two features. These values are decided by using a small sized manually annotated parallel corpus of 206 sentence pairs. Experiments are conducted on the ACL (Association of Computational Linguist) 2014 corpus (English-Hindi). The BLEU score is improved by 0.6% on clean corpus.

Importance of Signal Processing Cues in Transcription Correction for Low Resource Indian Languages

Accurate phonetic transcriptions are crucial for building robust acoustic models for speech recognition as well as speech synthesis applications. Phonetic transcriptions are not usually provided with speech corpora. A lexicon is used to generate phone level transcriptions of speech corpora with sentence level transcriptions. When lexical entries are not available, letter to sound (LTS) rules are used. Whether it is a lexicon or LTS, the rules for pronunciation are generic, and may not match the spoken utterance. This can lead to transcription errors. The objective of this study is to address the issue of mismatch between the transcription and its acoustic realisation. In particular, the issue of vowel deletions is studied. Group delay based segmentation is used to determine insertion/deletion of vowels in the speech utterance. The transcriptions are corrected in the training data based on this. The corrected data is used in automatic speech recognition (ASR) and text to speech synthesis (TTS) systems. ASR and TTS systems built with the corrected transcriptions show improvements in the performance.

Identifying and Analyzing different Aspects of English-Hindi Code-Switching in Twitter

Code-switching or juxtaposition of linguistic units from two or more languages in a single utterance, in recent times, has become very common in text, thanks to social media and other computer mediated forms of communication. In this exploratory study of English-Hindi code-switching on Twitter, we automatically create a large corpus of code-switched tweets and devise techniques to identify the relationship between successive components in a code-switched tweet. More specifically, we identify pragmatic functions like narrative-evaluative, negative reinforcement, translation etc. characterizing relation between successive components. We analyze the difference / similarity between switching patterns in code-switched and monolingual multi-component tweets. We observe strong dominance of narrative-evaluative (non-opinion to opinion or vice-versa) switching in case of both code-switched and monolingual multi-component tweets in around 40% cases. Polarity switching appears to be a prevalent switching phenomenon (10%) specifically in code-switched tweets (three to four times higher than monolingual multi-component tweets) where preference of expressing negative sentiment in Hindi is approximately twice compared to English. Positive reinforcement appears to be an important pragmatic function for English multi-component tweets whereas negative reinforcement plays a key role for Devanagari multi-component tweets. Our results also indicate that the extent and nature of code-switching also strongly depend on the topic (sports, politics etc.) of discussion.

Explicitly Modeling Word Translations in Neural Machine Translation

In this paper, we show that word translations can be explicitly incorporated into NMT effectively to avoid wrong translations. Specifically, we propose three cross-lingual encoders to explicitly incorporate word translations into NMT: 1) Factored encoder that encodes a word and its translation in a vertical way; 2) Gated encoder that uses a gated mechanism to selectively control the amount of word translations moving forward; and 3) Mixed encoder that stitchingly learns a word and its translation annotations over sequences where words and their translations are alternatively mixed. Besides, we first use a simple word dictionary approach and then a word sense disambiguation (WSD) approach to effectively model the word context for better word translation. Experimentation on Chinese-to-English translation demonstrates that all proposed encoders are able to improve the translation accuracy for both traditional RNN-based NMT and recent self-attention-based NMT (hereafter Transformer). Specifically, Mixed encoder yields the most significant improvement of 2.0 in BLEU on the RNN-based NMT while Gated encoder improves 1.2 in BLEU on Transformer. This indicates the usefulness of an WSD approach in modeling word context for better word translation. This also indicates the effectiveness of our proposed cross-lingual encoders in explicitly modeling word translations to avoid wrong translations in NMT. Finally, we discuss in-depth how word translations benefit different NMT frameworks from several perspectives.

Children Story Classification in Indian Languages using Linguistic and Keyword based Features

The primary objective of this work is to classify Hindi and Telugu stories into three genres: fable, folk-tale and legend. In this work, we are proposing a framework for story classification (SC) using keyword and part-of-speech (POS) features. For improving the performance of SC system, feature reduction techniques and combinations of various POS tags are explored. Further, we investigated the performance of SC by dividing the story into parts depending on its semantic structure. In this work, stories are (i) manually divided into parts based on their semantics as introduction, main and climax; and (ii) automatically divided into equal parts based on number of sentences in a story as initial, middle and end. We have also examined sentence increment model that aims at determining an optimum number of sentences required to identify story genre by incremental selection of sentences in a story. Experiments are conducted on Hindi and Telugu story corpora consisting of 300 and 150 short stories, respectively. The performance of SC system is evaluated using different combinations of keyword and POS based features, with three promising machine learning classifiers: (i) Naive Bayes (NB), (ii) k-Nearest Neighbour (KNN) and (iii) Support Vector Machine (SVM). Performance of the classifier is evaluated using 10-fold cross-validation and effectiveness of classifier is measured using precision, recall and F-measure. From the classification results, it is observed that adding linguistic information boosts the performance of story classification significantly. In view of the structure of the story, main and initial parts of the story have shown comparatively better performance. The results from the sentence incremental model have indicated that, the first nine and seven sentences in Hindi and Telugu stories respectively are sufficient for better classification of stories. In most of the studies, SVM models outperformed the other models in classification accuracy.

NeuMorph: Neural Morphological Tagging for Low-Resource Languages - An Experimental Study for Indic Languages

This article deals with morphological tagging for five Indic and two severely resource-poor languages, Coptic and Kurmanji. The task entails prediction of morphological tag (case, degree, gender etc.) of an in-context word. We hypothesize that the tag of a word is dependent only on its local context instead of the entire sentence. In this light, usefulness of convolution operation for predicting the tags is studied resulting in a convolutional neural network (CNN) based morphological tagger. Our proposed model (BLSTM-CNN) achieves insightful results in comparison to the present state-of-the-art. Following the recent trend, the task is carried out under three different settings, single language, across languages and across keys. Whereas the previous models used only character-level features, we show that the addition of word vectors along with character level embedding significantly improves the performance of all the models.

Sentiment Analysis for a Resource Poor Language - Roman Urdu

Neural Conversation Generation with Auxiliary Emotional Supervised Models

An important aspect of developing dialogue agents involves endowing a conversation system with emotion perception and interaction. Most existing emotion dialogue models lack the adaptability and extensibility of different scenes because of their limitation to require a user-specified emotion category or their reliance on a fixed emotional dictionary. To overcome these limitations, we propose a neural Chinese conversation generation with auxiliary emotional supervised Model (nCCG-ESM) comprising a sequence-to-sequence (Seq2Seq) generation model and an emotional classifier used as an auxiliary model. The emotional classifier was trained to predict the emotion distributions of the dialogues, which were then used as emotion supervised signals to guide the generation model to generate diverse emotional responses. The proposed nCCG-ESM is flexible enough to generate responses with emotional diversity, including user-specified or unspecified emotions, which can be adapted and extended to different scenarios. Experiments on large-scale Weibo post-response pairs showed that the proposed model was capable of producing more diverse, appropriate, and emotionally rich responses, yielding substantial gains in diversity scores and human evaluations.

Adversarial Training for Unknown Word Problems in Neural Machine Translation

Nearly most of the work in neural machine translation is limited to a quite restricted vocabulary, crudely treating all other words the same as an symbol. For the translation of agglutinative language, such as Mongolian, unknown (UNK) words also come from the misunderstanding of the translation model to the morphological changes. In this study, we introduce a new adversarial training model in generative adversarial net to alleviate the UNK problem in Mongolian?Chinese machine translation. We add a variety of Mongolian morphological noise samples into the training set in the form of pseudo-data, to increase the generalization ability for UNK. The training process can be described as three adversarial sub models (generator, filter and discriminator), playing a win?win game. In this game, the added filter plays the role of emphasizing the discriminator to pay attention to the negative generations that contain noise and improving training efficiency. Finally, the discriminator cannot easily discriminate the negative samples generated by the generator with filter and human translations. The experimental results show that the newly emerged Mongolian?Chinese task is state-of-the-art. Under this premise, the training time is greatly shortened.

Enhanced Double-Carrier Word Embedding Via Phonetics and Writing

Word embeddings, which map words into a unified vector space, capture rich semantic information. From a linguistic point of view, words have two carriers, speech and writing, yet the most recent word embedding models focus on only the writing carrier and ignore the role of the speech carrier in semantic expressions. However, in the development of language, speech appears before writing and plays an important role in the development of writing. For phonetic language systems, the written forms are secondary symbols of spoken ones. Based on this idea, we carried out our work and proposed double-carrier word embedding (DCWE). We used DCWE to conduct a simulation of the generation order of speech and writing. We trained written embedding based on phonetic embedding. The final word embedding fuses writing and phonetic embedding. To illustrate that our model can be applied to most languages, we selected Chinese, English and Spanish as examples and evaluated these models through word similarity and text classification experiments.

Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora

Comparable corpora are valuable alternatives for the expensive parallel corpora. They comprise informative parallel fragments which are useful resources for different natural language processing tasks. In this work, a generative model is proposed for efficient extraction of parallel fragments from a pair of comparable documents. The core of the proposed model is a graph called the Matching Graph. The ability of the Matching Graph to be trained on a small initial seed makes it a proper model for language pairs suffering from the scarce resource problem. Experiments show that the Matching Graph performs significantly better than other recently published models. According to the experiments on English-Persian and Arabic-Persian language pairs, the extracted parallel fragments can be used instead of parallel data for training statistical machine translation systems. Results reveal that the extracted fragments in the best case are able to retrieve about 90% of the information of a statistical machine translation system which is trained on a parallel corpus. Moreover, it is shown that using the extracted fragments as additional information for training statistical machine translation systems leads to an improvement of about 2% for English-Persian and about 1% for Arabic-Persian translation on BLEU score.

SentiFars: A Persian Polarity Lexicon for Sentiment Analysis

There is no doubt on the usefulness of public opinion towards different issues in social media and world wide web. Extracting the feeling of people about an issue from text is not straightforward. Polarity lexicons which assign polarity tags or scores to words and phrases play an important role in sentiment analysis systems. As English is the richest language in this area, getting benefit of existing English resources in order to build new ones have attracted many researchers in recent years. In this paper, we propose a new translation-based approach for building polarity resources in resource-lean languages such as Persian. The results of empirical evaluation of the proposed approach approve its effectiveness. The generated resource is the largest publicly available polarity lexicon for Persian.

Order-Sensitive Keywords based Response Generation in Open-domain Conversational Systems

External keywords are crucial for response generation models to address the generic response problems in open domain conversational systems. The occurrence of keywords in a response depends heavily on the order of the keywords as they are generated sequentially. Meanwhile, the order of keywords also affects the semantics of a response. Previous keywords based methods mainly focus on the composite of keywords, while the order of keywords has not been sufficiently discussed. In this work, we propose an order-sensitive keywords based model to explore the influence of the order of keywords. It automatically inferences the most suitable order that is optimized to generate a natural and relevant response given a message, and subsequently generates the response using the ordered keywords as building blocks. We conducted experiments on a public Twitter dataset and the results show that our approach outperforms the state-of-the-art baselines in both automatic and human evaluations.

Extracting Polarity Shifting Patterns from Any Corpus Based on Natural Annotation

In recent years, online sentiment texts are generated by users in various domains and in different languages. Binary polarity classification (positive or negative) on business sentiment texts can help both companies and customers to evaluate products or services. Sometimes, the polarity of sentiment texts can be modified, making the polarity classification difficult. In sentiment analysis, such modification of polarity is termed as \textbf{polarity shifting}, which shifts the polarity of a sentiment clue (emotion, evaluation etc.). It is well known that detection of polarity shifting can help improve sentiment analysis in texts. However, to detect polarity shifting in corpora is challenging: 1) polarity shifting is normally sparse in texts, making human annotation difficult; 2) corpora with dense polarity shifting are few, we may need polarity shifting patterns from various corpora. In this paper, an approach is presented to extract polarity shifting patterns from any text corpus. For the first time, we proposed to select texts rich in polarity shifting by idea of \textbf{natural annotation}, which is used to replace human annotation. With a sequence mining algorithm, the selected texts are used to generate polarity shifting pattern candidates, and then we rank them by C-value before human annotation. The approach is tested on different corpora and different languages. The results show that our approach can capture various types of polarity shifting patterns, and some patterns are unique to specific corpora. Therefore, for better performance, it is reasonable to construct polarity shifting patterns directly from the given corpus.

All ACM Journals | See Full Journal Index

Search TALLIP
enter search term and/or author name