ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Latest Articles

Order-Sensitive Keywords Based Response Generation in Open-Domain Conversational Systems

External keywords are crucial for response generation models to address the generic response problems in open-domain conversational systems. The... (more)


Call for Nominations
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP)


The term of the current Editor-in-Chief (EiC) of the ACM Trans. on Asian and Low-Resource Language Information Processing (TALLIP) is coming to an end, and the ACM Publications Board has set up a nominating committee to assist the Board in selecting the next EiC.  TALLIP was established in 2002 and has been experiencing steady growth, with 178 submissions received in 2017.

Nominations, including self nominations, are invited for a three-year term as TALLIP EiC, beginning on June 1, 2019.  The EiC appointment may be renewed at most one time. This is an entirely voluntary position, but ACM will provide appropriate administrative support.

Appointed by the ACM Publications Board, Editors-in-Chief (EiCs) of ACM journals are delegated full responsibility for the editorial management of the journal consistent with the journal's charter and general ACM policies. The Board relies on EiCs to ensure that the content of the journal is of high quality and that the editorial review process is both timely and fair. He/she has final say on acceptance of papers, size of the Editorial Board, and appointment of Associate Editors. A complete list of responsibilities is found in the ACM Volunteer Editors Position Descriptions. Additional information can be found in the following documents:

Nominations should include a vita along with a brief statement of why the nominee should be considered. Self-nominations are encouraged, and should include a statement of the candidate's vision for the future development of TALLIP. The deadline for submitting nominations is April 15, 2019, although nominations will continue to be accepted until the position is filled.

Please send all nominations to the nominating committee chair, Monojit Choudhury ([email protected]).

The search committee members are:

  • Monojit Choudhury (Microsoft Research, India), Chair
  • Kareem M. Darwish (Qatar Computing Research Institute, HBKU)
  • Tei-wei Kuo (National Taiwan University & Academia Sinica) EiC of ACM Transactions on Cyber-Physical Systems; Vice Chair, ACM SIGAPP
  • Helen Meng, (Chinese University of Hong Kong)
  • Taro Watanabe (Google Inc., Tokyo)
  • Holly Rushmeier (Yale University), ACM Publications Board Liaison
Word Reordering for Translation into Korean Sign Language Using Syntactically-guided Classification

One of the goals of machine translation is to break the language barrier that prevents communication with others and accessing information. Furthermore, deaf people face big language barriers in their daily lives. There are very few digital resources for sign language processing. In this paper, we present a machine translation system for translating Korean to Korean Sign Language (KSL) glosses. The system uses dictionary-based lexical transfer and syntactically guided data-driven structural transfer. A basic description of the linguistic features of KSL with other sign languages is also presented. This work especially focuses on structural transfer as word reordering. The core part of our work is a neural classification model for reordering order-important constituent pairs with a reordering task that is newly designed for Korean-to-KSL translation. The experiment results evaluated on news transcript data show that the proposed system achieves a BLEU score of 0.512 and a RIBES score of 0.425, significantly improving the performances of the baseline system.

Wasf-Vec: Topology-Based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology

Word clustering is a crucial issue in low resource languages. Since words that share semantics are expected to be clustered together, it is common to use feature vector representation generated from distributional theory based words embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low resource Iraqi's dialect language vocabulary. We start with a dialect fast stemming algorithm that utilizes the MSA data with a 0.85 accuracy measured by the F1 score, followed by training using the distributional theory based word embedding method on the stemmed data. This is followed by an analysis of how dialect words were clustered within other Modern Standard Arabic words while using word semantic relations that are well supported by solid linguistic theories, and we shed the light on the strong and weak words' relation representations. The analysis is handled by visualizing the first two PCA components in 2D space, examining the words nearest neighbors, and analyzing distance-histogram of specific words' templates. New simple yet effective spatial feature vector named Wasf-Vec for word representation is proposed in this work that utilizes the orthographical, phonological, and morphological words' structures. Wasf technique captures relations that are not contextual based as in the distributional theory based word embedding method. The validation of the words classification used in this paper is done by employing the classes in a class-based language modeling CBLM. Wasf-Vec CBLM achieved 7% lower perplexity (pp) than distributional theory based word embedding method CBLM. This result is significant when working with low resource languages.

Filtered Pseudo-Parallel Corpus Improves Low-Resource Neural Machine Translation

Large-scale parallel corpora are essential for training high-quality machine translation systems but such corpora are not freely available for many language pairs. In previous studies, training data has been augmented by pseudo-parallel corpora obtained by using machine translation models to translate monolingual corpora into the source language. However, in low-resource language pairs in which only low-accurate machine translation systems can be used, translation quality degrades when a pseudo-parallel corpus is used naively. To improve machine translation performance with low-resource language pairs, we propose a method to expand the training data effectively via filtering the pseudo-parallel corpus using quality estimation based on sentence-level round-trip translation. We experimented on three language pairs using small, medium, and large size parallel corpora and observed that BLEU scores improved significantly for low-resource language pairs.

An Automatic and A Machine-Assisted Method to Clean Bilingual Corpus

Two different methods of corpus cleaning are presented in this paper. One is a machine-assisted technique which is good to clean small sized parallel corpus and the other is an automatic method which is suitable for cleaning large sized parallel corpus. A baseline SMT (MOSES) system is used to evaluate these methods. Machine assisted techniques use two features: word alignment and length of the source and target language sentence. These features are used to detect mistranslations in the corpus, which are then handled by a human translator. Experiments are conducted on the EILMT (English to Indian Language Machine Translation) corpus (English-Hindi). BLEU score is improved by 0.27% on clean corpus. Automatic method of corpus cleaning uses a combination of two features. One feature is length of source and target language sentence and the second feature is Viterbi alignment score generated by Hidden Markov Model (HMM) model for each sentence pair. Two different threshold values are used for these two features. These values are decided by using a small sized manually annotated parallel corpus of 206 sentence pairs. Experiments are conducted on the ACL (Association of Computational Linguist) 2014 corpus (English-Hindi). The BLEU score is improved by 0.6% on clean corpus.

Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

In this study, we developed an Isarn Dharma word segmentation system. We mainly focused on solving the word ambiguity and unknown word problems in unsegmented Isarn Dharma text. Ambiguous Isarn Dharma words occur frequently in the word construction due to the writing style without tone markers. Thus, words can be interpreted as having different tone and meanings in the same writing. To overcome these problems, we developed an Isarn Dharma character cluster (IDCC)-character-based statistical model and affixation with named entity recognition method (IDCC-C-based statistical model and affixation with NER). This method integrates the IDCC-based and character-based statistical models to distinguish the word boundaries. The IDCC-based statistical model utilizes the IDCC feature to disambiguate any ambiguous words. The unknown words are handled using the character-based statistical model based on the character features. In addition, linguistic knowledge is employed to detect the boundaries of a new word based on the construction morphology and NER. In evaluations, we compared the proposed method with various word segmentation methods. The experimental results showed that the proposed method performed slightly better than the other methods when the corpus size increased. Using the test set, the proposed method obtained the best f-measure of 92.19.

Punjabi to ISO 15919 and Roman Transliteration with Phonetic Rectification

Transliteration removes the script barriers. Unfortunately, Punjabi is written in four different scripts i.e. Gurmukhi, Shahmukhi, Devnagri and Latin. The Latin script is understandable for nearly all factions of Punjabi community. The objective of our work is to transliterate the Punjabi Gurmukhi script into Latin script. There has been considerable progress in Punjabi to Latin transliteration, but the accuracy of present day systems is less than fifty percent (Google Translator has approximately 45 percent accuracy). We do not have the facility of rich parallel corpus for Punjabi, so we can not use the corpus based techniques of machine learning which are in vogue these days. The existing systems of transliteration follow grapheme-based approach. The grapheme-based transliteration is unable to handle many scenarios such as tones, inherent schwa, glottal stops, nasalization and gemination. In this paper, the graphemebased transliteration has been augmented with phonetic rectification where the Punjabi script is rectified phonetically before applying character-to-character mapping. Handling the inherent short vowel schwa was the major challenge in phonetic rectification. Instead of following the fixed syllabic pattern, we devised a generic finite state transducer to insert schwa. The accuracy of our transliteration system is approximately 96.82 percent.

Layer-wise De-training and Re-training for Convolutional Sequence to Sequence Machine Translation

Convolutional sequence to sequence (ConvS2S) machine translation system is one of the typical Neural Machine Translation (NMT) systems. Training ConvS2S model tends to get stuck in a local optimum in our pre-studies. To overcome this inferior behavior, we propose to de-train a trained ConvS2S model in a mild way and retrains to find better solution globally. In particular, the trained parameters of one layer of the NMT network are abandoned by reinitialization while other layers' parameters are kept at the same time to kick off re-optimization from a new start point and safeguard the new start point not too far from the previous optimum. This procedure is executed layer by layer until all layers of the ConvS2S model are explored. Experiments show that when compared to various measures for escaping from local optimum including initialization with random seeds, adding perturbations to the baseline parameters, and retraining with the baseline models, our method consistently improves the ConvS2S translation quality across various language pairs, and achieves performances comparable to Transformer.

Children Story Classification in Indian Languages using Linguistic and Keyword based Features

The primary objective of this work is to classify Hindi and Telugu stories into three genres: fable, folk-tale and legend. In this work, we are proposing a framework for story classification (SC) using keyword and part-of-speech (POS) features. For improving the performance of SC system, feature reduction techniques and combinations of various POS tags are explored. Further, we investigated the performance of SC by dividing the story into parts depending on its semantic structure. In this work, stories are (i) manually divided into parts based on their semantics as introduction, main and climax; and (ii) automatically divided into equal parts based on number of sentences in a story as initial, middle and end. We have also examined sentence increment model that aims at determining an optimum number of sentences required to identify story genre by incremental selection of sentences in a story. Experiments are conducted on Hindi and Telugu story corpora consisting of 300 and 150 short stories, respectively. The performance of SC system is evaluated using different combinations of keyword and POS based features, with three promising machine learning classifiers: (i) Naive Bayes (NB), (ii) k-Nearest Neighbour (KNN) and (iii) Support Vector Machine (SVM). Performance of the classifier is evaluated using 10-fold cross-validation and effectiveness of classifier is measured using precision, recall and F-measure. From the classification results, it is observed that adding linguistic information boosts the performance of story classification significantly. In view of the structure of the story, main and initial parts of the story have shown comparatively better performance. The results from the sentence incremental model have indicated that, the first nine and seven sentences in Hindi and Telugu stories respectively are sufficient for better classification of stories. In most of the studies, SVM models outperformed the other models in classification accuracy.

Enhanced Double-Carrier Word Embedding Via Phonetics and Writing

Word embeddings, which map words into a unified vector space, capture rich semantic information. From a linguistic point of view, words have two carriers, speech and writing, yet the most recent word embedding models focus on only the writing carrier and ignore the role of the speech carrier in semantic expressions. However, in the development of language, speech appears before writing and plays an important role in the development of writing. For phonetic language systems, the written forms are secondary symbols of spoken ones. Based on this idea, we carried out our work and proposed double-carrier word embedding (DCWE). We used DCWE to conduct a simulation of the generation order of speech and writing. We trained written embedding based on phonetic embedding. The final word embedding fuses writing and phonetic embedding. To illustrate that our model can be applied to most languages, we selected Chinese, English and Spanish as examples and evaluated these models through word similarity and text classification experiments.

Subword Attentive model for Arabic Sentiment Analysis: A deep learning approach

Social media data is unstructured data where these big data are exponentially increasing day-to-day in many different disciplines. Analysis and understanding the semantic of these data are a big challenge due to its variety and huge volume. To address this gap, Unstructured Arabic texts have been studied in this work owing to its abundant appearance in social media websites. This work addresses the difficulty of handling unstructured social media texts, particularly when the data at hand is very limited. This intelligent data augmentation technique that handles the problem of less availability of data are used. This paper has proposed a novel architecture for hand Arabic words classification and understands based on convolutional neural networks (CNN) and recurrent neural network (RNN). Moreover, convolutional neural networks (CNN) is the most powerful technique for the analysis of Arabic tweets and social network analysis. The main technique used in this work is character-level CNN and a RNN stacked on top of one another as the classification architecture. These two techniques give 95% accuracy in the Arabic texts data set.

Extracting Polarity Shifting Patterns from Any Corpus Based on Natural Annotation

In recent years, online sentiment texts are generated by users in various domains and in different languages. Binary polarity classification (positive or negative) on business sentiment texts can help both companies and customers to evaluate products or services. Sometimes, the polarity of sentiment texts can be modified, making the polarity classification difficult. In sentiment analysis, such modification of polarity is termed as \textbf{polarity shifting}, which shifts the polarity of a sentiment clue (emotion, evaluation etc.). It is well known that detection of polarity shifting can help improve sentiment analysis in texts. However, to detect polarity shifting in corpora is challenging: 1) polarity shifting is normally sparse in texts, making human annotation difficult; 2) corpora with dense polarity shifting are few, we may need polarity shifting patterns from various corpora. In this paper, an approach is presented to extract polarity shifting patterns from any text corpus. For the first time, we proposed to select texts rich in polarity shifting by idea of \textbf{natural annotation}, which is used to replace human annotation. With a sequence mining algorithm, the selected texts are used to generate polarity shifting pattern candidates, and then we rank them by C-value before human annotation. The approach is tested on different corpora and different languages. The results show that our approach can capture various types of polarity shifting patterns, and some patterns are unique to specific corpora. Therefore, for better performance, it is reasonable to construct polarity shifting patterns directly from the given corpus.

A Deep Neural Network Framework for English Hindi Question Answering

All ACM Journals | See Full Journal Index

enter search term and/or author name