ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Latest Articles

Order-Sensitive Keywords Based Response Generation in Open-Domain Conversational Systems

External keywords are crucial for response generation models to address the generic response problems in open-domain conversational systems. The... (more)

Neural Conversation Generation with Auxiliary Emotional Supervised Models

An important aspect of developing dialogue agents involves endowing a conversation system with emotion perception and interaction. Most existing... (more)

SentiFars: A Persian Polarity Lexicon for Sentiment Analysis

There is no doubt about the usefulness of public opinion toward different issues in social media and the World Wide Web. Extracting the feelings of people about an issue from text is not straightforward. Polarity lexicons that assign polarity tags or scores to words and phrases play an important role in sentiment analysis systems. As English is the... (more)

Filtered Pseudo-parallel Corpus Improves Low-resource Neural Machine Translation

Large-scale parallel corpora are essential for training high-quality machine translation systems; however, such corpora are not freely available for... (more)

Layer-Wise De-Training and Re-Training for ConvS2S Machine Translation

The convolutional sequence-to-sequence (ConvS2S) machine translation system is one of the typical neural machine translation (NMT) systems. Training... (more)


ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) welcomes Imed Zitouni as its new Editor-in-Chief, for the term September 1, 2019 to August 31, 2022. Imed is Principal Research Manager at Microsoft.

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit Selection Based Mandarin Speech Synthesis

Towards Integrate Classification Lexicon for Handling Unknown Words in Chinese-Vietnamese Neural Machine Translation

In neural machine translation (NMT), due to the limitation of the vocabulary, the words outside the vocabulary (out of vocabulary,OOV) can not be translated properly, which greatly affects the accuracy of the translation system. For the NMT of languages with less training corpus and resource-scarce, this problem is more serious. Although bilingual parallel corpus in resource-scarce languages is difficult to obtain, it is easy to obtain and utilize external knowledge, bilingual lexicon and other resources. Therefore, in this paper, we propose a method for processing unknown words in Chinese-Vietnamese NMT integrated with classification lexicon. Different types of unknown words are constructed by word alignment, Wikipedia extraction and rule-based methods integrated into Chinese-Vietnamese NMT. Different types of unknown words are classified and processed with classification lexicon respectively, and then restored after translation. To improve the performance of Chinese-Vietnamese NMT. Experiments on Chinese-Vietnamese, English-Vietnamese and Mongolian-Chinese translation show that this method significantly improves the accuracy and the performance of NMT for resource-scarce languages.

Efficient Low-resource Neural Machine Translation with Reread and Feedback Mechanism

How to utilize information sufficiently is a key problem in neural machine translation (NMT), which is effectively improved in rich-resource NMT by leveraging large-scale bilingual sentence pairs. However, for low-resource NMT, lack of bilingual sentence pairs results in poor translation performance, therefore, take full advantage of global information in encoding-decoding process is an effective way for low-resource NMT. In this paper, we propose a novel reread-feedback NMT architecture (RFNMT) for global information using. Our architecture builds upon the improved sequence to sequence neural network, and consists of double-deck attention-based encoder-decoder framework. In our proposed architecture, the information generated by first-pass encoding and decoding process both flow to second-pass encoding process for more sufficient parameters initialization and information using. Specifically, we first propose a 'reread' mechanism to transfer the annotation of first-pass encoder to the second-pass encoder, and then the annotation is used for the initialization of second-pass encoder. Secondly, we propose a 'feedback' mechanism that transfer first-pass decoder's outputs to second-pass encoder via an important weight model and an improved gated recurrent units (GRU). Experiment results on multiple corpora demonstrate the effectiveness of our proposed RFNMT architecture especially in low-resource settings.

Uniformly Interpolated Balancing for Robust Prediction in Translation Quality Estimation: A Case Study of English-Korean Translation

There has been growing interest among researchers in quality estimation (QE), which attempts to automatically predict the quality of MT outputs. Most existing works on QE are based on supervised approaches using quality-annotated training data. However, QE training data quality scores readily become imbalanced or skewed: QE data are mostly composed of high translation quality sentence pairs but the data lack low translation quality sentence pairs. The use of imbalanced data with an induced quality estimator tends to produce biased translation quality scores with ?high? translation quality scores assigned even to poorly translated sentences. To address the data imbalance, this paper proposes a simple, efficient procedure called uniformly interpolated balancing to construct more balanced QE training data by inserting greater uniformness to training data. The proposed uniformly interpolated balancing procedure is based on the preparation of two different types of manually annotated QE data: 1) default skewed data and 2) near-uniform data. First, we obtain default skewed data in a naive manner without considering the imbalance by manually annotating qualities on MT outputs. Second, we obtain near-uniform data in a selective manner by manually annotating a subset only, which is selected from the automatically quality-estimated sentence pairs. Finally, we create uniformly interpolated balanced data by combining these two types of data, where one half originates from the default skewed data and the other half originates from the near-uniform data. We expect that uniformly interpolated balancing reflects the intrinsic skewness of the true quality distribution and manages the imbalance problem. Experimental results on an English-Korean quality estimation task show that the proposed uniformly interpolated balancing leads to robustness on both skewed and uniformly distributed quality test sets when compared to the test sets of other non-balanced datasets.

Wasf-Vec: Topology-Based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology

Word clustering is a crucial issue in low resource languages. Since words that share semantics are expected to be clustered together, it is common to use feature vector representation generated from distributional theory based words embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low resource Iraqi's dialect language vocabulary. We start with a dialect fast stemming algorithm that utilizes the MSA data with a 0.85 accuracy measured by the F1 score, followed by training using the distributional theory based word embedding method on the stemmed data. This is followed by an analysis of how dialect words were clustered within other Modern Standard Arabic words while using word semantic relations that are well supported by solid linguistic theories, and we shed the light on the strong and weak words' relation representations. The analysis is handled by visualizing the first two PCA components in 2D space, examining the words nearest neighbors, and analyzing distance-histogram of specific words' templates. New simple yet effective spatial feature vector named Wasf-Vec for word representation is proposed in this work that utilizes the orthographical, phonological, and morphological words' structures. Wasf technique captures relations that are not contextual based as in the distributional theory based word embedding method. The validation of the words classification used in this paper is done by employing the classes in a class-based language modeling CBLM. Wasf-Vec CBLM achieved 7% lower perplexity (pp) than distributional theory based word embedding method CBLM. This result is significant when working with low resource languages.

Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

In this study, we developed an Isarn Dharma word segmentation system. We mainly focused on solving the word ambiguity and unknown word problems in unsegmented Isarn Dharma text. Ambiguous Isarn Dharma words occur frequently in the word construction due to the writing style without tone markers. Thus, words can be interpreted as having different tone and meanings in the same writing. To overcome these problems, we developed an Isarn Dharma character cluster (IDCC)-character-based statistical model and affixation with named entity recognition method (IDCC-C-based statistical model and affixation with NER). This method integrates the IDCC-based and character-based statistical models to distinguish the word boundaries. The IDCC-based statistical model utilizes the IDCC feature to disambiguate any ambiguous words. The unknown words are handled using the character-based statistical model based on the character features. In addition, linguistic knowledge is employed to detect the boundaries of a new word based on the construction morphology and NER. In evaluations, we compared the proposed method with various word segmentation methods. The experimental results showed that the proposed method performed slightly better than the other methods when the corpus size increased. Using the test set, the proposed method obtained the best f-measure of 92.19.

Punjabi to ISO 15919 and Roman Transliteration with Phonetic Rectification

Transliteration removes the script barriers. Unfortunately, Punjabi is written in four different scripts i.e. Gurmukhi, Shahmukhi, Devnagri and Latin. The Latin script is understandable for nearly all factions of Punjabi community. The objective of our work is to transliterate the Punjabi Gurmukhi script into Latin script. There has been considerable progress in Punjabi to Latin transliteration, but the accuracy of present day systems is less than fifty percent (Google Translator has approximately 45 percent accuracy). We do not have the facility of rich parallel corpus for Punjabi, so we can not use the corpus based techniques of machine learning which are in vogue these days. The existing systems of transliteration follow grapheme-based approach. The grapheme-based transliteration is unable to handle many scenarios such as tones, inherent schwa, glottal stops, nasalization and gemination. In this paper, the graphemebased transliteration has been augmented with phonetic rectification where the Punjabi script is rectified phonetically before applying character-to-character mapping. Handling the inherent short vowel schwa was the major challenge in phonetic rectification. Instead of following the fixed syllabic pattern, we devised a generic finite state transducer to insert schwa. The accuracy of our transliteration system is approximately 96.82 percent.

Enhanced Double-Carrier Word Embedding Via Phonetics and Writing

Word embeddings, which map words into a unified vector space, capture rich semantic information. From a linguistic point of view, words have two carriers, speech and writing, yet the most recent word embedding models focus on only the writing carrier and ignore the role of the speech carrier in semantic expressions. However, in the development of language, speech appears before writing and plays an important role in the development of writing. For phonetic language systems, the written forms are secondary symbols of spoken ones. Based on this idea, we carried out our work and proposed double-carrier word embedding (DCWE). We used DCWE to conduct a simulation of the generation order of speech and writing. We trained written embedding based on phonetic embedding. The final word embedding fuses writing and phonetic embedding. To illustrate that our model can be applied to most languages, we selected Chinese, English and Spanish as examples and evaluated these models through word similarity and text classification experiments.

StyloThai: A Scalable Framework For Stylometric Authorship Identification of Thai Documents

Authorship identification helps to identify the true author of a given anonymous document from a set of candidate authors. The applications of this task can be found in several domains such as law enforcement agencies and information retrieval. These application domains are not limited to a specific language or community. However, most of the existing solutions are designed for English and a little attention has been paid to Thai. These existing solutions are not directly applicable to Thai due to the linguistic differences between these two languages. Moreover, the existing solution designed for Thai is unable to (i) handle outliers in the dataset; (ii) scale when the size of the candidate authors set increases; and (iii) perform well when the number of writing samples for each candidate author is low. We identify a stylometric feature space for the Thai authorship identification task. Based on our feature space, we present an authorship identification solution that uses probabilistic k nearest neighbors? classifier by transforming each document into a collection of point sets. We create a new Thai authorship identification corpus containing 547 documents from 200 authors, which is significantly larger than the corpus used by the existing study (an increase of 32 folds in terms of the number of candidate authors). The experimental results show that our solution can overcome the limitations of the existing solution and outperforms all competitors with an accuracy level of 91.02%. Moreover, we found that combining all categories of the stylometric features outperforms the other combinations. Finally, we cross-compare the feature spaces and classification methods of all solutions. We found that (i) our solution can scale as the number of candidate authors increases; (ii) our method outperforms all the competitors; and (iii) our feature space provides better performance than the feature space used by the existing study.

Deep Learning for Arabic Error Detection and Correction

Research on tools for automating the proofreading of Arabic text has received much attention in recent years. There is an increasing demand for applications that can detect and correct Arabic spelling and grammatical errors to improve the quality of Arabic text content and application input. Our review of previous studies indicates that few Arabic spell-checking research efforts correctly address detection and correction for ill-formed words that do not conform to the Arabic morphology system. Even fewer systems address the detection and correction of erroneous well-formed Arabic words that are either contextually or semantically inconsistent within the text. We introduce an approach that investigates employing deep neural network technology for error detection in Arabic Text. We have developed a systematic framework for spelling and grammar error detection as well as correction at the word level based on a Bidirectional Long-Short-Term Memory (Bi-LSTM) mechanism and Word embedding, in which a Polynomial Network (PN) classifier is at the top of the system. In order to get conclusive results, we have developed the most significant gold standard annotated corpus to date, containing 15 million fully-inflected Arabic words. This data was collected from diverse text sources and genres, in which any erroneous and ill-formed words have been annotated, validated and manually revised by Arabic specialists. The experimental results confirm that our proposed system significantly outperforms the performance of Microsoft Word 2013 and Open Office Ayaspell 3.4 that have been used in the literature for evaluating similar research.?

Subword Attentive model for Arabic Sentiment Analysis: A deep learning approach

Social media data is unstructured data where these big data are exponentially increasing day-to-day in many different disciplines. Analysis and understanding the semantic of these data are a big challenge due to its variety and huge volume. To address this gap, Unstructured Arabic texts have been studied in this work owing to its abundant appearance in social media websites. This work addresses the difficulty of handling unstructured social media texts, particularly when the data at hand is very limited. This intelligent data augmentation technique that handles the problem of less availability of data are used. This paper has proposed a novel architecture for hand Arabic words classification and understands based on convolutional neural networks (CNN) and recurrent neural network (RNN). Moreover, convolutional neural networks (CNN) is the most powerful technique for the analysis of Arabic tweets and social network analysis. The main technique used in this work is character-level CNN and a RNN stacked on top of one another as the classification architecture. These two techniques give 95% accuracy in the Arabic texts data set.

Korean Part-of-Speech Tagging based on Morpheme Generation

Two major problems of Korean part-of-speech (POS) tagging are that the word-spacing unit is not mapped one-to-one to a POS tag and that morphemes should be recovered during POS tagging. Therefore, this paper proposes a novel two-step Korean POS tagger that solves the problems. This tagger first generates a sequence of lemmatized and recovered morphemes that can be mapped one-to-one to a POS tag using an encoder-decoder architecture derived from a POS-tagged corpus. Then, the POS tag of each morpheme in the generated sequence is finally determined by a standard sequence labeling method. Since the knowledge for segmenting and recovering morphemes is extracted automatically from a POS-tagged corpus by an encoder-decoder architecture, the POS tagger is constructed without a dictionary nor hand-crafted linguistic rules. The experimental results on a standard data set show that the proposed method outperforms existing POS taggers with its state-of-the-art performance.

Extracting Polarity Shifting Patterns from Any Corpus Based on Natural Annotation

In recent years, online sentiment texts are generated by users in various domains and in different languages. Binary polarity classification (positive or negative) on business sentiment texts can help both companies and customers to evaluate products or services. Sometimes, the polarity of sentiment texts can be modified, making the polarity classification difficult. In sentiment analysis, such modification of polarity is termed as \textbf{polarity shifting}, which shifts the polarity of a sentiment clue (emotion, evaluation etc.). It is well known that detection of polarity shifting can help improve sentiment analysis in texts. However, to detect polarity shifting in corpora is challenging: 1) polarity shifting is normally sparse in texts, making human annotation difficult; 2) corpora with dense polarity shifting are few, we may need polarity shifting patterns from various corpora. In this paper, an approach is presented to extract polarity shifting patterns from any text corpus. For the first time, we proposed to select texts rich in polarity shifting by idea of \textbf{natural annotation}, which is used to replace human annotation. With a sequence mining algorithm, the selected texts are used to generate polarity shifting pattern candidates, and then we rank them by C-value before human annotation. The approach is tested on different corpora and different languages. The results show that our approach can capture various types of polarity shifting patterns, and some patterns are unique to specific corpora. Therefore, for better performance, it is reasonable to construct polarity shifting patterns directly from the given corpus.

S 3 -NET: SRU-based Sentence and Self-matching Networks for Machine Reading Comprehension

Machine reading comprehension question answering (MRC-QA) is the task of understanding the context of a given passage to find a correct answer within it. A passage is composed of several sentences; therefore, the length of the input sentence becomes longer, leading to diminished performance. In this paper, we propose S3-NET, which adds sentence-based encoding to solve this problem. S3-NET, which is based on a simple recurrent unit architecture, is a deep learning model that solves the MRC-QA by applying matching network to sentence level encoding. In addition, S3-NET utilizes self-matching networks to compute attention weight for its own recurrent neural network sequences. We performs MRC-QA for SQuAD dataset of English and MindsMRC dataset of Korean. The experimental results show that for SQuAD, the S3-NET model proposed in this paper produces 71.91% and 74.12% EM and 81.02% and 82.34% F1 in single and ensemble models, respectively, and for MindsMRC, our model achieves 69.43% and 71.28% EM and 81.53% and 82.77% F1 in single and ensemble models, respectively.

Loanword Identification in Low-resource Languages with Minimal Supervision

Bilingual resources play a very important role in many natural language processing (NLP) tasks, especially the tasks in cross-lingual scenarios. However, it is expensive and time-consuming to build such resources. Lexical borrowing happens in almost every language. This inspires us to detect these loanwords effectively, and to use the "loanword (in receipt language)"-"donor word (in donor language)" to extend the bilingual resource for NLP tasks in low-resource languages. In this paper, we propose a novel method to identify loanwords in Uyghur. The most important advantage of this method is that our model only relies on large amount of monolingual corpora and only a small scale of annotated data. Our loanword identification model includes two parts: loanword candidate generation and loanword prediction. In the first part, we use two large scale monolingual corpora and a small bilingual dictionary to train a cross-lingual embedding model; Since semantic unrelated words often cannot be treated as loanword pairs, a loanword candidate list will be generated according to this model and a word list in Uyghur. In the second part, we predict from the above candidates based on a log-linear model which integrates several features such as pronunciation similarity, part-of-speech (POS) tags and hybrid language modeling. To evaluate the effectiveness of our proposed method, we conduct two types of experiments: loanword identification and OOV translation. Experimental results shown that: 1) our proposed method achieves at least 8\% F1 improvements compared with other models in all four loanword identification tasks in Uyghur; 2) after extend the exist translation models with loanword identification results, OOV rates in several language pairs reduced significantly (5\%), the translation performance also improved (0.25 BLEU).

A Burmese (Myanmar) Treebank: Guideline and Analysis

A 20, 000-sentence Burmese (Myanmar) treebank on news articles has been released under a CC BY-NC-SA license. Complete phrase structure annotation was developed for each sentence from the morphologically annotated data prepared in previous work of Ding et al. [1]. As the final result of the Burmese component in the Asian Language Treebank Project, this is the first large-scale, open-access treebank for the Burmese language. The annotation details and features of this treebank are presented.

Persian Semantic Role Labeling

All ACM Journals | See Full Journal Index

enter search term and/or author name