ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Latest Articles

Order-Sensitive Keywords Based Response Generation in Open-Domain Conversational Systems

External keywords are crucial for response generation models to address the generic response problems in open-domain conversational systems. The... (more)

Neural Conversation Generation with Auxiliary Emotional Supervised Models

An important aspect of developing dialogue agents involves endowing a conversation system with emotion perception and interaction. Most existing... (more)

SentiFars: A Persian Polarity Lexicon for Sentiment Analysis

There is no doubt about the usefulness of public opinion toward different issues in social media and the World Wide Web. Extracting the feelings of people about an issue from text is not straightforward. Polarity lexicons that assign polarity tags or scores to words and phrases play an important role in sentiment analysis systems. As English is the... (more)

Filtered Pseudo-parallel Corpus Improves Low-resource Neural Machine Translation

Large-scale parallel corpora are essential for training high-quality machine translation systems; however, such corpora are not freely available for... (more)

Layer-Wise De-Training and Re-Training for ConvS2S Machine Translation

The convolutional sequence-to-sequence (ConvS2S) machine translation system is one of the typical neural machine translation (NMT) systems. Training... (more)


ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) welcomes Imed Zitouni as its new Editor-in-Chief, for the term September 1, 2019 to August 31, 2022. Imed is Principal Research Manager at Microsoft.

Learning and Modeling Unit Embeddings Using Deep Neural Networks for Unit Selection Based Mandarin Speech Synthesis

Efficient Low-resource Neural Machine Translation with Reread and Feedback Mechanism

How to utilize information sufficiently is a key problem in neural machine translation (NMT), which is effectively improved in rich-resource NMT by leveraging large-scale bilingual sentence pairs. However, for low-resource NMT, lack of bilingual sentence pairs results in poor translation performance, therefore, take full advantage of global information in encoding-decoding process is an effective way for low-resource NMT. In this paper, we propose a novel reread-feedback NMT architecture (RFNMT) for global information using. Our architecture builds upon the improved sequence to sequence neural network, and consists of double-deck attention-based encoder-decoder framework. In our proposed architecture, the information generated by first-pass encoding and decoding process both flow to second-pass encoding process for more sufficient parameters initialization and information using. Specifically, we first propose a 'reread' mechanism to transfer the annotation of first-pass encoder to the second-pass encoder, and then the annotation is used for the initialization of second-pass encoder. Secondly, we propose a 'feedback' mechanism that transfer first-pass decoder's outputs to second-pass encoder via an important weight model and an improved gated recurrent units (GRU). Experiment results on multiple corpora demonstrate the effectiveness of our proposed RFNMT architecture especially in low-resource settings.

Uniformly Interpolated Balancing for Robust Prediction in Translation Quality Estimation: A Case Study of English-Korean Translation

There has been growing interest among researchers in quality estimation (QE), which attempts to automatically predict the quality of MT outputs. Most existing works on QE are based on supervised approaches using quality-annotated training data. However, QE training data quality scores readily become imbalanced or skewed: QE data are mostly composed of high translation quality sentence pairs but the data lack low translation quality sentence pairs. The use of imbalanced data with an induced quality estimator tends to produce biased translation quality scores with ?high? translation quality scores assigned even to poorly translated sentences. To address the data imbalance, this paper proposes a simple, efficient procedure called uniformly interpolated balancing to construct more balanced QE training data by inserting greater uniformness to training data. The proposed uniformly interpolated balancing procedure is based on the preparation of two different types of manually annotated QE data: 1) default skewed data and 2) near-uniform data. First, we obtain default skewed data in a naive manner without considering the imbalance by manually annotating qualities on MT outputs. Second, we obtain near-uniform data in a selective manner by manually annotating a subset only, which is selected from the automatically quality-estimated sentence pairs. Finally, we create uniformly interpolated balanced data by combining these two types of data, where one half originates from the default skewed data and the other half originates from the near-uniform data. We expect that uniformly interpolated balancing reflects the intrinsic skewness of the true quality distribution and manages the imbalance problem. Experimental results on an English-Korean quality estimation task show that the proposed uniformly interpolated balancing leads to robustness on both skewed and uniformly distributed quality test sets when compared to the test sets of other non-balanced datasets.

Word Reordering for Translation into Korean Sign Language Using Syntactically-guided Classification

One of the goals of machine translation is to break the language barrier that prevents communication with others and accessing information. Furthermore, deaf people face big language barriers in their daily lives. There are very few digital resources for sign language processing. In this paper, we present a machine translation system for translating Korean to Korean Sign Language (KSL) glosses. The system uses dictionary-based lexical transfer and syntactically guided data-driven structural transfer. A basic description of the linguistic features of KSL with other sign languages is also presented. This work especially focuses on structural transfer as word reordering. The core part of our work is a neural classification model for reordering order-important constituent pairs with a reordering task that is newly designed for Korean-to-KSL translation. The experiment results evaluated on news transcript data show that the proposed system achieves a BLEU score of 0.512 and a RIBES score of 0.425, significantly improving the performances of the baseline system.

Wasf-Vec: Topology-Based Word Embedding for Modern Standard Arabic and Iraqi Dialect Ontology

Word clustering is a crucial issue in low resource languages. Since words that share semantics are expected to be clustered together, it is common to use feature vector representation generated from distributional theory based words embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low resource Iraqi's dialect language vocabulary. We start with a dialect fast stemming algorithm that utilizes the MSA data with a 0.85 accuracy measured by the F1 score, followed by training using the distributional theory based word embedding method on the stemmed data. This is followed by an analysis of how dialect words were clustered within other Modern Standard Arabic words while using word semantic relations that are well supported by solid linguistic theories, and we shed the light on the strong and weak words' relation representations. The analysis is handled by visualizing the first two PCA components in 2D space, examining the words nearest neighbors, and analyzing distance-histogram of specific words' templates. New simple yet effective spatial feature vector named Wasf-Vec for word representation is proposed in this work that utilizes the orthographical, phonological, and morphological words' structures. Wasf technique captures relations that are not contextual based as in the distributional theory based word embedding method. The validation of the words classification used in this paper is done by employing the classes in a class-based language modeling CBLM. Wasf-Vec CBLM achieved 7% lower perplexity (pp) than distributional theory based word embedding method CBLM. This result is significant when working with low resource languages.

Isarn Dharma Word Segmentation Using a Statistical Approach with Named Entity Recognition

In this study, we developed an Isarn Dharma word segmentation system. We mainly focused on solving the word ambiguity and unknown word problems in unsegmented Isarn Dharma text. Ambiguous Isarn Dharma words occur frequently in the word construction due to the writing style without tone markers. Thus, words can be interpreted as having different tone and meanings in the same writing. To overcome these problems, we developed an Isarn Dharma character cluster (IDCC)-character-based statistical model and affixation with named entity recognition method (IDCC-C-based statistical model and affixation with NER). This method integrates the IDCC-based and character-based statistical models to distinguish the word boundaries. The IDCC-based statistical model utilizes the IDCC feature to disambiguate any ambiguous words. The unknown words are handled using the character-based statistical model based on the character features. In addition, linguistic knowledge is employed to detect the boundaries of a new word based on the construction morphology and NER. In evaluations, we compared the proposed method with various word segmentation methods. The experimental results showed that the proposed method performed slightly better than the other methods when the corpus size increased. Using the test set, the proposed method obtained the best f-measure of 92.19.

Punjabi to ISO 15919 and Roman Transliteration with Phonetic Rectification

Transliteration removes the script barriers. Unfortunately, Punjabi is written in four different scripts i.e. Gurmukhi, Shahmukhi, Devnagri and Latin. The Latin script is understandable for nearly all factions of Punjabi community. The objective of our work is to transliterate the Punjabi Gurmukhi script into Latin script. There has been considerable progress in Punjabi to Latin transliteration, but the accuracy of present day systems is less than fifty percent (Google Translator has approximately 45 percent accuracy). We do not have the facility of rich parallel corpus for Punjabi, so we can not use the corpus based techniques of machine learning which are in vogue these days. The existing systems of transliteration follow grapheme-based approach. The grapheme-based transliteration is unable to handle many scenarios such as tones, inherent schwa, glottal stops, nasalization and gemination. In this paper, the graphemebased transliteration has been augmented with phonetic rectification where the Punjabi script is rectified phonetically before applying character-to-character mapping. Handling the inherent short vowel schwa was the major challenge in phonetic rectification. Instead of following the fixed syllabic pattern, we devised a generic finite state transducer to insert schwa. The accuracy of our transliteration system is approximately 96.82 percent.

Children Story Classification in Indian Languages using Linguistic and Keyword based Features

The primary objective of this work is to classify Hindi and Telugu stories into three genres: fable, folk-tale and legend. In this work, we are proposing a framework for story classification (SC) using keyword and part-of-speech (POS) features. For improving the performance of SC system, feature reduction techniques and combinations of various POS tags are explored. Further, we investigated the performance of SC by dividing the story into parts depending on its semantic structure. In this work, stories are (i) manually divided into parts based on their semantics as introduction, main and climax; and (ii) automatically divided into equal parts based on number of sentences in a story as initial, middle and end. We have also examined sentence increment model that aims at determining an optimum number of sentences required to identify story genre by incremental selection of sentences in a story. Experiments are conducted on Hindi and Telugu story corpora consisting of 300 and 150 short stories, respectively. The performance of SC system is evaluated using different combinations of keyword and POS based features, with three promising machine learning classifiers: (i) Naive Bayes (NB), (ii) k-Nearest Neighbour (KNN) and (iii) Support Vector Machine (SVM). Performance of the classifier is evaluated using 10-fold cross-validation and effectiveness of classifier is measured using precision, recall and F-measure. From the classification results, it is observed that adding linguistic information boosts the performance of story classification significantly. In view of the structure of the story, main and initial parts of the story have shown comparatively better performance. The results from the sentence incremental model have indicated that, the first nine and seven sentences in Hindi and Telugu stories respectively are sufficient for better classification of stories. In most of the studies, SVM models outperformed the other models in classification accuracy.

Enhanced Double-Carrier Word Embedding Via Phonetics and Writing

Word embeddings, which map words into a unified vector space, capture rich semantic information. From a linguistic point of view, words have two carriers, speech and writing, yet the most recent word embedding models focus on only the writing carrier and ignore the role of the speech carrier in semantic expressions. However, in the development of language, speech appears before writing and plays an important role in the development of writing. For phonetic language systems, the written forms are secondary symbols of spoken ones. Based on this idea, we carried out our work and proposed double-carrier word embedding (DCWE). We used DCWE to conduct a simulation of the generation order of speech and writing. We trained written embedding based on phonetic embedding. The final word embedding fuses writing and phonetic embedding. To illustrate that our model can be applied to most languages, we selected Chinese, English and Spanish as examples and evaluated these models through word similarity and text classification experiments.

StyloThai: A Scalable Framework For Stylometric Authorship Identification of Thai Documents

Authorship identification helps to identify the true author of a given anonymous document from a set of candidate authors. The applications of this task can be found in several domains such as law enforcement agencies and information retrieval. These application domains are not limited to a specific language or community. However, most of the existing solutions are designed for English and a little attention has been paid to Thai. These existing solutions are not directly applicable to Thai due to the linguistic differences between these two languages. Moreover, the existing solution designed for Thai is unable to (i) handle outliers in the dataset; (ii) scale when the size of the candidate authors set increases; and (iii) perform well when the number of writing samples for each candidate author is low. We identify a stylometric feature space for the Thai authorship identification task. Based on our feature space, we present an authorship identification solution that uses probabilistic k nearest neighbors? classifier by transforming each document into a collection of point sets. We create a new Thai authorship identification corpus containing 547 documents from 200 authors, which is significantly larger than the corpus used by the existing study (an increase of 32 folds in terms of the number of candidate authors). The experimental results show that our solution can overcome the limitations of the existing solution and outperforms all competitors with an accuracy level of 91.02%. Moreover, we found that combining all categories of the stylometric features outperforms the other combinations. Finally, we cross-compare the feature spaces and classification methods of all solutions. We found that (i) our solution can scale as the number of candidate authors increases; (ii) our method outperforms all the competitors; and (iii) our feature space provides better performance than the feature space used by the existing study.

Fusion of spatio-temporal information for Indic word recognition combining Online and Offline text data

This paper presents a novel approach towards Indic handwritten word recognition by fusing spatio-temporal information extracted from handwritten images. The main challenge in Indic word recognition lies in its complexity due to the presence of modifiers, overlapping and touching characters, compound characters, etc. Hidden Markov Models (HMMs) have been used to model such data due to their ability to learn sequential data, however, the recognition performance is not satisfactory. In this paper, we present a Long Short-Term Memory (LSTM)-based architecture for offline Indic word recognition. Offline recognition methods usually involve spatial data whereas it has been observed that online recognition schemes show better performance than the offline methodologies. Online information, usually refers to the temporal information obtained from the strokes of the pen tip while writing,which is missing in offline word images. In this paper, an effort has been made to extract the online temporal information from offline images using stroke recovery and later it is combined with spatial information in deep LSTM architecture. During recognition, the character models are trained using both offline and online data separately. Finally, a novel fusion scheme has been used to combine them together. From the experiment it is noted that recognition performance of handwritten Indic words improves considerably due to the fusion scheme of spatial and temporal data.

Subword Attentive model for Arabic Sentiment Analysis: A deep learning approach

Social media data is unstructured data where these big data are exponentially increasing day-to-day in many different disciplines. Analysis and understanding the semantic of these data are a big challenge due to its variety and huge volume. To address this gap, Unstructured Arabic texts have been studied in this work owing to its abundant appearance in social media websites. This work addresses the difficulty of handling unstructured social media texts, particularly when the data at hand is very limited. This intelligent data augmentation technique that handles the problem of less availability of data are used. This paper has proposed a novel architecture for hand Arabic words classification and understands based on convolutional neural networks (CNN) and recurrent neural network (RNN). Moreover, convolutional neural networks (CNN) is the most powerful technique for the analysis of Arabic tweets and social network analysis. The main technique used in this work is character-level CNN and a RNN stacked on top of one another as the classification architecture. These two techniques give 95% accuracy in the Arabic texts data set.

Extracting Polarity Shifting Patterns from Any Corpus Based on Natural Annotation

In recent years, online sentiment texts are generated by users in various domains and in different languages. Binary polarity classification (positive or negative) on business sentiment texts can help both companies and customers to evaluate products or services. Sometimes, the polarity of sentiment texts can be modified, making the polarity classification difficult. In sentiment analysis, such modification of polarity is termed as \textbf{polarity shifting}, which shifts the polarity of a sentiment clue (emotion, evaluation etc.). It is well known that detection of polarity shifting can help improve sentiment analysis in texts. However, to detect polarity shifting in corpora is challenging: 1) polarity shifting is normally sparse in texts, making human annotation difficult; 2) corpora with dense polarity shifting are few, we may need polarity shifting patterns from various corpora. In this paper, an approach is presented to extract polarity shifting patterns from any text corpus. For the first time, we proposed to select texts rich in polarity shifting by idea of \textbf{natural annotation}, which is used to replace human annotation. With a sequence mining algorithm, the selected texts are used to generate polarity shifting pattern candidates, and then we rank them by C-value before human annotation. The approach is tested on different corpora and different languages. The results show that our approach can capture various types of polarity shifting patterns, and some patterns are unique to specific corpora. Therefore, for better performance, it is reasonable to construct polarity shifting patterns directly from the given corpus.

S 3 -NET: SRU-based Sentence and Self-matching Networks for Machine Reading Comprehension

Machine reading comprehension question answering (MRC-QA) is the task of understanding the context of a given passage to find a correct answer within it. A passage is composed of several sentences; therefore, the length of the input sentence becomes longer, leading to diminished performance. In this paper, we propose S3-NET, which adds sentence-based encoding to solve this problem. S3-NET, which is based on a simple recurrent unit architecture, is a deep learning model that solves the MRC-QA by applying matching network to sentence level encoding. In addition, S3-NET utilizes self-matching networks to compute attention weight for its own recurrent neural network sequences. We performs MRC-QA for SQuAD dataset of English and MindsMRC dataset of Korean. The experimental results show that for SQuAD, the S3-NET model proposed in this paper produces 71.91% and 74.12% EM and 81.02% and 82.34% F1 in single and ensemble models, respectively, and for MindsMRC, our model achieves 69.43% and 71.28% EM and 81.53% and 82.77% F1 in single and ensemble models, respectively.

Transliteration of Arabizi into Arabic script for Tunisian dialect

In this paper, we focus on the designing and the development of a system of converting Tunisian Dialect text that is written in Latin script (also called Arabizi) into Arabic script following the CODA (Conventional Orthography for Dialectal Arabic). To do that, we resorted to collect from the internet (chat, comments, etc.) or from messages (instant messaging and mobile phone text messaging). Most of these messages and comments are written in Latin script. Moreover, the language used in social media and SMS messaging is characterized by the use of informal and non-standard vocabulary such as repeated letters for emphasis, typos, non-standard abbreviations, and nonlinguistic content, such as emoticons. In the context of Natural Language Processing (NLP), transliterating from Arabizi to Arabic script is a necessary step since most recently available tools for processing Arabic Dialects expect Arabic script input. Then, we propose a hybrid approach to transliteration of Arabizi into Arabic script for Tunisian Dialect, namely a rule-based approach and a discriminative model as a sequence classification task based on CRFs (Conditional Random Fields). In this work, transliteration was performed at both the word and the character levels. At the end, our system gets a WER (Word Error Rate) of 09.80\% and a CER (Character Error Rate) of 10.47\%.

A Deep Neural Network Framework for English Hindi Question Answering

Persian Semantic Role Labeling

All ACM Journals | See Full Journal Index

enter search term and/or author name