ACM DL

ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Menu
Latest Articles

Optimizing Automatic Evaluation of Machine Translation with the ListMLE Approach

Automatic evaluation of machine translation is critical for the evaluation and development of machine translation systems. In this study, we propose a... (more)

Response Selection and Automatic Message-Response Expansion in Retrieval-Based QA Systems using Semantic Dependency Pair Model

This article presents an approach to response selection and message-response (MR) database expansion... (more)

Input Method for Human Translators: A Novel Approach to Integrate Machine Translation Effectively and Imperceptibly

Computer-aided translation (CAT) systems are the most popular tool for helping human translators efficiently perform language translation. To further improve the translation efficiency, there is an increasing interest in applying machine translation (MT) technology to upgrade CAT. To thoroughly integrate MT into CAT systems, in this article, we... (more)

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

Law enforcement faces problems in tracing the true identity of offenders in cybercrime investigations. Most offenders mask their true identity, impersonate people of high authority, or use identity deception and obfuscation tactics to avoid detection and traceability. To address the problem of anonymity, authorship analysis is used to identify... (more)

Word Segmentation for Burmese Based on Dual-Layer CRFs

Burmese is an isolated language, in which the syllable is the smallest unit. Syllable segmentation methods based on matching lead to performance... (more)

Incorporating Multi-Level User Preference into Document-Level Sentiment Classification

Document-level sentiment classification aims to predict a user’s sentiment polarity in a document about a product. Most existing methods only... (more)

“UTTAM”: An Efficient Spelling Correction System for Hindi Language Based on Supervised Learning

In this article, we propose a system called “UTTAM,” for correcting spelling errors in Hindi language text using supervised learning. Unlike other languages, Hindi contains a large set of characters, words with inflections and complex characters, phonetically similar sets of characters, and so on. The complexity increases the... (more)

NEWS

Science Citation Index Listing

TALLIP will be listed in the Science Citation Index Expanded starting with the first 2015 issue, 14(1). TALLIP will be included in the 2017 Journal Citation Report, and the first Impact Factor will be published mid-2018.

New Name, Expanded Scope

This page provides information about the journal Transactions on Asian and Low-Resource Language Information Processing (TALLIP), a publication of the Association for Computing Machinery (ACM).

The journal was formerly known as the Transactions on Asian Language Information Processing (TALIP): see the editorial charter for information on the expanded scope of the journal.  

Sentiment Analysis of Iraqi Arabic Dialect on Facebook Based on Distributed Representations of Documents

Nowadays, social media is used by many people to express their opinions about a variety of topics. Opinion Mining or Sentiment Analysis techniques extract opinions from user generated contents. Over the years, a multitude of Sentiment Analysis studies has been done about the English language with deficiencies of research in all other languages. Unfortunately, Arabic is one of the languages that seems to lack substantial research, despite the rapid growth of its use on social media outlets. Furthermore, specific Arabic dialects should be studied, not just Modern Standard Arabic. In this paper, we experiment sentiments analysis of Arabic Iraqi dialect using word embedding. First, we made a large corpus from previous works to learn word representations. Second, we generated word embedding model by training corpus using Doc2Vec representations based on Paragraph and Distributed Memory Model of Paragraph Vectors (DM-PV) architectures. Lastly, the represented feature used for training four binary classifiers (Logistic Regression, Decision Tree, Support Vector Machine and Naive Bayes) to detect sentiment. We also experimented different values of parameters (window size, dimension and negative samples). In the light of the experiments, it can be concluded that our approach achieves a better performance for Logistic Regression and Support Vector Machine than the other classifiers.

Sub-stroke-wise Relative Feature for Online Indic Handwriting Recognition

The main problem of Bangla and Devanagari handwriting recognition is the shape similarity of characters. There are only a few pieces of work on author-independent cursive online Indian text recognition, and shape similarity problem needs more attention from researchers. To handle the shape similarity problem of cursive characters of Bangla and Devanagari scripts, in this paper, we propose a new category of features called sub-stroke-wise relative feature (SRF) which are based on relative information of the constituent parts of the handwritten strokes. Relative information among some of the parts within a character can be a distinctive feature as it scales up small dissimilarities and enhances discrimination among similar-looking shapes. Also, contextual anticipatory phenomena are automatically modeled by this type of feature, as it takes into account the influence of previous and forthcoming strokes. We have tested popular state-of-the-art feature sets as well as proposed SRF using various (up to 20,000-word) lexicons and noticed that SRF significantly outperforms the state-of-the-art feature sets for online Bangla and Devanagari cursive word recognition.

Role of Discourse Information in Urdu Sentiment Classification: A Rule-Based Method and Machine Learning Technique

A Rule-based Kurdish Text Transliteration System

In this article, we present a rule-based approach for transliterating two mostly used orthographies in Sorani Kurdish. Our work consists of detecting each character in a word by removing the possible ambiguities and mapping it into the target orthography. We describe different challenges in Kurdish text mining and propose novel ideas concerning the transliteration task for Sorani Kurdish. Our transliteration system, named Wergor, achieves 82.79% overall precision and more than 99% in detecting the double-usage characters. We also present a manually transliterated corpus for Kurdish.

Online Handwritten Gurmukhi Words Recognition: An Inclusive Study

Identification of offline and online handwritten words is a challenging and complex task. In comparison to Latin and Oriental scripts, the research and study of handwriting recognition at word level in Indic scripts is at its initial phases. The global and analytical are two main methods of handwriting recognition. The present work introduces a novel analytical approach for online handwritten Gurmukhi words recognition based on minimal set of words and recognizes an input Gurmukhi word as a sequence of characters. We employed a sequential step by step approach to recognize online handwritten Gurmukhi words. Considering the massive variability in online Gurmukhi handwriting, the present work employs the completely linked non-homogeneous hidden Markov model. In the present study, we considered the dependent, major dependent and super dependent nature of strokes to form Gurmukhi characters in words. On test sets of online handwritten Gurmukhi datasets, the word level accuracy rates are 85.98%, 84.80%, 82.40% and 82.20% in four different modes. Besides the online Gurmukhi word recognition, the present work also provides Gurmukhi handwriting analysis study for varying writing styles, and proposes novel techniques for zone detection and rearrangement of strokes. Our proposed algorithms have been successfully employed to online handwritten Gurmukhi word recognition in dependent and independent modes of handwriting.

Low-Resource Machine Transliteration Using Recurrent Neural Networks

Grapheme-to-phoneme models are key components in automatic speech recognition and text-to-speech systems. With low-resource language pairs that do not have available and well-developed pronunciation lexicons, grapheme-to-phoneme models are particularly useful. These models are based on initial alignments between grapheme source and phoneme target sequences. Inspired by sequence-to-sequence recurrent neural network-based translation methods, the current research presents an approach that applies an alignment representation for input sequences and pre-trained source and target embeddings to overcome the transliteration problem for a low-resource languages pair. Evaluation and experiments involving French and Vietnamese showed that with only a small bilingual pronunciation dictionary available for training the transliteration models, promising results were obtained with a large increase BLEU-scores and a reduction in translation error rate (TER) and phoneme error rate (PER). Moreover, we compared our proposed neural network-based transliteration approach with a statistical one.

A Survey of Opinion Mining in Arabic: A Comprehensive System Perspective Covering Challenges and Advances in Tools, Resources, Models, Applications and Visualizations

Opinion mining or sentiment analysis continues to gain interest in industry and academics. While there has been significant progress in developing models for sentiment analysis, the field remains an active area of research for many languages across the world, and in particular for the Arabic language which is the 5th most spoken language, and has become the 4th most used language on the Internet. With the flurry of research activity in Arabic opinion mining, several researchers have provided surveys to capture advances in the field. While these surveys capture a wealth of important progress in the field, the fast pace of advances in machine learning and natural language processing (NLP) necessitates a continuous need for more up-to-date literature survey. The aim of this paper is to provide a comprehensive literature survey for state-of-the-art advances in Arabic opinion mining. The survey goes beyond surveying previous works that were primarily focused on classification models. Instead, this paper provides a comprehensive system perspective by covering advances in different aspects of an opinion mining system, including advances in NLP software tools, lexical sentiment and corpora resources, classification models and applications of opinion mining. It also presents future directions for opinion mining in Arabic. The survey also covers latest advances in the field, including deep learning advances in Arabic Opinion Mining. The paper provides state-of-the-art information to help new or established researchers in the field as well as industry developers who aim to deploy an operational complete opinion mining system. Key insights are captured at the end of each section for particular aspects of the opinion mining system giving the reader a choice of focusing on particular aspects of interest.

A Survey of Discourse Representations for Chinese Discourse Annotation

A key element in computational discourse analysis is the design of a formal representation for the discourse structure of a text. With machine learning being the dominant method, it is important to identify a discourse representation that can be used to perform large-scale annotation. This survey provides a systematic analysis of existing discourse representation theories to evaluate whether they are suitable for annotation of Chinese text. Specifically, the two properties, expressiveness and practicality, are introduced to compare representations based on rhetorical relations and representations based on entity relations. The comparison systematically reveals linguistic and computational characteristics of the theories. After that, we conclude that none of existing theories are quite suitable for scalable Chinese discourse annotation because they are not both expressive and practical. Therefore, a new discourse representation needs to be proposed, which should balance the expressiveness and practicality, and cover rhetorical relations and entity relations. Inspired by the conclusions, this survey discusses some preliminary proposals on how to represent the discourse structure that are worth pursuing.

On the Usage of a Classical Arabic Corpus as a Language Resource: Related Research and Key Challenges

This paper presents a literature review of computer science related works applied on hadith, a kind of Arabic narrations which appeared in the 7th century. We study and compare existent works in several fields of Natural Language Processing (NLP), Information Retrieval (IR) and Knowledge Extraction (KE). Thus, we illicit the main drawbacks of existent works and identify some research issues, which may be considered by the research community. We also study the characteristics of this type of documents, by enumerating the advantages/limits of using hadith as a language resource. Moreover, our study shows that existent works used different collections of hadiths, thus making hard to compare objectively their results. Besides, many preprocessing steps are recurrent through these applications, thus wasting a lot of time. Consequently, the key issues for building generic language resources from hadiths are discussed, taking into account the relevance of related works and the wide community of researchers which are interested in. The ultimate goal is to structure hadith books for multiple usages, thus building common collections which may be exploited in future applications.

NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech Tagging

A feasible and flexible annotation system is designed for joint tokenization and part-of-speech (POS) tagging to annotate those languages without natural definition of words. This design was motivated by the fact that word separators are not used in many highly analytic East and Southeast Asian languages. Although several of the languages are well-studied, e.g., Chinese and Japanese, many are understudied and with low resource, e.g., Burmese (Myanmar) and Khmer. In the first part of the paper, the proposed annotation system, named nova, is introduced. nova contains only four basic tags (n, v, a, and o) while these tags can be further modified and combined to adapt complex linguistic phenomena in tokeniztion and POS tagging. In the second part of the paper, the application of nova is discussed, with practical examples on Burmese and Khmer, where the feasibility and flexibility of nova are demonstrated. The relation between nova and two universal POS tagsets is discussed in the final part of the paper.

Improving Word Embedding Coverage in Less-resource Language through Multi-linguality and Cross-linguality: A Case Study with Aspect based Sentiment Analysis

Efficient word representations play an important role in solving various problems related to Natural Language Processing (NLP), data mining, text mining etc. The issue of data sparsity poses a great challenge in creating efficient word representation model for solving the underlying problem. The problem is more intensified with resource-poor languages due to the absence of sufficient amount of corpus. In this work we propose to minimize the effect of data sparsity by leveraging bilingual word embeddings learned through a parallel corpus. We train and evaluate deep Long Short Term Memory (LSTM) based architecture and show the effectiveness of the proposed approach for two aspect level sentiment analysis tasks i.e. aspect term extraction and sentiment classification. The neural network architecture is further assisted by the hand-crafted features for prediction. We apply the proposed model in two experimental setups, viz. multi-lingual and cross-lingual. Experimental results show effectiveness of the proposed approach against the state-of-the-art methods.

Tempo-HindiWordNet: A Lexical Knowledge-base for Temporal Information Processing

Temporality has significantly contributed to the various Natural Language Processing and Information Retrieval applications. In this paper, we first create a lexical knowledge-base in Hindi by identifying the temporal orientation of word senses based on their definition and then use this resource to detect underlying temporal orientation of the sentences. In order to create the resource, we propose a semi-supervised learn- ing framework, where each synset of the Hindi WordNet is classified into one of the five categories, namely past, present, future, neutral and atemporal. The algorithm initiates learning with a set of seed synsets and then iterates following different expansion strategies, viz. probabilistic expansion based on classifier?s confidence and semantic distance based measures. We manifest the usefulness of the resource that we build on an external task, viz. sentence-level temporal classification. The underlying idea is that a temporal knowledge- base can help in classifying the sentences according to their inherent temporal properties. Experiments on two different domains, viz. General and Twi er show very interesting results.

Automatic Diacritics Restoration for Tunisian Dialect

Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. Firstly, a word without diacratics could have many possible meanings depending on their diacritization. Secondly, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Thirdly, without diacritics a word could have many possible POS instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this paper, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on CRFs (Conditional Random Fields). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (WER 21.44% for CRF and WER 34.6% for SMT).

Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

This study aims to increase the performance of word embeddings in analogy and similarity tasks by proposing a new weighting scheme for the co-occurrence counting. The idea behind this new family of weights is to overcome the disadvantage of distant appearing word pairs, which are indeed semantically close, while representing them in the co-occurrence counting. For high resource languages this disadvantage might not be effective due to high frequency of co-occurrence. However, when there is not enough available resource, such pairs suffer from being distant. In order to favour such pairs, a polynomial weighting scheme is proposed to shift the weights up for distant words, whereas the weighting of nearby words is left nearly unchanged. The parameter optimization for new weights and the effects of the weighting scheme are analysed for English, Italian and Turkish languages. A small portion of English resources and a quarter of Italian resources are utilized for demonstration purposes as if these languages are low resource languages. Performance increase is observed in analogy tests when the proposed weighting scheme is applied to relatively small corpora (i.e. mimicking low resource languages) of both English and Italian. In order to show the effectiveness of the proposed scheme in small corpora, it is also shown for a large English corpus that the performance of the proposed weighting scheme cannot outperform the original weights. Since Turkish is relatively a low resources language, it is demonstrated that the proposed weighting scheme can increase the performance of both analogy and similarity tests when all Turkish Wikipedia pages are utilized as corpus.

Multitask Pointer Network for Korean Dependency Parsing

Dependency parsing is a fundamental problem in natural language processing. We introduce a novel dependancy parsing framework called head pointing based dependancy parsing. In this framework, we cast Korean dependency parsing problem to a statistical head pointing and arc labeling problem. To address the problem, a novel neural network called Multitask Pointer Networks is devised for a neural sequential head pointing and type labeling architecture. Our approach does not require any hand-crafting features or language-specific rules to parse dependency. Furthermore it shows state-of-the-art performance in Korean dependency parsing.

Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

The number of possible word forms is theoretically infinite in agglutinative languages. This brings the out-of-vocabulary (OOV) issue for part-of-speech (PoS) tagging in agglutinative languages. Since the inflectional morphology does not change the PoS tag of a word, we propose to learn stems along with PoS tags simultaneously. Therefore, we aim to overcome the sparsity problem by reducing the word forms into their stems. We adopt a Bayesian model that is fully unsupervised. We build a Hidden Markov Model for PoS tagging where the stems are emitted through hidden states. Several versions of the model are introduced in order to observe the effects of the different dependencies throughout the corpus; such as the dependency between stems and PoS tags or the dependency between PoS tags and affixes. Additionally, we use neural word embeddings to estimate the semantic similarity between the word form and the stem. We use the semantic similarity as prior information to discover the actual stem of a word since the inflection does not change the meaning of a word. We compare our models with other unsupervised stemming and PoS tagging models on Turkish, Hungarian, Finnish, Basque, and English. The results show that a joint model for PoS tagging and stemming improves upon an independent PoS tagger and stemmer in agglutinative languages.

All ACM Journals | See Full Journal Index

Search TALLIP
enter search term and/or author name