ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Latest Articles

Sentiment Analysis of Iraqi Arabic Dialect on Facebook Based on Distributed Representations of Documents

Nowadays, social media is used by many people to express their opinions about a variety of topics.... (more)

Online Handwritten Gurmukhi Words Recognition: An Inclusive Study

Identification of offline and online handwritten words is a challenging and complex task. In comparison to Latin and Oriental scripts, the research and study of handwriting recognition at word level in Indic scripts is at its initial phases. The two main methods of handwriting recognition are global and analytical. The present work introduces a... (more)

Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

This study aims to increase the performance of word embeddings by proposing a new weighting scheme for co-occurrence counting. The idea behind this... (more)

On the Usage of a Classical Arabic Corpus as a Language Resource: Related Research and Key Challenges

This article presents a literature review of computer-science-related research applied on hadith, a kind of Arabic narration which appeared in the 7th... (more)

Unsupervised Joint PoS Tagging and Stemming for Agglutinative Languages

The number of possible word forms is theoretically infinite in agglutinative languages. This brings up the out-of-vocabulary (OOV) issue for... (more)

A Survey of Discourse Representations for Chinese Discourse Annotation

A key element in computational discourse analysis is the design of a formal representation for the discourse structure of a text. With machine... (more)


Science Citation Index Listing

TALLIP will be listed in the Science Citation Index Expanded starting with the first 2015 issue, 14(1). TALLIP will be included in the 2017 Journal Citation Report, and the first Impact Factor will be published mid-2018.

New Name, Expanded Scope

This page provides information about the journal Transactions on Asian and Low-Resource Language Information Processing (TALLIP), a publication of the Association for Computing Machinery (ACM).

The journal was formerly known as the Transactions on Asian Language Information Processing (TALIP): see the editorial charter for information on the expanded scope of the journal.  

Role of Discourse Information in Urdu Sentiment Classification: A Rule-Based Method and Machine Learning Technique

A Comparative Analysis on Hindi and English Text Summarization

Text summarization is the process of transfiguring a large documental information into a clear and concise form. In this paper, we present a detailed comparative study of various extractive methods for automatic text summarization on Hindi and English text datasets of news articles. We consider thirteen different summarization techniques, namely, TextRank, LexRank, Luhn, LSA, Edmundson, ChunkRank, TGraph, UniRank, NN-ED, NN-SE, FE-SE, SummaRuNNer, and MMR-SE and evaluate their performance using various performance metrics such as precision, recall, F1, cohesion, non-redundancy, readability, and significance. A thorough analysis is done in eight different parts that exhibits the strengths and limitations of these methods, effect of performance over the summary length, impact of language of a document, and other factors as well. A standard summary evaluation tool (ROUGE) and extensive programmatic evaluation using Python 3.5 in Anaconda environment are used to evaluate their outcome.

Handwritten Manipuri Meetei-Mayek Classification using Convolutional Neural Network

A new technique for classifying all the 56 different characters of the Manipuri Meetei-Mayek is proposed herein. The characters are grouped under 5 categories, which are: Eeyek Eepee (original alphabets), Lom Eeyek (Additional Letters), Cheising Eeyek (Digits), Lonsum Eeyek(Letters with short ending), and Cheitap Eeyek (Vowel Signs. Two related works proposed by previous researchers are studied for understanding the benefits claimed by the proposed Deep Learning Approach in Handwritten Manipuri Meetei-Mayek (HMMM). 1) Histogram of Oriented (HOG) with SVM classifier is implemented for thoroughly understanding how HOG features can influence accuracy. 2) The handwritten samples are trained using simple CNN and compared with the proposed CNN based architecture. Significant progress has been made in the field of Optical Character Recognition (OCR) for well-known Indian Languages as well as globally popular languages. Our work is novel in the sense that there is no record of work available up to date which is able to classify all the 56 classes of the MMM. It will also serve as a pre-cursor for developing end-to-end OCR software for translating old manuscripts, newspaper archives, books, etc.

A supplementary feature set for sentiment analysis in Japanese dialogues

Recently real-time affect-awareness is being applied in several commercial systems, such as dialogue systems and computer games. Real-time recognition of affective sates, however, requires the application of costly feature extraction methods and/or labor-intensive annotation of large datasets, especially in the case of Asian languages where large annotated datasets are seldom available. To improve recognition accuracy we propose the use of cognitive context in the form of ?emotion-sensitive? intentions. Intentions are often represented through dialogue acts and, as an emotion-sensitive model of dialogue acts, a tagset of interpersonal relations-directing interpersonal acts (the IA model) is proposed. The model?s adequacy is assessed using a sentiment classification task in comparison with two well-known dialogue act models, the SWBD-DAMSL and the DIT++. For the assessment, five Japanese in-game dialogues were annotated with labels of sentiments and the tags of all three dialogue act models which were used to enhance a baseline sentiment classifier system. The adequacy of the IA tagset is demonstrated by a 9% improvement to the baseline sentiment classifier?s recognition accuracy, outperforming the other two models by more than 5%.

A Survey of Opinion Mining in Arabic: A Comprehensive System Perspective Covering Challenges and Advances in Tools, Resources, Models, Applications and Visualizations

Opinion mining or sentiment analysis continues to gain interest in industry and academics. While there has been significant progress in developing models for sentiment analysis, the field remains an active area of research for many languages across the world, and in particular for the Arabic language which is the 5th most spoken language, and has become the 4th most used language on the Internet. With the flurry of research activity in Arabic opinion mining, several researchers have provided surveys to capture advances in the field. While these surveys capture a wealth of important progress in the field, the fast pace of advances in machine learning and natural language processing (NLP) necessitates a continuous need for more up-to-date literature survey. The aim of this paper is to provide a comprehensive literature survey for state-of-the-art advances in Arabic opinion mining. The survey goes beyond surveying previous works that were primarily focused on classification models. Instead, this paper provides a comprehensive system perspective by covering advances in different aspects of an opinion mining system, including advances in NLP software tools, lexical sentiment and corpora resources, classification models and applications of opinion mining. It also presents future directions for opinion mining in Arabic. The survey also covers latest advances in the field, including deep learning advances in Arabic Opinion Mining. The paper provides state-of-the-art information to help new or established researchers in the field as well as industry developers who aim to deploy an operational complete opinion mining system. Key insights are captured at the end of each section for particular aspects of the opinion mining system giving the reader a choice of focusing on particular aspects of interest.

Chinese-Catalan: A Neural Machine Translation Approach based on Pivoting and Attention Mechanisms

This paper innovatively addresses machine translation from Chinese to Catalan using neural pivot strategies trained without any direct parallel data. The Catalan language is very similar to Spanish from a linguistic point of view, which motivates the use of Spanish as pivot language. Regarding neural architecture we are using the latest state-of-the-art which is the Transformer model, only based on attention mechanisms. Additionally, this work provides new resources to the community which consist on a human developed gold standard of 4,000 sentences between Catalan and Chinese and all the others United Nations official languages (Arabic, English, French, Russian and Spanish). Results show that the standard pseudo-corpus or synthetic pivot approach performs better than cascade and BLEU is only 6 points BLEU behind direct Chinese-to-Spanish machine translation system.

Machine Translation Evaluation Metric Based on Dependency Parsing Model

Most of the syntax-based metrics obtain the similarity by comparing the sub-structures extracted from the trees of hypothesis and reference. These sub-structures cannot represent all the information in the trees because their lengths are limited. To sufficiently use the reference syntax information, a new automatic evaluation metric is proposed based on dependency parsing model. First, a dependency parsing model is trained using the reference dependency tree for each sentence. Then, the hypothesis is parsed by this dependency parsing model and the corresponding hypothesis dependency tree is generated. The quality of hypothesis can be judged by the quality of the hypothesis dependency tree. Unigram F-score is included in the new metric so that lexicon similarity is obtained. According to experimental results, the proposed metric can perform better than METEOR and BLEU on system level, and get comparable results with METEOR on sentence level. To further improve the performance, we also propose a combined metric which gets the best performance on sentence level and on system level.

Tempo-HindiWordNet: A Lexical Knowledge-base for Temporal Information Processing

Temporality has significantly contributed to the various Natural Language Processing and Information Retrieval applications. In this paper, we first create a lexical knowledge-base in Hindi by identifying the temporal orientation of word senses based on their definition and then use this resource to detect underlying temporal orientation of the sentences. In order to create the resource, we propose a semi-supervised learn- ing framework, where each synset of the Hindi WordNet is classified into one of the five categories, namely past, present, future, neutral and atemporal. The algorithm initiates learning with a set of seed synsets and then iterates following different expansion strategies, viz. probabilistic expansion based on classifier?s confidence and semantic distance based measures. We manifest the usefulness of the resource that we build on an external task, viz. sentence-level temporal classification. The underlying idea is that a temporal knowledge- base can help in classifying the sentences according to their inherent temporal properties. Experiments on two different domains, viz. General and Twi er show very interesting results.

Automatic Diacritics Restoration for Tunisian Dialect

Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. Firstly, a word without diacratics could have many possible meanings depending on their diacritization. Secondly, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology [12]. In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Thirdly, without diacritics a word could have many possible POS instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses [8]. Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this paper, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on CRFs (Conditional Random Fields). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (WER 21.44% for CRF and WER 34.6% for SMT).

All ACM Journals | See Full Journal Index

enter search term and/or author name