ACM Transactions on

Asian and Low-Resource Language Information Processing (TALLIP)

Latest Articles

Words Are Important: Improving Sentiment Analysis in the Persian Language by Lexicon Refining

Lexicon-based sentiment analysis (SA) aims to address the problem of extracting people’s opinions from their comments on the Web using a predefined lexicon of opinionated words. In contrast to the machine learning (ML) approach, lexicon-based methods are domain-independent methods that do not need a large annotated training corpus and hence... (more)

The Rule-Based Sundanese Stemmer

Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a morphological process. Moreover, most of Sundanese affixes are derivational, so removing the... (more)

A Dependency Parser for Spontaneous Chinese Spoken Language

Dependency analysis is vital for spoken language understanding in spoken dialogue systems. However, existing research has mainly focused on western... (more)

Improving Vector Space Word Representations Via Kernel Canonical Correlation Analysis

Cross-lingual word embeddings are representations for vocabularies of two or more languages in one common continuous vector space and are widely used... (more)

Novel Character Identification Utilizing Semantic Relation with Animate Nouns in Korean

For identifying speakers of quoted speech or extracting social networks from literature, it is indispensable to extract character names and nominals.... (more)

Graph-Based Bilingual Word Embedding for Statistical Machine Translation

Bilingual word embedding has been shown to be helpful for Statistical Machine Translation (SMT). However, most existing methods suffer from two... (more)

CLASENTI: A Class-Specific Sentiment Analysis Framework

Arabic text sentiment analysis suffers from low accuracy due to Arabic-specific challenges (e.g., limited resources, morphological complexity, and dialects) and general linguistic issues (e.g., fuzziness, implicit sentiment, sarcasm, and spam). The limited resources problem requires efforts to build new and improved Arabic corpora and lexica. We... (more)

Domain-specific Named Entity Recognition with Document-Level Optimization

Previous studies normally formulate named entity recognition (NER) as a sequence labeling task and optimize the solution in the sentence level. In... (more)

Comparison of Methods to Annotate Named Entity Corpora

The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the... (more)

Weakly Supervised POS Tagging without Disambiguation

Weakly supervised part-of-speech (POS) tagging is to learn to predict the POS tag for a given word in context by making use of partial annotated data... (more)


Science Citation Index Listing

TALLIP will be listed in the Science Citation Index Expanded starting with the first 2015 issue, 14(1). TALLIP will be included in the 2017 Journal Citation Report, and the first Impact Factor will be published mid-2018.

New Name, Expanded Scope

This page provides information about the journal Transactions on Asian and Low-Resource Language Information Processing (TALLIP), a publication of the Association for Computing Machinery (ACM).

The journal was formerly known as the Transactions on Asian Language Information Processing (TALIP): see the editorial charter for information on the expanded scope of the journal.  

Diacritic-Based Matching of Arabic Words

Words in Arabic consist of letters and short vowel symbols called diacritics inscribed atop regular letters. Changing diacritics may change the syntax and semantics of a word; turning it into another. This results in difficulties when comparing words based solely on string matching. Typically, Arabic NLP applications resort to morphological analysis to battle ambiguity originating from this and other challenges. In this paper, we introduce three alternative algorithms to compare two words with possibly different diacritics. We propose the Subsume knowledge-based algorithm, the Imply rule-based algorithm, and the Alike machine-learning based algorithm. We evaluated the soundness, completeness and accuracy of the algorithms against a large dataset of 86,886 word pairs. Our evaluation shows that the accuracy of Subsume (100%), Imply (99.32%), and Alike (99.53%). Although accurate, Subsume was able to judge only 75% of the data. Both Subsume and Imply are sound, while Alike is not. We demonstrate the utility of the algorithms using a real-life use case  in lemma disambiguation and in linking hundreds of Arabic dictionaries.

Response Selection and Automatic Message-Response Expansion in Retrieval-Based QA Systems using Semantic Dependency Pair Model

This study presents an approach to select suitable response and further automatically expand the message-response (MR) database from the unstructured data on the websites for a QA system. First, we manually construct an MR database as a baseline database based on the articles collected from the psychological consultation websites. The Chinese Knowledge and Information Processing PCFG is adopted to obtain the semantic dependency graphs (SDGs) of all the messages and responses in the baseline MR database. For each sentence in the MR database, all the semantic dependencies (SDs), each composed of two words and their semantic relation, are extracted from the SDG of the sentence to form a semantic dependency set. Finally, a matrix with the element representing the correlation between the SDs of the messages and their corresponding responses is constructed as a SD Pair Model (SDPM) for response selection. Moreover, as the MR pairs in the psychological consultation websites are increasing day by day, MR database in the QA system should be expanded to satisfy the new need from the user. For MR database expansion, the unstructured data from the message board are automatically collected. For the collected data, the supervised LDA is adopted for event detection and then the event-based delta-BIC is used for MR article segmentation. Each extracted message segment is then fed to the constructed retrieval-based QA system to find the best matched response segment and the matching score is also estimated to verify if the MR pair is suitable to be included in the expanded MR database. Compared to the traditional vector space model, the proposed approach achieved a more favorable performance according to a statistical significance test. The retrieval accuracy based on MR expansion was also evaluated and a satisfactory result was obtained confirming the effectiveness of the expanded MR database.

Sentiment Analysis of Iraqi Arabic Dialect on Facebook Based on Distributed Representations of Documents

Nowadays, social media is used by many people to express their opinions about a variety of topics. Opinion Mining or Sentiment Analysis techniques extract opinions from user generated contents. Over the years, a multitude of Sentiment Analysis studies has been done about the English language with deficiencies of research in all other languages. Unfortunately, Arabic is one of the languages that seems to lack substantial research, despite the rapid growth of its use on social media outlets. Furthermore, specific Arabic dialects should be studied, not just Modern Standard Arabic. In this paper, we experiment sentiments analysis of Arabic Iraqi dialect using word embedding. First, we made a large corpus from previous works to learn word representations. Second, we generated word embedding model by training corpus using Doc2Vec representations based on Paragraph and Distributed Memory Model of Paragraph Vectors (DM-PV) architectures. Lastly, the represented feature used for training four binary classifiers (Logistic Regression, Decision Tree, Support Vector Machine and Naive Bayes) to detect sentiment. We also experimented different values of parameters (window size, dimension and negative samples). In the light of the experiments, it can be concluded that our approach achieves a better performance for Logistic Regression and Support Vector Machine than the other classifiers.

Sub-stroke-wise Relative Feature for Online Indic Handwriting Recognition

The main problem of Bangla and Devanagari handwriting recognition is the shape similarity of characters. There are only a few pieces of work on author-independent cursive online Indian text recognition, and shape similarity problem needs more attention from researchers. To handle the shape similarity problem of cursive characters of Bangla and Devanagari scripts, in this paper, we propose a new category of features called sub-stroke-wise relative feature (SRF) which are based on relative information of the constituent parts of the handwritten strokes. Relative information among some of the parts within a character can be a distinctive feature as it scales up small dissimilarities and enhances discrimination among similar-looking shapes. Also, contextual anticipatory phenomena are automatically modeled by this type of feature, as it takes into account the influence of previous and forthcoming strokes. We have tested popular state-of-the-art feature sets as well as proposed SRF using various (up to 20,000-word) lexicons and noticed that SRF significantly outperforms the state-of-the-art feature sets for online Bangla and Devanagari cursive word recognition.

A Rule-based Kurdish Text Transliteration System

In this article, we present a rule-based approach for transliterating two mostly used orthographies in Sorani Kurdish. Our work consists of detecting each character in a word by removing the possible ambiguities and mapping it into the target orthography. We describe different challenges in Kurdish text mining and propose novel ideas concerning the transliteration task for Sorani Kurdish. Our transliteration system, named Wergor, achieves 82.79% overall precision and more than 99% in detecting the double-usage characters. We also present a manually transliterated corpus for Kurdish.

Online Handwritten Gurmukhi Words Recognition: An Inclusive Study

Identification of offline and online handwritten words is a challenging and complex task. In comparison to Latin and Oriental scripts, the research and study of handwriting recognition at word level in Indic scripts is at its initial phases. The global and analytical are two main methods of handwriting recognition. The present work introduces a novel analytical approach for online handwritten Gurmukhi words recognition based on minimal set of words and recognizes an input Gurmukhi word as a sequence of characters. We employed a sequential step by step approach to recognize online handwritten Gurmukhi words. Considering the massive variability in online Gurmukhi handwriting, the present work employs the completely linked non-homogeneous hidden Markov model. In the present study, we considered the dependent, major dependent and super dependent nature of strokes to form Gurmukhi characters in words. On test sets of online handwritten Gurmukhi datasets, the word level accuracy rates are 85.98%, 84.80%, 82.40% and 82.20% in four different modes. Besides the online Gurmukhi word recognition, the present work also provides Gurmukhi handwriting analysis study for varying writing styles, and proposes novel techniques for zone detection and rearrangement of strokes. Our proposed algorithms have been successfully employed to online handwritten Gurmukhi word recognition in dependent and independent modes of handwriting.

Wikipedia-based Relatedness Measurements for Multilingual Short Text Clustering

Throughout the world, people can post information about their local area in their own languages using social networking services. Multilingual short text clustering is an important task to organize such information and it can be applied to various applications, such as event detection and summarization. However, measuring the relatedness between short texts written in various languages is a challenging problem. In addition to handling multiple languages, the semantic gaps among all languages must be considered. In this paper, we propose two Wikipedia-based semantic relatedness measurement methods for multilingual short text clustering. The proposed methods solve the semantic gap problem by incorporating inter-language links of Wikipedia into Extended Naive Bayes (ENB), a probabilistic method that can be applied to measure semantic relatedness among monolingual short texts. The proposed methods represent a multilingual short text as a vector of the English version of Wikipedia articles (entities). By transferring texts to a unified vector space, the relatedness between texts in different languages with similar meanings can be increased. We also propose an approach that can improve clustering performance and reduce the processing time by eliminating language-specific entities in the unified vector space. Experimental results of multilingual Twitter message clustering revealed that the proposed methods outperformed cross-lingual explicit semantic analysis, a previously proposed method to measure relatedness between texts in different languages. Moreover, the proposed methods were comparable to ENB applied to texts translated into English using a proprietary translation service. The proposed methods enabled relatedness measurements for multilingual short text clustering without requiring machine translation processes.

Low-Resource Machine Transliteration Using Recurrent Neural Networks

Grapheme-to-phoneme models are key components in automatic speech recognition and text-to-speech systems. With low-resource language pairs that do not have available and well-developed pronunciation lexicons, grapheme-to-phoneme models are particularly useful. These models are based on initial alignments between grapheme source and phoneme target sequences. Inspired by sequence-to-sequence recurrent neural network-based translation methods, the current research presents an approach that applies an alignment representation for input sequences and pre-trained source and target embeddings to overcome the transliteration problem for a low-resource languages pair. Evaluation and experiments involving French and Vietnamese showed that with only a small bilingual pronunciation dictionary available for training the transliteration models, promising results were obtained with a large increase BLEU-scores and a reduction in translation error rate (TER) and phoneme error rate (PER). Moreover, we compared our proposed neural network-based transliteration approach with a statistical one.

Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Informational Retrieval in Indian Languages

We investigate the use of word embeddings for query translation to improve precision in Cross Language Information Retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close to each other in this space. Multilingual word embeddings are constructed in such a way that similar words across languages have similar vector representations. We explore the effective use of bilingual and multilingual word embeddings learned from comparable corpora of Indic languages to the task of CLIR. We propose a clustering method based on the multilingual word vectors to group similar words across languages. For this we construct a graph with words from multiple languages as nodes and with edges connecting words with similar vectors. We use the Louvain Method for community detection to find communities in this graph. We show that choosing target language words as query translations from the clusters or communities containing the query terms helps in improving CLIR. We also find that better quality query translations are obtained when words from more languages are used to do the clustering even when the additional languages are neither the source of the target language. This is probably because having more similar words across multiple languages help define well-defined dense sub-clusters that help us obtain precise query translations. In this paper, we demonstrate the use of multilingual word embedding and word clusters for CLIR involving Indic languages. We also make available a tool for obtaining related words and the visualizations of the multilingual word vectors for English, Hindi, Bengali, Marathi, Gujarati and Tamil.

Pause-based phrase extraction and effective OOV handling for low-resource machine translation systems

Machine translation is the core problem for several natural language processing research across the globe. However, building a translation system involving low-resource languages remains a challenge with respect to statistical machine translation (SMT). This work proposes and studies the effect of a phrase-induced hybrid machine translation system for translation from English-to-Tamil, under a low-resource setting, using a limited domain-specific parallel text corpus. Unlike conventional hybrid MT systems, the free-word ordering feature of the target language Tamil, is exploited to form a re-ordered target language model and to extend the parallel text corpus for training the SMT. In the current work, a novel rule-based phrase extraction method, implemented using parts-of-speech (POS) and place-of-pause (POP) in both the languages, is proposed which is used to pre-process the training corpus for developing the back-off phraseinduced SMT (PiSMT). Further, out-of-vocabulary (OOV) words are handled using speech-based transliteration and two-level thesaurus intersection techniques based on the parts-of-speech tag of the OOV word. In order to ensure that the input with OOV words does not skip phrase-level translation in the hierarchical model, a phrase-level example-based machine translation (PL-EBMT) approach is adopted to find the closest matching phrase and perform translation followed by OOV replacement. The proposed system results in a bilingual evaluation understudy (BLEU) score of 80.21 and a translation edit rate (TER) of 20.18. The performance of the system is compared in terms of adequacy and fluency, with existing translation systems for this specific language pair and it is observed that the proposed system outperforms its counterparts.

"UTTAM": An Efficient Spelling Correction System for Hindi Language Based on Supervised Learning

Improving NER Tagging Performance in Low-Resource Languages via Multilingual Learning

Existing supervised solutions for Named Entity Recognition (NER) typically rely on large annotated corpus. Collecting large amounts of annotated corpus is time consuming and requires considerable human effort. However, collecting small amounts of NER annotated corpus for any language is feasible. But, the performance may degrade due to data sparsity. We address data sparsity by borrowing features from the data of a closely-related language. We use hierarchical neural networks to train a supervised NER system and the feature borrowing happens via sharing of the layers of the network across languages. The neural network is trained on the combined dataset of the involved languages, also termed as Multilingual Learning. Unlike existing systems, we share all layers of the network across languages. In our experiments, sharing all layers of network has been empirically observed to obtain better NER tagging performance for Indian languages. By multilingual learning, we show that the low-resource language NER performance increases mainly due to (a) increased named entity vocabulary (b) cross-lingual sub-word features and (c) multilingual learning playing the role of regularization.

Optimizing Automatic Evaluation of Machine Translation with the ListMLE Approach

Automatic evaluation of machine translation is critical in the evaluation and development of machine translation systems. In this article, we propose a new model for automatic evaluation of machine translation. The proposed model combines standard n-gram precision features and sentence semantic mapping features with neural features, including neural language model probabilities and the embedding distances between translation outputs and their reference translations. We optimize the model with a representative list-wise learning to rank approach, ListMLE, in terms of human ranking assessments. The experimental results on WMT15 Metrics task indicate that the proposed approach has a significantly better correlation with human assessments than several state-of-the-art baseline approaches. In particular, the results confirm that the proposed list-wise learning to rank approach is useful and powerful for optimizing automatic evaluation metrics in terms of human ranking assessments. Deep analysis further reveals that optimizing automatic metrics with the ListMLE approach is reasonable and the neural features can gain considerable improvement over the traditional features.

Incorporating Multi-level User Preference into Document-level Sentiment Classification

Document-level sentiment classification aims to predict user's sentiment polarity in a document about a product. Most of existing methods only focus on review contents and ignore users who post reviews. In fact, when reviewing a product, different users have different word-using habits to express opinions (i.e., word-level user preference), care different attributes of the product (i.e., aspect-level user preference) and have different characteristics to score the review (i.e., polarity-level user preference). These preferences have great influences on interpreting the sentiment of text. To address this issue, we propose a model called Hierarchical User Attention Network (HUAN), which incorporates multi-level user preference into a hierarchical neural network to perform document-level sentiment classification. Specifically, HUAN encodes different kinds of information (word, sentence, aspect and document) in a hierarchical structure and imports user embedding and user attention mechanism to model these preferences. Empirical results on two real-world datasets show that HUAN achieves state-of-the-art performances. Furthermore, HUAN can also mine important attributes of products for different users.

On the Usage of a Classical Arabic Corpus as a Language Resource: Related Research and Key Challenges

This paper presents a literature review of computer science related works applied on hadith, a kind of Arabic narrations which appeared in the 7th century. We study and compare existent works in several fields of Natural Language Processing (NLP), Information Retrieval (IR) and Knowledge Extraction (KE). Thus, we illicit the main drawbacks of existent works and identify some research issues, which may be considered by the research community. We also study the characteristics of this type of documents, by enumerating the advantages/limits of using hadith as a language resource. Moreover, our study shows that existent works used different collections of hadiths, thus making hard to compare objectively their results. Besides, many preprocessing steps are recurrent through these applications, thus wasting a lot of time. Consequently, the key issues for building generic language resources from hadiths are discussed, taking into account the relevance of related works and the wide community of researchers which are interested in. The ultimate goal is to structure hadith books for multiple usages, thus building common collections which may be exploited in future applications.

NOVA: A Feasible and Flexible Annotation System for Joint Tokenization and Part-of-Speech Tagging

A feasible and flexible annotation system is designed for joint tokenization and part-of-speech (POS) tagging to annotate those languages without natural definition of words. This design was motivated by the fact that word separators are not used in many highly analytic East and Southeast Asian languages. Although several of the languages are well-studied, e.g., Chinese and Japanese, many are understudied and with low resource, e.g., Burmese (Myanmar) and Khmer. In the first part of the paper, the proposed annotation system, named nova, is introduced. nova contains only four basic tags (n, v, a, and o) while these tags can be further modified and combined to adapt complex linguistic phenomena in tokeniztion and POS tagging. In the second part of the paper, the application of nova is discussed, with practical examples on Burmese and Khmer, where the feasibility and flexibility of nova are demonstrated. The relation between nova and two universal POS tagsets is discussed in the final part of the paper.

Improving Word Embedding Coverage in Less-resource Language through Multi-linguality and Cross-linguality: A Case Study with Aspect based Sentiment Analysis

Efficient word representations play an important role in solving various problems related to Natural Language Processing (NLP), data mining, text mining etc. The issue of data sparsity poses a great challenge in creating efficient word representation model for solving the underlying problem. The problem is more intensified with resource-poor languages due to the absence of sufficient amount of corpus. In this work we propose to minimize the effect of data sparsity by leveraging bilingual word embeddings learned through a parallel corpus. We train and evaluate deep Long Short Term Memory (LSTM) based architecture and show the effectiveness of the proposed approach for two aspect level sentiment analysis tasks i.e. aspect term extraction and sentiment classification. The neural network architecture is further assisted by the hand-crafted features for prediction. We apply the proposed model in two experimental setups, viz. multi-lingual and cross-lingual. Experimental results show effectiveness of the proposed approach against the state-of-the-art methods.

Tempo-HindiWordNet: A Lexical Knowledge-base for Temporal Information Processing

Temporality has significantly contributed to the various Natural Language Processing and Information Retrieval applications. In this paper, we first create a lexical knowledge-base in Hindi by identifying the temporal orientation of word senses based on their definition and then use this resource to detect underlying temporal orientation of the sentences. In order to create the resource, we propose a semi-supervised learn- ing framework, where each synset of the Hindi WordNet is classified into one of the five categories, namely past, present, future, neutral and atemporal. The algorithm initiates learning with a set of seed synsets and then iterates following different expansion strategies, viz. probabilistic expansion based on classifier?s confidence and semantic distance based measures. We manifest the usefulness of the resource that we build on an external task, viz. sentence-level temporal classification. The underlying idea is that a temporal knowledge- base can help in classifying the sentences according to their inherent temporal properties. Experiments on two different domains, viz. General and Twi er show very interesting results.

Co-occurrence Weight Selection in Generation of Word Embeddings for Low Resource Languages

This study aims to increase the performance of word embeddings in analogy and similarity tasks by proposing a new weighting scheme for the co-occurrence counting. The idea behind this new family of weights is to overcome the disadvantage of distant appearing word pairs, which are indeed semantically close, while representing them in the co-occurrence counting. For high resource languages this disadvantage might not be effective due to high frequency of co-occurrence. However, when there is not enough available resource, such pairs suffer from being distant. In order to favour such pairs, a polynomial weighting scheme is proposed to shift the weights up for distant words, whereas the weighting of nearby words is left nearly unchanged. The parameter optimization for new weights and the effects of the weighting scheme are analysed for English, Italian and Turkish languages. A small portion of English resources and a quarter of Italian resources are utilized for demonstration purposes as if these languages are low resource languages. Performance increase is observed in analogy tests when the proposed weighting scheme is applied to relatively small corpora (i.e. mimicking low resource languages) of both English and Italian. In order to show the effectiveness of the proposed scheme in small corpora, it is also shown for a large English corpus that the performance of the proposed weighting scheme cannot outperform the original weights. Since Turkish is relatively a low resources language, it is demonstrated that the proposed weighting scheme can increase the performance of both analogy and similarity tests when all Turkish Wikipedia pages are utilized as corpus.

Multitask Pointer Network for Korean Dependency Parsing

Dependency parsing is a fundamental problem in natural language processing. We introduce a novel dependancy parsing framework called head pointing based dependancy parsing. In this framework, we cast Korean dependency parsing problem to a statistical head pointing and arc labeling problem. To address the problem, a novel neural network called Multitask Pointer Networks is devised for a neural sequential head pointing and type labeling architecture. Our approach does not require any hand-crafting features or language-specific rules to parse dependency. Furthermore it shows state-of-the-art performance in Korean dependency parsing.

Transition-Based Korean Dependency Parsing Using Hybrid Word Representations of Syllables and Morphemes with LSTMs

Recently, neural approaches for transition based dependency parsing have become one of the state-of-the art methods for performing dependency parsing tasks in many languages. In neural transition-based parsing, a parser state representation is first computed from the configuration of a stack and a buffer, which is then fed into a feed-forward neural network model that predicts a next transition action. Since words are basic elements of a stack and buffer, a parser state representation is largely affected by how a word representation is defined. Specifically, word representation issues become more severe in morphologically rich languages such as Korean, as a set of possible words is not restricted but rather nearly unlimited due to its agglutinative characteristics. In this paper, we propose a hybrid word representation which combines two compositional word representations, each of which is derived from representations of syllables and morphemes, respectively. Our underlying assumption for this hybrid word representation is that because both syllables and morphemes are two common ways of decomposing Korean words, it is expected that their effects in inducing word representation are complementary to one another. Experimental results carried on Sejong and SPMRL 2014 datasets show that our proposed hybrid word representation leads to the state of the art performance.

Tempo-HindiWordNet: A Lexical Knowledge-base for Temporal Information Processing

In this paper we propose an efficient sentence-level temporal classifier for tagging sentences of Hindi documents with time senses. In order to achieve this goal, we need to determine the temporal sense of each word in the sentence.We propose a semi-supervised learning framework, where each synset of the HindiWordNet is classified into five temporal dimensions,namely past,present,future,neutral and atemporal. The algorithm initiates learning with a set of seed entities and then iterates following different expansion strategies,viz. probabilisticexpansionbasedonclassifiersconfidenceandsemanticdistancebasedmeasures. We use different representation methods, varying from simple word uni-grams of HindiWordNet glosses to word embeddings created from the glosses of synsets and other HindiWordNet relations. The resource, thus created is used for tagging sentences with past, present and future temporal senses. we develop two approaches based on machine learning and rules. Evaluation on two different domains, viz. newswire and tweets show encouraging performance.

Input Method for Human Translators: a Novel Approach to Integrate Machine Translation Effectively and Imperceptibly

Computer-aided translation (CAT) systems are the most popular tool for helping human translators efficiently perform language translation. To further improve the translation efficiency, there is an increasing interest in applying machine translation (MT) technology to upgrade CAT. To thoroughly integrate MT into CAT systems, in this paper, we propose a novel approach: a new input method that makes full use of the knowledge adopted by MT systems, such as translation rules, decoding hypotheses and n-best translation lists. The proposed input method contains two parts: phrase generation model, allowing human translators to type target sentences quickly, and n-gram prediction model, helping users choose perfect MT fragments smoothly. In addition, to tune the underlying MT system to generate the input method preferable results, we design a new evaluation metric for the MT system. The well-designed input method integrates MT effectively and imperceptibly, and it is particularly suitable for many target languages with complex characters, such as Chinese and Japanese. The extensive experiments demonstrate that our method saves more than 23\% time and over 42\% keystrokes, and it also improves the translation quality by more than 5 absolute BLEU scores compared with the strong baseline, i.e., post-editing using Google Pinyin.

Word Segmentation for Burmese Based on Dual-Layer CRFs

Burmese is an isolated language, in which syllable is the smallest unit. syllable segmentation method based on matching leads to performance subject to the syllable segmentation effect. This paper proposes a word segmentation method with fusion conditions of double syllable feature. It puts word segmentation and segmentation of syllable as a whole process, thus reducing the impact of errors on the syllable segmentation of Burmese. In the first layer of CRFs, Burmese characters as atomic features are integrated into the Burma section of the Barkis Speech Paradigm (BNF) features, to realize the Burma syllable sequence tags. in the second layer CRFs model, with the syllable marked as input, it realizes the sequence markers through building feature template with syllable as atomic features. The experimental results show that the proposed method has a better effect compared with the method based on the matching of syllable.

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

All ACM Journals | See Full Journal Index

enter search term and/or author name