Poor Chinese-Vietnamese bilingual parallel corpus make the existing Chinese-Vietnamese machine translation unsatisfactory. Considering the differences between Chinese and Vietnamese, we proposed a method of Chinese-Vietnamese tree-to-tree Statistical Machine Translation with language features. Lingual difference feature plays a good supervised role on machine translation. Analyzing the syntactic differences between Chinese and Vietnamese, we define some rules of language difference, attributive postposition award, time adverbial postposition award and locative adverbial postposition award .On the basis of Chinese-Vietnamese bilingual word-aligned corpus, these awards are combined into extract tree-to-tree translation rules. These defined rules are used to constraint the decoding, to prune and optimize the candidate sentences, and as a result, we acquire the optimal translation sequence. The experiments of Chinese-Vietnamese bilingual sentence translation showed that the proposed method performs well and that syntax difference features can greatly improve the efficiency and accuracy of the translation.
With the advent of social network services, Arabs opinions on the web have attracted many researchers in recent years, toward detecting and classifying sentiments in Arabic tweets and reviews. However, the impact of word embeddings vectors (WEVs) initialization and dataset balance on Arabic sentiment classification using deep learning has not been thoroughly studied. In this paper, a multi-channel embedding convolutional neural network (MCE-CNN) is proposed to improve Arabic sentiment classification by learning sentiment features from different text domains, word and character n-grams levels. MCE-CNN encodes a combination of different pre-trained word embeddings into the embedding block at each embedding channel and trains these channels in parallel. Besides, a separate feature extraction module implemented in a CNN block is used to extract more relevant sentiment features. These channels and blocks help to start training on high-quality WEVs and fine-tuning them. The performance of MCE-CNN is evaluated on several standard balanced and imbalanced datasets to reflect real-world using cases. Experimental results show that MCE-CNN provides a high classification accuracy and benefits from the second embedding channel on both standard Arabic and dialectal Arabic text, which outperforms state-of-the-art methods.
Translation quality estimation is an important task in machine translation, which has attracted increasing interest in recent years. A key problem in translation quality estimation is a lack of sufficient amount of the quality annotated training data. To address this lack, the Predictor-Estimator was recently proposed by introducing ?word prediction? as an additional pre-subtask that predicts a current target word with consideration of surrounding source and target contexts, thereby leading to two-stage neural models: a predictor and an estimator. However, the original Predictor-Estimator is not trained on a continuous stacking model but instead trained in a cascaded manner that separately trains the predictor from the estimator. In addition, the Predictor-Estimator is trained based on only single-task learning, which uses target-specific quality estimation data, without using training data that are available from other tasks. In this paper, we thus propose multi-task stack propagation, which extensively applies stack propagation to fully train the Predictor-Estimator on a continuous stacking architecture and multi-task learning to enhance the training data from other related quality estimation tasks. Experimental results on WMT17 quality estimation datasets show that the Predictor-Estimator trained using multi-task stack propagation provides statistically significant improvements over the baseline models. In particular, under an ensemble setting, the proposed multi-task stack propagation leads to state-of-the-art performance for all quality estimation tasks (at sentence/word/phrase levels) for WMT17 quality estimation tasks.
Chinese zero pronoun (ZP) resolution plays a critical role in discourse analysis. Different from traditional mention to mention approaches, this paper proposes a chain to chain approach to improve the performance of ZP resolution from three aspects. Firstly, consecutive ZPs are clustered into coreferential chains, each working as one independent anaphor as a whole. In this way, those ZPs far away from their overt antecedents can be bridged via other consecutive ZPs in the same coreferential chains and thus better resolved. Secondly, common noun phrases (NPs) are automatically grouped into coreferential chains using traditional approaches, each working as one independent antecedent candidate as a whole. That is, those NPs occurring in the same coreferential chain are viewed as one antecedent candidate as a whole, and ZP resolution is made between ZP coreferential chains and common NP coreferential chains. In this way, the performance can be much improved due to the effective reduction of search space by pruning singletons and negative instances. Thirdly and finally, additional features from ZP and common NP coreferential chains are employed to better represent anaphors and their antecedent candidates, respectively. Comprehensive experiments on the OntoNotes V5.0 corpus show that our chain to chain approach significantly outperforms the state-of-the-art mention to mention approaches. To our knowledge, this is the first work to resolve zero pronouns in a chain to chain way.
Role of Discourse Information in Urdu Sentiment Classification: A Rule-Based Method and Machine Learning Technique
Text summarization is the process of transfiguring a large documental information into a clear and concise form. In this paper, we present a detailed comparative study of various extractive methods for automatic text summarization on Hindi and English text datasets of news articles. We consider thirteen different summarization techniques, namely, TextRank, LexRank, Luhn, LSA, Edmundson, ChunkRank, TGraph, UniRank, NN-ED, NN-SE, FE-SE, SummaRuNNer, and MMR-SE and evaluate their performance using various performance metrics such as precision, recall, F1, cohesion, non-redundancy, readability, and significance. A thorough analysis is done in eight different parts that exhibits the strengths and limitations of these methods, effect of performance over the summary length, impact of language of a document, and other factors as well. A standard summary evaluation tool (ROUGE) and extensive programmatic evaluation using Python 3.5 in Anaconda environment are used to evaluate their outcome.
Abstractive text summarization is a highly difficult problem, and the sequence-to-sequence model has shown success in improving the performance on the task. However, the generated summaries are often inconsistent with the source content in semantics. In such cases, when generating summaries, the model selects semantically unrelated words with respect to the source content as the most probable output. The problem can be attributed to heuristically constructed training data, where summaries can be unrelated to the source content, thus containing semantically unrelated words and spurious word correspondence. In this paper, we propose a regularization approach for the sequence-to-sequence model and make use of what the model has learned to regularize the learning objective to alleviate the effect of the problem. In addition, we propose a practical human evaluation method to address the problem that the existing automatic evaluation method does not evaluate the semantic consistency with the source content properly. Experimental results demonstrate the effectiveness of the proposed approach, which outperforms almost all the existing models. Especially, the proposed approach improves the semantic consistency by 4% in terms of human evaluation.
A new technique for classifying all the 56 different characters of the Manipuri Meetei-Mayek is proposed herein. The characters are grouped under 5 categories, which are: Eeyek Eepee (original alphabets), Lom Eeyek (Additional Letters), Cheising Eeyek (Digits), Lonsum Eeyek(Letters with short ending), and Cheitap Eeyek (Vowel Signs. Two related works proposed by previous researchers are studied for understanding the benefits claimed by the proposed Deep Learning Approach in Handwritten Manipuri Meetei-Mayek (HMMM). 1) Histogram of Oriented (HOG) with SVM classifier is implemented for thoroughly understanding how HOG features can influence accuracy. 2) The handwritten samples are trained using simple CNN and compared with the proposed CNN based architecture. Significant progress has been made in the field of Optical Character Recognition (OCR) for well-known Indian Languages as well as globally popular languages. Our work is novel in the sense that there is no record of work available up to date which is able to classify all the 56 classes of the MMM. It will also serve as a pre-cursor for developing end-to-end OCR software for translating old manuscripts, newspaper archives, books, etc.
Recently real-time affect-awareness is being applied in several commercial systems, such as dialogue systems and computer games. Real-time recognition of affective sates, however, requires the application of costly feature extraction methods and/or labor-intensive annotation of large datasets, especially in the case of Asian languages where large annotated datasets are seldom available. To improve recognition accuracy we propose the use of cognitive context in the form of ?emotion-sensitive? intentions. Intentions are often represented through dialogue acts and, as an emotion-sensitive model of dialogue acts, a tagset of interpersonal relations-directing interpersonal acts (the IA model) is proposed. The model?s adequacy is assessed using a sentiment classification task in comparison with two well-known dialogue act models, the SWBD-DAMSL and the DIT++. For the assessment, five Japanese in-game dialogues were annotated with labels of sentiments and the tags of all three dialogue act models which were used to enhance a baseline sentiment classifier system. The adequacy of the IA tagset is demonstrated by a 9% improvement to the baseline sentiment classifier?s recognition accuracy, outperforming the other two models by more than 5%.
Treebank is one of the most important and useful resources in natural language processing represented in two different annotated schemas, phrase and dependency structures. There are many works that convert a phrase structure into a dependency structure and vice versa. Most of them are rule based which exploit the hand crafted head percolation table and argument table in predefined deterministic ways. In this paper, we propose a method to convert a dependency structure into a phrase structure by enriching a trainable model of former rule based approach. By adding a classifier to the algorithm and using post processing modification, the quality of conversion increased. We evaluate our method in two different languages, English and Persian, and analyze the errors. The results of our experiments show 46.01% reduction of error rate in English and 76.50% for Persian compared to our baseline. We build a new Head-driven Phrase Structure Grammar (HPSG) treebank by converting the 10000 sentences of PerDT into their corresponding HPSG structure and correct them manually.
Code-switching or juxtaposition of linguistic units from two or more languages in a single utterance, in recent times, has become very common in text, thanks to social media and other computer mediated forms of communication. In this exploratory study of English-Hindi code-switching on Twitter, we automatically create a large corpus of code-switched tweets and devise techniques to identify the relationship between successive components in a code-switched tweet. More specifically, we identify pragmatic functions like narrative-evaluative, negative reinforcement, translation etc. characterizing relation between successive components. We analyze the difference / similarity between switching patterns in code-switched and monolingual multi-component tweets. We observe strong dominance of narrative-evaluative (non-opinion to opinion or vice-versa) switching in case of both code-switched and monolingual multi-component tweets in around 40% cases. Polarity switching appears to be a prevalent switching phenomenon (10%) specifically in code-switched tweets (three to four times higher than monolingual multi-component tweets) where preference of expressing negative sentiment in Hindi is approximately twice compared to English. Positive reinforcement appears to be an important pragmatic function for English multi-component tweets whereas negative reinforcement plays a key role for Devanagari multi-component tweets. Our results also indicate that the extent and nature of code-switching also strongly depend on the topic (sports, politics etc.) of discussion.
Part-of-Speech (POS) tagging is a well established technology for most West European languages, and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments conducted using an Igbo corpus as a test bed for identifying the POS taggers and the Machine Learning (ML) methods that can achieve a good performance with the small data set available for the language. Experiments have been conducted using different well-known POS taggers developed for English or European languages, and different training data styles and sizes. Igbo has a number of language-specific characteristics that present a challenge for effective POS tagging. One interesting case is the wide use of verbs (and nominalisations thereof) which have an inherent noun complement, which form linked pairs in the POS tagging scheme, but which may appear discontinuously. Another issue is Igbos highly productive agglutinative morphology, which can produce many variant word forms from a given root. This productivity is a key cause of the out-of-vocabulary (OOV) words observed during Igbo tagging. We report results of experiments on a promising direction for improving tagging performance on such morphologically-inflected OOV words.
Singlish can be interesting to the computational linguistics community both linguistically as a major low-resource creole based on English, and computationally for information extraction and sentiment analysis of regional social media. In our conference paper, Wang et al. , we investigated part-of-speech (POS) tagging and dependency parsing for Singlish by constructing a treebank under the Universal Dependencies scheme, and successfully used neural stacking models to integrate English syntactic knowledge for boosting Singlish POS tagging and dependency parsing, achieving the state-of-the-art accuracies of 89.50% and 84.47% for Singlish POS tagging and dependency respectively. In this work, we substantially extend Wang et al.  by enlarging the Singlish treebank to more than triple the size and with much more diversity in topics, as well as further exploring neural multi-task models for integrating English syntactic knowledge. Results show that the enlarged treebank has achieved significant relative error reduction of 45.8% and 15.5% on the base model, 27% and 10% on the neural multi-task model, and 21% and 15% on the neural stacking model for POS tagging and dependency parsing respectively. Moreover, the state-of-the-art Singlish POS tagging and dependency parsing accuracies have been improved to 91.45% and 85.57% respectively. We make our treebanks and models available for further research.
Transfer parsing has been used for developing dependency parsers for languages with no treebank using transfer from treebanks of other languages (source languages). In delexicalized transfer parsing the words are replaced by their part-of-speech tags. Transfer parsing may not work well if a language does not follow uniform syntactic structure with respect to its different constituent patterns. Earlier work has used information derived from linguistic databases to transform a source language treebank to reduce the syntactic differences between the source and the target languages. We propose a transformation method where a source language pattern is transformed stochastically to one of the multiple possible patterns followed in the target language. The transformed source language treebank can be used to train a delexicalized parser in the target language. We show that this method significantly improves average performance of single-source delexicalized transfer parsers. We also propose a multi-source transfer parsing approach by concatenating transformed source language treebanks and show that the multi-source parsers work better when using a subset of the source language treebanks rather than all of them or only one. The treebanks are selected greedily based on the labelled attachment scores of the corresponding single-source parser trained using the treebank after transformation.
Word Sense Disambiguation (WSD) aims to automatically predict the correct sense of a word used in a particular sentence. All human languages exhibit word sense ambiguity and resolving this ambiguity can be difficult. Standard benchmark resources are required to develop, compare and evaluate WSD techniques. These are available for many languages but not for Urdu, despite this being a language with more than 300 million speakers and large volumes of text available digitally. To fill this gap, this study proposes a novel benchmark corpus for the Urdu All-Words WSD task. The corpus contains running text of 5,042 words of Urdu, in which all ambiguous words (856 instances) are manually tagged with senses from the Urdu Lughat dictionary. A range of baseline WSD models based on n-grams are applied to the corpus and the best performance (with an accuracy of 57.71%) is achieved using word 4-grams. The corpus is freely available to the research community to encourage further WSD research in Urdu.
Opinion mining or sentiment analysis continues to gain interest in industry and academics. While there has been significant progress in developing models for sentiment analysis, the field remains an active area of research for many languages across the world, and in particular for the Arabic language which is the 5th most spoken language, and has become the 4th most used language on the Internet. With the flurry of research activity in Arabic opinion mining, several researchers have provided surveys to capture advances in the field. While these surveys capture a wealth of important progress in the field, the fast pace of advances in machine learning and natural language processing (NLP) necessitates a continuous need for more up-to-date literature survey. The aim of this paper is to provide a comprehensive literature survey for state-of-the-art advances in Arabic opinion mining. The survey goes beyond surveying previous works that were primarily focused on classification models. Instead, this paper provides a comprehensive system perspective by covering advances in different aspects of an opinion mining system, including advances in NLP software tools, lexical sentiment and corpora resources, classification models and applications of opinion mining. It also presents future directions for opinion mining in Arabic. The survey also covers latest advances in the field, including deep learning advances in Arabic Opinion Mining. The paper provides state-of-the-art information to help new or established researchers in the field as well as industry developers who aim to deploy an operational complete opinion mining system. Key insights are captured at the end of each section for particular aspects of the opinion mining system giving the reader a choice of focusing on particular aspects of interest.
Semantic information that has been proven to be necessary to the resolution of common noun phrases is typically ignored by most existing Chinese zero pronoun resolvers. This is because that zero pronouns convey no descriptive information, which makes it almost impossible to calculate semantic similarities between the zero pronoun and its candidate antecedents. Moreover, most of traditional approaches are based on the single-candidate model, which considers the candidate antecedents of a zero pronoun in isolation and thus overlooks their reciprocities. To address these problems, we first propose a neural network-based zero pronoun resolver (NZR) that is capable of generating vector-space semantics of zero pronouns and candidate antecedents. On the basis of NZR, we develop the collaborative filtering-based framework for Chinese zero pronoun resolution task, exploring the reciprocities between the candidate antecedents of a zero pronoun to more rationally re-estimate their importance. Experiment results on the Chinese portion of the OntoNotes corpus are encouraging: our proposed model substantially surpasses the Chinese zero pronoun resolution baseline systems.
Although neural machine translation (NMT) has certain capability to implicitly learn semantic information of sentences, we explore and show that Part-of-Speech (POS) tags can be explicitly incorporated into the attention mechanism of NMT effectively to yield further improvements. More specifically, in this paper, we propose a NMT model with tag-enhanced attention mechanism. In our model, NMT and POS tagging are jointly modeled via multi-task learning, and the predicted POS tags are used to improve the attention model of NMT. Besides following common practice to enrich encoder annotations by introducing predicted source POS tags, we exploit predicted target POS tags to refine attention model in a coarse-to-fine manner. Specifically, we first implement a coarse attention operation solely on source annotations and target hidden state, where the produced context vector is applied to update target hidden state used for target POS tagging. Then, we perform a fine attention operation which extends the coarse one by further exploiting the predicted target POS tags. Finally, we facilitate word prediction by simultaneously utilizing the context vector from fine attention and the predicted target POS tags. Experimental results and further analyses on Chinese-English and Japanese-English translation tasks demonstrate the superiority of our proposed model over the conventional NMT models. We release our code at https://github.com/middlekisser/PEA-NMT.git.
Deep contextualized word embeddings (short for ELMo), as an emerging and effective replacement for the static word embeddings, have achieved success on a bunch of syntactic and semantic NLP problems. However, little is known about what is responsible for the improvements. In this paper, we focus on the effect of ELMo for a typical syntax problem -- universal POS tagging and dependency parsing. We incorporate ELMo as additional word embeddings into the state-of-the-art POS tagger and dependency parser, and it leads to consistent performance improvements. Experimental results show the model using ELMo outperforms the state-of-the-art baseline by an average 0.91 for POS tagging and 1.11 for dependency parsing. Further analysis reveals that the improvements mainly result from the ELMo's better abstraction ability on the out-of-vocabulary (OOV) words, and this ability is achieved by the character-level word representation in ELMo. Based on ELMo's advantage on OOV, experiments that simulate low-resource settings are conducted and the results show that deep contextualized word embeddings are effective for data-insufficient tasks where the OOV problem is severe.
Understanding causality in text is crucial for intelligent agents. In this paper, inspired by the human causality learning, we propose an experience-based causality learning framework. Comparing to traditional approaches which attempt to handle causality problem relying on textual clues and linguistic resources, we are the first to use experience information for causality learning. Specifically, we first constructs various scenarios for intelligent agents, thus, the agents can gain experience from interaction in these scenarios. Then, human participants build a number of training instances for agents causality learning based on these scenarios. Each instance contains two sentences and a label. Each sentence describes an event that agent experienced in a scenario and the label indicates whether the sentence (event) pair belongs to causal relation. Accordingly, we propose a model which can infer the causality in text using experience by accessing the corresponding event information based on the input sentence pair. Experiment results show that our method can achieve impressive performance on the grounded causality corpus and significantly outperform the conventional approaches. Our work suggests that the experience is very important for intelligent agents to understand causality.
In this paper, we studied the problem of parsing a math problem into logical forms. It is an essential pre-processing step for automatically solving math problem. Most of the existing studies on semantic parsing mainly focused on the single sentence level. However, for parsing math problem, we need to incorporate information from multiple sentences into consideration. To achieve the task, we formulated the task as a machine translation problem and extended the sequence to sequence model with a novel two-encoder architecture and a word level selective mechanism. For training and evaluating the proposed method, we constructed a large-scale dataset. Experimental results showed that the proposed two-encoder architecture and word level selective mechanism could bring significant improvement. The proposed method can achieve better performance than the state-of-the-art methods.
Neural machine translation (NMT) has made remarkable progress in recent years, but the performance of NMT suffers from a data sparsity problem since large-scale parallel corpora are only readily available for high-resource languages (HRLs). In recent days, transfer learning (TL) has been used widely in low-resource languages (LRLs) machine translation; while TL is becoming one of the vital directions for addressing the data sparsity problem in low-resource NMT. As a solution, a transfer learning method in NMT is generally obtained via initializing the low-resource model (child) with the high-resource model (parent). However, leveraging the original TL to low-resource models is neither able to make full use of highly related multiple HRLs nor receive different parameters from the same parents. In order to exploit multiple HRLs effectively, we present a language-independent and straightforward multi-round transfer learning (MRTL) approach to low-resource NMT. Besides, with the intention of reducing the differences between high-resource and lowresource languages at the character level, we introduce a unified transliteration method for various language families, which are both semantically and syntactically highly analogous with each other. Experiments on low-resource datasets show that our approaches are effective, significantly outperform the state-of-the-art methods and yield improvements of up to 5.63 BLEU points.
This paper innovatively addresses machine translation from Chinese to Catalan using neural pivot strategies trained without any direct parallel data. The Catalan language is very similar to Spanish from a linguistic point of view, which motivates the use of Spanish as pivot language. Regarding neural architecture we are using the latest state-of-the-art which is the Transformer model, only based on attention mechanisms. Additionally, this work provides new resources to the community which consist on a human developed gold standard of 4,000 sentences between Catalan and Chinese and all the others United Nations official languages (Arabic, English, French, Russian and Spanish). Results show that the standard pseudo-corpus or synthetic pivot approach performs better than cascade and BLEU is only 6 points BLEU behind direct Chinese-to-Spanish machine translation system.
Fine-grained sentiment analysis is a useful tool for producers to understand consumers' needs as well as complaints to products and related aspects from online platforms. In this paper, we define a novel task named "Multi-Entity Aspect-Based Sentiment Analysis (ME-ABSA)" to investigate the sentiment towards entities and their related aspects, making the well-studied aspect-based sentiment analysis a special case of this one, where the number of entities is limited to one. We contribute a new dataset for this task, with multi-entity Chinese posts in it. We propose to model context, entity and aspect memory to address the task and incorporate dependency information for further improvement. Experiments show that our methods perform significantly better than baseline methods on datasets for both ME-ABSA task and ABSA task. The in-depth analysis further validates the effectiveness of our methods and shows that our methods are capable of generalizing to new (entity, aspect) combinations with little loss of accuracy. This observation indicates that data annotation in real applications can be largely simplified.
Most of the syntax-based metrics obtain the similarity by comparing the sub-structures extracted from the trees of hypothesis and reference. These sub-structures cannot represent all the information in the trees because their lengths are limited. To sufficiently use the reference syntax information, a new automatic evaluation metric is proposed based on dependency parsing model. First, a dependency parsing model is trained using the reference dependency tree for each sentence. Then, the hypothesis is parsed by this dependency parsing model and the corresponding hypothesis dependency tree is generated. The quality of hypothesis can be judged by the quality of the hypothesis dependency tree. Unigram F-score is included in the new metric so that lexicon similarity is obtained. According to experimental results, the proposed metric can perform better than METEOR and BLEU on system level, and get comparable results with METEOR on sentence level. To further improve the performance, we also propose a combined metric which gets the best performance on sentence level and on system level.
Ancient Chinese brings the wisdom and spirit culture of the Chinese nation. Automatically translation from ancient Chinese to modern Chinese helps to inherit and carry forward the quintessence of the ancients. However, the lack of large-scale parallel corpus limits the study of machine translation in Ancient-Modern Chinese. In this paper, we propose an Ancient-Modern Chinese clause alignment approach based on the characteristics of these two languages. This method combines both lexical-based information and statistical-based information, which achieves 94.2 F1-score on our manual annotation test set. We use this method to create a new large-scale Ancient-Modern Chinese parallel corpus which contains over 1.24M bilingual pairs. To our best knowledge, this is the first large high-quality Ancient-Modern Chinese dataset. Furthermore, we analyzed and compared the performance of the SMT and various NMT based models on this dataset and provided a strong baseline for this task.
This paper presents a comprehensive study on Burmese (Myanmar) morphological analysis, from annotated data preparation to experiment-based investigation. Twenty thousand Burmese sentences in news field are annotated with morphological information as one component of the Asian Language Treebank Project. The annotation includes two-layer tokenization and part-of-speech (POS) tagging, to provide rich information on the morphological level and on the syntactic constituent level. The annotated corpus has been released under a CC BY-NC-SA license, and it is the largest open-access database of annotated Burmese when this manuscript was prepared in 2017. Detailed descriptions of the preparation, refinement, and features of the annotated corpus are provided in the first half of the paper. Facilitated by the deliberately prepared corpus, experiment-based investigations of Burmese morphological analysis are presented in the second half of the paper, wherein the standard sequence-labeling approach for conditional random fields and a long short-term memory (LSTM) based recurrent neural network (RNN) are applied and discussed. We obtained several general conclusions on the Burmese morphological analysis task, covering the scheme design of output tags, effect of joint tokenization and POS-tagging, and importance of ensemble from the viewpoint of stabilizing the performance of LSTM-based RNN. This study provides a solid basis for further studies on Burmese processing. Owing to the present study, in terms of morphological analysis, Burmese should no longer be referred to as a low-resourced or under-studied language.
Modern Standard Arabic, as well as Arabic dialect languages, are usually written without diacritics. The absence of these marks constitute a real problem in the automatic processing of these data by NLP tools. Indeed, writing Arabic without diacritics introduces several types of ambiguity. Firstly, a word without diacratics could have many possible meanings depending on their diacritization. Secondly, undiacritized surface forms of an Arabic word might have as many as 200 readings depending on the complexity of its morphology . In fact, the agglutination property of Arabic might produce a problem that can only be resolved using diacritics. Thirdly, without diacritics a word could have many possible POS instead of one. This is the case with the words that have the same spelling and POS tag but a different lexical sense, or words that have the same spelling but different POS tags and lexical senses . Finally, there is ambiguity at the grammatical level (syntactic ambiguity). In this paper, we propose the first work that investigates the automatic diacritization of Tunisian Dialect texts. We first describe our annotation guidelines and procedure. Then, we propose two major models, namely a statistical machine translation (SMT) and a discriminative model as a sequence classification task based on CRFs (Conditional Random Fields). In the second approach, we integrate POS features to influence the generation of diacritics. Diacritics restoration was performed at both the word and the character levels. The results showed high scores of automatic diacritization based on the CRF system (WER 21.44% for CRF and WER 34.6% for SMT).
Statistical machine translation (SMT) models require large bilingual corpora to produce high quality results. Nevertheless, such large bilingual corpora are unavailable for almost language pairs. In this work, we enhance SMT for low-resource languages using semantic similarity. Specifically, we focus on two strategies: sentence alignment and pivot translation. For sentence alignment, we use the representative method that based on sentence length and word alignment as a baseline method. We utilize word2vec to extract word similarity from monolingual data to improve the word alignment phase in the baseline method. The proposed sentence alignment algorithm is used to build bilingual corpora from Wikipedia. In pivot translation, the representative method called triangulation connects source to target phrases via common pivot phrases in source-pivot and pivot-target phrase tables. Nevertheless, it may lack information when some pivot phrases contain the same meaning, but they are not matched to each other. Therefore, we use similarity between pivot phrases to improve the triangulation method. Finally, we introduce a framework that combines the two proposed algorithms to improve SMT for low-resource languages. We conduct experiments on low-resource languages including Japanese-Vietnamese and Southeast Asian languages (Indonesian, Malay, Filipino, and Vietnamese). Experimental results show that our proposed methods of sentence alignment and pivot translation based on semantic similarity improve the baseline methods. The proposed framework significantly improves baseline SMT models trained on small bilingual corpora.