Recently, quality estimation has been attracting increasing interest from the machine translation researchers, aiming at finding a good estimator for the quality of machine translation output. The common approach for quality estimation is to treat the problem as a supervised classification task using a quality-annotated parallel corpus, called quality estimation data, as training data. However, the available size of quality estimation data for training remains small, due to the exorbitant cost involved in the creation of such data. In addition, most conventional quality estimation approaches rely on manually designed features to model nonlinear relationships between feature vectors and corresponding quality labels. To overcome these problems, this paper proposes a novel neural network architecture for quality estimation task called the predictor-estimator that considers word prediction as an additional pre-task. The major component of the proposed neural architecture is a word prediction model based on a modified neural machine translation model a probabilistic model for predicting a target word conditioning on all the other source and target contexts. Our proposed quality estimation method sequentially trains the following two types of neural models: 1) Predictor: Neural word prediction model trained from parallel corpora and 2) Estimator: Neural quality estimation model trained from quality estimation data. To transfer word prediction task to quality estimation task, we generate quality estimation feature vectors from the word prediction model and feed them into the quality estimation model. The results of experiments conducted on WMT 15 and 16 quality estimation datasets indicate that our proposed method has potential as it achieves state-of-the-art performances in the various sub-challenges.
Named Entity Recognition and Classification (NERC) is a process of identifying words and classifying them into person names, location names, organization names, etc. In this paper, we discuss the development of an Urdu Named Entity (NE) corpus, called the Kamran-PU-NE (KPU-NE) corpus, for three entity types, i.e., Person (PER), Organization (ORG), and Location (LOC), and marking the remaining tokens as Others (O). We use two supervised learning algorithms, Hidden Markov Model (HMM) and Artificial Neural Network (ANN), for the development of Urdu NERC system. We annotate the 652852-token corpus taken from 15 different genres with a total of 44480 NEs. The inter-annotator agreement between the two annotators in terms of Kappa statistic is 73.41%. With HMM, the highest recorded precision, recall, and f-measure values are 55.98%, 83.11%, and 66.90%, respectively, and with ANN, they are 81.05%, 87.54%, and 84.17%, respectively.
The unsupervised word alignments (such as GIZA++) are widely used in the phrase-based statistical machine translation. The quality of the model is proportional to the size and the quality of the bilingual corpus. However, for low-resource language pairs such as Chinese and Vietnamese, a result of unsupervised word alignment sometimes is of low quality due to the sparse data. In addition, this model does not take advantage of the linguistic relationships to improve performance of word alignment. Chinese and Vietnamese have the same language type and have close linguistic relationships. In this paper, we integrate the characteristics of linguistic relationships into the word alignment model to enhance the quality of Chinese-Vietnamese word alignment. These linguistic relationships are Sino-Vietnamese and content word. The experimental results showed that our method improved the performance of word alignment as well as the quality of machine translation.
Some natural languages belong to the same family or share similar syntactic and/or semantic regularities. This opportunity persuades researchers to share computational models across languages and benefit from high-quality models to boost existing low-performance alternatives. We follow a similar idea in our research. In this paper, we describe statistical and neural machine translation (MT) engines trained on a language pair but is used to translate another language. First we train a reliable model with a high-resource language, then we exploit cross-lingual similarities and adapt the model to work for a close language with almost zero resources. Our proposed solution can be easily applied to any close language pair that we choose Turkish (Tr) and Azeri or Azerbaijani (Az) in this regard. Azeri suffers from the lack of resources as there is almost no bilingual corpus and MT system for this language. To the best of our knowledge, this is the first time that an Azeri MT system is developed and evaluated. By use of our models, we are able to train an engine for the Az->English (En) direction with a BLEU score of 22.30.
Phrase based SMT is commonly used for automatic translation. However, PBSMT runs into difficulty when either or both of the source and target languages are morphologically rich. Factored models are found to be useful for such cases, as they consider word as a vector of factors. These factors can contain any information about the surface word and use it while translating. The objective of the current work is to handle morphological inflections in Hindi, Marathi and Malayalam using Factored translation models while translating from English. Statistical MT approaches face the problem of data sparsity while translating to a morphologically rich language. It is very unlikely for a parallel corpus to contain all morphological forms of words. We propose a solution to generate these unseen morphological forms and inject them into the original training corpus. We propose a simple and effective solution based on enriching the input with various morphological forms of words. We observe that morphology injection improves the quality of translation in terms of both adequacy and fluency. We verify this with experiments on three morphologically rich languages: Hindi, Marathi and Malayalam, while translating from English. From the detailed evaluations we observed an order of magnitude improvement in translation quality.
This paper discusses the process to automate building an Arabic multi-dialect speech corpora using Voice over Internet Protocol (VoIP). The Asterisk framework was adopted to act as the main connection between the parties, for which two virtual machines were created: a sender and a receiver. The sender makes a VoIP call to the receiver using the Asterisk framework, while the receiver records the call automatically, a process repeated for all the audio files involved in the corpora. In this work, more than sixty-seven thousand automatic calls were made between the sender and receiver machines, resulting in the VoIP Arabic corpora for four Arabic dialects. The resulting corpora can be considered as the first Arabic VoIP parallel speech corpora and will be made freely available for researchers in Arabic NLP and speech recognition research.
A notably challenging problem in emotion analysis is recognizing the cause of an emotion. Although there have been a few studies on emotion cause detection, most of them work on news reports or a few of them focus on microblogs using a one-user structure (i.e. all texts in a microblog are written by the same user). In this paper, we focus on emotion cause detection for Chinese microblogs using a multiple-user structure (i.e. texts in a microblog are successively written by several users). First, according to the fact that the causes of an emotion of a focused user may be provided by other users in a microblog with the multiple-user structure, we design an emotion cause annotation scheme which can deal with such a complicated case, and then provide an emotion cause corpus using the annotation scheme. Second, based on the analysis of the emotion cause corpus, we formalize two emotion cause detection tasks for microblogs (current-subtweet-based emotion cause detection and original-subtweet-based emotion cause detection). Furthermore, in order to examine the difficulty of the two emotion cause detection tasks and the contributions of texts written by different users in a microblog with the multiple-user structure, we choose two popular classification methods (SVM and LSTM) to do emotion cause detection. Our experiments show that the current-subtweet-based emotion cause detection is much more difficult than the original-subtweet-based emotion cause detection, and texts written by different users are very helpful for both emotion cause detection tasks. This study presents a pilot study of emotion cause detection which deals with Chinese microblogs using a complicated structure.