Language Analysis is a very important need for the native speaker to contest with digital world. Assamese, being an unexplored language, in this report we analyze different aspects from the perspective of speech to text processing. Starting from building a speech corpus, defining syllable rules and finally developing a speech search engine of Assamese. We have collected almost 20 hours of speech in three (viz., read, extempore, and conversation) modes and transcribed. We have also discussed some issues and challenges faced during development of the corpus. We have developed an automatic syllabification model with twelve rules for Assamese language and found an accuracy of more than 95% in our result. Out of the twelve patterns, five are found most frequent and maximum of four letters long syllable. With the help of HTK 3.5, we used deep learning based neural network models for our speech recognition model, where we obtained 78.05% accuracy for automatic transcription of Assamese speech.
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction becomes a difficult task for low-resource languages. Pivot language and cognate recognition approach have been proven useful to induce bilingual lexicons for such languages. We propose a constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction approach and enabling multiple constraints cycle to reach more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. In this paper, we have four datasets consisting of one Austronesian low-resource languages and three Indo-European high-resource languages. We use the Inverse Consultation method and translation pairs generated from cartesian product of input dictionaries as baselines and evaluate our result with precision, recall and F-score. Our method automatically find the best threshold of total cost for violating constraints to get the highest F-score, but users are allowed to specify threshold and number of n-cycle based on their preference and priority. Our method outperformed the baselines based on the F-score for all of our datasets. Based on the result, our method shows a prospect to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages and especially shows a very high prospect to enrich low-resource languages.
The unsupervised word alignments (such as GIZA++) are widely used in the phrase-based statistical machine translation. The quality of the model is proportional to the size and the quality of the bilingual corpus. However, for low-resource language pairs such as Chinese and Vietnamese, a result of unsupervised word alignment sometimes is of low quality due to the sparse data. In addition, this model does not take advantage of the linguistic relationships to improve performance of word alignment. Chinese and Vietnamese have the same language type and have close linguistic relationships. In this paper, we integrate the characteristics of linguistic relationships into the word alignment model to enhance the quality of Chinese-Vietnamese word alignment. These linguistic relationships are Sino-Vietnamese and content word. The experimental results showed that our method improved the performance of word alignment as well as the quality of machine translation.
This paper discusses the process to automate building an Arabic multi-dialect speech corpora using Voice over Internet Protocol (VoIP). The Asterisk framework was adopted to act as the main connection between the parties, for which two virtual machines were created: a sender and a receiver. The sender makes a VoIP call to the receiver using the Asterisk framework, while the receiver records the call automatically, a process repeated for all the audio files involved in the corpora. In this work, more than sixty-seven thousand automatic calls were made between the sender and receiver machines, resulting in the VoIP Arabic corpora for four Arabic dialects. The resulting corpora can be considered as the first Arabic VoIP parallel speech corpora and will be made freely available for researchers in Arabic NLP and speech recognition research.
A notably challenging problem in emotion analysis is recognizing the cause of an emotion. Although there have been a few studies on emotion cause detection, most of them work on news reports or a few of them focus on microblogs using a one-user structure (i.e. all texts in a microblog are written by the same user). In this paper, we focus on emotion cause detection for Chinese microblogs using a multiple-user structure (i.e. texts in a microblog are successively written by several users). First, according to the fact that the causes of an emotion of a focused user may be provided by other users in a microblog with the multiple-user structure, we design an emotion cause annotation scheme which can deal with such a complicated case, and then provide an emotion cause corpus using the annotation scheme. Second, based on the analysis of the emotion cause corpus, we formalize two emotion cause detection tasks for microblogs (current-subtweet-based emotion cause detection and original-subtweet-based emotion cause detection). Furthermore, in order to examine the difficulty of the two emotion cause detection tasks and the contributions of texts written by different users in a microblog with the multiple-user structure, we choose two popular classification methods (SVM and LSTM) to do emotion cause detection. Our experiments show that the current-subtweet-based emotion cause detection is much more difficult than the original-subtweet-based emotion cause detection, and texts written by different users are very helpful for both emotion cause detection tasks. This study presents a pilot study of emotion cause detection which deals with Chinese microblogs using a complicated structure.