Chinese Syntax Parsing Based on Sliding Match of Semantic String
Code-switching or juxtaposition of linguistic units from two or more languages in a single utterance, in recent times, has become very common in text, thanks to social media and other computer mediated forms of communication. In this exploratory study of English-Hindi code-switching on Twitter, we automatically create a large corpus of code-switched tweets and devise techniques to identify the relationship between successive components in a code-switched tweet. More specifically, we identify pragmatic functions like narrative-evaluative, negative reinforcement, translation etc. characterizing relation between successive components. We analyze the difference / similarity between switching patterns in code-switched and monolingual multi-component tweets. We observe strong dominance of narrative-evaluative (non-opinion to opinion or vice-versa) switching in case of both code-switched and monolingual multi-component tweets in around 40% cases. Polarity switching appears to be a prevalent switching phenomenon (10%) specifically in code-switched tweets (three to four times higher than monolingual multi-component tweets) where preference of expressing negative sentiment in Hindi is approximately twice compared to English. Positive reinforcement appears to be an important pragmatic function for English multi-component tweets whereas negative reinforcement plays a key role for Devanagari multi-component tweets. Our results also indicate that the extent and nature of code-switching also strongly depend on the topic (sports, politics etc.) of discussion.
Sentiment Analysis for a Resource Poor Language - Roman Urdu
Deep contextualized word embeddings (short for ELMo), as an emerging and effective replacement for the static word embeddings, have achieved success on a bunch of syntactic and semantic NLP problems. However, little is known about what is responsible for the improvements. In this paper, we focus on the effect of ELMo for a typical syntax problem -- universal POS tagging and dependency parsing. We incorporate ELMo as additional word embeddings into the state-of-the-art POS tagger and dependency parser, and it leads to consistent performance improvements. Experimental results show the model using ELMo outperforms the state-of-the-art baseline by an average 0.91 for POS tagging and 1.11 for dependency parsing. Further analysis reveals that the improvements mainly result from the ELMo's better abstraction ability on the out-of-vocabulary (OOV) words, and this ability is achieved by the character-level word representation in ELMo. Based on ELMo's advantage on OOV, experiments that simulate low-resource settings are conducted and the results show that deep contextualized word embeddings are effective for data-insufficient tasks where the OOV problem is severe.
Comparable corpora are valuable alternatives for the expensive parallel corpora. They comprise informative parallel fragments which are useful resources for different natural language processing tasks. In this work, a generative model is proposed for efficient extraction of parallel fragments from a pair of comparable documents. The core of the proposed model is a graph called the Matching Graph. The ability of the Matching Graph to be trained on a small initial seed makes it a proper model for language pairs suffering from the scarce resource problem. Experiments show that the Matching Graph performs significantly better than other recently published models. According to the experiments on English-Persian and Arabic-Persian language pairs, the extracted parallel fragments can be used instead of parallel data for training statistical machine translation systems. Results reveal that the extracted fragments in the best case are able to retrieve about 90% of the information of a statistical machine translation system which is trained on a parallel corpus. Moreover, it is shown that using the extracted fragments as additional information for training statistical machine translation systems leads to an improvement of about 2% for English-Persian and about 1% for Arabic-Persian translation on BLEU score.
Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation, and automatic question-answering. Recognizing the importance of NER, a plethora NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this paper makes three key contributions. Firstly, we have developed the largest Urdu NER corpus that contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has more than doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpus. Secondly, we have generated four word embeddings using two different techniques, fastText and Word2vec, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Finally, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques and impact of word embeddings on these techniques.
Statistical machine translation (SMT) models require large bilingual corpora to produce high quality results. Nevertheless, such large bilingual corpora are unavailable for almost language pairs. In this work, we enhance SMT for low-resource languages using semantic similarity. Specifically, we focus on two strategies: sentence alignment and pivot translation. For sentence alignment, we use the representative method that based on sentence length and word alignment as a baseline method. We utilize word2vec to extract word similarity from monolingual data to improve the word alignment phase in the baseline method. The proposed sentence alignment algorithm is used to build bilingual corpora from Wikipedia. In pivot translation, the representative method called triangulation connects source to target phrases via common pivot phrases in source-pivot and pivot-target phrase tables. Nevertheless, it may lack information when some pivot phrases contain the same meaning, but they are not matched to each other. Therefore, we use similarity between pivot phrases to improve the triangulation method. Finally, we introduce a framework that combines the two proposed algorithms to improve SMT for low-resource languages. We conduct experiments on low-resource languages including Japanese-Vietnamese and Southeast Asian languages (Indonesian, Malay, Filipino, and Vietnamese). Experimental results show that our proposed methods of sentence alignment and pivot translation based on semantic similarity improve the baseline methods. The proposed framework significantly improves baseline SMT models trained on small bilingual corpora.