A Dependency Parser for Spontaneous Chinese Spoken Language
For identifying speakers of quoted speech or extracting social networks from literature, it is indispensable to extract character names and nominals. However, detecting proper nouns in the novels translated into or written in Korean is harder than in English because Korean does not have capitalization feature. In addition, it is almost impossible for any proper noun dictionary to include all kind of character names which have been created or will be created by authors. Fortunately, a previous study shows that utilizing postpositions for animate nouns is a simple and effective tool for character identification in Korean novels without a proper noun dictionary and a training corpus. In this paper, we propose a character identification method utilizing the semantic relation with known animate nouns. For 80 novels in Korean, the proposed method increases the micro- and macro-average recall by 13.68% and 11.86%, respectively, while decreasing the micro-average precision by 0.28% and increasing the macro-average precision by 0.07% compared to the previous study. If we focus on characters that are responsible for more than 1% of the character name mentions in each novel, the micro- and macro-average F-measure of the proposed method are 96.98% and 97.32%, respectively.
Our research proposed an iterative Sundanese stemmer by removing the derivational affixes prior to the inflexional. This scheme was chosen because, in the Sundanese affixation, a confix (one of derivational affix) is applied in the last phase of a morphological process. Moreover, most of Sundanese affixes are derivational, so removing the derivational affix as the first step is reasonable. To handle ambiguity, the last recognized affix was returned as the result. As the baseline, a Confix-Stripping Approach which applies Porter Stemmer for the Indonesian language was used. This stemmer shares similarities in terms of affix type, but uses a different stemming order. To observe whether the baseline stems the Sundanese affixed word properly, some features that were not covered by the baseline, such as the infix and allomorph removal, were added. The evaluation was done using 4,453 unique affixed words collected from Sundanese online magazines. The experiment shows that, as a whole, our stemmer outperforms the modified baseline in terms of recognized affixed type accuracy and properly stemmed affixed words. Our stemmer recognized 68.87% of the Sundanese affixes types and produced 96.79% of the correctly affixed words; the modified baseline resulted in 21.70% and 71.59% respectively.
Cross-lingual word embeddings are representations for vocabularies of two or more languages in one common continuous vector space and are widely used in various NLP tasks. A simple yet efficient way to generate cross-lingual word embeddings is using canonical correlation analysis (CCA). However, CCA works with the assumption that the vector representations of similar words in different languages are related by a linear relationship. This assumption does not always hold true, especially for substantially different languages. We therefore propose to use kernel canonical correlation analysis (KCCA) to capture non-linear relationships between word embeddings of two languages. By extensively evaluating the resulting word embeddings on three tasks (word similarity, cross-lingual dictionary induction, cross-lingual document classification) across five language pairs, we show that our approach produces essentially better semantic vectors than CCA-based method, especially for substantially different languages.
Arabic text sentiment analysis suffers from low accuracy due to Arabic-specific challenges, (e.g., limited resources, morphological complexity, and dialects) and general linguistic issues (e.g., fuzziness, implicit, sarcasm, and spam). The limited resources problem requires efforts to build new and improve Arabic corpora and lexica. We propose a class-specific sentiment analysis (CLASENTI) framework. The framework includes a new annotation approach to build multi-faceted Arabic corpus and lexicon, which are simultaneously annotated with domains, dialects, linguistic issues and polarity strengths. The new corpus and lexicon annotation facilitate the development of new classification model and polarity strength calculation. For the new sentiment classification model, we propose a hybrid model combining corpus-based and lexicon-based models. The corpus-based model has two interrelated phases to build; 1) full-corpus classification models for all facets; and 2) class-specific models trained on filtered subsets of the corpus according to the performances of the full-corpus models. To calculate polarity strengths, the lexicon-based model filters the annotated lexicon based on the specific classes of the domain and dialect. As a case study, we have collected and annotated 15,274 reviews from various sources, including surveys, Facebook comments, and Twitter posts, pertaining to governmental services in an Arab country. CLASENTI framework reaches up to 95% accuracy and 93% F1-Score surpassing the best-known sentiment classifiers that achieve 82% accuracy and 81% F1-Score for Arabic when tested on the same dataset.
We investigate the use of word embeddings for query translation to improve precision in Cross Language Information Retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close to each other in this space. Multilingual word embeddings are constructed in such a way that similar words across languages have similar vector representations. We explore the effective use of bilingual and multilingual word embeddings learned from comparable corpora of Indic languages to the task of CLIR. We propose a clustering method based on the multilingual word vectors to group similar words across languages. For this we construct a graph with words from multiple languages as nodes and with edges connecting words with similar vectors. We use the Louvain Method for community detection to find communities in this graph. We show that choosing target language words as query translations from the clusters or communities containing the query terms helps in improving CLIR. We also find that better quality query translations are obtained when words from more languages are used to do the clustering even when the additional languages are neither the source of the target language. This is probably because having more similar words across multiple languages help define well-defined dense sub-clusters that help us obtain precise query translations. In this paper, we demonstrate the use of multilingual word embedding and word clusters for CLIR involving Indic languages. We also make available a tool for obtaining related words and the visualizations of the multilingual word vectors for English, Hindi, Bengali, Marathi, Gujarati and Tamil.
Bilingual word embedding has been shown to be helpful for Statistical Machine Translation (SMT). However, most existing methods suffer from two obvious drawbacks. First, they only focus on simple contexts such as an entire document or a fixed sized sliding window to build word embedding and ignore latent useful information from the selected context. Second, the word sense but not the word should be the minimal semantic unit; however, most existing methods are still use word representation. To overcome these drawbacks, this paper presents a novel Graph-based Bilingual Word Embedding (GBWE) method that projects bilingual word senses into a multi-dimensional semantic space. First, a bilingual word co-occurrence graph is constructed using the co-occurrence and pointwise mutual information between the words. Then, maximum complete sub-graphs (cliques), which play the role of a minimal unit for bilingual sense representation, are dynamically extracted according to the contextual information. Consequently, correspondence analysis, principle component analyses and neural networks are used to summarize the clique-word matrix into lower dimensions to build the embedding model. Without contextual information, the proposed GBWE can be applied to lexical translation. In addition, given the contextual information, GBWE is able to give a dynamic solution for bilingual word representations, which can be applied to phrase translation and generation. Empirical results show that GBWE can enhance the performance of lexical translation and Chinese/French-to-English phrase-based SMT.
Lexicon-based sentiment analysis aims to address the problem of extracting people[ opinions from their comments on the Web using a pre-defined lexicon of opinionated words. In contrast to machine learning approach, lexicon-based methods are domain-independent methods which do not need a large annotated training corpus and hence are faster. This makes the lexicon-based approach to be prevalent in the sentiment analysis community. However, the story is different for Persian language. In contrast to English, using lexicon-based method in Persian is a new discipline. There are rather limited resources available for sentiment analysis in Persian making the accuracy of the existing lexicon-based methods lower than other languages. In the current study, first an exhaustive investigation of lexicon-based method is performed. Then, two new resources are introduced in order to addresses the problem of resource scarcity for sentiment analysis in Persian; a carefully labeled lexicon of sentiment words, PerLex, and a new hand-made dataset of about 16000 rated documents, PerView. Moreover, a new hybrid method using both machine learning and lexicon-based approach is presented in which PerLex words are used to train the machine learning algorithm. Experiments are carried out on our new PerView dataset. Results indicate that the accuracy of PerLex is higher than the existing NRC and SentiStrength lexicons. Also, the results show that using just adjectives leads to a higher performance in comparison to using NRC or SentiStrength Lexicons. Moreover, the results demonstrate the excellence of using opinionated lexicon terms followed by bigrams as the features employed in machine learning method.