Phrase representation, an important step in many NLP tasks, involves representing phrases as continuous-valued vectors. This paper presents detailed comparisons concerning the effects of word vectors, training data, and the composition and objective function used in a composition model for phrase representation. Specifically, we first discuss how the augmented word representations affect the performance of the composition model. Then, we investigate whether different types of training data influence the performance of the composition model and, if so, how they influence it. Finally, we evaluate combinations of different composition and objective functions and discuss the factors related to composition model performance. All evaluations were conducted in both English and Chinese. Our main findings are as follows: (1) The Additive model with semantic enhanced word vectors performs comparably to the state-of-the-art model; (2) A simple Matrix model with semantically enhanced word vectors systematically outperforms the state-of-the-art model by a large margin; (3) Representing high frequency phrases by estimating their surrounding contexts is a good training objective for bigram phrase similarity tasks; and (4) The high performance of the Matrix model with semantic enhanced word vectors is due to the matrix transformation and the greater weight attached to important words. Previous works focus on the composition function; however, our findings indicate that other components in the composition model (especially word representation) make a critical difference in phrase representation.
In Computational Linguistics, Hindi and Urdu are not viewed as monolithic whole and have received separate attention with respect to their text processing. From POS tagging to Machine translation, separate models are trained for each application despite the fact that their similarity guarantees unified models that could work for both of them. The reasons mainly are their divergent literary vocabularies and separate orthographies and probably also their political status and the social perception that they are two separate languages. Together Hindi and Urdu constitute the third largest language spoken in the world, yet they do not receive enough attention in the NLP research community. In this work, we focus on dependency parsing of Hindi and Urdu under two settings: mono-lingual and cross-register. In the mono-lingual setting, we aim to learn reasonably accurate dependency parsers for both Hindi and Urdu. To address their differences so that their models could be used interchangeably, we also explore different cross-lingual transfer strategies in the latter setting. With respect to mono-lingual parsing, we show that incorporating linguistically relevant information like case marking and grammatical agreement into the parsing model can significantly improve parsing of these languages. We improved the parsing of both Hindi and Urdu by $\sim$1.5\% absolute over a challenging baseline which uses rich features like part of speech tags, chunk tags, bit strings and lemma. In the case of resource sharing, we show that transliteration coupled with class based information induced over a harmonized Hindi and Urdu text can help transfer model parameters efficiently. We achieved an improvement of 14.5\% absolute over a simpler delexicalized baseline and 2.3\% absolute over a more challenging fully lexicalized baseline which uses machine translation to translate the training data into the target language.
In this paper, we propose a word embedding based named entity recognition (NER) approach. NER is commonly approached as a sequence labeling task with the application of methods such as conditional random field (CRF). However, for low resourced languages without the presence of sufficiently large training data, methods such as CRF do not perform well. In our work, we make use of the proximity of the vector embeddings of words to approach the NER problem. The hypothesis is that word vectors belonging to the same name category, e.g. a person name, occur in close vicinity in the abstract vector space of the embedded words. Assuming that this clustering hypothesis is true, we apply a standard classification approach on the vectors of words to learn a decision boundary between the NER classes. Our NER experiments are conducted on a morphologically rich and low resourced language, namely Bengali. Our approach significantly outperforms standard baseline CRF approaches that use cluster labels of word embeddings and gazetteers constructed from Wikipedia. Further, we propose an unsupervised approach (that use automatically created NE gazetteer from Wikipedia in the absence of training data). For a low resource language the training points obtained from the Wikipedia is not sufficient to train a classifier. As a result, we propose to make use of the distance measure between the vector embeddings of words to expand the set of Wikipedia training examples with additional NEs extracted from a monolingual corpus that yield significant improvement in the unsupervised NER performance. In fact our expansion method performs better than the traditional CRF based (supervised) approach (i.e. F-score 65.4% vs 64.2%). Finally, we compare our proposed approach to the official submission for the IJCNLP-2008 Bengali NER shared task and achieves an overall improvement of F-score 4.84% with respect to the best official system.
This paper proposes a technique for mining bilingual lexicons from pairs of parallel short word sequences. The technique builds a generative model from a corpus of training data consisting of such pairs. The model is a hierarchical non-parametric Bayesian model that directly induces a bilingual lexicon while training. The model learns in an unsupervised manner, and is designed to exploit characteristics of the language pairs being mined. The proposed model is capable of utilizing commonly used word-pair frequency information, and in addition can employ the internal character alignments within the words themselves. It is thereby capable of mining transliterations, and can use reliably-aligned transliteration pairs to support the mining of other words in their context. The model is also capable of performing word reordering and word deletion during the alignment process. In this paper we study two mining tasks based on English-Japanese and English-Chinese language pairs, and compare the proposed approach to a baseline based on a simpler model that used only word-pair frequency information. Our results show the the proposed method is able to mine bilingual word pairs at higher levels of precision and recall than the baseline.