In Computational Linguistics, Hindi and Urdu are not viewed as monolithic whole and have received separate attention with respect to their text processing. From POS tagging to Machine translation, separate models are trained for each application despite the fact that their similarity guarantees unified models that could work for both of them. The reasons mainly are their divergent literary vocabularies and separate orthographies and probably also their political status and the social perception that they are two separate languages. Together Hindi and Urdu constitute the third largest language spoken in the world, yet they do not receive enough attention in the NLP research community. In this work, we focus on dependency parsing of Hindi and Urdu under two settings: mono-lingual and cross-register. In the mono-lingual setting, we aim to learn reasonably accurate dependency parsers for both Hindi and Urdu. To address their differences so that their models could be used interchangeably, we also explore different cross-lingual transfer strategies in the latter setting. With respect to mono-lingual parsing, we show that incorporating linguistically relevant information like case marking and grammatical agreement into the parsing model can significantly improve parsing of these languages. We improved the parsing of both Hindi and Urdu by $\sim$1.5\% absolute over a challenging baseline which uses rich features like part of speech tags, chunk tags, bit strings and lemma. In the case of resource sharing, we show that transliteration coupled with class based information induced over a harmonized Hindi and Urdu text can help transfer model parameters efficiently. We achieved an improvement of 14.5\% absolute over a simpler delexicalized baseline and 2.3\% absolute over a more challenging fully lexicalized baseline which uses machine translation to translate the training data into the target language.
Discourse relations between two text segments play an important role in many natural language processing (NLP) tasks. The connectives strongly indicate the sense of discourse relations, while in fact, there are no connectives in a large proportion of discourse relations, i.e. implicit discourse relations. Compared with explicit relations, implicit relations are much harder to detect and have drawn significant attention. Until now, there have been many studies focusing on English implicit discourse relations, and few studies address implicit relation recognition in Chinese even though the implicit discourse relations in Chinese are more common than those in English. In our work, both the English and Chinese languages are our focus. The key to implicit relation prediction is to properly model the semantics of the two discourse arguments, as well as the contextual interaction between them. To achieve this goal, we propose a neural network based framework that consists of two hierarchies. The first one is the model hierarchy, in which we propose a max-margin learning method to explore the implicit discourse relation from multiple views. The second one is the feature hierarchy, in which we learn multi-level distributed representations from words, arguments and syntactic structures to sentences. We have conducted experiments on the standard benchmarks of English and Chinese, and the results show that compared with several methods our proposed method can achieve the best performance in most cases.
In this paper, we propose a word embedding based named entity recognition (NER) approach. NER is commonly approached as a sequence labeling task with the application of methods such as conditional random field (CRF). However, for low resourced languages without the presence of sufficiently large training data, methods such as CRF do not perform well. In our work, we make use of the proximity of the vector embeddings of words to approach the NER problem. The hypothesis is that word vectors belonging to the same name category, e.g. a person name, occur in close vicinity in the abstract vector space of the embedded words. Assuming that this clustering hypothesis is true, we apply a standard classification approach on the vectors of words to learn a decision boundary between the NER classes. Our NER experiments are conducted on a morphologically rich and low resourced language, namely Bengali. Our approach significantly outperforms standard baseline CRF approaches that use cluster labels of word embeddings and gazetteers constructed from Wikipedia. Further, we propose an unsupervised approach (that use automatically created NE gazetteer from Wikipedia in the absence of training data). For a low resource language the training points obtained from the Wikipedia is not sufficient to train a classifier. As a result, we propose to make use of the distance measure between the vector embeddings of words to expand the set of Wikipedia training examples with additional NEs extracted from a monolingual corpus that yield significant improvement in the unsupervised NER performance. In fact our expansion method performs better than the traditional CRF based (supervised) approach (i.e. F-score 65.4% vs 64.2%). Finally, we compare our proposed approach to the official submission for the IJCNLP-2008 Bengali NER shared task and achieves an overall improvement of F-score 4.84% with respect to the best official system.
This paper proposes a technique for mining bilingual lexicons from pairs of parallel short word sequences. The technique builds a generative model from a corpus of training data consisting of such pairs. The model is a hierarchical non-parametric Bayesian model that directly induces a bilingual lexicon while training. The model learns in an unsupervised manner, and is designed to exploit characteristics of the language pairs being mined. The proposed model is capable of utilizing commonly used word-pair frequency information, and in addition can employ the internal character alignments within the words themselves. It is thereby capable of mining transliterations, and can use reliably-aligned transliteration pairs to support the mining of other words in their context. The model is also capable of performing word reordering and word deletion during the alignment process. In this paper we study two mining tasks based on English-Japanese and English-Chinese language pairs, and compare the proposed approach to a baseline based on a simpler model that used only word-pair frequency information. Our results show the the proposed method is able to mine bilingual word pairs at higher levels of precision and recall than the baseline.