In this paper, we summarize the methodology and the results of our two-year long efforts to construct a comprehensive WordNet for Turkish. We provide the details of manual and automated stages of the construction and discuss the problems and future directions.
Word embedding based methods have received increasing attention for their flexibility and effectiveness in many natural language processing (NLP) tasks including Word Similarity (WS). However, these approaches rely on high-quality corpus and neglect prior knowledge. Lexicon based methods concentrate on human's intelligence contained in semantic resources, e.g., Tongyici Cilin, HowNet and Chinese WordNet, but they have the drawback of being unable to deal with unknown words. This paper proposes a three-stage framework for measuring the Chinese word similarity by incorporating prior knowledge obtained from lexicons and statistics into word embedding: in the first stage, we utilize retrieval techniques to crawl the contexts of word pairs from web resources to extend context corpus. In the next stage, we investigate three types of single similarity measurements including lexicon similarities, statistical similarities and embedding-based similarities. Finally, we exploit simple combination strategies with math operations and the counter-fitting combination strategy using optimization method. To demonstrate our system's efficiency, comparable experiments are conducted on the PKU-500 dataset. Our final results are 0.561/0.516 of Spearman/Pearson rank correlation coefficient, which outperform the state-of-the-art performance to the best of our knowledge. Experiment results on Chinese MC-30 and SemEval-2012 datasets show that our system also performs well on other Chinese datasets, which proves its transferability. Besides, our system is not language-specific, and can be applied to other languages, e.g., English.
Serendipity plays an important role in the appreciation of users for a recommendation system and has shown to be a possible way to alleviate the problem of over-specialization. In this paper, we study the problem of introducing serendipity into an entity recommendation system. Specifically, we aim to recommend more serendipitous entities to users based on the query they are searching for, which can help the system to better feed users with surprisingly interesting entity recommendations that they might not have discovered yet. To this end, we develop neural models to rank each candidate entities according to how well they engage the interest of users when searching for a query. We leverage various factors that may imply different degrees of serendipity. Extensive experiments are conducted on large-scale, real-world datasets collected from a widely used commercial search engine. The experiments show that our method significantly outperforms several strong baselines. Experimental results also show that our method can effectively recommend serendipitous entities that are of more interest to users for domain independent queries. The click-through rate (CTR) results further demonstrate that our method can significantly improve user engagement.
We annotate 60,000 words of classical Arabic with topics in philosophy, religion, literature, and law with fine-grained segment-based morphological descriptions. We use these annotations for building a morphological segmenter and part-of-speech tagger for Classical Arabic. With character-level classification and features from the word and its lexical context, the segmenter achieves a word accuracy of 96.8% with the main issue being a high rate of out-of-vocabulary words. A token-based part-of-speech tagger achieves an accuracy of 96.22 (97.72% on known tokens) in spite of the small size of the corpus. An error analysis shows that most of the tagging errors are results of segmentation and that the quality improves with more data being added. The morphological segmenter/tagger has a wide range of potential applications in processing the Arabic linguistics heritage.