Towards Machine Translation in Semantic Vector Space
A Constraint Approach to Pivot-based Bilingual Dictionary Induction
Bigram language models and reevaluation strategy for improved recognition of online handwritten Tamil words
Interest in statistical approaches for Korean morphological analyses has recently been shown. However, previous studies have been mostly based on generative models, including a hidden Markov model (HMM), without utilizing discriminative models such as a conditional random field (CRF). In this paper, we present a two-stage discriminative approach based on CRFs for a Korean morphological analysis. Similar to methods used for Chinese, we perform two disambiguation procedures based on CRFs: 1) morpheme segmentation and 2) POS tagging. In morpheme segmentation, an input sentence is segmented into sequences of morphemes, where a morpheme unit is either atomic or compound. In the POS tagging procedure, each morpheme (atomic or compound) is assigned a POS tag. Once the POS tagging is complete, we carry out a post-processing of the compound morphemes, where each compound morpheme is further decomposed into atomic morphemes, which is based on pre-analyzed patterns and generalized HMMs obtained from the given tagged corpus. Experimental results show the promise of our proposed method.
High quality bilingual dictionaries are very useful, but such resources are rarely available for lower-density language pairs, especially for those that are closely related. Using a third language to link two other languages is a well-known solution, and usually requires only two input bilingual dictionaries A-B and B-C to automatically induce the new one, A-C. This approach, however, has never been demonstrated to utilize the complete structures of the input bilingual dictionaries, and this is a key failing because the dropped meanings negatively influence the result. This paper proposes a constraint approach to pivot-based dictionary induction where language A and C are closely related. We create constraints from language similarity and model the structures of the input dictionaries as a Boolean optimization problem which is then formulated within the Weighted Partial Max-SAT framework, an extension of Boolean Satisfiability (SAT). All of the encoded CNF (Conjunctive Normal Form), the predominant input language of modern SAT/MAX-SAT solvers, formulas are evaluated by a solver to produce the target (output) bilingual dictionary. Moreover, we discuss alternative formalizations as a comparison study. We designed a tool that uses Sat4j library as the default solver to implement our method, and conducted an experiment in which the induced bilingual dictionary achieved better quality than the baseline method.