This paper proposes an approach to infer bilingual Semantic Role Labeling (SRL) efficiently. As conveying the same meaning, translated bi-texts should have the same predicate semantic structure. However, it is very difficult to obtain consistent SRL results on both sides of bi-texts in monolingual SRL systems. Moreover, both sides of bi-texts usually contain complementary language cues, which can be used to improve over monolingual SRL systems. Thus, it is a better way to jointly infer bilingual Semantic Role Labeling. However, existing methods for joint bilingual SRL require high inference costs. In this paper, we utilize a simple but efficient technique - Lagrange Dual Decomposition to search for consistent results for both sides of bi-texts. On the other hand, intuitively the bilingual complementary cues could also provide the guidance for argument identification. To achieve this goal, we propose a method called Bi-Directional Projection (BDP) to recover arguments discarded in the argument identification phase of the monolingual SRL systems. We evaluate our method on a standard parallel benchmark - the OntoNotes dataset. The experimental results show that our method yields significant improvements over the state-of-the-art monolingual systems. In addition, our approach is also better and faster than existing methods due to Bi-Directional Projection and Lagrange Dual Decomposition.
Converting Continuous-Space Language Models into N-gram Language Models with Efficient Bilingual Pruning for Statistical Machine Translation
Collective Web-based Parenthetical Translation Extraction Using Markov Logic Networks
Document Image Understanding (DIU) and Electronic Document Management are active fields of research involving image understanding, interpretation, efficient handling, and routing of documents as well their retrieval. Research on most of the non-cursive scripts (Latin), have matured whereas research on the cursive (connected) scripts is still moving towards perfection. Many researchers are currently working on the cursive scripts (Arabic and other scripts adopting it) around the world so that the difficulties and challenges in document understanding and handling of these scripts can be overcome. Sindhi script has the largest extension of the original Arabic alphabet among languages adopting Arabic script; it contains 52 characters compared to 28 characters in Arabic alphabet in order to accommodate more sounds for the language. There are 24 differentiating characters with some possessing four dots. For Sindhi OCR research and development a database is needed for training and testing of Sindhi text images. We have developed a large database containing 4 billion 57 million words and 15 billion 275 million characters in 150 various fonts in 4 font weight and 4 styles. The database contents were collected from various sources including websites, books, theses and others. A custom built application was also developed to create a text image from a text document that supports various fonts and sizes. The database considers the words, characters, characters with spaces and lines. The database is freely available as a partial or full database by sending email to one of the authors.
This paper presents an elegant technique for extracting low level stroke features, like line segments, curve segments, end points and junction points from the off line printed text using template matching approach. The proposed feature are used to classify the subset of characters from Gujarati character set. The dataset consist of approximately 16000 middle zone symbols of 42 different character classes. The symbols are collected from three different sources, namely machine printed book, laser printed document and news papers, in order to add varieties in terms of size, font type, style, ink variation and boundary deformation. The experiment shows that the features are quite robust against the variations and the results obtained are comparable with other existing work.
A rule-based pre-reordering approach is proposed for statistical Japanese-to-English machine translation using the dependency structure of source-side sentences. A Japanese sentence is pre-reordered to an English-like order at the morpheme level for a statistical machine translation system during the training and decoding phase to resolve the reordering problem. In this paper, extra-chunk pre-reordering of morphemes is proposed, which allows Japanese functional morphemes to move across chunk boundaries. This contrasts with the intra-chunk reordering used in previous approaches, which restricts the reordering of morphemes within a chunk. Linguistically oriented discussions show that correct pre-reordering can not be realized without extra-chunk movement of morphemes. The proposed approach is compared with five rule-based pre-reordering approaches designed for Japanese-to-English translation and with a language independent statistical pre-reordering approach on a standard patent data set and on a news data set obtained by crawling Internet news sites. Two state-of-the-art statistical machine translation systems, one phrase-based and the other hierarchical phrase-based, are used in experiments. Experimental results show that the proposed approach markedly outperforms the compared approaches on automatic reordering measures (Kendall's tau, Spearman's rho, fuzzy reordering score, and test set RIBES) and on the automatic translation precision measure of test set BLEU score.
Experiments of various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that statistical and machine learning approaches have significantly better performance than dictionary-based approaches. We believe this note is the first systematic comparison of word segmentation approaches for Burmese, based on an annotated corpus with relatively considerable size (containing approximately half million words). This work intends to discover the properties and proper approaches of Burmese textual processing and to promote further researches of the understudied language.
A lemmatization algorithm for Bengali has been developed and its effectiveness for word sense disambiguation (WSD) is investigated. One of the key challenges for computer processing of agglutinative languages is to deal with the frequent morphological variations of the root words as they appear in the text. Therefore, designing of a lemmatizer is essential for developing many natural language processing (NLP) tools for such languages. In this experiment, Bengali which is the national language of Bangladesh and the second most popular language in the Indian subcontinent has been taken as a reference. In order to design the lemmatizer (BenLem), possible transformations through which surface words are formed from lemmas are studied so that suitable reverse transformations (along with contextual knowledge) can be applied on a surface word to get the corresponding lemma back. BenLem is found to be capable of handling both inflectional and derivational morphology in Bengali. It is evaluated on a set of $18$ news articles taken from FIRE Bengali News Corpus consisting of $3,338$ surface words (excluding proper nouns) and found to be about 82.68\% accurate. The role of the lemmatizer is then investigated for Bengali WSD. Fifty ($50$) news articles are randomly selected from the FIRE corpus and five most frequent polysemous Bengali words are considered for sense disambiguation. Different WSD systems are considered for this experiment and it is noticed that BenLem improves the performance of all the WSD systems and the improvement is statistically significant.