Discourse parsing aims to identify structures and relationship between different discourse units. Most existing approaches analyze a whole discourse in one-time that often fail in distinguishing long-span relation and properly representing discourse units. In this paper, we propose a novel parsing model to analyze a discourse in a two-step fashion with different features to characterize intra-sentence and inter-sentence discourse structures, respectively. Our model works in a transition-based framework and benefits from a stack long short term memory neural network model. Experiments on benchmark discourse tree banks show that our method outperforms traditional 1-step parsing method in both English and Chinese.
Texts containing (i) repetition - repeated writing of the same stroke several times, (ii) over-writing, and (iii) crossing out are very common in natural handwriting. We call the presence of these three types of writing as noise here. Cleaning of text from such types of noises is important for robust recognition. To the best of our knowledge, no work has been reported on cleaning of such noises from online text in any scripts and hence in this paper we propose a novel automatic text cleaning approach for online handwriting recognition. Here, at first, crossing out noises with straight strike-through lines are detected using the straightness criteria of online strokes. Regions containing repetition, over-writing, and other types of crossing outs are located using the positional information of the overlapping strokes. Density, self-intersections etc. are computed from the strokes belonging to located regions to predict the noise type. For crossing outs, all strokes of the crossing out region are removed. For repetition and over-writing, strokes written earlier are removed keeping the latest strokes. Finally, delayed strokes are properly arranged and word is passed to online recognizer. Though recognition of natural handwriting is quite difficult, in this pioneering attempt, we obtain encouraging noise detection and word recognition accuracies.
An African language, Igbo with about 32,000,000 speakers in the world belongs to the group of languages with zero language processing resources needed for advanced natural language processing (NLP) and lan- guage technology applications. Thus, we present in this article the adapted state-of-the-art methods used in achieving necessary NLP resources for the development of a Part-of-Speech tagging (PoS) system for Igbo. That of corpora and tagset, we discuss some of the problems encountered along the way, and proffered solutions to the problems
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction becomes a difficult task for low-resource languages. Pivot language and cognate recognition approach have been proven useful to induce bilingual lexicons for such languages. We propose a constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction approach and enabling multiple constraints cycle to reach more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. In this paper, we have four datasets consisting of one Austronesian low-resource languages and three Indo-European high-resource languages. We use the Inverse Consultation method and translation pairs generated from cartesian product of input dictionaries as baselines and evaluate our result with precision, recall and F-score. Our method automatically find the best threshold of total cost for violating constraints to get the highest F-score, but users are allowed to specify threshold and number of n-cycle based on their preference and priority. Our method outperformed the baselines based on the F-score for all of our datasets. Based on the result, our method shows a prospect to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages and especially shows a very high prospect to enrich low-resource languages.
Topic Segmentation is one of the pillars of Natural Language Processing. Yet there is a remarkable lack of research in this field, as far as the Arabic Language is concerned. The purpose of this paper is to improve Arabic Topic Segmentation (ATS) by inquiring into two segmenters : ArabC99 and ArabTextTiling. This study is carried out on two independent levels : The pre-processing level and the segmentation level. These levels represent the basic steps of topic segmentation. On the pre-processing level, we examine the effect of using different Arabic stemming algorithms on ATS. We find out that Light10 is more appropriate for the pre-processing step. Based on this conclusion, we proceed to the second level by proposing two Arabic segmenters called ArabC99-LS-LSA and ArabTextTiling-LS-LSA. These latter use external semantic knowledge related to the Latent Semantic Analysis (LSA). Based on the evaluation results, we notice that LSA provides improvements in this field. Hence, the main outcome of this paper emphasized the multilevel improvement of ATS based on Light10 and LSA.
A notably challenging problem in emotion analysis is recognizing the cause of an emotion. Although there have been a few studies on emotion cause detection, most of them work on news reports or a few of them focus on microblogs using a one-user structure (i.e. all texts in a microblog are written by the same user). In this paper, we focus on emotion cause detection for Chinese microblogs using a multiple-user structure (i.e. texts in a microblog are successively written by several users). First, according to the fact that the causes of an emotion of a focused user may be provided by other users in a microblog with the multiple-user structure, we design an emotion cause annotation scheme which can deal with such a complicated case, and then provide an emotion cause corpus using the annotation scheme. Second, based on the analysis of the emotion cause corpus, we formalize two emotion cause detection tasks for microblogs (current-subtweet-based emotion cause detection and original-subtweet-based emotion cause detection). Furthermore, in order to examine the difficulty of the two emotion cause detection tasks and the contributions of texts written by different users in a microblog with the multiple-user structure, we choose two popular classification methods (SVM and LSTM) to do emotion cause detection. Our experiments show that the current-subtweet-based emotion cause detection is much more difficult than the original-subtweet-based emotion cause detection, and texts written by different users are very helpful for both emotion cause detection tasks. This study presents a pilot study of emotion cause detection which deals with Chinese microblogs using a complicated structure.