Code Mixing: A Challenge for Language Identification in the Language of Social Media 论文

2014引用 269

Authorship Attribution and ProfilingNatural Language Processing TechniquesHate Speech and Cyberbullying Detection

Natural Language Processing Techniques Hate Speech and Cyberbullying Detection Authorship Attribution and Profiling

作者

摘要

In social media communication, multilingual speakers often switch between languages, and, in such an environment, automatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of creating, which contains Facebook posts and comments that exhibit code mixing between Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsupervised dictionary-based approach, supervised word-level classification with and without contextual clues, and sequence labelling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is important to take contextual clues into consideration.

作者查看全部 (4)

Jennifer Foster

Joachim Wagner

Amitava Das

Utsab Barman

Code Mixing: A Challenge for Language Identification in the Language of Social Media 论文

摘要

作者查看全部 (4)

相关技术查看全部 (2)

相关事件

相关文章