Dropped Pronoun Recovery in Chinese SMS

Nov 2, 2015

By Chris Giannella , Ransom Winder, Ph.D. , Stacy Petersen

Artificial Intelligence Research & Prototyping

In written Chinese, personal pronouns are often dropped, particularly in informal genres like Short Message Service messages sent via cell phones. The authors examine a simplified version of dropped pronoun recovery detection in Chinese SMS messages.

Download Resources

Dropped Pronoun Recovery in Chinese SMS

PDF Accessibility

One or more of the PDF files on this page fall under E202.2 Legacy Exceptions and may not be completely accessible. You may request an accessible version of a PDF using the form on the Contact Us page.

In written Chinese, personal pronouns are commonly dropped when they can be inferred from context. This practice is particularly common in informal genres like Short Message Service (SMS) messages sent via cell phones. Restoring dropped personal pronouns can be a useful preprocessing step for information extraction. Dropped personal pronoun recovery can be divided into two subtasks: (1) detecting dropped personal pronoun slots and (2) determining the identity of the pronoun for each slot. We address a simpler version of restoring dropped personal pronouns wherein only the person numbers are identified. After applying a word segmenter, we used a linear-chain conditional random field (CRF) to predict which words were at the start of an independent clause. Then, using the independent clause start information, as well as lexical and syntactic information, we applied a CRF or a maximum-entropy classifier to predict whether a dropped personal pronoun immediately preceded each word and, if so, the person number of the dropped pronoun. We conducted a series of experiments using a manually annotated corpus of Chinese SMS messages. Our machine-learning–based approaches substantially outperformed a rule-based approach based partially on rules developed by Chung and Gildea in 2010. Features derived from parsing did not help our approaches. We conclude that the parse information is largely superfluous for identifying dropped personal pronouns if reasonably accurate independent clause start information is available.