More is not always better: Balancing sense distributions for all-words Word Sense Disambiguation

Marten Postma, Ruben Izquierdo Bevia, Piek Vossen

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Current Word Sense Disambiguation systems show an extremely poor performance on low frequent senses, which is mainly caused by the difference in sense distributions between training and test data. The main focus in tackling this problem has been on acquiring more data or selecting a single predominant sense and not necessarily on the meta properties of the data itself. We demonstrate that these properties, such as the volume, provenance, and balancing, play an important role with respect to system performance. In this paper, we describe a set of experiments to analyze these meta properties in the framework of a state-of-the-art WSD system when evaluated on the SemEval-2013 English all-words dataset. We show that volume and provenance are indeed important, but that approximating the perfect balancing of the selected training data leads to an improvement of 21 points and exceeds state-of-the-art systems by 14 points while using only simple features. We therefore conclude that unsupervised acquisition of training data should be guided by strategies aimed at matching meta properties.

Original languageEnglish
Title of host publicationCOLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016
Subtitle of host publicationTechnical Papers
PublisherAssociation for Computational Linguistics, ACL Anthology
Pages3496-3506
Number of pages11
ISBN (Print)9784879747020
Publication statusPublished - 1 Jan 2016
Event26th International Conference on Computational Linguistics, COLING 2016 - Osaka, Japan
Duration: 11 Dec 201616 Dec 2016

Publication series

NameCOLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers

Conference

Conference26th International Conference on Computational Linguistics, COLING 2016
Country/TerritoryJapan
CityOsaka
Period11/12/1616/12/16

Fingerprint

Dive into the research topics of 'More is not always better: Balancing sense distributions for all-words Word Sense Disambiguation'. Together they form a unique fingerprint.

Cite this