More is not always better: balancing sense distributions for all-words Word Sense Disambiguation

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Current Word Sense Disambiguation systems show an extremely poor performance on low fre-
quent senses, which is mainly caused by the difference in sense distributions between training
and test data. The main focus in tackling this problem has been on acquiring more data or se-
lecting a single predominant sense and not necessarily on the meta properties of the data itself.
We demonstrate that these properties, such as the volume, provenance, and balancing, play an
important role with respect to system performance. In this paper, we describe a set of experi-
ments to analyze these meta properties in the framework of a state-of-the-art WSD system when
evaluated on the SemEval-2013 English all-words dataset. We show that volume and provenance
are indeed important, but that approximating the perfect balancing of the selected training data
leads to an improvement of 21 points and exceeds state-of-the-art systems by 14 points while
using only simple features. We therefore conclude that unsupervised acquisition of training data
should be guided by strategies aimed at matching meta properties.
Original languageEnglish
Title of host publicationProceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Pages3496-3506
Number of pages11
Publication statusPublished - 2016

Cite this

Postma, M. C., Izquierdo, R., & Vossen, P. T. J. M. (2016). More is not always better: balancing sense distributions for all-words Word Sense Disambiguation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (pp. 3496-3506)
Postma, M.C. ; Izquierdo, R. ; Vossen, P.T.J.M. / More is not always better: balancing sense distributions for all-words Word Sense Disambiguation. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers . 2016. pp. 3496-3506
@inproceedings{2840c4efd99b4c7d8cad56197832eda2,
title = "More is not always better: balancing sense distributions for all-words Word Sense Disambiguation",
abstract = "Current Word Sense Disambiguation systems show an extremely poor performance on low fre-quent senses, which is mainly caused by the difference in sense distributions between trainingand test data. The main focus in tackling this problem has been on acquiring more data or se-lecting a single predominant sense and not necessarily on the meta properties of the data itself.We demonstrate that these properties, such as the volume, provenance, and balancing, play animportant role with respect to system performance. In this paper, we describe a set of experi-ments to analyze these meta properties in the framework of a state-of-the-art WSD system whenevaluated on the SemEval-2013 English all-words dataset. We show that volume and provenanceare indeed important, but that approximating the perfect balancing of the selected training dataleads to an improvement of 21 points and exceeds state-of-the-art systems by 14 points whileusing only simple features. We therefore conclude that unsupervised acquisition of training datashould be guided by strategies aimed at matching meta properties.",
author = "M.C. Postma and R. Izquierdo and P.T.J.M. Vossen",
year = "2016",
language = "English",
pages = "3496--3506",
booktitle = "Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers",

}

Postma, MC, Izquierdo, R & Vossen, PTJM 2016, More is not always better: balancing sense distributions for all-words Word Sense Disambiguation. in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers . pp. 3496-3506.

More is not always better: balancing sense distributions for all-words Word Sense Disambiguation. / Postma, M.C.; Izquierdo, R.; Vossen, P.T.J.M.

Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers . 2016. p. 3496-3506.

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - More is not always better: balancing sense distributions for all-words Word Sense Disambiguation

AU - Postma, M.C.

AU - Izquierdo, R.

AU - Vossen, P.T.J.M.

PY - 2016

Y1 - 2016

N2 - Current Word Sense Disambiguation systems show an extremely poor performance on low fre-quent senses, which is mainly caused by the difference in sense distributions between trainingand test data. The main focus in tackling this problem has been on acquiring more data or se-lecting a single predominant sense and not necessarily on the meta properties of the data itself.We demonstrate that these properties, such as the volume, provenance, and balancing, play animportant role with respect to system performance. In this paper, we describe a set of experi-ments to analyze these meta properties in the framework of a state-of-the-art WSD system whenevaluated on the SemEval-2013 English all-words dataset. We show that volume and provenanceare indeed important, but that approximating the perfect balancing of the selected training dataleads to an improvement of 21 points and exceeds state-of-the-art systems by 14 points whileusing only simple features. We therefore conclude that unsupervised acquisition of training datashould be guided by strategies aimed at matching meta properties.

AB - Current Word Sense Disambiguation systems show an extremely poor performance on low fre-quent senses, which is mainly caused by the difference in sense distributions between trainingand test data. The main focus in tackling this problem has been on acquiring more data or se-lecting a single predominant sense and not necessarily on the meta properties of the data itself.We demonstrate that these properties, such as the volume, provenance, and balancing, play animportant role with respect to system performance. In this paper, we describe a set of experi-ments to analyze these meta properties in the framework of a state-of-the-art WSD system whenevaluated on the SemEval-2013 English all-words dataset. We show that volume and provenanceare indeed important, but that approximating the perfect balancing of the selected training dataleads to an improvement of 21 points and exceeds state-of-the-art systems by 14 points whileusing only simple features. We therefore conclude that unsupervised acquisition of training datashould be guided by strategies aimed at matching meta properties.

M3 - Conference contribution

SP - 3496

EP - 3506

BT - Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

ER -

Postma MC, Izquierdo R, Vossen PTJM. More is not always better: balancing sense distributions for all-words Word Sense Disambiguation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers . 2016. p. 3496-3506