Abstract
State-of-the-art entity linkers achieve high accuracy scores with probabilistic methods. However,
these scores should be considered in relation to the properties of the datasets they are evaluated
on. Until now, there has not been a systematic investigation of the properties of entity linking
datasets and their impact on system performance. In this paper we report on a series of hypotheses
regarding the long tail phenomena in entity linking datasets, their interaction, and their impact
on system performance. Our systematic study of these hypotheses shows that evaluation datasets
mainly capture head entities and only incidentally cover data from the tail, thus encouraging
systems to overfit to popular/frequent and non-ambiguous cases. We find the most difficult cases
of entity linking among the infrequent candidates of ambiguous forms. With our findings, we
hope to inspire future designs of both entity linking systems and evaluation datasets. To support
this goal, we provide a list of recommended actions for better inclusion of tail cases.
these scores should be considered in relation to the properties of the datasets they are evaluated
on. Until now, there has not been a systematic investigation of the properties of entity linking
datasets and their impact on system performance. In this paper we report on a series of hypotheses
regarding the long tail phenomena in entity linking datasets, their interaction, and their impact
on system performance. Our systematic study of these hypotheses shows that evaluation datasets
mainly capture head entities and only incidentally cover data from the tail, thus encouraging
systems to overfit to popular/frequent and non-ambiguous cases. We find the most difficult cases
of entity linking among the infrequent candidates of ambiguous forms. With our findings, we
hope to inspire future designs of both entity linking systems and evaluation datasets. To support
this goal, we provide a list of recommended actions for better inclusion of tail cases.
Original language | English |
---|---|
Title of host publication | Proceedings of the the International Conference on Computational Linguistics (COLING 2018) |
Place of Publication | Santa Fe |
Publisher | International Conference on Computational Linguistics (COLING) |
Pages | 664-674 |
Number of pages | 11 |
ISBN (Print) | 9781948087506 |
Publication status | Published - 2018 |
Event | 27th International Conference on Computational Linguistics COLING 2018 - Santa Fe, NM Duration: 20 Aug 2018 → 26 Aug 2018 Conference number: 27 |
Conference
Conference | 27th International Conference on Computational Linguistics COLING 2018 |
---|---|
Abbreviated title | COLING 2018 |
City | Santa Fe, NM |
Period | 20/08/18 → 26/08/18 |