Measuring the Diversity of Automatic Image Descriptions

C.W.J. van Miltenburg, Desmond Elliott, P.T.J.M. Vossen

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Automatic image description systems typically produce generic sentences that only make use of
a small subset of the vocabulary available to them. In this paper, we consider the production
of generic descriptions as a lack of diversity in the output, which we quantify using established
metrics and two new metrics that frame image description as a word recall task. This framing
allows us to evaluate system performance on the head of the vocabulary, as well as on the long
tail, where system performance degrades. We use these metrics to examine the diversity of the
sentences generated by nine state-of-the-art systems on the MS COCO data set. We find that
the systems trained with maximum likelihood objectives produce less diverse output than those
trained with additional adversarial objectives. However, the adversarially-trained models only
produce more types from the head of the vocabulary and not the tail. Besides vocabulary-based
methods, we also look at the compositional capacity of the systems, specifically their ability to
create compound nouns and prepositional phrases of different lengths. We conclude that there
is still much room for improvement, and offer a toolkit to measure progress towards the goal of
generating more diverse image descriptions.
LanguageEnglish
Title of host publicationProceedings of the the International Conference on Computational Linguistics (COLING 2018)
PublisherInternational Conference on Computational Linguistics (COLING)
Pages1730-1741
Number of pages12
ISBN (Print)9781948087506
Publication statusPublished - 2018
Event27th International Conference on Computational Linguistics COLING 2018 - Santa Fe, NM
Duration: 20 Aug 201826 Aug 2018
Conference number: 27

Conference

Conference27th International Conference on Computational Linguistics COLING 2018
Abbreviated titleCOLING 2018
CitySanta Fe, NM
Period20/08/1826/08/18

Fingerprint

Maximum likelihood

Cite this

van Miltenburg, C. W. J., Elliott, D., & Vossen, P. T. J. M. (2018). Measuring the Diversity of Automatic Image Descriptions. In Proceedings of the the International Conference on Computational Linguistics (COLING 2018) (pp. 1730-1741). [C18-1147] International Conference on Computational Linguistics (COLING).
van Miltenburg, C.W.J. ; Elliott, Desmond ; Vossen, P.T.J.M. / Measuring the Diversity of Automatic Image Descriptions. Proceedings of the the International Conference on Computational Linguistics (COLING 2018). International Conference on Computational Linguistics (COLING), 2018. pp. 1730-1741
@inproceedings{6aea3ca376d74ae5a8b4d2e532f946b5,
title = "Measuring the Diversity of Automatic Image Descriptions",
abstract = "Automatic image description systems typically produce generic sentences that only make use ofa small subset of the vocabulary available to them. In this paper, we consider the productionof generic descriptions as a lack of diversity in the output, which we quantify using establishedmetrics and two new metrics that frame image description as a word recall task. This framingallows us to evaluate system performance on the head of the vocabulary, as well as on the longtail, where system performance degrades. We use these metrics to examine the diversity of thesentences generated by nine state-of-the-art systems on the MS COCO data set. We find thatthe systems trained with maximum likelihood objectives produce less diverse output than thosetrained with additional adversarial objectives. However, the adversarially-trained models onlyproduce more types from the head of the vocabulary and not the tail. Besides vocabulary-basedmethods, we also look at the compositional capacity of the systems, specifically their ability tocreate compound nouns and prepositional phrases of different lengths. We conclude that thereis still much room for improvement, and offer a toolkit to measure progress towards the goal ofgenerating more diverse image descriptions.",
author = "{van Miltenburg}, C.W.J. and Desmond Elliott and P.T.J.M. Vossen",
year = "2018",
language = "English",
isbn = "9781948087506",
pages = "1730--1741",
booktitle = "Proceedings of the the International Conference on Computational Linguistics (COLING 2018)",
publisher = "International Conference on Computational Linguistics (COLING)",

}

van Miltenburg, CWJ, Elliott, D & Vossen, PTJM 2018, Measuring the Diversity of Automatic Image Descriptions. in Proceedings of the the International Conference on Computational Linguistics (COLING 2018)., C18-1147, International Conference on Computational Linguistics (COLING), pp. 1730-1741, 27th International Conference on Computational Linguistics COLING 2018, Santa Fe, NM, 20/08/18.

Measuring the Diversity of Automatic Image Descriptions. / van Miltenburg, C.W.J.; Elliott, Desmond; Vossen, P.T.J.M.

Proceedings of the the International Conference on Computational Linguistics (COLING 2018). International Conference on Computational Linguistics (COLING), 2018. p. 1730-1741 C18-1147.

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Measuring the Diversity of Automatic Image Descriptions

AU - van Miltenburg, C.W.J.

AU - Elliott, Desmond

AU - Vossen, P.T.J.M.

PY - 2018

Y1 - 2018

N2 - Automatic image description systems typically produce generic sentences that only make use ofa small subset of the vocabulary available to them. In this paper, we consider the productionof generic descriptions as a lack of diversity in the output, which we quantify using establishedmetrics and two new metrics that frame image description as a word recall task. This framingallows us to evaluate system performance on the head of the vocabulary, as well as on the longtail, where system performance degrades. We use these metrics to examine the diversity of thesentences generated by nine state-of-the-art systems on the MS COCO data set. We find thatthe systems trained with maximum likelihood objectives produce less diverse output than thosetrained with additional adversarial objectives. However, the adversarially-trained models onlyproduce more types from the head of the vocabulary and not the tail. Besides vocabulary-basedmethods, we also look at the compositional capacity of the systems, specifically their ability tocreate compound nouns and prepositional phrases of different lengths. We conclude that thereis still much room for improvement, and offer a toolkit to measure progress towards the goal ofgenerating more diverse image descriptions.

AB - Automatic image description systems typically produce generic sentences that only make use ofa small subset of the vocabulary available to them. In this paper, we consider the productionof generic descriptions as a lack of diversity in the output, which we quantify using establishedmetrics and two new metrics that frame image description as a word recall task. This framingallows us to evaluate system performance on the head of the vocabulary, as well as on the longtail, where system performance degrades. We use these metrics to examine the diversity of thesentences generated by nine state-of-the-art systems on the MS COCO data set. We find thatthe systems trained with maximum likelihood objectives produce less diverse output than thosetrained with additional adversarial objectives. However, the adversarially-trained models onlyproduce more types from the head of the vocabulary and not the tail. Besides vocabulary-basedmethods, we also look at the compositional capacity of the systems, specifically their ability tocreate compound nouns and prepositional phrases of different lengths. We conclude that thereis still much room for improvement, and offer a toolkit to measure progress towards the goal ofgenerating more diverse image descriptions.

UR - https://aclanthology.info/events/coling-2018

M3 - Conference contribution

SN - 9781948087506

SP - 1730

EP - 1741

BT - Proceedings of the the International Conference on Computational Linguistics (COLING 2018)

PB - International Conference on Computational Linguistics (COLING)

ER -

van Miltenburg CWJ, Elliott D, Vossen PTJM. Measuring the Diversity of Automatic Image Descriptions. In Proceedings of the the International Conference on Computational Linguistics (COLING 2018). International Conference on Computational Linguistics (COLING). 2018. p. 1730-1741. C18-1147