Skip to main navigation Skip to search Skip to main content

CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this vector, so that the adjustment is only applied when needed. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also generalize to similar language use and leverage semantic compositionality. We verify the model's performance on two different tasks of identifying the targets of referring expressions, where it has to learn new language use. The results show that the model can efficiently learn and generalize from only a few examples, with little interference with the model's original zero-shot performance.
Original languageEnglish
Pages (from-to)1201-1223
JournalJournal of Artificial Intelligence Research
Volume74
DOIs
Publication statusPublished - 2022
Externally publishedYes

Funding

This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. The authors would like to thank Johan Boye, Ulme Wennberg, Dmytro Kalpakchi, and the anonymous reviewers for their helpful comments.

Funders
Knut och Alice Wallenbergs Stiftelse

    Fingerprint

    Dive into the research topics of 'CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings'. Together they form a unique fingerprint.

    Cite this