MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning

Yifan Jiang, Jiarui Zhang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

23 Downloads (Pure)

Abstract

While multi-modal large language models (MLLMs) have shown significant progress across popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints on numbers) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only consider a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 × 3 matrices). And they fail to capture all abstract reasoning patterns in human cognition necessary for addressing real-world tasks, such as geometric properties and object boundary understanding in real-world navigation. To evaluate MLLMs' AVR abilities systematically, we introduce MARVEL founded on the core knowledge system in human cognition, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model performance is grounded in perception or reasoning, MARVEL complements the standard AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with ten representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all MLLMs show near-random performance on MARVEL, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance). Although closed-source MLLMs, such as GPT-4V, show a promising understanding of reasoning patterns (on par with humans) after adding textual descriptions, this advantage is hindered by their weak perception abilities. We release our entire code and dataset at https://github.com/1171-jpg/MARVEL_AVR.

Original languageEnglish
Title of host publicationAdvances in Neural Information Processing Systems 37 (NeurIPS 2024)
Subtitle of host publication[Proceedings]
EditorsA. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, C. Zhang
PublisherNeurIPS
Pages1-26
Number of pages26
ISBN (Electronic)9798331314385
Publication statusPublished - 2024
Event38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada
Duration: 9 Dec 202415 Dec 2024

Conference

Conference38th Conference on Neural Information Processing Systems, NeurIPS 2024
Country/TerritoryCanada
CityVancouver
Period9/12/2415/12/24

Bibliographical note

Publisher Copyright:
© 2024 Neural information processing systems foundation. All rights reserved.

Funding

We appreciate Fred Morstatter for very helpful comments. We thank Tian Jin and Jiachi Liang for their assistance in data collection. This research was sponsored by the Defense Advanced Research Projects Agency via Contract HR00112390061.

FundersFunder number
Defense Advanced Research Projects AgencyHR00112390061

    Fingerprint

    Dive into the research topics of 'MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning'. Together they form a unique fingerprint.

    Cite this