Abstract
While multi-modal large language models (MLLMs) have shown significant progress across popular visual reasoning benchmarks, whether they possess abstract visual reasoning abilities remains an open question. Similar to the Sudoku puzzles, abstract visual reasoning (AVR) problems require finding high-level patterns (e.g., repetition constraints on numbers) that control the input shapes (e.g., digits) in a specific task configuration (e.g., matrix). However, existing AVR benchmarks only consider a limited set of patterns (addition, conjunction), input shapes (rectangle, square), and task configurations (3 × 3 matrices). And they fail to capture all abstract reasoning patterns in human cognition necessary for addressing real-world tasks, such as geometric properties and object boundary understanding in real-world navigation. To evaluate MLLMs' AVR abilities systematically, we introduce MARVEL founded on the core knowledge system in human cognition, a multidimensional AVR benchmark with 770 puzzles composed of six core knowledge patterns, geometric and abstract shapes, and five different task configurations. To inspect whether the model performance is grounded in perception or reasoning, MARVEL complements the standard AVR question with perception questions in a hierarchical evaluation framework. We conduct comprehensive experiments on MARVEL with ten representative MLLMs in zero-shot and few-shot settings. Our experiments reveal that all MLLMs show near-random performance on MARVEL, with significant performance gaps (40%) compared to humans across all patterns and task configurations. Further analysis of perception questions reveals that MLLMs struggle to comprehend the visual features (near-random performance). Although closed-source MLLMs, such as GPT-4V, show a promising understanding of reasoning patterns (on par with humans) after adding textual descriptions, this advantage is hindered by their weak perception abilities. We release our entire code and dataset at https://github.com/1171-jpg/MARVEL_AVR.
| Original language | English |
|---|---|
| Title of host publication | Advances in Neural Information Processing Systems 37 (NeurIPS 2024) |
| Subtitle of host publication | [Proceedings] |
| Editors | A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, C. Zhang |
| Publisher | NeurIPS |
| Pages | 1-26 |
| Number of pages | 26 |
| ISBN (Electronic) | 9798331314385 |
| Publication status | Published - 2024 |
| Event | 38th Conference on Neural Information Processing Systems, NeurIPS 2024 - Vancouver, Canada Duration: 9 Dec 2024 → 15 Dec 2024 |
Conference
| Conference | 38th Conference on Neural Information Processing Systems, NeurIPS 2024 |
|---|---|
| Country/Territory | Canada |
| City | Vancouver |
| Period | 9/12/24 → 15/12/24 |
Bibliographical note
Publisher Copyright:© 2024 Neural information processing systems foundation. All rights reserved.
Funding
We appreciate Fred Morstatter for very helpful comments. We thank Tian Jin and Jiachi Liang for their assistance in data collection. This research was sponsored by the Defense Advanced Research Projects Agency via Contract HR00112390061.
| Funders | Funder number |
|---|---|
| Defense Advanced Research Projects Agency | HR00112390061 |