Neurosymbolic Visual Reasoning with Scene Graphs and Multimodal LLMs

Filip Ilievski, M. Jaleed Khan, Edward Curry

Research output: Chapter in Book / Report / Conference proceedingChapterAcademicpeer-review

20 Downloads (Pure)

Abstract

This chapter explores the advancements and challenges in achieving comprehensive scene understanding and visual reasoning through neurosymbolic integration and Multimodal Large Language Models (MLLMs). It begins by highlighting the limitations of basic vision tasks in extracting contextual and relational information from scenes, introducing scene graphs as a structured representation to bridge this gap. The chapter delves into Scene Graph Generation (SGG) methods, emphasising the importance of incorporating common sense knowledge from knowledge graphs to enhance the accuracy and expressiveness of scene graphs. The NeuSyRE framework is presented as a neurosymbolic approach for enriched scene graph generation and reasoning, demonstrating its effectiveness in downstream tasks such as image captioning and visual question answering. The chapter also examines the role of MLLMs in visual reasoning, discussing their architectures, performance on zero-shot tasks and challenges in handling fine-grained visual details. MARVEL, a novel benchmark for abstract visual reasoning, is introduced to evaluate the perceptual and reasoning capabilities of MLLMs. Insights from MARVEL highlight the limitations of current MLLMs in solving complex reasoning tasks and underscore the potential of neurosymbolic systems to address these challenges. The chapter concludes by emphasising the synergy between neurosymbolic approaches and MLLMs in advancing visual intelligence and achieving robust and explainable AI systems.
Original languageEnglish
Title of host publicationHandbook on Neurosymbolic AI and Knowledge Graphs
EditorsPascal Hitzler, Abhilekha Dalal, Mohammad Saeid Mahdavinejad, Sanaz Saki Norouzi
PublisherIOS Press
Pages689-711
Number of pages23
ISBN (Electronic)9781643685793
ISBN (Print)9781643685786
DOIs
Publication statusPublished - 2025

Publication series

NameFrontiers in Artificial Intelligence and Applications
PublisherIOS Press
Volume400

Fingerprint

Dive into the research topics of 'Neurosymbolic Visual Reasoning with Scene Graphs and Multimodal LLMs'. Together they form a unique fingerprint.

Cite this