A Survey on Multimodal Reasoning

A Comprehensive Survey on Multimodal Reasoning

Reasoning categories

In this paper, we focus on the reasoning abilities of Multimodal Large Language Models. The reasoning methods employed by these models fall under the category of informal reasoning. This is primarily because they utilize natural language to articulate the steps and conclusions involved in the reasoning process and they allow a certain degree of inaccuracy in their reasoning mechanisms. This paper primarily focuses on three reasoning types: deductive reasoning, abductive reasoning, and analogical reasoning. These types are highlighted due to their prevalent application in real-world reasoning tasks, particularly within the scope of current MLLMs.

MLLMs Summarization

A brief timeline outlining recent developments in MLLMs
Evaluation Benchmark Summarization.
Comparison of their training stages between different models.

Conclusion and Future Directions

The concept of reasoning ability is pivotal in the quest for Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI), a goal that has been pursued across various scientific disciplines for several decades. With recent advancements in both models and evaluation benchmarks, there is a growing discourse on the reasoning abilities exhibited in current LLMs and MLLMs. In this work, we present the different types of reasoning and discuss the models, data, and evaluation methods used to measure and understand the reasoning abilities demonstrated in existing studies. Our survey aims to provide a better understanding of our current standing in this research direction and hopes to inspire further exploration into the reasoning abilities of future work.


Arxiv HuggingFace