InfiMM
Advancing Multimodal Understanding from Flamingo's Legacy through Diverse LLM Integration
Overview
InfiMM, inspired by the Flamingo architecture, sets itself apart with unique training data and diverse large language models (LLMs). This approach allows InfiMM to maintain the core strengths of Flamingo while offering enhanced capabilities. As the premier open-sourced variant in this domain, InfiMM excels in accessibility and adaptability, driven by community collaboration. It’s more than an emulation of Flamingo; it’s an innovation in visual language processing.
Our model is another attempt to produce the result reported in the paper “Flamingo: A Large-scale Visual Language Model for Multimodal Understanding” by DeepMind. Compared with previous open-sourced attempts (OpenFlamingo and IDEFIC, InfiMM offers a more flexible models, allowing for a wide range of applications. In particular, InfiMM integrates the latest LLM models into VLM domain the reveals the impact of LLMs with different scales and architectures.
Please note that InfiMM is currently in beta stage and we are continuously working on improving it.
Model Details
- Developed by: Institute of Automation, Chinese Academy of Sciences and ByteDance
- Model Type: Visual Language Model (VLM)
- Language: English
- LLMs: Zephyr, LLaMA2-13B, Vicuna-13B
- Vision Model: EVA CLIP
- Language(s) (NLP): en
- License: see License section
Evaluation
PreTraining Evaluation
We evaluate the pretrained models on the following downstream tasks: Image Captioning and VQA. We also compare with our results with IDEFICS.
Model | Shots | COCO CIDEr | Flickr30K CIDEr | VQA v2 Acc | TextVQA Acc | OK-VQA Acc |
---|---|---|---|---|---|---|
IDEFICS-9B | 0 | 46 | 27.3 | 50.9 | 25.9 | 38.4 |
4 | 93 | 59.7 | 55.4 | 27.6 | 45.5 | |
IDEFICS-80B | 0 | 91.8 | 53.7 | 60 | 30.9 | 45.2 |
4 | 110.3 | 73.7 | 64.6 | 34.4 | 52.4 | |
InfiMM-Zephyr-7B | 0 | 78.8 | 60.7 | 33.7 | 15.2 | 17.1 |
4 | 108.6 | 71.9 | 59.1 | 34.3 | 50.5 | |
InfiMM-Llama2-13B | 0 | 85.4 | 54.6 | 51.6 | 24.2 | 26.4 |
4 | 125.2 | 87.1 | 66.1 | 38.2 | 55.5 | |
InfiMM-Vicuna13B | 0 | 69.6 | 49.6 | 60.4 | 32.8 | 49.2 |
4 | 118.1 | 81.4 | 64.2 | 38.4 | 53.7 |
IFT Evaluation
In our analysis, we concentrate on two primary benchmarks for evaluating MLLMs: 1) Multi-choice Question Answering (QA) and 2) Open-ended Evaluation. We’ve observed that the evaluation metrics for tasks like Visual Question Answering (VQA) and Text-VQA are overly sensitive to exact answer matches. This approach can be misleading, particularly when models provide synonymous but technically accurate responses. Therefore, these metrics have been omitted from our comparison for a more precise assessment. The evaluation results are shown in the table below.
Model | ScienceQA-Img | MME | MM-VET | InfiMM-Eval | MMbench | MMMU-Val | MMMU-Test |
---|---|---|---|---|---|---|---|
Otter-9B | - | 1292/306 | 24.6 | 32.2 | - | 22.69 | - |
IDEFICS-9B-Instruct | 60.6 | -/- | - | - | - | 24.53 | - |
InfiMM-Zephyr-7B | 71.1 | P: 1406 C:327 | 32.8 | 36.0 | 59.7 | 39.4 | 35.5 |
InfiMM-Llama-13b | 73.0 | P: 1444.5 C: 337.6 | 39.2 | 0.4559/0.414 | 66.4 | 39.1 | 35.2 |
InfiMM-Vicuna-13B | 74.0 | P: 1461.2 C: 323.5 | 36.0 | 40.0 | 66.7 | 37.6 | 34.6 |