InfiMM-HD
InfiMM-HD A Leap Forward in High-Resolution Multimodal Understanding
Overview
InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas.
Model Details
- Developed by: Institute of Automation, Chinese Academy of Sciences and ByteDance
- Model Type: Visual Language Model (VLM)
- Language: English