Instruction-aware Visual Feature Extraction for Multimodal Large Language Model
🏞️

Instruction-aware Visual Feature Extraction for Multimodal Large Language Model

Tags
Visual Instruction Tuning
Efficient Training
Published
January 1, 2025
Author
Jinhong Tu, Erdong Chen, Shuhan Zhang
Abstract
We propose an innovative architecture for Multimodal Large Language Models (MLLMs) that enhances instruction-aware visual feature extraction. Our method introduces a novel adapter module that iteratively pools visual tokens guided by the text prompt, allowing the model to selectively attend to relevant image regions. This approach aims to improve efficiency by reducing redundant visual information while maintaining or enhancing performance on vision-language tasks. We will evaluate our method on standard benchmarks for visual question answering and reasoning in zero-shot settings. We will verify the efficacy of our design by ablating the instruction-aware adapter and show that our method maintains a similar performance when reducing inference costs. Our work aims to contribute to the development of more computationally efficient MLLMs, and we will present further results as we refine and evaluate our method across diverse multimodal benchmarks.
 
notion image