Publications
Exploring Perceptual Limitations of Multimodal LLMs on Small Visual Objects
Abstract
Multimodal Large Language Models (MLLMs) have recently achieved remarkable performance in various multimodal benchmarks. However, general benchmarks often do not reveal the specific aspects of their visual perception limits due to the lack of controllability. In this work, we quantitatively study the perception of small visual objects in several widely-used MLLMs and reveal a pervasive limitation in answering questions about small objects in images. We then conduct a controlled study of MLLMs' perception, using text-reading as a surrogate task for general visual perception to understand how quality, size, distractors, and location of an object can independently affect the ability of MLLMs to perceive it in images. Through this controlled study, we find that lower object quality, smaller object size and the presence of visual distractors can both independently reduce MLLMs' ability to answer visual questions. More …
- Date
- 2026
- Authors
- Jiarui Zhang, Jinyi Hu, Mahyar Khayatkhoei, Filip Ilievski, Maosong Sun
- Journal
- Transactions on Machine Learning Research