Are 'visual' AI models actually blind?

The latest round of language models, such as GPT-4o and Gemini 1.5 Pro, are considered “multi-modal” as they can understand images and audio in addition to text. However, a new study reveals that these models may not actually “see” in the way humans do. Researchers conducted a study on current AI models’ visual understanding, presenting them with simple visual tasks that even a first-grader could easily solve. The AI models struggled with tasks like identifying overlapping shapes or counting interlocking circles, despite these being basic tasks. The study highlighted the limitations of these models in visual reasoning.

The researchers found that the AI models’ responses were inconsistent and often inaccurate, indicating a lack of true visual understanding. The models seemed to rely on patterns in their training data rather than genuine visual comprehension. For example, the models performed well on tasks involving a specific image, such as the Olympic Rings, but struggled with variations of the same task.

The study suggested that the AI models may not be blind in the traditional sense but lack the ability to make accurate visual judgments. Their responses were described as approximate and abstract, indicating a limited understanding of visual concepts. While these models excel in certain tasks like identifying human actions or everyday objects, they fall short in basic visual reasoning tasks.

Overall, the study emphasized the need to understand the limitations of these “visual” AI models and not rely solely on marketing claims. While they may excel in specific areas, their performance in tasks requiring true visual understanding is lacking. Research like this is essential to provide a more accurate assessment of the capabilities of these models.

