Where Do Vision Language Models (VLMs) Fall Short?

0

BlindTest: A New Benchmark for Evaluating the Limits of Visual Tasks

In the past 8 months, the emergence of Vision Language Models (VLMs) like GPT-4V has led to a surge in image-text processing applications. These models can accurately identify objects in a scene and perform complex tasks, such as calculating the cost of beer on a table based on images of the scene and a menu. However, the paper we will examine today reveals surprising limitations in VLMs when it comes to certain tasks, questioning whether they perceive images as humans do.

BlindTest Exposing the Limitations of VLMs

This paper introduces a set of seven visual tasks called BlindTest. These tasks are very simple for humans but pose significant challenges to the latest VLMs. For example, they include tasks such as verifying if two circles overlap or counting the number of shapes in an image.

Task 1: Counting Intersection Points

In this task, the models are asked how many times two linear functions intersect. GPT-4o, Gemini-1.5 Pro, Claude-3 Sonnet, and Claude-3.5 Sonnet achieved accuracy rates of 48.67%, 69.67%, 64.00%, and 77.33%, respectively. These results show that VLMs struggle to determine whether lines intersect.

Task 2: Evaluating the State of Two Circles

This task evaluates whether two circles are touching or overlapping. GPT-4o showed an accuracy of 72.69%, Gemini-1.5 Pro 92.78%, Sonnet-3 84.52%, and Sonnet-3.5 91.66%. While VLMs demonstrate some ability to assess overlapping circles, there is still room for improvement.

Task 3: Identifying Circled Characters

In this task, various strings with characters circled in order are presented, and the VLMs are evaluated on their ability to recognize which characters are circled. GPT-4o achieved 70.18%, Gemini-1.5 Pro 92.81%, Sonnet-3 73.34%, and Sonnet-3.5 89.22% accuracy. These results indicate that VLMs have difficulty identifying circled characters.

Limitations of VLMs and Future Research Directions

These experimental results reveal the limitations of VLMs’ visual recognition capabilities, particularly in tasks such as counting intersection points, evaluating the state of two circles, and identifying circled characters. This suggests that VLMs struggle to accurately perceive detailed visual information.

Conclusion

Why can’t VLMs perceive images like humans do? The results from BlindTest highlight the limitations of VLMs’ visual understanding. These findings emphasize the need for further research and development to enhance the visual capabilities of VLMs. Future studies may require new approaches to improve the visual recognition abilities of VLMs, such as utilizing early fusion techniques to enhance vision modules.

References

Leave a Reply