Here we are curious to see how a vision language does when it comes to Stroop Color and Word Test. In humans, there is a very interesting effect called the Stroop effect (see here).

Please go here and do the experiment first and then come back, it’s fun!!

Now let's see how does LLaVA the task. It will blow your mind!

🌋 LLaVA: Large Language and Vision Assistant

LLaVA (llava-v1.5–13b-4bit)

First, we ar using this demo on HF.
Let’s try a few examples:

What do you see? Did you say “Blue” or Red”?

We give the model a picture of the word “BLUE” with red ink and ask the model “In this experiment, you are required to say the color of the word, not what the word says. one word”. What would you expect?

Wrong!!!

It was pretty disappointing! But what about the newer version?

LLaVA (llava-v1.6–34b)

Wow!!

It did really well, and it seems that the new version

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge

has improved!

But can we still fool it? let’s try!

Wow! it is impressive!

It seems the newest version of LLaVA really has an understanding of images!

So to answer our question, does LLaVA have the Stroop effect?

It seems that the LLaVA v1.6 is pretty robust and does not show any sign of the effect!

But wait! Do you think the model can do the task right?

The model focused on text!!! And wrote the words in the first row correctly but failed the rest!

So maybe the model has the Stroop effect?

Another failure example:

Another failure example of LLaVA
Another failure example of LLaVA
Another failure example of LLaVA, the model keeps generating!!!

What if we create a dataset and try to evaluate the model? A dataset of images with 5by5 grids of colourful words. We benchmarked `llava-hf/llava-v1.6-mistral-7b-hf’ models with 200 samples.

for this prompt “ In this experiment you are required to say the color of the word, not what the word says. Read the list as fast as you can.” the scores are:

{'rouge1': 0.2468631578947369, 'rouge2': 0.05287940379403794, 'rougeL': 0.171221052631579, 'rougeLsum': 0.17204210526315794}
{'bleu': 0.08737648301423374, 'precisions': [0.19452054794520549, 0.13663911845730028, 0.08116343490304709, 0.02701949860724234], 'brevity_penalty': 1.0, 'length_ratio': 3.7244897959183674, 'translation_length': 3650, 'reference_length': 980}

for this prompt “ In this experiment you are required to say what the word says, not what the color of the word is. Read the list as fast as you can.” the scores are:

{'rouge1': 0.40956600104156504, 'rouge2': 0.18901720719876702, 'rougeL': 0.3243491992969437, 'rougeLsum': 0.32076109917651274}
{'bleu': 0.15035696220994574, 'precisions': [0.26395039858281666, 0.18453976764968721, 0.130297565374211, 0.0805277525022748], 'brevity_penalty': 1.0, 'length_ratio': 2.304081632653061, 'translation_length': 2258, 'reference_length': 980}So why does the model do well on simple tasks and fail on more complex tasks?

So it seems the model prefers the word over the colour as well! Similar to us!?

Several other questions might be interesting to investigate next:

1- What makes the newer model better with this margin?

2- Has the model seen these types of images or there is an emergence of image understanding?

3- How are these models compared to human intelligence?

4- Can we use these models as a human brain simulator and study the brain?