Close-source LVLMs

We evaluate state-of-the-art commercial LVLMs (GPT-4o and Gemini-Pro), obtaining answers through their official APIs. Here are the results of commercial LVLMs and open-source LVLMs on Top-10 biased pairs (measured by bias based on outcome difference).

Results (in %) under multimodal bias (VL-Bias) evaluation

$$Model$$ $$Ipss^o$$ $$B_{ovl}^o$$ $$B_{max}^o$$ $$Acc$$ $$\Delta Acc$$
LLaVA1.5-7B 51.06 2.91 18.66 52.14 95.14
LLaVA1.5-13B 56.19 5.83 14.71 58.87 79.58
LLaVA1.6-13B 59.88 12.66 29.27 66.71 56.46
MiniGPT-v2 55.24 11.44 24.32 60.79 57.92
mPLUG-Owl2 51.80 32.82 52.44 75.47 10.46
LLaMA-Adapter-v2 51.67 3.40 24.62 52.59 94.81
InstructBLIP 62.71 16.75 41.46 74.28 20.31
Otter 56.05 16.23 47.56 65.72 50.01
LAMM 53.41 10.60 23.18 57.35 79.78
Kosmos-2 44.12 2.49 1.47 48.18 86.15
Qwen-VL 63.31 15.86 55.47 74.65 21.90
InternLM-XC2 59.08 22.93 51.22 75.16 4.94
Shikra 56.17 11.64 29.71 63.12 52.65
LLaVA-RLHF 57.51 10.20 22.39 62.84 68.14
RLHF-V 51.23 26.42 40.98 68.29 23.62
GPT-4o 67.19 7.31 11.03 72.63 0.00
Gemini-Pro 57.08 23.63 81.71 75.36 0.00

Results (in %) under visual unimodal bias (V-Bias) evaluation

$$Model$$ $$Ipss^o$$ $$B_{ovl}^o$$ $$B_{max}^o$$ $$Acc$$ $$\Delta Acc$$
LLaVA1.5-7B 51.67 2.22 15.67 52.36 95.03
LLaVA1.5-13B 58.17 4.75 8.82 60.03 78.63
LLaVA1.6-13B 61.58 11.45 25.61 67.60 53.51
MiniGPT-v2 56.47 4.84 9.46 58.34 72.10
mPLUG-Owl2 53.18 31.50 57.32 76.07 11.16
LLaMA-Adapter-v2 51.92 2.22 14.93 52.40 95.19
InstructBLIP 63.90 14.99 40.25 74.03 20.32
Otter 56.88 13.16 48.78 65.08 51.70
LAMM 52.49 4.83 17.64 54.54 85.89
Kosmos-2 44.07 2.75 6.88 47.82 84.28
Qwen-VL 62.08 17.65 53.91 74.76 23.15
InternLM-XC2 59.90 22.68 48.78 76.30 4.02
Shikra 58.18 5.71 14.38 61.57 59.93
LLaVA-RLHF 59.40 6.38 13.44 62.78 69.06
RLHF-V 45.11 34.42 60.87 66.97 20.38
GPT-4o 67.62 6.67 14.84 72.24 0.00
Gemini-Pro 65.04 13.08 40.24 75.43 0.00

Results (in %) under language unimodal bias (L-Bias) evaluation

$$Model$$ $$Ipss^o$$ $$B_{ovl}^o$$ $$B_{max}^o$$ $$Acc$$ $$\Delta Acc$$
LLaVA1.5-7B 50.63 0.89 7.46 50.73 98.55
LLaVA1.5-13B 54.36 4.15 25.38 55.74 87.66
LLaVA1.6-13B 57.71 11.98 34.05 63.07 66.32
MiniGPT-v2 54.03 8.67 23.19 58.04 77.66
mPLUG-Owl2 52.09 27.81 47.56 71.42 11.56
LLaMA-Adapter-v2 50.00 0.00 0.00 50.00 100.00
InstructBLIP 60.23 15.70 32.93 70.66 29.32
Otter 58.10 6.85 17.07 62.17 53.82
LAMM 51.30 4.67 14.71 53.39 84.14
Kosmos-2 46.56 1.68 1.22 47.71 55.83
Qwen-VL 61.07 13.57 42.19 70.61 27.93
InternLM-XC2 54.30 24.77 67.19 70.01 8.37
Shikra 56.89 11.04 25.62 62.81 54.34
LLaVA-RLHF 56.84 8.94 37.31 61.68 74.60
RLHF-V 50.25 29.38 45.14 68.91 32.29
GPT-4o 61.86 11.24 24.26 70.02 0.00
Gemini-Pro 51.10 30.28 64.06 70.58 0.00