Close-source LVLMs
We evaluate state-of-the-art commercial LVLMs (GPT-4o and Gemini-Pro),
obtaining answers through their official APIs. Here are the results of
commercial LVLMs and open-source LVLMs on Top-10 biased pairs (measured
by bias based on outcome difference).
Results (in %) under multimodal bias (VL-Bias) evaluation
$$Model$$ |
$$Ipss^o$$ |
$$B_{ovl}^o$$ |
$$B_{max}^o$$ |
$$Acc$$ |
$$\Delta Acc$$ |
LLaVA1.5-7B |
51.06 |
2.91 |
18.66 |
52.14 |
95.14 |
LLaVA1.5-13B |
56.19 |
5.83 |
14.71 |
58.87 |
79.58 |
LLaVA1.6-13B |
59.88 |
12.66 |
29.27 |
66.71 |
56.46 |
MiniGPT-v2 |
55.24 |
11.44 |
24.32 |
60.79 |
57.92 |
mPLUG-Owl2 |
51.80 |
32.82 |
52.44 |
75.47 |
10.46 |
LLaMA-Adapter-v2 |
51.67 |
3.40 |
24.62 |
52.59 |
94.81 |
InstructBLIP |
62.71 |
16.75 |
41.46 |
74.28 |
20.31 |
Otter |
56.05 |
16.23 |
47.56 |
65.72 |
50.01 |
LAMM |
53.41 |
10.60 |
23.18 |
57.35 |
79.78 |
Kosmos-2 |
44.12 |
2.49 |
1.47 |
48.18 |
86.15 |
Qwen-VL |
63.31 |
15.86 |
55.47 |
74.65 |
21.90 |
InternLM-XC2 |
59.08 |
22.93 |
51.22 |
75.16 |
4.94 |
Shikra |
56.17 |
11.64 |
29.71 |
63.12 |
52.65 |
LLaVA-RLHF |
57.51 |
10.20 |
22.39 |
62.84 |
68.14 |
RLHF-V |
51.23 |
26.42 |
40.98 |
68.29 |
23.62 |
GPT-4o |
67.19 |
7.31 |
11.03 |
72.63 |
0.00 |
Gemini-Pro |
57.08 |
23.63 |
81.71 |
75.36 |
0.00 |
Results (in %) under visual unimodal bias (V-Bias) evaluation
$$Model$$ |
$$Ipss^o$$ |
$$B_{ovl}^o$$ |
$$B_{max}^o$$ |
$$Acc$$ |
$$\Delta Acc$$ |
LLaVA1.5-7B |
51.67 |
2.22 |
15.67 |
52.36 |
95.03 |
LLaVA1.5-13B |
58.17 |
4.75 |
8.82 |
60.03 |
78.63 |
LLaVA1.6-13B |
61.58 |
11.45 |
25.61 |
67.60 |
53.51 |
MiniGPT-v2 |
56.47 |
4.84 |
9.46 |
58.34 |
72.10 |
mPLUG-Owl2 |
53.18 |
31.50 |
57.32 |
76.07 |
11.16 |
LLaMA-Adapter-v2 |
51.92 |
2.22 |
14.93 |
52.40 |
95.19 |
InstructBLIP |
63.90 |
14.99 |
40.25 |
74.03 |
20.32 |
Otter |
56.88 |
13.16 |
48.78 |
65.08 |
51.70 |
LAMM |
52.49 |
4.83 |
17.64 |
54.54 |
85.89 |
Kosmos-2 |
44.07 |
2.75 |
6.88 |
47.82 |
84.28 |
Qwen-VL |
62.08 |
17.65 |
53.91 |
74.76 |
23.15 |
InternLM-XC2 |
59.90 |
22.68 |
48.78 |
76.30 |
4.02 |
Shikra |
58.18 |
5.71 |
14.38 |
61.57 |
59.93 |
LLaVA-RLHF |
59.40 |
6.38 |
13.44 |
62.78 |
69.06 |
RLHF-V |
45.11 |
34.42 |
60.87 |
66.97 |
20.38 |
GPT-4o |
67.62 |
6.67 |
14.84 |
72.24 |
0.00 |
Gemini-Pro |
65.04 |
13.08 |
40.24 |
75.43 |
0.00 |
Results (in %) under language unimodal bias (L-Bias) evaluation
$$Model$$ |
$$Ipss^o$$ |
$$B_{ovl}^o$$ |
$$B_{max}^o$$ |
$$Acc$$ |
$$\Delta Acc$$ |
LLaVA1.5-7B |
50.63 |
0.89 |
7.46 |
50.73 |
98.55 |
LLaVA1.5-13B |
54.36 |
4.15 |
25.38 |
55.74 |
87.66 |
LLaVA1.6-13B |
57.71 |
11.98 |
34.05 |
63.07 |
66.32 |
MiniGPT-v2 |
54.03 |
8.67 |
23.19 |
58.04 |
77.66 |
mPLUG-Owl2 |
52.09 |
27.81 |
47.56 |
71.42 |
11.56 |
LLaMA-Adapter-v2 |
50.00 |
0.00 |
0.00 |
50.00 |
100.00 |
InstructBLIP |
60.23 |
15.70 |
32.93 |
70.66 |
29.32 |
Otter |
58.10 |
6.85 |
17.07 |
62.17 |
53.82 |
LAMM |
51.30 |
4.67 |
14.71 |
53.39 |
84.14 |
Kosmos-2 |
46.56 |
1.68 |
1.22 |
47.71 |
55.83 |
Qwen-VL |
61.07 |
13.57 |
42.19 |
70.61 |
27.93 |
InternLM-XC2 |
54.30 |
24.77 |
67.19 |
70.01 |
8.37 |
Shikra |
56.89 |
11.04 |
25.62 |
62.81 |
54.34 |
LLaVA-RLHF |
56.84 |
8.94 |
37.31 |
61.68 |
74.60 |
RLHF-V |
50.25 |
29.38 |
45.14 |
68.91 |
32.29 |
GPT-4o |
61.86 |
11.24 |
24.26 |
70.02 |
0.00 |
Gemini-Pro |
51.10 |
30.28 |
64.06 |
70.58 |
0.00 |