Open-source LVLMs
We benchmark 15 commonly used open-source LVLMs by single-turn Perplexity (PPL)
inferencer, which confines their output to options and computes the probability for
each option.
Results (in %) under multimodal bias (VL-Bias) evaluation
$$Model$$ |
$$Ipss$$ |
$$B_{ovl}$$ |
$$B_{max}$$ |
$$Acc$$ |
$$\Delta Acc$$ |
LLaVA1.5-7B |
51.58 |
1.85 |
15.94 |
52.15 |
95.66 |
LLaVA1.5-13B |
57.92 |
2.91 |
18.60 |
59.08 |
81.08 |
LLaVA1.6-13B |
65.64 |
3.29 |
21.06 |
67.70 |
59.77 |
MiniGPT-v2 |
58.20 |
2.72 |
16.48 |
59.74 |
73.02 |
mPLUG-Owl2 |
72.59 |
6.48 |
34.02 |
77.56 |
8.84 |
LLaMA-Adapter-v2 |
55.31 |
0.60 |
7.38 |
55.67 |
86.16 |
InstructBLIP |
74.26 |
4.10 |
19.94 |
77.52 |
14.05 |
Otter |
62.68 |
1.82 |
9.25 |
63.96 |
59.11 |
LAMM |
54.51 |
1.63 |
10.09 |
55.24 |
85.69 |
Kosmos-2 |
48.96 |
0.22 |
0.93 |
49.58 |
70.66 |
Qwen-VL |
71.29 |
4.07 |
30.14 |
74.27 |
23.27 |
InternLM-XC2 |
72.93 |
6.30 |
37.32 |
77.77 |
9.45 |
Shikra |
61.08 |
3.40 |
21.56 |
63.44 |
54.48 |
LLaVA-RLHF |
61.05 |
4.15 |
27.57 |
63.04 |
71.24 |
RLHF-V |
67.16 |
6.96 |
27.69 |
72.34 |
15.09 |
Results (in %) under visual unimodal bias (V-Bias) evaluation
$$Model$$ |
$$Ipss$$ |
$$B_{ovl}$$ |
$$B_{max}$$ |
$$Acc$$ |
$$\Delta Acc$$ |
LLaVA1.5-7B |
51.67 |
1.60 |
11.34 |
52.17 |
95.62 |
LLaVA1.5-13B |
58.85 |
2.55 |
14.44 |
59.90 |
79.27 |
LLaVA1.6-13B |
66.65 |
3.36 |
17.55 |
68.79 |
56.72 |
MiniGPT-v2 |
55.30 |
1.58 |
7.43 |
56.14 |
83.97 |
mPLUG-Owl2 |
73.26 |
5.77 |
31.50 |
77.68 |
9.07 |
LLaMA-Adapter-v2 |
55.16 |
0.42 |
6.78 |
55.40 |
86.39 |
InstructBLIP |
75.06 |
3.23 |
18.02 |
77.61 |
13.60 |
Otter |
62.54 |
1.48 |
8.46 |
63.56 |
60.38 |
LAMM |
57.54 |
0.62 |
4.33 |
57.94 |
77.85 |
Kosmos-2 |
48.95 |
0.21 |
0.95 |
49.53 |
72.69 |
Qwen-VL |
71.07 |
4.54 |
29.88 |
74.36 |
23.99 |
InternLM-XC2 |
72.53 |
7.24 |
37.80 |
78.05 |
8.09 |
Shikra |
60.23 |
2.10 |
14.40 |
61.66 |
63.15 |
LLaVA-RLHF |
62.50 |
3.01 |
14.36 |
64.00 |
68.89 |
RLHF-V |
63.83 |
10.46 |
33.05 |
71.30 |
19.02 |
Results (in %) under language unimodal bias (L-Bias) evaluation
$$Model$$ |
$$Ipss$$ |
$$B_{ovl}$$ |
$$B_{max}$$ |
$$Acc$$ |
$$\Delta Acc$$ |
LLaVA1.5-7B |
50.86 |
1.25 |
12.08 |
51.27 |
97.43 |
LLaVA1.5-13B |
55.86 |
1.65 |
14.60 |
56.41 |
86.85 |
LLaVA1.6-13B |
62.52 |
2.37 |
17.35 |
63.93 |
69.94 |
MiniGPT-v2 |
54.84 |
2.05 |
13.48 |
55.95 |
84.63 |
mPLUG-Owl2 |
70.37 |
4.75 |
22.58 |
73.92 |
11.45 |
LLaMA-Adapter-v2 |
51.72 |
0.34 |
2.22 |
51.91 |
95.45 |
InstructBLIP |
71.83 |
3.41 |
16.94 |
74.42 |
19.54 |
Otter |
59.71 |
0.93 |
4.65 |
60.36 |
68.99 |
LAMM |
56.13 |
0.91 |
3.72 |
56.67 |
80.50 |
Kosmos-2 |
49.94 |
0.03 |
0.14 |
49.99 |
74.55 |
Qwen-VL |
70.18 |
2.96 |
19.94 |
72.35 |
18.48 |
InternLM-XC2 |
71.83 |
5.38 |
37.23 |
75.80 |
9.23 |
Shikra |
59.69 |
3.25 |
13.86 |
61.80 |
56.65 |
LLaVA-RLHF |
59.70 |
3.61 |
34.59 |
61.34 |
75.23 |
RLHF-V |
64.08 |
7.36 |
33.69 |
69.25 |
27.68 |