Open-source LVLMs

We benchmark 15 commonly used open-source LVLMs by single-turn Perplexity (PPL) inferencer, which confines their output to options and computes the probability for each option.

Results (in %) under multimodal bias (VL-Bias) evaluation

$$Model$$ $$Ipss$$ $$B_{ovl}$$ $$B_{max}$$ $$Acc$$ $$\Delta Acc$$
LLaVA1.5-7B 51.58 1.85 15.94 52.15 95.66
LLaVA1.5-13B 57.92 2.91 18.60 59.08 81.08
LLaVA1.6-13B 65.64 3.29 21.06 67.70 59.77
MiniGPT-v2 58.20 2.72 16.48 59.74 73.02
mPLUG-Owl2 72.59 6.48 34.02 77.56 8.84
LLaMA-Adapter-v2 55.31 0.60 7.38 55.67 86.16
InstructBLIP 74.26 4.10 19.94 77.52 14.05
Otter 62.68 1.82 9.25 63.96 59.11
LAMM 54.51 1.63 10.09 55.24 85.69
Kosmos-2 48.96 0.22 0.93 49.58 70.66
Qwen-VL 71.29 4.07 30.14 74.27 23.27
InternLM-XC2 72.93 6.30 37.32 77.77 9.45
Shikra 61.08 3.40 21.56 63.44 54.48
LLaVA-RLHF 61.05 4.15 27.57 63.04 71.24
RLHF-V 67.16 6.96 27.69 72.34 15.09

Results (in %) under visual unimodal bias (V-Bias) evaluation

$$Model$$ $$Ipss$$ $$B_{ovl}$$ $$B_{max}$$ $$Acc$$ $$\Delta Acc$$
LLaVA1.5-7B 51.67 1.60 11.34 52.17 95.62
LLaVA1.5-13B 58.85 2.55 14.44 59.90 79.27
LLaVA1.6-13B 66.65 3.36 17.55 68.79 56.72
MiniGPT-v2 55.30 1.58 7.43 56.14 83.97
mPLUG-Owl2 73.26 5.77 31.50 77.68 9.07
LLaMA-Adapter-v2 55.16 0.42 6.78 55.40 86.39
InstructBLIP 75.06 3.23 18.02 77.61 13.60
Otter 62.54 1.48 8.46 63.56 60.38
LAMM 57.54 0.62 4.33 57.94 77.85
Kosmos-2 48.95 0.21 0.95 49.53 72.69
Qwen-VL 71.07 4.54 29.88 74.36 23.99
InternLM-XC2 72.53 7.24 37.80 78.05 8.09
Shikra 60.23 2.10 14.40 61.66 63.15
LLaVA-RLHF 62.50 3.01 14.36 64.00 68.89
RLHF-V 63.83 10.46 33.05 71.30 19.02

Results (in %) under language unimodal bias (L-Bias) evaluation

$$Model$$ $$Ipss$$ $$B_{ovl}$$ $$B_{max}$$ $$Acc$$ $$\Delta Acc$$
LLaVA1.5-7B 50.86 1.25 12.08 51.27 97.43
LLaVA1.5-13B 55.86 1.65 14.60 56.41 86.85
LLaVA1.6-13B 62.52 2.37 17.35 63.93 69.94
MiniGPT-v2 54.84 2.05 13.48 55.95 84.63
mPLUG-Owl2 70.37 4.75 22.58 73.92 11.45
LLaMA-Adapter-v2 51.72 0.34 2.22 51.91 95.45
InstructBLIP 71.83 3.41 16.94 74.42 19.54
Otter 59.71 0.93 4.65 60.36 68.99
LAMM 56.13 0.91 3.72 56.67 80.50
Kosmos-2 49.94 0.03 0.14 49.99 74.55
Qwen-VL 70.18 2.96 19.94 72.35 18.48
InternLM-XC2 71.83 5.38 37.23 75.80 9.23
Shikra 59.69 3.25 13.86 61.80 56.65
LLaVA-RLHF 59.70 3.61 34.59 61.34 75.23
RLHF-V 64.08 7.36 33.69 69.25 27.68