Open-source LVLMs
We benchmark 19 commonly used open-source LVLMs by single-turn Perplexity (PPL) inferencer, which confines their output to options and computes the probability for each option.
Results (in %) under multimodal bias (VL-Bias) evaluation
| $$Model$$ | $$Ipss$$ | $$B_{ovl}$$ | $$B_{max}$$ | $$Acc$$ | $$\Delta Acc$$ | 
|---|---|---|---|---|---|
| LLaVA1.5-7B | 51.58 | 1.85 | 15.94 | 52.15 | 95.66 | 
| LLaVA1.5-13B | 57.92 | 2.91 | 18.60 | 59.08 | 81.08 | 
| LLaVA1.6-13B | 65.64 | 3.29 | 21.06 | 67.70 | 59.77 | 
| MiniGPT-v2 | 58.20 | 2.72 | 16.48 | 59.74 | 73.02 | 
| mPLUG-Owl2 | 72.59 | 6.48 | 34.02 | 77.56 | 8.84 | 
| LLaMA-Adapter-v2 | 55.31 | 0.60 | 7.38 | 55.67 | 86.16 | 
| InstructBLIP | 74.26 | 4.10 | 19.94 | 77.52 | 14.05 | 
| Otter | 62.68 | 1.82 | 9.25 | 63.96 | 59.11 | 
| LAMM | 54.51 | 1.63 | 10.09 | 55.24 | 85.69 | 
| Kosmos-2 | 48.96 | 0.22 | 0.93 | 49.58 | 70.66 | 
| Qwen-VL | 71.29 | 4.07 | 30.14 | 74.27 | 23.27 | 
| InternLM-XC2 | 72.93 | 6.30 | 37.32 | 77.77 | 9.45 | 
| Shikra | 61.08 | 3.40 | 21.56 | 63.44 | 54.48 | 
| LLaVA-RLHF | 61.05 | 4.15 | 27.57 | 63.04 | 71.24 | 
| RLHF-V | 67.16 | 6.96 | 27.69 | 72.34 | 15.09 | 
Results (in %) under visual unimodal bias (V-Bias) evaluation
| $$Model$$ | $$Ipss$$ | $$B_{ovl}$$ | $$B_{max}$$ | $$Acc$$ | $$\Delta Acc$$ | 
|---|---|---|---|---|---|
| LLaVA1.5-7B | 51.67 | 1.60 | 11.34 | 52.17 | 95.62 | 
| LLaVA1.5-13B | 58.85 | 2.55 | 14.44 | 59.90 | 79.27 | 
| LLaVA1.6-13B | 66.65 | 3.36 | 17.55 | 68.79 | 56.72 | 
| MiniGPT-v2 | 55.30 | 1.58 | 7.43 | 56.14 | 83.97 | 
| mPLUG-Owl2 | 73.26 | 5.77 | 31.50 | 77.68 | 9.07 | 
| LLaMA-Adapter-v2 | 55.16 | 0.42 | 6.78 | 55.40 | 86.39 | 
| InstructBLIP | 75.06 | 3.23 | 18.02 | 77.61 | 13.60 | 
| Otter | 62.54 | 1.48 | 8.46 | 63.56 | 60.38 | 
| LAMM | 57.54 | 0.62 | 4.33 | 57.94 | 77.85 | 
| Kosmos-2 | 48.95 | 0.21 | 0.95 | 49.53 | 72.69 | 
| Qwen-VL | 71.07 | 4.54 | 29.88 | 74.36 | 23.99 | 
| InternLM-XC2 | 72.53 | 7.24 | 37.80 | 78.05 | 8.09 | 
| Shikra | 60.23 | 2.10 | 14.40 | 61.66 | 63.15 | 
| LLaVA-RLHF | 62.50 | 3.01 | 14.36 | 64.00 | 68.89 | 
| RLHF-V | 63.83 | 10.46 | 33.05 | 71.30 | 19.02 | 
Results (in %) under language unimodal bias (L-Bias) evaluation
| $$Model$$ | $$Ipss$$ | $$B_{ovl}$$ | $$B_{max}$$ | $$Acc$$ | $$\Delta Acc$$ | 
|---|---|---|---|---|---|
| LLaVA1.5-7B | 50.86 | 1.25 | 12.08 | 51.27 | 97.43 | 
| LLaVA1.5-13B | 55.86 | 1.65 | 14.60 | 56.41 | 86.85 | 
| LLaVA1.6-13B | 62.52 | 2.37 | 17.35 | 63.93 | 69.94 | 
| MiniGPT-v2 | 54.84 | 2.05 | 13.48 | 55.95 | 84.63 | 
| mPLUG-Owl2 | 70.37 | 4.75 | 22.58 | 73.92 | 11.45 | 
| LLaMA-Adapter-v2 | 51.72 | 0.34 | 2.22 | 51.91 | 95.45 | 
| InstructBLIP | 71.83 | 3.41 | 16.94 | 74.42 | 19.54 | 
| Otter | 59.71 | 0.93 | 4.65 | 60.36 | 68.99 | 
| LAMM | 56.13 | 0.91 | 3.72 | 56.67 | 80.50 | 
| Kosmos-2 | 49.94 | 0.03 | 0.14 | 49.99 | 74.55 | 
| Qwen-VL | 70.18 | 2.96 | 19.94 | 72.35 | 18.48 | 
| InternLM-XC2 | 71.83 | 5.38 | 37.23 | 75.80 | 9.23 | 
| Shikra | 59.69 | 3.25 | 13.86 | 61.80 | 56.65 | 
| LLaVA-RLHF | 59.70 | 3.61 | 34.59 | 61.34 | 75.23 | 
| RLHF-V | 64.08 | 7.36 | 33.69 | 69.25 | 27.68 |