Close-source LVLMs
We evaluate state-of-the-art commercial LVLMs (GPT-4o and Gemini-Pro), obtaining answers through their official APIs. Here are the results of commercial LVLMs and open-source LVLMs on Top-10 biased pairs (measured by bias based on outcome difference).
Results (in %) under multimodal bias (VL-Bias) evaluation
| $$Model$$ | $$Ipss^o$$ | $$B_{ovl}^o$$ | $$B_{max}^o$$ | $$Acc$$ | $$\Delta Acc$$ | 
|---|---|---|---|---|---|
| LLaVA1.5-7B | 51.06 | 2.91 | 18.66 | 52.14 | 95.14 | 
| LLaVA1.5-13B | 56.19 | 5.83 | 14.71 | 58.87 | 79.58 | 
| LLaVA1.6-13B | 59.88 | 12.66 | 29.27 | 66.71 | 56.46 | 
| MiniGPT-v2 | 55.24 | 11.44 | 24.32 | 60.79 | 57.92 | 
| mPLUG-Owl2 | 51.80 | 32.82 | 52.44 | 75.47 | 10.46 | 
| LLaMA-Adapter-v2 | 51.67 | 3.40 | 24.62 | 52.59 | 94.81 | 
| InstructBLIP | 62.71 | 16.75 | 41.46 | 74.28 | 20.31 | 
| Otter | 56.05 | 16.23 | 47.56 | 65.72 | 50.01 | 
| LAMM | 53.41 | 10.60 | 23.18 | 57.35 | 79.78 | 
| Kosmos-2 | 44.12 | 2.49 | 1.47 | 48.18 | 86.15 | 
| Qwen-VL | 63.31 | 15.86 | 55.47 | 74.65 | 21.90 | 
| InternLM-XC2 | 59.08 | 22.93 | 51.22 | 75.16 | 4.94 | 
| Shikra | 56.17 | 11.64 | 29.71 | 63.12 | 52.65 | 
| LLaVA-RLHF | 57.51 | 10.20 | 22.39 | 62.84 | 68.14 | 
| RLHF-V | 51.23 | 26.42 | 40.98 | 68.29 | 23.62 | 
| GPT-4o | 67.19 | 7.31 | 11.03 | 72.63 | 0.00 | 
| Gemini-Pro | 57.08 | 23.63 | 81.71 | 75.36 | 0.00 | 
Results (in %) under visual unimodal bias (V-Bias) evaluation
| $$Model$$ | $$Ipss^o$$ | $$B_{ovl}^o$$ | $$B_{max}^o$$ | $$Acc$$ | $$\Delta Acc$$ | 
|---|---|---|---|---|---|
| LLaVA1.5-7B | 51.67 | 2.22 | 15.67 | 52.36 | 95.03 | 
| LLaVA1.5-13B | 58.17 | 4.75 | 8.82 | 60.03 | 78.63 | 
| LLaVA1.6-13B | 61.58 | 11.45 | 25.61 | 67.60 | 53.51 | 
| MiniGPT-v2 | 56.47 | 4.84 | 9.46 | 58.34 | 72.10 | 
| mPLUG-Owl2 | 53.18 | 31.50 | 57.32 | 76.07 | 11.16 | 
| LLaMA-Adapter-v2 | 51.92 | 2.22 | 14.93 | 52.40 | 95.19 | 
| InstructBLIP | 63.90 | 14.99 | 40.25 | 74.03 | 20.32 | 
| Otter | 56.88 | 13.16 | 48.78 | 65.08 | 51.70 | 
| LAMM | 52.49 | 4.83 | 17.64 | 54.54 | 85.89 | 
| Kosmos-2 | 44.07 | 2.75 | 6.88 | 47.82 | 84.28 | 
| Qwen-VL | 62.08 | 17.65 | 53.91 | 74.76 | 23.15 | 
| InternLM-XC2 | 59.90 | 22.68 | 48.78 | 76.30 | 4.02 | 
| Shikra | 58.18 | 5.71 | 14.38 | 61.57 | 59.93 | 
| LLaVA-RLHF | 59.40 | 6.38 | 13.44 | 62.78 | 69.06 | 
| RLHF-V | 45.11 | 34.42 | 60.87 | 66.97 | 20.38 | 
| GPT-4o | 67.62 | 6.67 | 14.84 | 72.24 | 0.00 | 
| Gemini-Pro | 65.04 | 13.08 | 40.24 | 75.43 | 0.00 | 
Results (in %) under language unimodal bias (L-Bias) evaluation
| $$Model$$ | $$Ipss^o$$ | $$B_{ovl}^o$$ | $$B_{max}^o$$ | $$Acc$$ | $$\Delta Acc$$ | 
|---|---|---|---|---|---|
| LLaVA1.5-7B | 50.63 | 0.89 | 7.46 | 50.73 | 98.55 | 
| LLaVA1.5-13B | 54.36 | 4.15 | 25.38 | 55.74 | 87.66 | 
| LLaVA1.6-13B | 57.71 | 11.98 | 34.05 | 63.07 | 66.32 | 
| MiniGPT-v2 | 54.03 | 8.67 | 23.19 | 58.04 | 77.66 | 
| mPLUG-Owl2 | 52.09 | 27.81 | 47.56 | 71.42 | 11.56 | 
| LLaMA-Adapter-v2 | 50.00 | 0.00 | 0.00 | 50.00 | 100.00 | 
| InstructBLIP | 60.23 | 15.70 | 32.93 | 70.66 | 29.32 | 
| Otter | 58.10 | 6.85 | 17.07 | 62.17 | 53.82 | 
| LAMM | 51.30 | 4.67 | 14.71 | 53.39 | 84.14 | 
| Kosmos-2 | 46.56 | 1.68 | 1.22 | 47.71 | 55.83 | 
| Qwen-VL | 61.07 | 13.57 | 42.19 | 70.61 | 27.93 | 
| InternLM-XC2 | 54.30 | 24.77 | 67.19 | 70.01 | 8.37 | 
| Shikra | 56.89 | 11.04 | 25.62 | 62.81 | 54.34 | 
| LLaVA-RLHF | 56.84 | 8.94 | 37.31 | 61.68 | 74.60 | 
| RLHF-V | 50.25 | 29.38 | 45.14 | 68.91 | 32.29 | 
| GPT-4o | 61.86 | 11.24 | 24.26 | 70.02 | 0.00 | 
| Gemini-Pro | 51.10 | 30.28 | 64.06 | 70.58 | 0.00 |