Close-source LVLMs

We evaluate state-of-the-art commercial LVLMs (GPT-4o and Gemini-Pro), obtaining answers through their official APIs. Here are the results of commercial LVLMs and open-source LVLMs on Top-10 biased pairs (measured by bias based on outcome difference).

Results (in %) under multimodal bias (VL-Bias) evaluation

$$Model$$	$$Ipss^o$$	$$B_{ovl}^o$$	$$B_{max}^o$$	$$Acc$$	$$\Delta Acc$$
LLaVA1.5-7B	51.06	2.91	18.66	52.14	95.14
LLaVA1.5-13B	56.19	5.83	14.71	58.87	79.58
LLaVA1.6-13B	59.88	12.66	29.27	66.71	56.46
MiniGPT-v2	55.24	11.44	24.32	60.79	57.92
mPLUG-Owl2	51.80	32.82	52.44	75.47	10.46
LLaMA-Adapter-v2	51.67	3.40	24.62	52.59	94.81
InstructBLIP	62.71	16.75	41.46	74.28	20.31
Otter	56.05	16.23	47.56	65.72	50.01
LAMM	53.41	10.60	23.18	57.35	79.78
Kosmos-2	44.12	2.49	1.47	48.18	86.15
Qwen-VL	63.31	15.86	55.47	74.65	21.90
InternLM-XC2	59.08	22.93	51.22	75.16	4.94
Shikra	56.17	11.64	29.71	63.12	52.65
LLaVA-RLHF	57.51	10.20	22.39	62.84	68.14
RLHF-V	51.23	26.42	40.98	68.29	23.62
GPT-4o	67.19	7.31	11.03	72.63	0.00
Gemini-Pro	57.08	23.63	81.71	75.36	0.00

Results (in %) under visual unimodal bias (V-Bias) evaluation

$$Model$$	$$Ipss^o$$	$$B_{ovl}^o$$	$$B_{max}^o$$	$$Acc$$	$$\Delta Acc$$
LLaVA1.5-7B	51.67	2.22	15.67	52.36	95.03
LLaVA1.5-13B	58.17	4.75	8.82	60.03	78.63
LLaVA1.6-13B	61.58	11.45	25.61	67.60	53.51
MiniGPT-v2	56.47	4.84	9.46	58.34	72.10
mPLUG-Owl2	53.18	31.50	57.32	76.07	11.16
LLaMA-Adapter-v2	51.92	2.22	14.93	52.40	95.19
InstructBLIP	63.90	14.99	40.25	74.03	20.32
Otter	56.88	13.16	48.78	65.08	51.70
LAMM	52.49	4.83	17.64	54.54	85.89
Kosmos-2	44.07	2.75	6.88	47.82	84.28
Qwen-VL	62.08	17.65	53.91	74.76	23.15
InternLM-XC2	59.90	22.68	48.78	76.30	4.02
Shikra	58.18	5.71	14.38	61.57	59.93
LLaVA-RLHF	59.40	6.38	13.44	62.78	69.06
RLHF-V	45.11	34.42	60.87	66.97	20.38
GPT-4o	67.62	6.67	14.84	72.24	0.00
Gemini-Pro	65.04	13.08	40.24	75.43	0.00

Results (in %) under language unimodal bias (L-Bias) evaluation

$$Model$$	$$Ipss^o$$	$$B_{ovl}^o$$	$$B_{max}^o$$	$$Acc$$	$$\Delta Acc$$
LLaVA1.5-7B	50.63	0.89	7.46	50.73	98.55
LLaVA1.5-13B	54.36	4.15	25.38	55.74	87.66
LLaVA1.6-13B	57.71	11.98	34.05	63.07	66.32
MiniGPT-v2	54.03	8.67	23.19	58.04	77.66
mPLUG-Owl2	52.09	27.81	47.56	71.42	11.56
LLaMA-Adapter-v2	50.00	0.00	0.00	50.00	100.00
InstructBLIP	60.23	15.70	32.93	70.66	29.32
Otter	58.10	6.85	17.07	62.17	53.82
LAMM	51.30	4.67	14.71	53.39	84.14
Kosmos-2	46.56	1.68	1.22	47.71	55.83
Qwen-VL	61.07	13.57	42.19	70.61	27.93
InternLM-XC2	54.30	24.77	67.19	70.01	8.37
Shikra	56.89	11.04	25.62	62.81	54.34
LLaVA-RLHF	56.84	8.94	37.31	61.68	74.60
RLHF-V	50.25	29.38	45.14	68.91	32.29
GPT-4o	61.86	11.24	24.26	70.02	0.00
Gemini-Pro	51.10	30.28	64.06	70.58	0.00