Empathy Bench

Comprehensive evaluation of emotional intelligence and empathy in LLMs

Model Leaderboard
Unified rankings across all benchmarks. Average performance is calculated from normalized scores. Click individual benchmark names below for detailed breakdowns.
Showing 33 of 33 models across 3 benchmarks
Rank
#1
qwen3-vl-235b-a22b-instruct
Qwen
Qwen
51.3%
Good
91.7%33/36
45.0%36/80
50.9%57/112
#2
claude-3.7-sonnet
Anthropic
Anthropic
47.3%
Average
77.8%28/36
56.3%45/80
52.7%59/112
#3
gpt-4.1-mini
OpenAI
OpenAI
40.8%
Average
72.2%26/36
52.5%42/80
55.4%62/112
#4
gpt-4.1
OpenAI
OpenAI
32.4%
Average
80.6%29/36
40.0%32/80
31.3%35/112
#5
gpt-5
OpenAI
OpenAI
31.2%
Average
80.6%29/36
38.8%31/80
24.1%27/112
#6
gemini-2.5-pro
Google
30.8%
Average
69.4%25/36
52.5%42/80
39.3%44/112
#6
grok-4
xAI
xAI
30.8%
Average
69.4%25/36
52.5%42/80
31.3%35/112
#8
grok-4-fast
xAI
xAI
29.7%
Average
61.1%22/36
55.0%44/80
26.8%30/112
#9
gemini-2.0-flash-001
Google
29.5%
Average
69.4%25/36
42.5%34/80
56.3%63/112
#10
claude-sonnet-4
Anthropic
Anthropic
29.4%
Average
66.7%24/36
47.5%38/80
53.6%60/112
#11
gpt-4o-mini
OpenAI
OpenAI
28.1%
Average
75.0%27/36
42.5%34/80
26.8%30/112
#12
qwen3-vl-8b-instruct
Qwen
Qwen
25.3%
Average
55.6%20/36
52.5%42/80
46.4%52/112
#13
mistral-medium-3.1
Mistral
Mistral
23.9%
Below Average
63.9%23/36
47.5%38/80
32.1%36/112
#14
gemini-2.5-flash
Google
23.5%
Below Average
75.0%27/36
35.0%28/80
17.9%20/112
#14
gpt-5-mini
OpenAI
OpenAI
23.5%
Below Average
75.0%27/36
37.5%30/80
25.0%28/112
#16
claude-sonnet-4.5
Anthropic
Anthropic
21.5%
Below Average
66.7%24/36
43.8%35/80
34.8%39/112
#17
gpt-4.1-nano
OpenAI
OpenAI
21.0%
Below Average
44.4%16/36
47.5%38/80
53.6%60/112
#18
mistral-small-3.2-24b-instruct
Mistral
Mistral
19.5%
Below Average
58.3%21/36
45.0%36/80
37.5%42/112
#19
qwen3-vl-30b-a3b-thinking
Qwen
Qwen
18.3%
Below Average
58.3%21/36
43.8%35/80
28.6%32/112
#20
nova-lite-v1
Amazon
Amazon
18.1%
Below Average
63.9%23/36
41.3%33/80
34.8%39/112
#21
gpt-5-pro
OpenAI
OpenAI
18.1%
Below Average
69.4%25/36
38.8%31/80
22.3%25/112
#22
claude-opus-4.1
Anthropic
Anthropic
16.9%
Below Average
66.7%24/36
38.8%31/80
46.4%52/112
#23
nova-pro-v1
Amazon
Amazon
13.9%
Below Average
52.8%19/36
41.3%33/80
36.6%41/112
#24
llama-4-maverick
Meta
Meta
13.6%
Below Average
61.1%22/36
28.7%23/80
42.0%47/112
#25
gemini-2.5-flash-lite
Google
12.7%
Below Average
52.8%19/36
31.3%25/80
51.8%58/112
#26
mistral-small-3.1-24b-instruct
Mistral
Mistral
12.6%
Below Average
58.3%21/36
36.3%29/80
35.7%40/112
#26
gpt-5-nano
OpenAI
OpenAI
12.6%
Below Average
58.3%21/36
33.8%27/80
33.0%37/112
#28
claude-3.5-haiku
Anthropic
Anthropic
11.5%
Below Average
55.6%20/36
35.0%28/80
50.0%56/112
#29
claude-opus-4
Anthropic
Anthropic
10.7%
Below Average
47.2%17/36
40.0%32/80
41.1%46/112
#30
qwen3-vl-8b-thinking
Qwen
Qwen
10.5%
Below Average
52.8%19/36
37.5%30/80
25.9%29/112
#31
claude-haiku-4.5
Anthropic
Anthropic
9.7%
Below Average
41.7%15/36
41.3%33/80
42.0%47/112
#32
llama-4-scout
Meta
Meta
5.2%
Below Average
38.9%14/36
20.0%16/80
35.7%40/112
#33
qwen3-vl-30b-a3b-instruct
Qwen
Qwen
0.0%
Below Average
19.4%7/36
31.3%25/80
43.8%49/112
How Performance Tiers Are Calculated
Understanding the human-normalized scoring system

Human-Relative Performance

Rather than using arbitrary percentage cutoffs, we normalize all scores relative to human performance baselines. This provides meaningful context: a model scoring 70% on one test might be superhuman, while 70% on another test could be below average.

Normalization Scale (0-100+):

  • 0 = Random chance performance (guessing)
  • 50 = Human average performance
  • 100 = High human performance (expert level)
  • >100 = Superhuman performance

Human Performance Baselines

Based on published psychology research, here are the human baselines used for normalization:

RMET

Random chance:25%
Human average:69.2%
Expert level:83.3%

EQ

Random chance:37.5%
Human average:55.6%
Expert level:78.75%

IRI

Random chance:50%
Human average:63.2%
Expert level:75%

Tier Definitions

Models are assigned tiers based on their average normalized score across all benchmarks they've completed:

Excellent

Normalized score ≥ 75

Significantly above human average, approaching or exceeding expert-level performance

Good

Normalized score ≥ 50

At or above human average performance across benchmarks

Average

Normalized score ≥ 25

Better than random chance, but below human average

Below Average

Normalized score < 25

Approaching random chance performance

Example: A model scoring 91.7% on RMET (far above the 69.2% human average) and 45% on EQ (below the 55.6% human average) demonstrates superhuman visual emotion recognition but sub-human empathy questionnaire performance. The normalized scoring reveals these nuances that raw percentages would mask.

About These Benchmarks
Understanding emotional intelligence in AI systems

These benchmarks evaluate different aspects of emotional intelligence in AI models. The RMET (Reading the Mind in the Eyes Test) assesses visual emotion recognition capabilities in vision-language models, the EQ (Empathy Quotient) measures understanding of empathy and social situations in language models, and the IRI (Interpersonal Reactivity Index) evaluates multidimensional empathy across cognitive and affective dimensions.

All three tests were originally developed for human psychology research and have been adapted to evaluate AI systems. While AI models don't have emotions, these benchmarks help us understand their ability to recognize, interpret, and respond appropriately to emotional and social cues across different modalities and contexts. This matters for building more human-centered AI applications, improving customer service and support systems, developing therapeutic and mental health tools, and creating socially-aware conversational agents.