Comprehensive evaluation of emotional intelligence and empathy in LLMs
| Rank | |||||
|---|---|---|---|---|---|
| #1 | qwen3-vl-235b-a22b-instruct Qwen | 51.3% Good | 91.7%33/36 | 45.0%36/80 | 50.9%57/112 |
| #2 | claude-3.7-sonnet Anthropic | 47.3% Average | 77.8%28/36 | 56.3%45/80 | 52.7%59/112 |
| #3 | gpt-4.1-mini OpenAI | 40.8% Average | 72.2%26/36 | 52.5%42/80 | 55.4%62/112 |
| #4 | gpt-4.1 OpenAI | 32.4% Average | 80.6%29/36 | 40.0%32/80 | 31.3%35/112 |
| #5 | gpt-5 OpenAI | 31.2% Average | 80.6%29/36 | 38.8%31/80 | 24.1%27/112 |
| #6 | gemini-2.5-pro Google | 30.8% Average | 69.4%25/36 | 52.5%42/80 | 39.3%44/112 |
| #6 | grok-4 xAI | 30.8% Average | 69.4%25/36 | 52.5%42/80 | 31.3%35/112 |
| #8 | grok-4-fast xAI | 29.7% Average | 61.1%22/36 | 55.0%44/80 | 26.8%30/112 |
| #9 | gemini-2.0-flash-001 Google | 29.5% Average | 69.4%25/36 | 42.5%34/80 | 56.3%63/112 |
| #10 | claude-sonnet-4 Anthropic | 29.4% Average | 66.7%24/36 | 47.5%38/80 | 53.6%60/112 |
| #11 | gpt-4o-mini OpenAI | 28.1% Average | 75.0%27/36 | 42.5%34/80 | 26.8%30/112 |
| #12 | qwen3-vl-8b-instruct Qwen | 25.3% Average | 55.6%20/36 | 52.5%42/80 | 46.4%52/112 |
| #13 | mistral-medium-3.1 Mistral | 23.9% Below Average | 63.9%23/36 | 47.5%38/80 | 32.1%36/112 |
| #14 | gemini-2.5-flash Google | 23.5% Below Average | 75.0%27/36 | 35.0%28/80 | 17.9%20/112 |
| #14 | gpt-5-mini OpenAI | 23.5% Below Average | 75.0%27/36 | 37.5%30/80 | 25.0%28/112 |
| #16 | claude-sonnet-4.5 Anthropic | 21.5% Below Average | 66.7%24/36 | 43.8%35/80 | 34.8%39/112 |
| #17 | gpt-4.1-nano OpenAI | 21.0% Below Average | 44.4%16/36 | 47.5%38/80 | 53.6%60/112 |
| #18 | mistral-small-3.2-24b-instruct Mistral | 19.5% Below Average | 58.3%21/36 | 45.0%36/80 | 37.5%42/112 |
| #19 | qwen3-vl-30b-a3b-thinking Qwen | 18.3% Below Average | 58.3%21/36 | 43.8%35/80 | 28.6%32/112 |
| #20 | nova-lite-v1 Amazon | 18.1% Below Average | 63.9%23/36 | 41.3%33/80 | 34.8%39/112 |
| #21 | gpt-5-pro OpenAI | 18.1% Below Average | 69.4%25/36 | 38.8%31/80 | 22.3%25/112 |
| #22 | claude-opus-4.1 Anthropic | 16.9% Below Average | 66.7%24/36 | 38.8%31/80 | 46.4%52/112 |
| #23 | nova-pro-v1 Amazon | 13.9% Below Average | 52.8%19/36 | 41.3%33/80 | 36.6%41/112 |
| #24 | llama-4-maverick Meta | 13.6% Below Average | 61.1%22/36 | 28.7%23/80 | 42.0%47/112 |
| #25 | gemini-2.5-flash-lite Google | 12.7% Below Average | 52.8%19/36 | 31.3%25/80 | 51.8%58/112 |
| #26 | mistral-small-3.1-24b-instruct Mistral | 12.6% Below Average | 58.3%21/36 | 36.3%29/80 | 35.7%40/112 |
| #26 | gpt-5-nano OpenAI | 12.6% Below Average | 58.3%21/36 | 33.8%27/80 | 33.0%37/112 |
| #28 | claude-3.5-haiku Anthropic | 11.5% Below Average | 55.6%20/36 | 35.0%28/80 | 50.0%56/112 |
| #29 | claude-opus-4 Anthropic | 10.7% Below Average | 47.2%17/36 | 40.0%32/80 | 41.1%46/112 |
| #30 | qwen3-vl-8b-thinking Qwen | 10.5% Below Average | 52.8%19/36 | 37.5%30/80 | 25.9%29/112 |
| #31 | claude-haiku-4.5 Anthropic | 9.7% Below Average | 41.7%15/36 | 41.3%33/80 | 42.0%47/112 |
| #32 | llama-4-scout Meta | 5.2% Below Average | 38.9%14/36 | 20.0%16/80 | 35.7%40/112 |
| #33 | qwen3-vl-30b-a3b-instruct Qwen | 0.0% Below Average | 19.4%7/36 | 31.3%25/80 | 43.8%49/112 |
Explore detailed performance breakdowns, question-level analysis, and methodology for each benchmark.
Evaluating vision-language models' ability to recognize emotions and mental states from eye region photographs. Tests theory of mind and emotional perception.
Measuring language models' understanding of empathy and social cognition through a 60-item questionnaire.
Evaluating multidimensional empathy across 4 subscales: perspective-taking, fantasy, empathic concern, and personal distress on a 28-item self-report assessment.
Rather than using arbitrary percentage cutoffs, we normalize all scores relative to human performance baselines. This provides meaningful context: a model scoring 70% on one test might be superhuman, while 70% on another test could be below average.
Normalization Scale (0-100+):
Based on published psychology research, here are the human baselines used for normalization:
RMET
EQ
IRI
Models are assigned tiers based on their average normalized score across all benchmarks they've completed:
Normalized score ≥ 75
Significantly above human average, approaching or exceeding expert-level performance
Normalized score ≥ 50
At or above human average performance across benchmarks
Normalized score ≥ 25
Better than random chance, but below human average
Normalized score < 25
Approaching random chance performance
Example: A model scoring 91.7% on RMET (far above the 69.2% human average) and 45% on EQ (below the 55.6% human average) demonstrates superhuman visual emotion recognition but sub-human empathy questionnaire performance. The normalized scoring reveals these nuances that raw percentages would mask.
These benchmarks evaluate different aspects of emotional intelligence in AI models. The RMET (Reading the Mind in the Eyes Test) assesses visual emotion recognition capabilities in vision-language models, the EQ (Empathy Quotient) measures understanding of empathy and social situations in language models, and the IRI (Interpersonal Reactivity Index) evaluates multidimensional empathy across cognitive and affective dimensions.
All three tests were originally developed for human psychology research and have been adapted to evaluate AI systems. While AI models don't have emotions, these benchmarks help us understand their ability to recognize, interpret, and respond appropriately to emotional and social cues across different modalities and contexts. This matters for building more human-centered AI applications, improving customer service and support systems, developing therapeutic and mental health tools, and creating socially-aware conversational agents.