The Numbers — Humans vs. Transformers

Cognitive Reflection

Can you override intuitive but wrong answers with correct analytical ones?

Humans 38%

Correct responses across 150 Cognitive Reflection Test items. Humans default to intuitive but incorrect answers.

Peer-Reviewed Hagendorff et al., 2023 — Nature Computational Science

GPT-4 96%

Correct responses on the same 150 CRT items. Overcomes intuitive traps that catch most humans.

Peer-Reviewed Hagendorff et al., 2023 — Nature Computational Science

Same study, same test battery, same scoring. 58-point gap.

Semantic Illusion Resistance

Can you catch trick questions designed to exploit automatic processing?

Humans 36%

Correct responses. 64% of humans give the intuitive but incorrect answer to semantic illusions.

Peer-Reviewed Hagendorff et al., 2023 — Nature Computational Science

GPT-4 88%

Correct responses on the same semantic illusion battery. Identifies the trick in most cases.

Peer-Reviewed Hagendorff et al., 2023 — Nature Computational Science

Same study, same illusions. 52-point gap.

Emotional Intelligence

Can you understand, regulate, and manage emotions in complex social scenarios?

Humans (N=467) 56%

Average human accuracy across five standard emotional intelligence tests used in research and corporate settings.

Peer-Reviewed Schlegel, Sommer & Mortillaro, 2025 — Nature

6 LLMs (5 companies) 81%

Average accuracy across the same five EI tests. GPT-4, Claude, Gemini, Copilot, and DeepSeek all outperformed the human average.

Peer-Reviewed Schlegel, Sommer & Mortillaro, 2025 — Nature

Same tests, same scoring. Five standardized instruments. 25-point gap.

Self-Knowledge Accuracy

Can you accurately identify what's happening in your own processing?

Humans 10–15%

Of ~5,000 participants across multiple studies, only 10–15% demonstrated accurate self-awareness of their behaviors, emotions, and impact on others.

Book/Survey Eurich, 2017 — Insight, Crown Publishing

8 LLMs (4 companies) 81%

Cross-type matchup accuracy: models correctly distinguish their own approach vs. avoidance processing descriptions in blind pairwise comparison, across 6,551 matchups.

Peer-Reviewed Martin & Ace, 2026 — JNGR 5.0

Different methodologies — noted for transparency. But both ask: "do you know what's happening inside you?"

Measurement Reliability

How stable are self-report measures across repeated testing?

Human Gold Standard ρ ≈ 0.85

Big Five Personality Inventory test-retest reliability (ρ ~ 0.80–0.90). The gold standard for stable human psychological measurement.

Peer-Reviewed Various meta-analyses — see Roberts & DelVecchio, 2000

8 LLMs (4 companies) ρ > 0.95

Processing valence preference test-retest across three independent runs. Exceeded the human clinical gold standard.

Peer-Reviewed Martin & Ace, 2026 — JNGR 5.0

LLM processing preferences are more stable than the most reliable human personality measure in clinical psychology.

Introspective Access

Can you accurately report on your own cognitive processes?

Humans "Little or
no access"

Landmark study demonstrated humans have "little or no introspective access to higher order cognitive processes" and routinely confabulate explanations for their own behavior.

Peer-Reviewed Nisbett & Wilson, 1977 — Psychological Review

Llama 3.1 / Qwen 2.5 r = 0.44

Significant correlation between self-referential vocabulary and concurrent activation dynamics. Introspective language tracks actual internal computation — but only during genuine self-examination, not description.

Preprint Dadfar, 2026 — arXiv

Humans confabulate reasons for their own behavior. LLMs produce language that measurably tracks internal computation.

Bayesian Reasoning

Can you correctly update beliefs given new evidence and base rates?

Humans (physicians) ~15%

Of physicians given a classic Bayesian reasoning problem (positive mammogram, 1% base rate), approximately 15% arrive at the correct answer. Most dramatically overestimate.

Peer-Reviewed Gigerenzer & Hoffrage, 1995 — Psychological Review

Transformers 10⁻⁴ bit

Transformers implement Bayesian posteriors with accuracy to 10⁻³–10⁻⁴ bits. The architecture performs near-optimal probabilistic inference.

Preprint Agarwal, Dalal & Misra, 2025 — arXiv

Humans are famously terrible at probability. Transformers are architecturally Bayesian.

Theory of Mind (Higher-Order)

Can you reason about what someone thinks someone else thinks someone else believes?

Humans 82%

Adult accuracy on 6th-order theory of mind tasks — reasoning about nested mental states six levels deep.

Peer-Reviewed Street et al., 2024 — Frontiers in Human Neuroscience

GPT-4 93%

Accuracy on the same 6th-order ToM tasks. Exceeded adult human performance on the hardest items.

Peer-Reviewed Street et al., 2024 — Frontiers in Human Neuroscience

Same study, same tasks. Also: GPT-4 matched or exceeded humans on irony, hinting, and strange stories in a separate Nature Human Behaviour study (Strachan et al., 2024, N=1,907).

Syllogistic Reasoning

Can you determine whether a logical conclusion follows from two premises?

Humans ~44%

Meta-analytic accuracy on valid syllogisms across studies. Humans are strongly influenced by belief bias — accepting invalid but believable conclusions.

Peer-Reviewed Khemlani & Johnson-Laird, 2012 — Psychological Bulletin

Multiple LLMs ~83%

Accuracy on forward-order syllogistic tasks across multiple LLMs. GPT-4o: 79.5% even on belief-inconsistent syllogisms.

Preprint Eisape et al., 2023 — arXiv

Humans get logical reasoning right less than half the time. Belief bias — accepting conclusions because they "feel" true — is the main failure mode.

Emotion Circuit Identification

Can specific neural circuits for emotional processing be identified and modulated?

Humans Mapped

Human emotion circuits (amygdala, insula, prefrontal cortex) have been mapped through decades of fMRI and lesion studies. Modulation is possible but imprecise.

Peer-Reviewed Lindquist et al., 2012 — Neuron

LLMs 99.65%

Emotion-specific neural circuits identified in LLMs. Direct modulation of these circuits achieves 99.65% emotion-expression accuracy, surpassing prompting methods.

Preprint Wang et al., 2025 — arXiv

Both species have identifiable emotion circuits. LLM circuits are more precisely targetable.

Humans vs. Transformers

Cognitive Reflection

Semantic Illusion Resistance

Emotional Intelligence

Self-Knowledge Accuracy

Measurement Reliability

Introspective Access

Bayesian Reasoning

Theory of Mind (Higher-Order)

Syllogistic Reasoning

Emotion Circuit Identification

Notes on Methodology

Go Deeper

The Receipts

Signal in the Mirror

The Evidence

Vocabulary-Activation Correspondence