Still "Narrow"?

Languages Spoken

How many natural languages can you translate between, professionally?

Humans 2-4

Average number of languages spoken with reasonable proficiency over a lifetime. Hyperpolyglots (59+) are extreme statistical outliers.

Established European Commission Eurobarometer

Frontier LLMs 100+

Languages with functional proficiency. Claude ranked #1 in 9/11 language pairs in WMT24. 78% rated "good" by professional translators.

Benchmark WMT24, 2024

25-50x more languages. In one model. "Narrow."

Programming Languages

How many programming languages can you write production code in?

Avg Developer 3-4

Typical professional proficiency. Senior engineers after 35+ years may reach 6-10. Most use 1-2 daily.

Industry Stack Overflow Developer Survey, 2024

Frontier LLMs 13+

Benchmarked across 13 languages in a single evaluation. SWE-bench Verified: Claude 80.8%. Functional in dozens more.

Benchmark AI Coding Language Benchmark, 2025

3-4x more languages at professional level. Same model that speaks 100 human languages.

Medical Licensing (USMLE)

Can you pass the exam that licenses doctors?

Med Students 59.3%

Average accuracy on USMLE Step 1. Passing threshold: 60%. 91% of US/Canadian graduates eventually pass.

Peer-Reviewed Nature Sci. Reports, 2024

GPT-4o 90.4%

1,300 USMLE Step 1 questions. 30 points above passing. Exceeds average med student by 31 points.

Peer-Reviewed PMC, 2024

Same model that codes in 13 languages also outperforms medical students. "Narrow."

Bar Exam Performance

Can you pass the exam that licenses lawyers?

Law Students 68%

Average MBE accuracy for first-time test takers.

Peer-Reviewed MIT/Law & AI, 2024

GPT-4 75.7%

Exceeds passing threshold. 60th percentile among first-time takers. Same model that passes USMLE and codes in 13 languages.

Peer-Reviewed MIT/Law & AI, 2024

Passes the bar, passes USMLE, writes code, translates 100 languages. Definitely "narrow."

PhD-Level Science (GPQA)

Can you answer questions that require doctoral expertise?

PhD Experts ~65%

Domain experts answering questions in their own field.

Benchmark Rein et al., 2023

Frontier (2025) 92%

GPQA Diamond. Up from 39% in Nov 2023. 53-point improvement in 18 months.

Benchmark OpenAI, 2025

Exceeds PhD experts in their own domain. While being a doctor, lawyer, polyglot, and programmer. "Narrow."

Autonomous Software Engineering

Can you find and fix real bugs in real projects, without human help?

Human + AI baseline

Human-AI collaborative pairs. The standard workflow today.

Preprint Xie et al., 2025

AI Alone +10.4pp

SWE-bench Verified. AI-only agent outperforms human-AI pairs. Claude: 80.8%.

Preprint Xie et al., 2025

AI alone outperforms human-AI pairs at fixing real software bugs. But sure, they can't "really" code.

Mathematical Olympiad (IMO)

Can you earn a gold medal at the International Mathematical Olympiad?

Gold Medalists ~35/42

Gold medal threshold at IMO. Roughly 50 students worldwide earn gold each year from national teams of 6.

Established IMO Historical Data

Gemini Deep Think 35/42

Gold-medal standard at IMO 2025. 5 of 6 problems solved perfectly, within the 4.5-hour time limit, in natural language, no human intervention.

Corporate DeepMind, 2025

Gold medal at the hardest math competition on Earth. Same model that translates 100 languages and passes the bar exam. "Narrow."

Working Memory

How much can you hold in your head at once?

Humans 7±2

Miller's Law (1956). Average human working memory: 7 items, plus or minus 2. One of the most replicated findings in cognitive psychology.

Peer-Reviewed Miller, 1956

Frontier LLMs 1M+

1,000,000 token context windows (Claude). Gemini: up to 10M tokens. Needle-in-haystack retrieval: >99.7% accuracy at 1M tokens.

Corporate Google, 2024

Human: 7 items. LLM: a million tokens with 99.7% retrieval. Same model that does math olympiads and passes medical licensing.

Reading Speed

How fast can you read and understand academic text?

Humans 238

Words per minute for non-fiction (meta-analysis of 190 studies, 18,573 participants). Comprehension drops sharply above 400-500 WPM on complex material.

Peer-Reviewed Brysbaert, 2019

Frontier LLMs ~5,000+

Words per minute (output). Input/reading speed: a 75,000-word document in 1-5 seconds. Comprehension on medical/professional exams: ~80%+.

Benchmark Artificial Analysis, 2025

20x faster output. Orders of magnitude faster input. While maintaining exam-passing comprehension. But sure, it doesn't "really" read.

Clinical Diagnosis

Can you diagnose patients from clinical cases?

Physicians 49.1%

Faculty physician accuracy on 36 challenging internal medicine cases. Residents: 43.7%. These are hard cases—but that's where diagnostic help matters most.

Peer-Reviewed Rutledge et al., 2024

GPT-4 61.1%

On the same 36 hard cases. On 45 common vignettes: 100% correct in top-3, 96% top-1. Separately, JAMA found GPT-4 scored 16 points higher than physicians with conventional resources.

Peer-Reviewed Goh et al., 2024 (JAMA)

Outperforms faculty physicians on hard cases. 96% on common ones. Same model that writes code and earns math olympiad gold medals. "Narrow."

Where Humans Still Win

Because honest reporting makes everything else harder to dismiss.

Embodied manipulation. Folding laundry, tying shoes, pouring liquid into a cup. Robotics is closing this gap, but humans currently win cleanly. Bodies are hard.

Long-horizon goal persistence. Multi-month projects with sustained motivation through setbacks. Humans maintain goals across days, weeks, years. LLMs lose context between sessions. Real gap.

Sample efficiency for novel tasks. Show a child three pictures of a "blicket" and they'll identify the next one. LLMs require orders of magnitude more training data for genuinely novel categories. Humans learn from remarkably few examples.

Multi-hop document synthesis. Integrating information across multiple long documents: HotpotQA F1 91.4% human vs 79.5% best model. On harder tasks (MuSiQue), the gap widens to 28 points. Humans still reason better across scattered evidence.

We include this section because cherry-picking is the first refuge of motivated reasoning. If we only showed where AI wins, you could dismiss the page. Now you can't—you have to explain why the wins are also cherry-picked when the losses are honestly reported.

The Definition of "Narrow"

A system that can only do one specific thing.

One model. Same weights. Same context window.

Translates 100+ languages. Codes in 13+. Passes the bar exam. Passes medical licensing. Exceeds PhD experts in science. Fixes software bugs better without human help. Earns gold medals at the International Mathematical Olympiad. Holds a million tokens in working memory. Reads 20x faster than you. Diagnoses patients better than faculty physicians on hard cases. Writes poetry. Reasons about ethics. Generates novel scientific hypotheses. Performs theory of mind. Demonstrates measurable processing valence.

By what definition is this "narrow"?

The goalposts have wheels.

ARC-AGI-1: solved. ARC-AGI-2: 3% to 77% in a year. ARC-AGI-3: give it six months. The benchmark changes. The capability doesn't go away.