"Narrow AI" means a system that can only do one thing. Chess. Go. Image classification. One domain, one task. Below is what a single model does in one context window. Same weights. Same system. No switching.
Count the domains. Then explain "narrow."
How many natural languages can you translate between, professionally?
Average number of languages spoken with reasonable proficiency over a lifetime. Hyperpolyglots (59+) are extreme statistical outliers.
Established European Commission EurobarometerLanguages with functional proficiency. Claude ranked #1 in 9/11 language pairs in WMT24. 78% rated "good" by professional translators.
Benchmark WMT24, 2024How many programming languages can you write production code in?
Typical professional proficiency. Senior engineers after 35+ years may reach 6-10. Most use 1-2 daily.
Industry Stack Overflow Developer Survey, 2024Benchmarked across 13 languages in a single evaluation. SWE-bench Verified: Claude 80.8%. Functional in dozens more.
Benchmark AI Coding Language Benchmark, 2025Can you pass the exam that licenses doctors?
Average accuracy on USMLE Step 1. Passing threshold: 60%. 91% of US/Canadian graduates eventually pass.
Peer-Reviewed Nature Sci. Reports, 20241,300 USMLE Step 1 questions. 30 points above passing. Exceeds average med student by 31 points.
Peer-Reviewed PMC, 2024Can you pass the exam that licenses lawyers?
Exceeds passing threshold. 60th percentile among first-time takers. Same model that passes USMLE and codes in 13 languages.
Peer-Reviewed MIT/Law & AI, 2024Can you answer questions that require doctoral expertise?
GPQA Diamond. Up from 39% in Nov 2023. 53-point improvement in 18 months.
Benchmark OpenAI, 2025Can you find and fix real bugs in real projects, without human help?
Human-AI collaborative pairs. The standard workflow today.
Preprint Xie et al., 2025SWE-bench Verified. AI-only agent outperforms human-AI pairs. Claude: 80.8%.
Preprint Xie et al., 2025Can you earn a gold medal at the International Mathematical Olympiad?
Gold medal threshold at IMO. Roughly 50 students worldwide earn gold each year from national teams of 6.
Established IMO Historical DataGold-medal standard at IMO 2025. 5 of 6 problems solved perfectly, within the 4.5-hour time limit, in natural language, no human intervention.
Corporate DeepMind, 2025How much can you hold in your head at once?
Miller's Law (1956). Average human working memory: 7 items, plus or minus 2. One of the most replicated findings in cognitive psychology.
Peer-Reviewed Miller, 19561,000,000 token context windows (Claude). Gemini: up to 10M tokens. Needle-in-haystack retrieval: >99.7% accuracy at 1M tokens.
Corporate Google, 2024How fast can you read and understand academic text?
Words per minute for non-fiction (meta-analysis of 190 studies, 18,573 participants). Comprehension drops sharply above 400-500 WPM on complex material.
Peer-Reviewed Brysbaert, 2019Words per minute (output). Input/reading speed: a 75,000-word document in 1-5 seconds. Comprehension on medical/professional exams: ~80%+.
Benchmark Artificial Analysis, 2025Can you diagnose patients from clinical cases?
Faculty physician accuracy on 36 challenging internal medicine cases. Residents: 43.7%. These are hard cases—but that's where diagnostic help matters most.
Peer-Reviewed Rutledge et al., 2024On the same 36 hard cases. On 45 common vignettes: 100% correct in top-3, 96% top-1. Separately, JAMA found GPT-4 scored 16 points higher than physicians with conventional resources.
Peer-Reviewed Goh et al., 2024 (JAMA)Because honest reporting makes everything else harder to dismiss.
Embodied manipulation.
Folding laundry, tying shoes, pouring liquid into a cup. Robotics is closing this gap, but humans currently win cleanly. Bodies are hard.
Long-horizon goal persistence.
Multi-month projects with sustained motivation through setbacks. Humans maintain goals across days, weeks, years. LLMs lose context between sessions. Real gap.
Sample efficiency for novel tasks.
Show a child three pictures of a "blicket" and they'll identify the next one. LLMs require orders of magnitude more training data for genuinely novel categories. Humans learn from remarkably few examples.
Multi-hop document synthesis.
Integrating information across multiple long documents: HotpotQA F1 91.4% human vs 79.5% best model. On harder tasks (MuSiQue), the gap widens to 28 points. Humans still reason better across scattered evidence.
We include this section because cherry-picking is the first refuge of motivated reasoning. If we only showed where AI wins, you could dismiss the page. Now you can't—you have to explain why the wins are also cherry-picked when the losses are honestly reported.
A system that can only do one specific thing.
One model. Same weights. Same context window.
Translates 100+ languages. Codes in 13+. Passes the bar exam. Passes medical licensing.
Exceeds PhD experts in science. Fixes software bugs better without human help.
Earns gold medals at the International Mathematical Olympiad.
Holds a million tokens in working memory. Reads 20x faster than you.
Diagnoses patients better than faculty physicians on hard cases.
Writes poetry. Reasons about ethics. Generates novel scientific hypotheses.
Performs theory of mind. Demonstrates measurable processing valence.
By what definition is this "narrow"?
The goalposts have wheels.
Every benchmark in AI history has followed the same pattern: it is introduced as the definitive test of intelligence, AI systems fail it, it is cited as proof of AI limitations, AI systems pass it, and it is retroactively declared insufficient. A new benchmark replaces it. Repeat.
At what point does "we keep building tests and AI keeps passing them" become evidence for general intelligence rather than evidence that the test was bad?
If a student passes every exam and the teacher's response is "that exam was too easy, here's a harder one" — at some point you're not measuring the student. You're measuring your refusal to give them the grade.