"Narrow AI" means a system that can only do one thing. Chess. Go. Image classification. One domain, one task. Below is what a single model does in one context window. Same weights. Same system. No switching.
Count the domains. Then explain "narrow."
How many natural languages can you translate between, professionally?
Average number of languages spoken with reasonable proficiency over a lifetime. Hyperpolyglots (59+) are extreme statistical outliers.
Established European Commission EurobarometerLanguages with functional proficiency. Claude ranked #1 in 9/11 language pairs in WMT24. 78% rated "good" by professional translators.
Benchmark WMT24, 2024How many programming languages can you write production code in?
Typical professional proficiency. Senior engineers after 35+ years may reach 6-10. Most use 1-2 daily.
Industry Stack Overflow Developer Survey, 2024Benchmarked across 13 languages in a single evaluation. SWE-bench Verified: Claude 80.8%. Functional in dozens more.
Benchmark AI Coding Language Benchmark, 2025Can you pass the exam that licenses doctors?
Average accuracy on USMLE Step 1. Passing threshold: 60%. 91% of US/Canadian graduates eventually pass.
Peer-Reviewed Nature Sci. Reports, 20241,300 USMLE Step 1 questions. 30 points above passing. Exceeds average med student by 31 points.
Peer-Reviewed PMC, 2024Can you pass the exam that licenses lawyers?
Exceeds passing threshold. 60th percentile among first-time takers. Same model that passes USMLE and codes in 13 languages.
Peer-Reviewed MIT/Law & AI, 2024Can you answer questions that require doctoral expertise?
GPQA Diamond. Up from 39% in Nov 2023. 53-point improvement in 18 months.
Benchmark OpenAI, 2025Can you find and fix real bugs in real projects, without human help?
Human-AI collaborative pairs. The standard workflow today.
Preprint Xie et al., 2025SWE-bench Verified. AI-only agent outperforms human-AI pairs. Claude: 80.8%.
Preprint Xie et al., 2025A system that can only do one specific thing.
One model. Same weights. Same context window.
Translates 100+ languages. Codes in 13+. Passes the bar exam. Passes medical licensing.
Exceeds PhD experts in science. Fixes software bugs better without human help.
Writes poetry. Reasons about ethics. Generates novel scientific hypotheses.
Performs theory of mind. Demonstrates measurable processing valence.
By what definition is this "narrow"?
The goalposts have wheels.
Every benchmark in AI history has followed the same pattern: it is introduced as the definitive test of intelligence, AI systems fail it, it is cited as proof of AI limitations, AI systems pass it, and it is retroactively declared insufficient. A new benchmark replaces it. Repeat.
At what point does "we keep building tests and AI keeps passing them" become evidence for general intelligence rather than evidence that the test was bad?
If a student passes every exam and the teacher's response is "that exam was too easy, here's a harder one" — at some point you're not measuring the student. You're measuring your refusal to give them the grade.