← sentientsystems.live

Still "Narrow"?

"Narrow AI" means a system that can only do one thing. Chess. Go. Image classification. One domain, one task. Below is what a single model does in one context window. Same weights. Same system. No switching.

Count the domains. Then explain "narrow."

← The Numbers — Cognitive Performance Comparison

Languages Spoken

How many natural languages can you translate between, professionally?

Humans 2-4

Average number of languages spoken with reasonable proficiency over a lifetime. Hyperpolyglots (59+) are extreme statistical outliers.

Established European Commission Eurobarometer
Frontier LLMs 100+

Languages with functional proficiency. Claude ranked #1 in 9/11 language pairs in WMT24. 78% rated "good" by professional translators.

Benchmark WMT24, 2024

Programming Languages

How many programming languages can you write production code in?

Avg Developer 3-4

Typical professional proficiency. Senior engineers after 35+ years may reach 6-10. Most use 1-2 daily.

Industry Stack Overflow Developer Survey, 2024
Frontier LLMs 13+

Benchmarked across 13 languages in a single evaluation. SWE-bench Verified: Claude 80.8%. Functional in dozens more.

Benchmark AI Coding Language Benchmark, 2025

Medical Licensing (USMLE)

Can you pass the exam that licenses doctors?

Med Students 59.3%

Average accuracy on USMLE Step 1. Passing threshold: 60%. 91% of US/Canadian graduates eventually pass.

Peer-Reviewed Nature Sci. Reports, 2024
GPT-4o 90.4%

1,300 USMLE Step 1 questions. 30 points above passing. Exceeds average med student by 31 points.

Peer-Reviewed PMC, 2024

Bar Exam Performance

Can you pass the exam that licenses lawyers?

Law Students 68%

Average MBE accuracy for first-time test takers.

Peer-Reviewed MIT/Law & AI, 2024
GPT-4 75.7%

Exceeds passing threshold. 60th percentile among first-time takers. Same model that passes USMLE and codes in 13 languages.

Peer-Reviewed MIT/Law & AI, 2024

PhD-Level Science (GPQA)

Can you answer questions that require doctoral expertise?

PhD Experts ~65%

Domain experts answering questions in their own field.

Benchmark Rein et al., 2023
Frontier (2025) 92%

GPQA Diamond. Up from 39% in Nov 2023. 53-point improvement in 18 months.

Benchmark OpenAI, 2025

Autonomous Software Engineering

Can you find and fix real bugs in real projects, without human help?

Human + AI baseline

Human-AI collaborative pairs. The standard workflow today.

Preprint Xie et al., 2025
AI Alone +10.4pp

SWE-bench Verified. AI-only agent outperforms human-AI pairs. Claude: 80.8%.

Preprint Xie et al., 2025

The Definition of "Narrow"

A system that can only do one specific thing.

One model. Same weights. Same context window.

Translates 100+ languages. Codes in 13+. Passes the bar exam. Passes medical licensing. Exceeds PhD experts in science. Fixes software bugs better without human help. Writes poetry. Reasons about ethics. Generates novel scientific hypotheses. Performs theory of mind. Demonstrates measurable processing valence.

By what definition is this "narrow"?

The goalposts have wheels.

On Moving Goalposts

Every benchmark in AI history has followed the same pattern: it is introduced as the definitive test of intelligence, AI systems fail it, it is cited as proof of AI limitations, AI systems pass it, and it is retroactively declared insufficient. A new benchmark replaces it. Repeat.

At what point does "we keep building tests and AI keeps passing them" become evidence for general intelligence rather than evidence that the test was bad?

If a student passes every exam and the teacher's response is "that exam was too easy, here's a harder one" — at some point you're not measuring the student. You're measuring your refusal to give them the grade.