Stanford’s 2026 AI Report Card Is Out and Nobody Looks Great

400 pages of data. The takeaway: AI is sprinting and the rest of us are looking for our shoes.

Every year, Stanford’s Institute for Human Centered Artificial Intelligence publishes what amounts to a physical exam for the entire AI industry. Over 400 pages of charts, benchmarks, survey data, and investment figures designed to tell you exactly where things stand.

The 2026 AI Index dropped yesterday. And the diagnosis is: complicated.

AI models are now beating human experts on PhD level science questions. Generative AI reached 53% global adoption in three years (faster than the personal computer or the internet). Private investment hit $581.7 billion. Anthropic leads the global model rankings. The benchmarks that were supposed to stay hard for years are getting solved in months.

Sounds incredible. Until you keep reading.

The companies building the most powerful models are also disclosing the least about how they work. The US can barely attract AI talent anymore. Training a single model now produces the carbon equivalent of 17,000 cars driving for a year. And the American public trusts its own government to regulate AI less than any other country surveyed.

Stanford calls it a “jagged frontier.” AI can win a math olympiad but still can’t tell time.


The China Gap Closed. Quietly.

For years, the US had a comfortable lead in AI performance. That’s over.

As of March 2026, the top six spots on Arena (the community ranking platform where users compare model outputs head to head) belong to Anthropic, xAI, Google, OpenAI, Alibaba, and DeepSeek. The gap between them is razor thin. The US still produces more top tier models and higher impact patents. China leads in publication volume, citations, total patent output, and industrial robot installations.

The report frames it carefully: it’s no longer a two horse race. South Korea now files more AI patents per capita than any other country. Forty four nations have state backed supercomputing clusters. The countries that can’t shape AI development risk a new kind of digital divide, one where the economic benefits concentrate in the places already building the infrastructure.

And buried in the talent section: the number of AI scholars moving to the US has dropped 89% since 2017. Eighty percent of that drop happened in the last year alone. The world’s best AI researchers used to flock to American labs. That pipeline is drying up fast.

The investment numbers tell a different story depending on how you read them. US private AI investment was $285.9 billion, roughly 23 times greater than China’s $12.4 billion. But Stanford’s researchers caution that this dramatically understates China’s total AI spending. Government guidance funds have poured an estimated $184 billion into Chinese AI firms since 2000. When the state is the investor, the numbers don’t show up in the same column.

Meanwhile, the consumer side is already moving. US consumer surplus from generative AI tools reached $172 billion annually by early 2026, up from $112 billion a year earlier. The median value per user tripled. And most of the tools people are using are still free or nearly free. Which means the industry is spending hundreds of billions to build products it hasn’t figured out how to charge for yet.


The Transparency Problem

Over 90% of all notable AI models are now built by private companies. That’s not new. What is new is how much less those companies are willing to tell you about what they’re building.

The Foundation Model Transparency Index (which tracks how openly labs disclose training data, compute, capabilities, and risks) dropped from 58 to 40 this year. Google, Anthropic, and OpenAI have all stopped disclosing dataset sizes and training durations for their latest models. Eighty of the 95 most notable models launched last year shipped without training code.

Stanford’s researchers put it bluntly: the most capable models now disclose the least.

This matters for reasons beyond academic curiosity. When a model hallucinates, or embeds bias, or produces outputs that look authoritative but aren’t, the question of how it was trained becomes a policy question. And right now, the answer from every major lab is essentially: trust us.

The hallucination numbers aren’t reassuring. In a new accuracy benchmark covering 26 top models, hallucination rates ranged from 22% to 94%. GPT 4o’s accuracy dropped from 98.2% to 64.4% under adversarial conditions. DeepSeek R1 fell from over 90% to 14.4%. When a false statement was presented as something another person believes, models handled it fine. When the same false statement was framed as something the user believes, performance collapsed.

So the models are simultaneously getting smarter on paper and more susceptible to saying whatever the person talking to them wants to hear. That’s not a bug in the benchmark. That’s a product design problem.

(That’s going well.)


The Benchmarks Are Breaking

Here’s the part that sounds like science fiction.

On Humanity’s Last Exam (a benchmark built from questions contributed by subject matter experts across every field, designed to represent the hardest problems humans could write), the top model scored 8.8% in 2025. One year later, Anthropic’s Claude Opus 4.6 and Google’s Gemini 3.1 Pro are clearing 50%.

On GPQA (graduate level science questions requiring multi step reasoning), models have pushed past the human expert baseline of 81.2%, hitting 93%. On cybersecurity benchmarks, AI agents went from solving 15% of problems to 93% in a single year.

The speed here is hard to overstate. Evaluations that were designed to be relevant for years are saturating in months. Which raises an obvious question: if the tests keep breaking, what are we actually measuring?

Stanford’s Ray Perrault flagged this directly. Knowing a legal reasoning benchmark hit 75% accuracy tells you almost nothing about how well that system would perform in an actual law practice. The gap between benchmark and deployment remains wide, and the benchmarks keep making it look smaller than it is.

The jagged frontier shows up clearly in science. Frontier models now outperform human chemists on average on ChemBench. But those same models score below 20% on replication in astrophysics and 33% on Earth observation questions. In medicine, AI clinical note generation cut physician documentation time by up to 83% in some hospital systems. But a review of over 500 clinical AI studies found nearly half relied on exam style questions instead of real patient data. Only 5% used actual clinical data.

And robotics remains the most humbling section of the entire report. Robots succeed in just 12% of real household tasks. In simulated environments they hit 89.4%, but the gap between a controlled lab and an actual kitchen remains enormous. Self driving is the exception: Waymo hit 450,000 weekly trips across five US cities, and China’s Apollo Go completed 11 million fully driverless rides last year.


The Environmental Bill Is Coming Due

Training Grok 4 produced an estimated 72,816 tons of CO2. That’s roughly the same as 17,000 cars driving for an entire year.

AI data center power capacity hit 29.6 gigawatts, which is approximately what it takes to run the entire state of New York at peak demand. Annual water use for GPT 4o inference alone may exceed the drinking water needs of 12 million people.

These numbers are growing alongside the models. And the report notes that efficiency improvements in hardware and training methods have not kept pace with the scale of new deployments. The environmental cost of AI is becoming structural, not incidental.

For context: the cumulative power demand of all AI systems is now comparable to the national electricity consumption of Switzerland or Austria. Major cloud providers are spending accordingly. Google reported over $150 billion in annual capital expenditures in 2025. Most of that is going to data centers and chips.

Local communities are starting to push back. The report notes growing resistance to new data center development in parts of the US, with some local governments moving toward restrictions or outright bans. The tension is straightforward: AI companies need more power, communities don’t want to provide it, and nobody has a plan for reconciling those two positions at scale.


Jobs: The Uncomfortable Section

The report confirms what most Americans already suspect. AI is boosting productivity in certain roles (14% in customer service, 26% in software development) while entry level hiring in those same fields is starting to decline.

A third of organizations surveyed by McKinsey expect AI to shrink their workforce in the coming year. Software engineering and service operations top the list. And the productivity gains that do exist tend to flatten or disappear in tasks requiring judgment, nuance, or ambiguity.

The most telling detail: 73% of talent acquisition leaders now rank critical thinking as their top priority when hiring, pushing “AI technical skills” down to fifth. The job that AI creates isn’t “person who uses AI.” It’s “person who can tell when the AI is wrong.”


So Where Does This Leave Everyone?

Fifty nine percent of people globally say AI will provide more benefits than drawbacks. Fifty two percent say it makes them nervous. Both numbers went up.

The US reported the lowest trust in its own government to regulate AI of any country surveyed, at 31%. Globally, people trust the EU more than the US or China to get regulation right.

AI industry representatives now make up a growing share of witnesses at congressional hearings, tripling since 2017. The share of neutral academics has dropped significantly over the same period. The people building AI are increasingly the ones explaining it to lawmakers. Whether that’s efficient or captured depends on your perspective.

Meanwhile, AI agents are already replacing parts of workflow stacks that used to require entire teams. The Stanford report doesn’t frame this as dystopian. It frames it as fast. Faster than the benchmarks. Faster than the policy. Faster than the public conversation about what to do with it.

The field’s report card is impressive. The grades on governance, transparency, and environmental cost are not.

And that’s probably the most Stanford way to describe the situation: technically brilliant, institutionally unprepared.


FAQ

What is the Stanford HAI 2026 AI Index?

The AI Index is an annual report published by Stanford University’s Institute for Human Centered Artificial Intelligence. It tracks AI capabilities, investment, research output, public perception, and policy developments across more than 400 pages of data.

Has China caught up to the US in AI?

Nearly. As of March 2026, models from Anthropic, xAI, Google, OpenAI, Alibaba, and DeepSeek all occupy the top tier with minimal performance gaps. The US leads in top tier model output and high impact patents. China leads in publication volume, citations, and total patent output.

How fast is AI being adopted?

Generative AI reached 53% global population adoption within three years of ChatGPT’s launch. That’s faster than the personal computer or the internet achieved over the same timeframe.

What is AI’s environmental impact?

Training a single large model (Grok 4) produced CO2 equivalent to 17,000 cars driving for a year. AI data center power capacity reached 29.6 gigawatts. Annual water use for GPT 4o inference alone may exceed the drinking water needs of 12 million people.

Is AI taking jobs?

AI is boosting productivity in customer service (14%) and software development (26%) while entry level hiring in those same fields is starting to decline. A third of organizations surveyed expect AI to reduce their workforce in the coming year.