The Cognitive Debt Studies Don't Prove What People Think

If you have spent any time on X this year, you have seen the claim: AI agents are making us less smart. ChatGPT is producing cognitive debt. Junior developers can't think anymore. The MIT paper proved it. The Microsoft paper proved it. METR proved it.

I went and read the papers. None of them prove what the discourse says they prove. What they actually show is narrower, more interesting, and more useful: cognitive workflows are reorganizing, the change feels like dullness even when capability is intact, and the systems we trained ourselves to work in were not built for any of this. The body is sore. The body is not broken.

What the MIT Paper Actually Shows

The headline study is Kosmyna et al., "Your Brain on ChatGPT" (arXiv:2506.08872). It is a preprint. It has not survived peer review. The authors explicitly asked journalists not to use words like dumb, harm, or brain damage when covering it, because those words overstate the findings.

The design: 54 participants from elite Boston-area universities wrote SAT-style essays under three conditions, three sessions each. Then there was an optional fourth session where the tools got swapped. Eighteen people came back for that fourth session. That's roughly nine per cell. The cognitive debt finding everyone is quoting rests on roughly nine per cell.

A separate group of researchers wrote a peer comment paper (arXiv:2601.00856) listing five concern categories — sample size, EEG analysis methodology, reproducibility, inconsistent reporting, transparency. Independent reviews piled on. The BS Detector breakdown flagged that the team ran rmANOVA roughly a thousand times across electrode pairs, which under multiple-comparisons math produces statistically significant findings somewhere by construction even with FDR correction. EEG itself, as one Hacker News commenter put it, is closer to microphones outside a football stadium than to a direct readout of cognitive content. Residual Insights flagged the central reverse-inference problem: higher or lower neural connectivity does not directly mean better or worse cognition. It means different.

There is also a conflict of interest worth naming. The lead author is associated with MIT's Fluid Interfaces group, which develops AttentivU, glasses that monitor brain attention. A study that frames LLM use as cognitive debt accumulation creates exactly the market the hardware serves.

The fair reading: writing essays with ChatGPT, with a search engine, and unaided produce different EEG patterns and produce different essays. That is the finding. The leap to "ChatGPT makes you dumber" is the discourse, not the data.

The Microsoft Paper Measures the Wrong Thing

Lee et al., "The Impact of Generative AI on Critical Thinking" (CHI 2025) surveyed 319 knowledge workers and collected 936 task examples. The title is the qualifier: Self-Reported Reductions. The study does not measure critical thinking performance. It measures self-perceived cognitive effort.

What the paper actually documents is a redistribution. From information gathering to verification. From generation to integration. From task execution to task stewardship. People with high confidence in the AI report spending less effort. People with high confidence in their own skill report spending more.

That is the cognitive offloading literature (Risko & Gilbert, 2016, Trends in Cognitive Sciences) producing the exact pattern it predicted forty years ago: when people develop reliable external systems, the internal effort moves to managing the system rather than doing the underlying work. Wegner's transactive memory framework showed the same thing in couples and teams in 1985 — each person remembers less of the total, the system remembers more than any individual would alone, and the joint capability rises.

The Microsoft paper documents a real effect. It does not document atrophy. The headline atrophy stories are doing the work the data refused to.

The METR Result Got Walked Back

The marquee study cited for AI hurting developers is METR's RCT. Sixteen experienced open-source developers, 246 real issues, within-subject randomization. Headline: AI tools caused a 19% slowdown despite developers self-forecasting a 24% speedup and self-reporting a 20% speedup after the fact.

The perception gap is the real finding and it has held up. The slowdown has not. METR's own August 2025 update and February 2026 update re-surveyed the same developers and now estimate an 18% speedup with a wide confidence interval. METR states plainly that they believe developers are more sped up by AI in early 2026 than the original study captured. They tried to run a 2026 follow-up and it failed in part because participants refused to work without AI. That is a data point about adoption depth, not capability.

The original 19% number was real. It was pointing at a learning curve, a poor fit between mid-2025 models and mature familiar codebases, and an integration cost that had not been amortized yet. None of that is the same as "AI makes developers worse." METR does not believe their own headline anymore. The discourse still does.

The Counter-Evidence Is Stronger

The most rigorous causal evidence in the entire field points the other direction. Kestin et al. (Nature Scientific Reports, 2025) ran a crossover RCT with 194 Harvard physics students comparing a prompt-engineered AI tutor against in-class active learning. Effect sizes from 0.73 to 1.3, which is enormous by any standard. The AI group hit roughly double the learning gains in 20% less time.

The detail that matters: the tutor was carefully scaffolded. Socratic. One step at a time. No solution dumps. Generic ChatGPT does not behave that way. The study does not show "AI helps learning." It shows scaffolded AI helps learning. That conditional matters.

Calculator meta-analyses (Ellington; Hembree & Dessart) across decades show small-to-moderate net-positive effects on procedural, computational, and problem-solving math when calculators are pedagogically integrated. Forty years of "calculators rot math" panic largely did not survive the data.

The foundational Sparrow et al. 2011 Science paper on the Google effect — the most-cited piece of evidence that "we offload to tech and forget" — has a poor replication record. It was included in the Camerer et al. 2018 Nature replication project as a failure to replicate. Chu (2015) also failed to replicate. The evidence base is thinner than its airtime suggests.

The Real Finding: Mode of Use Decides Everything

The cleanest pattern in the entire body of evidence comes from a multi-university study (Carnegie Mellon, Oxford, MIT, UCLA), n=1,222, on math and reading problems. Same population. Same tool. Different outcomes by prompting style. Students who asked the AI for finished answers performed worse when the tool was removed. Students who asked for hints performed equivalently to students who never used the tool at all.

Same axis shows up everywhere. The Hacker News thread on the MIT paper kept landing on the same distinction in different language: ask-mode preserves cognition, agent-mode produces atrophy. AI as interactive encyclopedia builds skill. AI as ghostwriter ships artifacts without understanding. One commenter put it cleanly: it's not AI chat causing cognitive debt — it's agents.

This is the most useful finding in the literature, and it gets almost no airtime relative to the panic studies. The thing that determines whether AI augments or atrophies your cognition is not how often you use it. It is whether you let it carry the difficulty or you keep the difficulty on yourself and use it as scaffolding.

What Soreness Actually Looks Like

The closest empirical analogue to AI-tool adoption is aviation automation. Casner et al. (2014) and follow-up simulator work show that manual flying skills decay with reliance on autopilot. Vigilance to primary instruments drops. Anticipation degrades. The mitigation is not removing automation. The mitigation is structured deliberate hand-flying — preserving the conditions that produced the skill in the first place, while still using the tool.

Bjork's desirable difficulty framework makes the underlying mechanism explicit. Conditions that slow initial acquisition produce stronger long-term retention and transfer. Smoothing out the struggle smooths out the learning. The generative effect — actively producing answers rather than reading them — is a measurable thing in cognitive psychology, not a feeling.

Both literatures predict the same thing for AI agents: the practitioners who treat the tool as substitution will lose the unaided skill. The practitioners who treat it as scaffolding will extend their range. The feeling of being less sharp during the first months of adoption is real, and it is the same kind of real as muscle soreness after a new training program. The body adapting to new movements is not the body breaking.

Where the Frame Breaks

The soreness metaphor holds for individual practitioners reorganizing their workflow. It does not hold for the ecosystem question.

A junior developer who never wrote production code unaided is not sore from new movements. They never learned the old movements. That is a developmental gap, not an atrophy gap, and it has different policy implications. Stack Overflow's 2025 data shows post-GPT-4 employment of 22-25-year-olds in AI-exposed jobs fell roughly 13% even as senior roles grew. Entry-level developer hiring is reportedly down 67% from the 2022 peak. Salesforce announced no new engineers in 2025. LeadDev reports 54% of engineering leaders plan to hire fewer juniors.

This is a real cost. It is not a cognition cost. It is a pipeline cost. The mechanism that produced seniors in the past — junior work that built mental models through repetition under supervision — is being closed at the entry layer. We will find out what the second-order effect is in five to ten years.

Faros AI shows the other piece of the structural story. PR volume in high-AI-adoption teams is up 98%. Review time per PR is up 91%. Senior engineers spend 4.3 minutes per AI suggestion against 1.2 minutes for human-written code. The throughput gain at the keyboard layer compresses against the review layer, and the review layer is staffed by the people whose juniors are not getting hired. The system is reorganizing in ways that look like productivity at the team level and like a structural bottleneck at the org level.

These mismatches are real. They are also independent from the cognition story. Most discourse conflates them, which is how a code-review bottleneck becomes a brain-damage headline.

What's Defensible to Claim

AI agents change cognitive workflows substantially. Substitution use degrades skill formation, especially for novices. Scaffolded use augments learning. Senior practitioners who maintain unaided practice extend their range. Junior pipelines are getting hollowed out, but as a consequence of where AI removes work, not because AI rots brains.

Discourse claims well past this — cognitive debt, AI is destroying critical thinking, ChatGPT makes you dumber — outrun their evidence base. The MIT paper has not been peer-reviewed and has serious methodological problems. The Microsoft paper measures self-reported effort, not capability. METR walked back its own headline. Sparrow's foundational Google-effect paper failed to replicate. The strongest causal study in the field shows scaffolded AI tutoring producing 0.73-1.3 effect sizes in the positive direction.

What we would need to settle the question: longitudinal cognitive measurements on heavy AI users tracked over years, replication of the MIT and Microsoft studies with stronger designs, rigorous trials of scaffolded versus substitution AI tutoring at scale, and field studies of what mitigations actually preserve junior skill formation under AI-heavy workflows. None of that data exists yet.

The adaptation-friction frame holds. The training systems, code-review pipelines, hiring funnels, and mentorship structures we built were designed around non-AI cognition. They are creaking. The body isn't broken. The movements are different. The soreness is the system catching up with what it's now being asked to do.

What the MIT Paper Actually Shows

The Microsoft Paper Measures the Wrong Thing

The METR Result Got Walked Back

The Counter-Evidence Is Stronger

The Real Finding: Mode of Use Decides Everything

What Soreness Actually Looks Like

Where the Frame Breaks

What's Defensible to Claim

Related posts

Stanford's AI Index vs the Consensus: What 400 Pages of Data Actually Show

The Deskilling Feedback Loop: When Vibe Coders Never Become Senior Engineers

Tennessee Wants to Jail Your AI Engineer for 15 Years

The Great Flattening: Why AI Is Killing Middle Management First