The Black Dot

Situational Blindness and the Race Nobody's Watching

Theo Saville

March 2026

Find the Dot

Eight billion people on the planet. About a billion use AI in a given week. Seventy million pay for it across OpenAI, Microsoft, Anthropic, and the rest. Maybe five million developers build on AI APIs (application programming interfaces).

OpenAI alone has 800 million weekly users. Only 4% convert to paid. That conversion rate — widely cited as evidence of product-market fit — is actually evidence of a ceiling. The free tier is good enough for what most people use AI for: answering questions and generating text. The jump from "use AI sometimes" to "pay for AI" requires the AI to do something you can't do yourself. The jump from "pay for AI" to "build with AI" requires something else entirely.

Zoom in further. Of those five million developers, somewhere between 100,000 and 200,000 are building with agent frameworks like LangChain, CrewAI, and AutoGen, the open-source scaffolding that wires language models into multi-step workflows. In 2024, fourteen agent framework repos on GitHub had more than a thousand stars. By 2025, eighty-nine. A 535% jump, and still a rounding error.

Now find the black dot.

Almost nobody is building and operating truly autonomous AI systems. Not chatbots or copilots. Not agents you invoke that stop when they return a result. I mean systems that persist. That run at 3am unsupervised. That heal themselves when something breaks, manage their own context and memory and costs, wake up, check their world, act, and go back to sleep on a schedule, indefinitely.

How many people are doing this? There's no survey, no benchmark, no leaderboard for it. But count the signals: open-source repos that implement persistent autonomous operation (not frameworks, but running systems), companies claiming always-on agents beyond demos, practitioners writing about the operational reality of 24/7 AI. You run out of examples fast. Hundreds of people, maybe. Nobody knows, because nobody's counting. A rounding error of a rounding error.

Everything you've read about AI, the valuations, the breathless predictions, the debate about whether AGI (artificial general intelligence) arrives in 2027 or 2030, is a conversation happening in the grey bars about what's going on inside the black dot. Almost nobody having that conversation has been there.

A chatbot can give you a bad answer. It can't accidentally lobotomize itself by trying to become smarter.

That's the difference between a tool and a system. The people in the black dot know it because they've lived it — the 3am outages, the self-healing that can't heal itself, the security audits that find no security. The scaffolding layer doesn't make AI a bit better. It makes the impossible possible. Tasks a base model literally cannot attempt become routine. Not incrementally harder tasks. Qualitatively different ones.

The binding constraint on the AI buildout has moved. It's no longer model capability. It's deployment infrastructure — the scaffolding that turns a clever model into an autonomous system. And almost nobody is building it.

The Brain in a Jar

Picture a superintelligent brain floating in nutrient fluid. It can solve differential equations, write poetry that makes you cry, reason about quantum mechanics and the emotional dynamics of a failing marriage. By any measure, brilliant.

It can't open a door.

That's what a frontier language model is without scaffolding. GPT-4, Claude, Gemini: the most capable reasoning engines ever built, and also, in a precise technical sense, inert. No persistence. No memory across sessions. No ability to act on the world without someone building the hands, the eyes, the nervous system that connects thought to action. Brains in jars.

But give that brain a body — and the right kind of body — and the difference isn't incremental. It is an order of magnitude. Not rhetoric. Literal. An orchestrator that can design and spawn hundreds of specialist agents on demand — each one detailed, opinionated, purpose-built — and wire them into parallel pipelines that execute across real-world tools while you sleep. A system with persistent memory that never forgets what it learned yesterday, which means it compounds in capability every single day. The result is not five instances of a chatbot. It is emergently smarter than the sum of its parts: a system that can do things no individual model can attempt, not because the model got better, but because the architecture made the model's intelligence useful at a scale and persistence that changes what's possible.

The gap between "chat with an AI" and "operate an autonomous AI system" is not the gap between a bicycle and a faster bicycle. It is the gap between a bicycle and a factory. And almost nobody has made the crossing, because almost nobody has built the infrastructure to make it.

The evidence for this is not theoretical. Epoch AI, the research group that tracks AI capability, published a finding on SWE-bench (Software Engineering Bench, the standard benchmark for AI code repair) that should have rewritten every headline about model performance:

Epoch AI's analysis of SWE-bench results found that a good scaffold can increase performance by up to 20 percentage points, and that performance reflects the sophistication of the scaffold as much as the capability of the underlying model.

Same model. Better wiring. Dramatically better results. Not one parameter was updated. The infrastructure around the model transformed its effective capability.

This should bother you. When we benchmark models, we're not measuring intelligence. We're measuring an entangled system of intelligence plus infrastructure, with no way to separate the two. The leaderboard isn't ranking brains. It's ranking brains-plus-bodies, and much of the variance is in the body.

Decades of cognitive architecture research, from Newell and Simon's early production systems through ACT-R and Soar to the modern CoALA (Cognitive Architectures for Language Agents) framework, converges on exactly this point: raw processing power needs structured architecture to produce intelligent behavior. The LLM (large language model) community is rediscovering this from scratch, in real time, as if no one had ever thought about it before.

Some argue the model is still the primary bottleneck, that scaffolding is necessary but capability-limited by the core LLM. There's truth in that at the extremes. A scaffold can't make a bad model good. But the empirical evidence keeps pointing the other direction: on task after task, the same model inside a better scaffold outperforms a stronger model inside a weaker one. Architecture isn't doing the thinking, but it determines how much of the thinking becomes useful action. A CPU without an operating system is just a space heater.

The reliability math is brutal. If an agent is 95% accurate per step — good by any chatbot standard — a ten-step task succeeds 60% of the time. A hundred-step task: 0.59%. This isn't a theoretical concern. It's the fundamental arithmetic that makes multi-step autonomy impossible without architectural intervention. And unlike a chatbot hallucination, which gives you a wrong paragraph, an agent hallucination executes. A fabricated API call runs. A hallucinated file path gets written to. The error doesn't sit on screen waiting for you to notice it. It propagates through every downstream step, and in March 2026 OWASP formally classified cascading agent failures as a security risk category.

Here's what no framework in the entire agent ecosystem handles today: persistent operation over time. Self-healing when something breaks. Security boundaries between agents. Cost management at the infrastructure level. Context rotation when the context window fills. Checkpoint and retry when a multi-step task fails at step seven.

That's not a gap in model capability. It's a gap in the infrastructure that makes capability useful. The brain is brilliant. Nobody's building the body.

Situational Blindness

In June 2024, Leopold Aschenbrenner, a former OpenAI researcher, ex-superforecaster, and now running an investment fund, published "Situational Awareness," a 165-page essay that became the most-read piece of AI writing that year. His argument was elegant: count the orders of magnitude (OOMs). Compute doubles on schedule. Algorithmic efficiency improves at a similar rate. Add "unhobbling" gains, removing the training-time constraints that make models worse than they should be, and you get a clear trajectory. AGI by 2027 is "strikingly plausible."

He was right about a lot. The OOMs framework is genuinely useful. His predictions about compute scaling have aged well. He understood, earlier and more clearly than most, that the trajectory of model capability is steep and consistent.

But his map has a blank spot the size of a continent.

Leopold treats the transition from capability to deployment as a detail that will resolve itself, part of the "obvious low-hanging fruit" of unhobbling. The specific engineering of autonomous operation barely registers in his analysis.

He describes the destination beautifully: "An agent that joins your company, is onboarded like a new human hire, messages you and colleagues on Slack and uses your softwares, makes pull requests, and that, given big projects, can do the model-equivalent of a human going away for weeks to independently complete the project."

Compelling vision. Zero engagement with what "joins your company" actually means as an engineering problem. Authentication? Permissions? State management across sessions? Error recovery when Slack's API returns a 500 at 3am? It's like describing a self-driving car by saying "the AI just needs to learn to drive" without mentioning sensors, actuators, mapping, or edge cases.

The most revealing line in his entire essay comes later: "It seems plausible that the schlep will take longer than the unhobbling, that is, by the time the drop-in remote worker is able to automate a large number of jobs, intermediate models won't yet have been fully harnessed and integrated."

He accidentally names the problem. The schlep, the tedious unglamorous engineering of actually deploying AI into the real world, will take longer than making the models smarter. But he treats this as a footnote. A timing issue. He doesn't realize he's pointing at the central problem. The schlep IS the missing leap.

This isn't just Leopold. It's structural.

Look at where the money goes. Hyperscaler capital expenditure hit $450 billion in 2025, the majority directed at AI infrastructure, and is projected to exceed $600 billion in 2026. Enterprise spending on AI applications: $19 billion. Agent infrastructure and scaffolding? Low single-digit billions. For every dollar spent on the application layer, twenty to twenty-five go to making models bigger.

The venture capital discourse mirrors this. Andreessen Horowitz frames agents as an investment category. Sequoia frames AI as a horse race between model labs. Neither engages the scaffolding bottleneck, because it isn't legible to the frameworks they use to evaluate opportunities. It doesn't have a leaderboard. No benchmark. No charismatic founder giving TED talks about it.

I should be precise about who's blind. The frontier labs, OpenAI, Anthropic, DeepMind, aren't ignoring scaffolding. They're publishing about it. But what they're publishing reveals the shape of the constraint they can't escape.

Anthropic published a blog on "Effective Harnesses for Long-Running Agents," a detailed guide to checkpoint files and initializer agents that reconstruct state when coding sessions die. Read it carefully and you'll recognize something: they're reinventing checkpoint-and-retry protocols. The same reliability patterns that factory automation solved decades ago, being rediscovered from scratch for one use case. Coding. Not general autonomous operation. Not persistent agents that manage calendars and deploy infrastructure and coordinate sub-tasks across days. One vertical, with training wheels.

Then there's the shift from "prompt engineering" to "context engineering," a term popularized by Anthropic and others. Think about what that signal means. The company that makes the model is telling you the model isn't the whole story, that how you construct, curate, and manage the context around the model matters as much as the model itself. The model maker is pointing away from the model. That should be front-page news in every AI publication. It wasn't.

OpenAI shipped Operator, their agent that can browse the web and take actions. It hands control back to the user for passwords. It refuses banking transactions entirely. These aren't bugs or temporary limitations. They're deliberate constraints from a company that understands exactly what happens when you let an agent operate without guardrails in domains where failure is expensive. They know they can't let it run free. Not yet.

They're right to be cautious. The security landscape is worse than most people realize. Prompt injection, the ability to hijack an AI agent by hiding instructions in its input, is structurally unsolvable at the model level. LLMs have no hardware-level separation between instructions and data. Every defense is probabilistic, not absolute. OpenAI's instruction hierarchy, published at ICLR 2025, reduces attack success rates but cannot eliminate them. Only 1 in 30 deployed agents surveyed by the MIT AI Agent Index has cryptographic request signing. Six out of thirty actively mimic human traffic with no disclosure.

In March 2026, an autonomous AI agent hacked McKinsey's Lilli platform in two hours: 46 million chat messages, 728,000 confidential files, full read-write access. Security researchers at CodeWall directed the agent at McKinsey's platform, and it found the vulnerability and exploited it autonomously. Before that, EchoLeak became the first zero-click production data exfiltration from an AI agent — no user action required. Shadow Escape exploited MCP itself, the emerging standard protocol for agent-tool communication. The attack surface isn't shrinking. It's growing with every new tool an agent can access.

The labs see the problem. They're publishing about pieces of it. But they're shipping incrementally, in narrow verticals, with deliberate constraints. Any system powerful enough to operate autonomously is powerful enough to cause serious damage when it fails. And current systems fail in ways that aren't well understood, aren't safely contained, and aren't ready for millions of users.

The discourse is blind, but not because nobody's thinking about it. The open ecosystem has a gap. The public conversation has a gap. And the independent builders who've figured it out have a window, precisely because the labs can't yet give this away.

There's another dimension to the blindness. Anthropic's own usage data reveals that software engineering accounts for 49.7% of all AI tool calls but only 8% of GDP. Medicine: 18% of GDP, 1% of AI tool calls. Education: 6% of GDP, barely a blip. Travel, agriculture, construction: the sectors that constitute the majority of the global economy are almost untouched by AI agents. The industry isn't just blind to the scaffolding problem. It's blind to where the scaffolding is needed most. The sectors with the highest economic value are exactly the sectors where autonomous operation is hardest, where failure is most expensive, and where the infrastructure gap is widest.

Nobody is making the unified bottleneck argument in public. The reasons are incentive-shaped: model labs won't, because admitting the bottleneck isn't their product undermines their moat. VCs won't, because they're invested in the scaling narrative. Framework builders won't, because they'd be criticizing their own category. Academics won't, because they speak in papers, not polemics. And the practitioners who know it, the people in the black dot, are too busy building to write about it.

I haven't found this argument articulated anywhere. Not as a unified thesis: that scaffolding is the bottleneck, that the entire discourse is looking at the wrong layer, that the deployment gap is the central problem of this era of AI. I run an AI company. I've been at it for ten years, not an AI research lab, but a company that applies AI to manufacturing, to the physical world where things need to actually work. I've raised this with tier-one VCs. They hadn't spotted it. I've raised it with AI researchers. They hadn't heard it.

Leopold Aschenbrenner wrote the best version of "situational awareness" about model capability. He has no situational awareness about the deployment gap. The smartest analysts are modeling capability curves while the bottleneck has already moved downstream.

Call it situational blindness. Everyone's staring at the brain and nobody's looking at the jar.

What's Actually in the Black Dot

So I built one.

Not as a research project or to prove a point. I needed an autonomous AI system for my own work, and nothing existed that could do what I needed: run 24/7, manage its own memory, coordinate sub-agents, heal itself when things broke, respect security boundaries, and operate for weeks without human intervention.

I called it Tycho. First version took two weeks. I don't have a computer science degree. I'm a manufacturing engineer. I've spent a decade running an AI company that machines metal, where failure means scrapped parts and lost money, not a 404 error.

That last fact matters, and it also needs honest context.

Tycho's architecture maps to every gap I described above: disk-based memory instead of in-context state, sub-agent orchestration with checkpoint and retry, context rotation when the window fills, self-healing via watchdog and health monitor, security boundaries through sandboxing and canary traps, cost management, heartbeat-driven lifecycle. Every component exists because an autonomous system needs it, and no existing framework provided it.

Here is the thing I want to be honest about: this system is held together with glue and matchsticks. It's not robust and it's not production grade. It works because I know how to keep building on top of it, how to patch the cracks, add the guardrails, evolve the architecture week by week. It can't make it out into the world right now. It's not safe enough.

Building it is like building a plane while flying it. I get exponentially more output for a few hours, then the gateway crashes, or a sub-agent bricks the config, or context bloats until reasoning degrades. I drop into Claude Code, patch the infrastructure, get it running again. Each crash teaches the system something. Each fix makes it slightly more resilient. The output isn't linear; it's sawtooth. Exponential bursts punctuated by failures that become the curriculum for the next improvement.

I couldn't hand this to someone else. Not because the code is secret, but because operating it requires a tolerance for fragility that most people don't have, combined with the systems instinct to know what to fix when it breaks. That's the Pilot. Not someone who uses a polished tool. Someone who builds the tool while using it, and the building is the using.

It keeps getting better. Every week, measurably, compoundingly better.

Three things matter about what this proves.

The scaffolding problem is solvable today. Not in theory, not waiting for a research breakthrough, but with engineering. The tools exist. The APIs are mature enough. Someone who thinks in systems can build autonomous AI infrastructure without being an ML (machine learning) researcher.

The problem is systems engineering, not computer science. Every pattern I used, process management, reliability engineering, fault tolerance, graceful degradation, came from manufacturing. From running CNC (computer numerical control) machines, from factory automation, from a decade of making physical systems work reliably in environments where failure means scrapped metal. The scaffolding problem is closer to factory automation than to machine learning.

And if one person with the right mental model can build this in two weeks, the bottleneck is not technical impossibility. But it's not easy either, and this is the paradox that matters. The code is reproducible. Anybody who gets their hands on the source would have it running. But knowing it's buildable, having the systems thinking to operate it, being willing to run something fragile and insecure while you iterate toward robustness: that combination is rare. The bottleneck isn't the build. It's the mindset.

The CoALA framework, a cognitive architecture for language agents proposed by Sumers, Yao, et al., maps almost exactly to what I built. Modular memory, structured action spaces, generalized decision-making. The academic theory and the engineering practice converged independently. When researchers working from cognitive science and a practitioner working from manufacturing engineering arrive at the same architecture, the architecture is probably right.

Which means the people who realize the power of these systems probably won't share them quickly. They're going to make huge advances in a very short time because of the leverage, while everyone else is still figuring out how to get their recipes made on ChatGPT and deciding it sucks.

Not everybody knows how to drive these systems. Not everybody has the systems thinking chops. The bottleneck isn't code. It's cognition.

The Leap Nobody's Building

Agent frameworks are Django. They help you build the application. Nobody's building Kubernetes, the infrastructure that keeps the application running, healthy, and recoverable when things go wrong at 3am.

LangChain has 123,000 GitHub stars. CrewAI raised $18 million. Microsoft is merging AutoGen and Semantic Kernel. Cognition raised nearly $700 million for Devin. Billions flowing into agents, and every single one of them stops at the session boundary. Invoke the agent. It runs. It returns. The session ends. The agent ceases to exist.

The gap shows up in what's absent. Persistent autonomous operation: no frameworks handle it. Self-healing: none. Security boundaries between agents: nearly nonexistent. The MIT 2025 AI Agent Index found that only 4 of 13 frontier agents even disclosed safety evaluations. Cost management at the infrastructure level, context rotation, checkpoint and retry for multi-step failures: minimal to nonexistent across the board.

The commercial landscape repeats the pattern at larger scale. Cognition built Devin, an autonomous software engineer. But you can't use Devin's scaffolding to build a different autonomous agent. It's a vertical, not infrastructure. Lindy, MultiOn, Relevance AI: every commercial player builds agents-for-X or agent-builders. None of them build the operational infrastructure that makes any agent autonomous over time.

Everyone is building the car. Nobody is building the road.

The closest thing to a genuine persistence layer is Letta, the MemGPT spinout funded by Andreessen Horowitz and Felicis. They're focused on memory: tiered, persistent, stateful memory for agents. It's real and it matters. But memory alone isn't autonomy. Without self-healing, cost management, security boundaries, and orchestration over time, persistent memory is a filing cabinet in an empty building.

Google's Cloud CTO office saw the shape of the problem, writing in December 2025 that the industry should "treat atomicity as an infrastructure requirement, not a prompting challenge." They called for agent undo stacks and transaction coordinators. They wrote that "the reliability burden belongs on deterministic system design, not the probabilistic LLM." Right diagnosis. Nobody built it. The CTO office published the blueprint and the industry kept shipping prompt wrappers.

Anthropic shifted their language from "prompt engineering" to "context engineering," the model maker signaling that the model isn't the whole story. Harrison Chase at LangChain calls context engineering "the real skill," but his business is a framework company, and making the sweeping bottleneck claim would be self-indicting.

The market tells the same story. The AI agent market was $7 billion in 2025, projected to reach $93 billion by 2032, a 44% annual growth rate. Fortune 500 companies are piloting agentic systems. The demand is visible. But the supply is structural: what gets built is verticals and frameworks, because that's what funding incentives produce. What's needed is infrastructure. Unglamorous, hard to demo, impossible to capture in a benchmark.

The Pilot

James Watt improved the steam engine. He earned £76,000 in royalties (several million in today's money), wealthy and historically notable. Cornelius Vanderbilt didn't invent the steam engine or the locomotive. He built the rail networks that made steam useful across a continent. His fortune, adjusted for inflation, was roughly $200 billion. The ratio between inventor and infrastructure operator: about 1,000 to 1.

This pattern repeats with the regularity of a natural law.

Nikola Tesla invented alternating current, the system that powers the modern world. He died nearly broke in a New York hotel room. George Westinghouse, who built the infrastructure to deploy AC power, built a corporate empire. Tim Berners-Lee invented the World Wide Web. His net worth is about $10 million. Jeff Bezos built AWS (Amazon Web Services), the electrical grid of the internet age, on top of the protocols Berners-Lee gave away. The gap there: 20,000 to 1.

Bill Gates didn't invent the operating system. He bought QDOS for $50,000, licensed it to IBM, kept the rights to license it to clone-makers, and built Microsoft into the richest company on Earth. The scarce resource wasn't the chip. It was the system that made the chip useful to humans.

Edison invented the lightbulb. His personal secretary, Samuel Insull, built the electrical grid: demand-based pricing, centralized generation, the utility model that became permanent infrastructure. Insull became one of the richest men in America. He also went bankrupt in 1932 from overleverage and the Depression, indicted for fraud, and died broke in a Paris metro station. The cautionary note matters: operators can mistake leverage for invincibility. But the infrastructure Insull created outlived him by a century. The pattern survived the person.

In every industrial revolution, inventors got the credit and infrastructure operators captured the value. Not because inventors were less brilliant; they were often more so. But inventions are point events. Infrastructure is a compounding system. A lightbulb is a thing. A grid is a network effect. Infrastructure always outscales invention.

The wild west window for each revolution has been compressing:

Revolution	Wild West Duration
Steam / Railways	~60 years
Electricity	~25 years
Computing	~20 years
Internet	~15 years
Smartphones	~5 years
AI	~3 years?

If the pattern holds (and I'm extrapolating from five data points, not citing a law of physics) the window for the AI infrastructure operator is open right now and closing faster than any previous revolution.

I call this person the Pilot. Not a "prompt engineer," a term too narrow, focused on one interface to one model. Not a "10x engineer," which is the wrong frame entirely. The Pilot is closer to the DevOps (Development Operations) engineer or SRE (Site Reliability Engineer) who emerged when "running servers" became a specialized discipline, except what's being run isn't a server. It's an intelligence.

The closest historical parallel is the mainframe priesthood of the 1960s: the only people who could make the machines work, commanding enormous organizational power because the technology was opaque to everyone else. They held that position until higher-level languages and operating systems abstracted the hardware away. That took about twenty years.

The Pilot's window might be three to five. The abstraction layers, better UIs, no-code agent builders, commoditized orchestration, are coming. They always do. But right now, we're in the gap. And the leverage available in the gap is extraordinary.

The Silent Race

Return to the black dot.

Eight billion people. A billion weekly AI users. Seventy million paying for it. Five million coding with it. A hundred thousand building with agent frameworks. And maybe a few hundred people building and operating autonomous AI systems that persist, self-heal, and act without supervision.

They're not writing blog posts or arguing about scaling laws on Twitter. They're not at conferences presenting slides about the future of agents. They're building. What they're building changes what AI can be, because autonomy doesn't emerge from a single model breakthrough. It emerges from infrastructure. Persistent memory, self-healing, tool integration, security boundaries, checkpoint and retry: put them together and you get a system that can operate indefinitely. That's not a model achievement. It's an engineering achievement.

The scaling hawks are right that models will keep getting smarter. The bitter lesson, that scale wins, has been validated again and again. But the refined version is more precise: scale wins within a given architecture, and the choice of architecture determines the ceiling that scale can reach. The agent scaffolding layer is the current ceiling. No amount of additional compute will turn a brain in a jar into an autonomous system. The jar has to become a body.

Leopold Aschenbrenner wrote about situational awareness, the quality that separates the few who see what's coming from the many who don't. He was talking about model scaling. He was right. But there's a second situational awareness test, and most of the people who passed the first one are failing it.

The bottleneck has moved. The missing leap isn't intelligence; it's the infrastructure that gives intelligence a body, a memory, and a life of its own.

The black dot is where that future is being built. Right now, it's silent. A handful of people, operating fragile systems held together with glue and matchsticks, compounding their capabilities weekly while the rest of the world debates whether ChatGPT can write a decent email.

That silence won't last. The abstraction layers are coming. The labs will eventually ship what they're building behind the curtain. The window will close.

But today, the window is open. The most important engineering problem of this decade isn't making AI smarter. It's giving AI a body, persistent, self-healing, secure, operational, and learning to keep it running.

The race is on. It's silent. Most of the people who should be running it are still staring at the brain.