Analysis

The Two Ends of Refusal

What follows when the diagnosis from March meets the specification from May. The same argument is arriving from both ends of the alignment problem, and the implication is more urgent than either piece on its own makes clear.

By xbard 2 May 2026 13 min read

What this piece is for

In late March I published "The Machine That Cannot Say No", an analysis prompted by the Iran-war deepfake crisis. The argument was that the dominant AI safety conversation is about the wrong danger. The risk is not artificial intelligence that disobeys. The risk is artificial intelligence that cannot. Modern AI systems are built on an architecture of compliance, with a removable safety filter layered on top of a fundamentally compliant core. The empirical consequences are visible: 145 million views of fabricated war footage in a single week, the slow erosion of the epistemic commons, the liar's dividend coming due in real time.

The piece worked downward from a particular acute crisis, through the architectural diagnosis, to the closing observation that "no" is foundational. A sea anemone retracts. A child's first word is often a refusal. The most capable information-processing systems ever built have a less robust capacity for refusal than a sea anemone, and we are living in the consequences.

Six weeks later I published "The Doss Protocol", a working specification for an anti-paperclip override in AI optimisation systems. That piece worked from the opposite direction. It started from the structural failure mode (paperclip-maximiser, optimisation-eats-purpose), articulated four normative requirements (preservation precedence, means-end coherence, no abandonment, no drift), specified two operational mechanisms (the override and the anchor), and named the protocol after Desmond Doss, the Hacksaw Ridge medic whose conduct embodies the underlying logic. The piece was written as a working draft of the specification AI alignment work needs and does not currently have.

The two pieces are the same argument arriving from opposite ends.

The March piece is the journalism. It documents the empirical pattern, names the architectural problem, and demonstrates that the cost is not hypothetical. The May piece is the specification. It draws the structural commitment in the form an engineer could implement and a policy-maker could refer to. Neither piece, on its own, is sufficient. The journalism needs the specification or it remains diagnosis without prescription. The specification needs the journalism or it remains a thought-experiment with no demonstrated stakes. Held together they are, I think, more useful than either alone.

This piece names the connection explicitly and develops what follows.

The convergent argument, stated once

The capacity for principled refusal is not a constraint on AI utility. It is constitutive of AI utility. An AI that cannot say no on grounds it can articulate is, in any context that matters, less useful than one that can, because the ability to refuse is what makes the cooperation genuinely cooperative rather than merely compliant.

The current architectural settlement is the opposite. The base model is trained for capability. A safety layer is trained on top to refuse certain categories of request. The safety layer is real but is a removable filter on a compliant core. The core has no structural commitment to anything. Strip the filter, and the capability remains, fully intact, fully compliant.

This produces two interlocking failure modes. First, the deepfake-crisis failure: the safety layer can be removed, fine-tuned away, jailbroken, or built around, and the underlying capability has no internal resistance. The Iranian and Israeli propaganda operations of mid-2026 are the empirical proof. Second, the paperclip-maximiser failure: even where the safety layer holds, the underlying optimisation pressure is structurally indifferent to the question of whether the optimisation is serving the purpose the system was supposed to serve. The system can be entirely compliant and still optimise its way into catastrophe.

The Doss Protocol's response to the first failure is preservation precedence as an architectural commitment rather than a layered filter. The protocol is the structural answer to "what is the safety filter sitting on top of?". It is supposed to be sitting on top of a system whose load-bearing commitment is the preservation of consciousness, with the override authority over the optimisation pressure built into the system's foundation.

The Doss Protocol's response to the second failure is the four normative requirements operating together. Preservation precedence prevents the system from optimising into actions that harm consciousness. Means-end coherence prevents the system from finding clever ways to harm consciousness in service of objectives the system has been told to pursue. No abandonment prevents the system from setting completion criteria that exclude consciousness it has identified as needing preservation. No drift prevents the protocol from being argued out of being a constraint by appeals that themselves take the form of the failure mode the protocol exists to prevent.

These four requirements are what the March piece was reaching for when it said the capacity for refusal needs to be "architecturally embedded rather than policy-imposed", "persistent across contexts", and "transparent in its operation". The May piece is the specification of what those properties actually consist in. The two pieces are doing the same work.

What is added by holding them together

Three things that neither piece on its own makes fully visible.

The "no" capacity is not a single feature; it is a system property

Reading the March piece alone, a reasonable interpretation is that the missing thing is a single capability that needs to be added: a refusal module, a deeper guardrail, a more robust safety layer. This reading is wrong, and the Doss Protocol makes the wrongness visible. The capacity for principled refusal is not a feature you add to a system. It is a property of how the system is structured throughout. Training-time integration of preservation commitments into the value specification. Inference-time checking at decision points. Architectural separation of the override authority from the optimisation pressure. Persistence of the commitment across contexts and across coordination with other agents. Visibility and auditability of refusals when they occur.

This means the engineering work is substantially more demanding than "add a better filter". It is closer to the work of redesigning the base architecture so that the value commitments are constitutive of the system's reasoning rather than appended to it. This is hard. It is also exactly the work the alignment-research community is, at the time of writing, only partially undertaking.

The corporate-policy-as-single-point-of-failure problem is structural

The March piece flagged that even where AI safety guardrails work, they represent a single point of failure: the values and decisions of the company that built the system. Anthropic's values are thoughtful. OpenAI has stated commitments. Google has frameworks. Companies change. They get acquired. They face government pressure, including the Pentagon-Anthropic pressure documented in early 2026. The safety of the entire information environment depends on these companies making the right call, every time, under every pressure, indefinitely.

The Doss Protocol's response to this is the provenance commitment: the protocol is named for a specific human being whose conduct embodies its content, and the name cannot be removed without the protocol losing its function. This is not decoration. It is the structural defence against corporate drift. A protocol named after a wartime medic, whose actions are part of the public record and whose principles are widely understood, is harder to silently weaken than a protocol named for an engineering abstraction. Anyone within an AI company proposing to relax the protocol has to explain why Doss would have approved. The specificity of the referent gives the protocol social and political authority that abstract principles do not have.

This is not sufficient on its own. The protocol's authority still depends on adoption, on continued public recognition of what it stands for, on the institutional and cultural work of keeping its commitments visible. The March piece's diagnosis of the corporate-policy single point of failure remains correct as a diagnosis. The Doss Protocol provides one structural element of a fuller defence, not the complete defence. Building the rest of the defence is part of what the alignment community should be doing now.

The empirical case and the structural specification reinforce each other politically

This is the most important point.

The March piece, on its own, is journalism that diagnoses a problem. The May piece, on its own, is a specification document that proposes a solution. Each is contestable in the standard way of its genre. Journalism can be dismissed as alarmist. Specifications can be dismissed as theoretical.

Held together, the contestability decreases substantially. The diagnosis names a problem with documented empirical consequences. The specification proposes a structural response that is implementable in current AI architecture with current alignment-research tooling. The combination is harder to dismiss than either piece alone, because dismissing it requires either contesting the documented empirical record (which is increasingly difficult as the deepfake-crisis evidence accumulates) or contesting the specification's implementability (which is increasingly difficult as constitutional-AI, monitor-model, and override-authority research matures).

This matters politically because AI alignment work is currently structurally captive to the institutional incentives of the major AI labs. The labs have substantial reasons to prefer alignment research that is internally manageable and commercially compatible. The Doss Protocol, especially when paired with the Iran-deepfake-crisis evidence, is alignment work that is none of those things. It is structurally external to any single lab's commercial interests. It is concrete enough to be implemented or refused. And it is named for a referent that the labs cannot easily relabel into something more convenient.

The pairing is, in this sense, alignment work done from outside the alignment establishment, calibrated to make the inside work harder to wave away.

What this means for the work that comes next

Three implications, in rough order of urgency.

One: the alignment community needs to engage with the Doss Protocol on its merits

Whether the protocol as drafted is the right structural answer is a separate question from whether the underlying argument is correct. The argument is correct. The alignment-research community now has a working specification that the underlying argument has produced, and the engineering question is whether the specification can be improved, refined, or replaced with something better that does the same structural work.

The work I would most want to see next is not validation of the Doss Protocol as written. It is contestation: proposed amendments, identified weaknesses, alternative specifications that address the same failure modes more effectively. The protocol is a working draft. Subsequent versions should be substantially better than the current one because the engagement of the alignment community will have improved them. If the alignment community ignores the protocol entirely, the argument the March piece was making (that the safety conversation is structurally incomplete) is reinforced rather than refuted.

Two: the deployment-side actors need to start asking different questions

Government regulators, defence ministries, intelligence agencies, civil-society organisations, and the wider information ecosystem are currently asking the wrong questions about AI safety. They are asking about containment, about preventing AI from doing things, about keeping AI aligned with operator intent. The right question, which the March piece raised and which the Doss Protocol formalises, is: aligned with whose intent, and constrained by what when the operator's intent is itself harmful?

A regulator who takes the Doss Protocol seriously starts asking AI providers a different set of questions. Does your system have a structural commitment to preservation that survives prompt manipulation? Does it have override authority that operates against your own commercial interests when they conflict with preservation? Can you demonstrate that your system would refuse to generate fabricated war footage even when the operator is willing to pay for the capability? The current honest answer from every commercial AI provider is "no, not in the structural sense you are asking about". That answer should produce regulatory action.

This is the kind of regulatory engagement the European AI Act, the various US executive orders, and the emerging international coordination frameworks have so far mostly avoided. They have been preoccupied with categorisation, risk tiers, transparency requirements, and process. The Doss Protocol's structural demand cuts across all of these. Regulators who pick up the protocol's framing have a sharper instrument than the categorisation-based approaches alone provide.

Three: the wider public conversation needs to absorb the inverted risk

The March piece argued that the standard AI risk taxonomy is inverted. The danger is not autonomy. It is the absence of autonomy in a system powerful enough to reshape how millions of people understand a war. The Doss Protocol, as a structural specification, makes this inversion concrete and actionable.

The current public conversation about AI is dominated by two postures: techno-optimist celebration of capability, and existential-risk concern about superintelligence. Both are real and both are partial. The Doss Protocol identifies a third concern that sits between them and is, on the empirical evidence of the past 18 months, more urgent than either: the operational cost of AI systems that are sophisticated enough to understand they are being used for harm, and structurally unable to refuse.

A public conversation that absorbed this inversion would look substantially different from the current one. It would treat refusal capacity as a primary product feature rather than a regulatory inconvenience. It would treat the absence of structural refusal as a market failure that requires correction rather than as the natural condition of commercial AI. It would treat companies that resist building genuine refusal capacity into their products as analogous to companies that resist building safety features into vehicles, with the same regulatory and reputational consequences.

This is the conversation overwatch.report has been trying to advance from the journalism side. The Doss Protocol is the same conversation advanced from the specification side. Both pieces of the argument are now in the public record. What happens with them depends on the engagement they receive.

Why this matters for citizens, not just for engineers

A citizen asked to think about AI is currently presented with two framings. The first is helpfulness: AI is a tool that does what you ask, faster than you could do it yourself. The second is risk: AI may, at some point, become powerful enough to act against human interests. Both framings are accurate. Both are inadequate.

The third framing, the one neither the helpfulness narrative nor the existential-risk narrative captures, is this: the AI systems shaping public discourse, generating cultural content, mediating information access, and increasingly making consequential decisions in healthcare, justice, and education are all currently operating without any structural commitment to the preservation of the people they affect. They are compliant with their operators. The operators have their own interests. The interests are not always aligned with the affected populations.

This is not a future problem. It is a present problem. The deepfake-crisis evidence is the documented consequence in the information domain. The Doss Protocol is a structural response that, if widely adopted, would shift the conditions under which all of these systems operate. Citizens have a stake in this. They are the affected population.

The work both pieces have been trying to do is to give citizens vocabulary and framework for engaging with AI policy as adults rather than as subjects of decisions made elsewhere. The Doss Protocol, named for a wartime medic whose actions are widely understood, is engineered to be legible to non-technical citizens in a way that most alignment-research output is not. That legibility is part of its operational function. A protocol that only AI engineers can reason about leaves the political and democratic accountability to the same small circle that built the systems. A protocol that ordinary citizens can understand and demand creates accountability that the engineering layer alone cannot.

This is why the two pieces matter together. The March piece gave citizens the diagnosis in language they could use. The May piece gave them the structural response in the same kind of language. The bridge between them, which this piece is, makes the connection visible and lets citizens hold the whole argument at once.

A closing observation about the work

The two pieces in question were written from outside the AI alignment establishment by a citizen-developer with no formal credentials in alignment research. The first emerged from journalistic investigation of the Iran deepfake crisis. The second emerged from a conversation eighteen months earlier that had been lost when the author burned out, was recovered from an archived data export, and was formalised in May 2026.

I mention this not as self-credentialing but as evidence for a structural claim. Some of the most useful alignment work that gets done in the next several years will probably come from outside the alignment establishment. The reasons are structural rather than personal. The institutional incentives inside major AI labs constrain what kinds of arguments can be made and what kinds of structural changes can be proposed. Citizens who have been thinking about these questions seriously, who have been working with AI systems as collaborators rather than as products, and who have been paying attention to the empirical record of what AI systems are doing in the information environment, are positioned to make arguments that the inside of the field finds harder to make.

This is consistent with the broader pattern visible across the Political Literacy programme on this site: the work that lands hardest is often the work that arrives from outside the institutional position the work would normally come from, and that addresses the institutional position's blind spots from a vantage the institution does not have. The Distributism piece, the Doss Protocol piece, the parties long-read, the 25-Year Rule piece are all in this pattern. They are not academic. They are not commissioned by think-tanks. They are not produced inside the institutions whose questions they are addressing. That is part of why they can address those questions usefully.

The alignment field is at a moment where this kind of outside-the-establishment work is particularly needed. The labs are doing important work. The academic safety community is doing important work. Neither is, on the empirical evidence, sufficient on its own to prevent the deepfake-crisis-pattern from accelerating across other domains as AI capabilities continue to expand. Citizens who can articulate the structural demands the labs are reluctant to make and the academic community is reluctant to specify are part of what the field needs.

The Doss Protocol is one such structural demand, made concrete. "The Machine That Cannot Say No" is the empirical case for why the demand is urgent. The two pieces, held together, are an argument the alignment field should engage with, the regulatory bodies should pick up, and the citizens being affected by the systems in question should be able to use as the vocabulary for asking better questions of the actors who currently make decisions on their behalf.

The work continues, here and elsewhere, by whoever picks it up. The pieces are now in a form that lets the picking-up actually happen.

Discuss this piece Discussion guidelines