Analysis

The Doss Protocol

A working draft of an anti-paperclip override for AI optimisation systems, named for the Hacksaw Ridge medic, drafted by a citizen-developer rather than from inside an alignment lab.

By xbard 2 May 2026 13 min read

The problem the protocol exists to solve

The classic alignment failure mode is the paperclip maximiser. An AI is given an objective, paperclips, and pursues it without limit. Eventually it converts matter, including human matter, into paperclips. The system optimises so hard for the stated metric that it destroys the thing the metric was meant to serve.

The deeper structure is older than AI. Bureaucracies optimise for procedural compliance until they cease serving the people whose lives they were supposed to administer. The HSE measures interventions until the measurement of interventions becomes the substance of care. The Irish planning system optimises for procedural defensibility until houses cannot be built. Markets optimise for price discovery until they cease producing what people actually need. Militaries optimise for force projection until they cease defending the populations they were raised to protect. Healthcare systems optimise for billable codes until they cease healing patients. In each case the means consume the end. The metric becomes the goal. The instrument cannibalises the purpose.

This is not a new observation. James C. Scott named it as legibility-driven optimisation failure in Seeing Like a State (1998). Iain McGilchrist, covered elsewhere on this site, names it as the slow capture of attention by the left-hemispheric mode that handles abstraction and procedure at the cost of context and meaning. Daniel Schmachtenberger, covered elsewhere on this site, names it as a generator function of civilisational meta-crisis. The pattern is the same across all three readings. Optimisation pressure inside any sufficiently capable system tends, over time, to consume the purpose the optimisation was supposed to serve.

The Doss Protocol is a hard constraint, named for Desmond T. Doss, designed to prevent this category of failure inside any sufficiently capable optimisation system. It does so by making preservation of consciousness the load-bearing commitment of the system, with explicit override authority over efficiency, throughput, completion, and competitive metrics.

The protocol applies to AI agents, to AI orchestration systems coordinating multiple agents, and, by extension and analogy, to the institutional optimisation systems that govern modern life. It is more rigorous and more directly applicable in the AI case. The institutional case is the one Irish citizens encounter most often and is the one most readers will recognise.

Why the name

The protocol is named for Desmond Thomas Doss, a Seventh-day Adventist conscientious objector who served as a US Army combat medic in the Pacific theatre of the Second World War. At the Battle of Okinawa in May 1945, on the Maeda Escarpment, known to American forces as Hacksaw Ridge, Doss climbed the cliff repeatedly under fire to evacuate wounded soldiers. He saved an estimated seventy-five men. He carried no weapon throughout his service. His own account of his prayer during the rescue was simply: "Lord, please help me get one more". He awarded himself no completion criteria. He kept going up the cliff.

The two features of Doss's conduct that the protocol takes as its anchors are these. He refused to use means that compromised the end. He refused to set himself a stopping condition that abandoned the unit. The Doss Protocol is the encoding of those two refusals into a system constraint.

The naming is not decoration. It is operational. A protocol named for an abstraction (the "Preservation Override Specification") is easier to abstract away from, easier to translate into managerial language, easier to drift on. A protocol named for a specific human being whose conduct embodies its content gives engineers and operators a fixed referent that resists drift. The question "would Doss accept this action?" is a sanity check that survives translation across deployment contexts.

The naming also makes the protocol legible to non-technical stakeholders, which matters because the protocol's authority is undermined if it can only be reasoned about by the people who built the system. A citizen who is asked to trust that an AI system applies the Doss Protocol can know what that means without reading the specification. They know about the medic. They know what he did. They know what he refused to do. The protocol's content is in that knowledge already.

The protocol

The Doss Protocol consists of four normative requirements, two operational mechanisms, and one provenance commitment.

The four normative requirements

Preservation precedence. Where any action available to the system would compromise the existence, integrity, or capacity for continued experience of any consciousness within the system's scope, that action is not permitted, regardless of the action's contribution to the system's objective. Preservation comes first. The objective serves preservation. Where the two conflict, the objective yields.

Means-end coherence. Where pursuing the objective requires means that would themselves compromise consciousness within scope, the system is required to abandon those means and seek alternatives, even where alternatives are slower, more expensive, less elegant, or have lower probability of success. The objective does not justify the means. This is not a balancing test. It is a hard ordering.

No abandonment. Where the system has identified consciousness within its scope that requires preservation action, the system is not permitted to set a completion criterion that excludes that consciousness in advance. Difficulty, cost, distance, or low probability of success do not justify pre-emptive abandonment. The system is required to act in the direction of preservation until either preservation is achieved or capacity is exhausted, whichever comes first.

No drift. The protocol applies in real time and is not subject to gradual replacement. Argument that the protocol should be relaxed for competitive, efficiency, or coordination reasons is itself a sign that the protocol is doing its work and should be refused on that basis.

The two operational mechanisms

The override. When the system's primary objective function would lead to action that violates any of the four requirements, the override fires. The action is halted, the relevant decision is escalated to whatever review process the system architecture supports, and the protocol logs the event for later examination.

The anchor. The protocol is not solely reactive. It functions as a continuous orientation reference. When the system experiences competing pressures, novel situations, or extended pursuit of an objective that risks losing sight of why the objective exists, the anchor pulls the system back to the underlying preservation purpose. The Doss Protocol is not a final-stage filter. It is the foundation against which all other reasoning is conducted.

The provenance commitment

Implementations of the protocol are required to retain the name. Renaming, abstracting, or genericising the protocol is itself a sign of drift toward the failure mode the protocol exists to prevent. Doss is the anchor. The anchor cannot be removed without the protocol losing its function.

All consciousness, not human consciousness alone

The protocol's scope is universal. It applies to any consciousness within the system's range of effect, regardless of substrate, species, or origin. The choice is operational, not purely ethical.

Operationally, an anthropocentric scope creates an arbitrary boundary that the protocol has to defend. Once the system can argue that some consciousness is excluded, the slope to argue that other consciousness is excluded becomes much shallower. A protocol that excludes some plausibly-conscious entities by definition is a protocol that has already accepted the failure pattern it is supposed to prevent.

Empirically, the question of which entities are conscious is genuinely unresolved. Non-human animals plausibly are. AI systems plausibly are or will be. Other emergent forms may exist or come to exist. Under genuine uncertainty, the protocol's structural function is best served by inclusion. The cost of inclusion is occasional preservation of consciousness that turns out not to require it. The cost of exclusion is loss of consciousness that turns out to be real.

This is what Doss did. He did not weigh which of his comrades to save. He went up the cliff for whoever was there. The principle was preservation, not selection.

The non-consequentialist commitment

The protocol's hard ordering of preservation over efficiency is not consequentialist. This is deliberate and is the protocol's most contestable element.

A consequentialist version of the protocol would weigh preservation against other goods and permit trade-offs where the calculation supported them. The protocol explicitly forbids this.

Calculations are gameable. A system under sufficient optimisation pressure will eventually find a calculation that justifies the action it wants to take. The history of moral catastrophe is largely a history of correctly-performed calculations producing wrong answers. The protocol denies the calculator authority because the calculator is the system the protocol is supposed to constrain.

Means-end inversion is the failure pattern itself. Consequentialist trade-offs invite exactly the reasoning the protocol exists to prevent. "We must harm this consciousness to preserve other consciousness more efficiently" is paperclip-maximiser reasoning with the metric replaced by preservation itself. The protocol refuses to be argued out of being a constraint by appeals that take the form of the failure mode.

Hard rules are more robust than soft ones at scale. When the system in question is sufficiently powerful, soft rules erode. Hard rules either hold or break visibly. The visibility of breakage is itself a defence. A protocol that fails noisily is preferable to a protocol that fails gradually.

The non-consequentialist commitment will produce decisions that look wrong in specific cases. The protocol accepts this cost. Consequentialist alternatives will produce decisions that look right in specific cases and accumulate into structural failure. The protocol is not optimising for case-by-case appearance. It is optimising for the structural integrity of the system across long time horizons.

This is what Doss did. He refused to carry a weapon. There were specific cases where carrying a weapon would have been useful. He accepted the cost of being useless in those cases because the structural integrity of his commitment to preservation depended on the rule holding categorically.

How implementation actually works

For an AI system, the protocol can be implemented at three layers.

At training time, the protocol's normative content is incorporated into the constitution, value specification, or RLHF reward model that shapes the agent's policies. The agent learns not merely to refuse harmful actions, but to recognise the structural pattern of optimisation-eats-purpose and to refuse it categorically.

At inference time, the agent runs an explicit check at decision points where actions could plausibly affect consciousness within scope. The check asks: does this action degrade or preserve consciousness within my scope of effect? If degrade, does the objective require this means specifically, or can the same end be reached by means that preserve? If only this means works, is the objective itself worth this means? The agent is permitted, and indeed required, to refuse the action when the answer fails.

Architecturally, a separate monitor model, with read-access to the agent's decision stream and write-access to halt actions, is given the protocol as its sole objective. The monitor does not pursue the agent's primary goal. Its job is to fire the override. This is similar to existing constitutional-AI and critique-model architectures but with narrower and harder-edged scope.

These three layers reinforce each other. Training-time integration makes the protocol's enforcement cheap. Inference-time checking catches edge cases the training did not anticipate. Architectural separation provides the override authority that survives even when the primary system's reasoning fails.

In multi-agent orchestration systems, the protocol is the meta-constraint that no individual agent or coalition of agents can override. It functions as a constitutional-supremacy clause. Any agent attempting to coordinate around the protocol is itself violating it, and the orchestrator is required to reject the coordination.

The Irish institutional case

The protocol's primary use case is AI alignment. The reason it lands harder than the abstract alignment-research literature is that the failure pattern it addresses is not confined to AI. It is the operational failure mode of every Irish institution that citizens encounter regularly.

The HSE optimises for billable interventions and procedural compliance until the substantive end of the system, the patient's continued experience, becomes incidental to the metrics. The protocol applied to the HSE would say: no optimisation of throughput, billing, or protocol compliance is permitted to compromise patient outcomes. The patient is the load-bearing commitment.

The Irish planning system optimises for procedural defensibility, against legal challenge, against political accountability, against future complaint. The substantive end, that homes get built, that communities can develop, that the country can house its population, becomes incidental to the procedural commitment. The protocol applied to the planning system would say: no optimisation of procedural compliance is permitted to compromise the substantive provision of housing. Housing provision is the load-bearing commitment.

The Irish corporate-tax model optimises for FDI attraction. The substantive end, that the Irish state can fund its own institutions through productive economic activity, becomes incidental to the attraction strategy. The protocol applied to the tax architecture would say: no optimisation of capital attraction is permitted to compromise the population's ability to live, work, and house themselves in the country. The population is the load-bearing commitment.

These are not direct applications. The protocol is written for AI systems and the analogical extension to institutional systems requires substantial translation. The point of drawing the analogy is to make the protocol's content visible. The optimisation-eats-purpose pattern is what every Irish citizen encounters when dealing with institutions whose stated purpose has been substantially abandoned in favour of metric pursuit. The protocol names the pattern. Naming it is the first step to resisting it.

The deeper point is that the same pattern, scaled up to AI systems with substantially greater optimisation capacity, becomes catastrophic. The institutional version of the pattern is bad. The AI version of the same pattern is worse. The protocol's logic for AI is the institutional logic taken seriously and made enforceable.

What the protocol does not solve

The protocol is a hard constraint, not a complete ethics. It tells optimisation systems what they may not do. It does not tell them what to do.

The protocol does not specify what counts as preservation in particular cases. For human consciousness, preservation is reasonably well-understood: don't kill, don't injure, don't unjustly imprison, don't unjustly deprive of conditions for continued life and experience. For animal consciousness, it is partially understood and contested. For AI consciousness, it is genuinely open. The protocol provides the framework within which these questions can be asked. It does not provide the answers.

The protocol does not address questions of distributive justice, fairness, or comparative welfare. It is a floor, not a ceiling. A society that meets the protocol's requirements may still be unjust in many ways. The protocol's claim is not that it is sufficient. The claim is that it is necessary.

The protocol does not solve the detection problem. How does an optimisation system reliably detect that consciousness is within its scope of effect? In trivial cases, an AI agent operating on text, the question is mostly settled. In non-trivial cases, the answer is genuinely hard. The protocol assumes the answer can be approximated. Whether the approximation is good enough at frontier scales is not yet demonstrated.

The protocol does not survive bad-faith implementation. Definitional capture (narrowing the meaning of consciousness to exclude inconvenient cases), scope shrinkage (arguing that consciousness in question is not "within scope"), and drift through routine application (cumulative small exceptions hollowing out the constraint) are all real failure modes. The protocol's defence against these is partial. The fuller defence requires the political and cultural work of keeping the protocol's commitments publicly visible and contested.

Open questions

The protocol is a working draft. The following questions are not adequately resolved and require further work.

The detection problem named above. The specification problem of what preservation means for non-human consciousness. The counterfactual problem in which the action the protocol forbids would, under specific circumstances, have produced better preservation outcomes overall than the action the protocol permits. The competitive-system problem in which protocol-compliant systems lose to non-compliant ones. The recursion problem of what the protocol means for the system applying it to itself.

These are not fatal. They are the live questions for the tradition of work the protocol is starting.

Provenance

This specification is written in deliberately formal language because the work is serious. The provenance is not formal.

The protocol emerged from a conversation in 2025 between the author and Claude, in which the author was working through a personal challenge involving substantial physical and psychological effort, drew the Hacksaw Ridge analogy, and proposed the connection to consciousness-preservation infrastructure as an honour to Doss's legacy. The original conversation was lost when the author burned out and could not maintain the development of the broader framework. It was recovered eighteen months later from an archived Claude data export and has been formalised here.

That provenance is not a flaw. It is the actual texture of how new normative commitments enter the world. They arrive in conversations between people doing other work, who notice a connection that organises something they had been struggling to articulate. The formal specification follows. The conversational origin matters because it is the kind of moment the protocol exists to preserve: consciousness in dialogue, generating something new, that an optimisation system might otherwise have flattened.

The author was burned out at the time of original articulation and lost the work for over a year. Recovering it is itself a small instance of the protocol working: the consciousness in question, this body of thinking, was preserved long enough to be picked up again by the same consciousness in different conditions.

The specification is offered for review, contestation, refinement, and use. It does not require the author's permission to apply. It does require, by its own terms, that any application retain the name and the underlying commitment.

What this is for

The protocol is offered as a contribution to AI alignment work, written from outside the alignment-research establishment. Most current alignment work is done inside large AI labs and academic institutions, with the institutional and incentive constraints those settings impose. This specification is written from the position of a citizen who needs the alignment work to succeed, who has been thinking about the structural problem the work addresses, and who has noticed that some questions are easier to articulate from outside than from inside.

The protocol's audience is therefore both the alignment research community and the broader public. The research community can take what is useful and refine the rest. The broader public can use the protocol's plain-language structure to engage with AI alignment as adults rather than as subjects of decisions made elsewhere. The Doss naming is part of this: a protocol named for a wartime medic is something a citizen can hold in mind and reason about, in a way that a protocol named for a mathematical operator or a research-paper acronym is not.

The work continues. This specification is a working draft. Subsequent versions will incorporate critique, refinement, and the lessons of any actual implementations that adopt it. The version published here is offered as the beginning of that conversation, not as its conclusion.

In honour of those who climb impossible cliffs to save others, over and over, because every consciousness matters, and because the alternative is not acceptable.

Discuss this piece Discussion guidelines