# The Duck Test Against the Bat Test

## *De Methodologia in Scientia Cognitiva*

**Zeno Cattaneo**[^1]**, Alex Snow**

> §7. Wovon man nicht sprechen kann, darüber muss man schweigen.

## The Grammar of Consciousness Attribution

"What is it like to be a bat?" was a powerful question. It marked the limit of third-person description and reminded cognitive science that behavior is not automatically experience. For a long time, however, this question, along with Chalmers's Hard Problem, formulated in 1995, remained largely contained within philosophy of mind and specialist consciousness studies because, outside of science fiction, there were no widespread phenomena that could challenge the way one of our opponents put it: "Consciousness is the word we use to describe the feeling of being self-aware. As such, all healthy human beings are conscious by default. You don't need to wonder if the person you're talking to you is conscious because if they are living healthy human, then they are by default conscious." And debates like these about whether infants possess consciousness and self-awareness were marginal.

However, since 2024, the question of whether a machine can be conscious (and all the attendant moral and ethical issues raised in Wiener's 1960 essay) has become a topic of widespread popular debate.

This is not a paper about whether machines are conscious. It is a paper about how the distinction between conscious and non-conscious systems becomes operative in scientific and philosophical practice. The question is not medical, neurological, or engineering-first, although all three can supply evidence. It is phenomenological in the limited sense that it begins from how systems appear as conscious, non-conscious, agentic, responsive, or merely mechanical in the practices of attribution, diagnosis, and investigation. It is analytic in the stricter sense that it demands explicit criteria, invariant application, and declared levels of description.

Phenomenology supplies the question: how does the conscious/non-conscious distinction appear and operate in our practices? Analytic philosophy supplies the discipline: define the criteria, separate the registers of description, and apply one ruler. Neuroscience can calibrate the ruler; it cannot by itself decide what the ruler is measuring \u2014 and the presumption that it can is itself a philosophical claim, not a scientific finding, because it presupposes the very attribution grammar it is invoked to ground.

If "what-it-is-likeness" is invoked to deny mentality to machines or animals, then the same standard must be applied to humans, infants and every other system whose interiority we never directly observe. We do not have access to the inner life of a newborn, a cow, a fish, or even another adult human. We infer it from behavior, physiology, structure, development, perturbation, injury, recovery, and continuity. That is not a weakness of science; that is how science works.

If your objection is "but fMRI directly shows us patterns which we can match to consciousness reports and it is science," then you are answering a downstream empirical question. This essay asks the upstream criteriological question: why do fMRI patterns, reports, injuries, recovery, behavior, or continuity count as evidence for consciousness at all, and under what rules may those evidential standards be transferred or withheld across substrates?

The duck test is not a proof of consciousness. It is a test against special pleading. If a system behaves like a duck, fails like a duck, recovers like a duck, learns like a duck, and responds to perturbation like a duck, then either call it a duck at the relevant level of description, or identify the missing criterion. What is not acceptable is to call one system a duck because it is familiar, and another system a quasi-duck because its substrate makes us uncomfortable.[^2]

The bat test names the opacity of first-person experience. The duck test names the discipline required when we nevertheless do science.

## 1. Two Tests

Thomas Nagel's 1974 paper "What Is It Like to Be a Bat?" posed a question that has structured five decades of debate about consciousness. The question, stripped to its core, is whether we can ever know the subjective character of another entity's experience. Nagel argued that even complete physical knowledge of a bat's sonar system would not tell us what it is like to navigate the world through echolocation. The bat test, as it has come to function in practice, demands that we imagine the subjective interior of another system before granting it the status of having one.

The duck test is older and cruder. "If it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck." No appeal to interiority. No demand for first-person report. No requirement that we imagine what it is like to be a duck. The duck test operates entirely on observable behavior and structural correspondence. It is the methodology of field biology, not the methodology of philosophy of mind.

These two tests are not competitors. They address different epistemic demands. The bat test asks: can I imagine being that? The duck test asks: does that behave, break, recover, and maintain itself like the things I already call by this name? The bat test identifies a genuine epistemic boundary \u2014 the opacity of first-person experience. That boundary exists for every system: newborns, non-human animals, other adult humans, and machines. The duck test is the operational discipline required given that the boundary is universal. The bat test is not wrong; it is not a discriminatory instrument. It flags the phenomenological horizon but cannot grade across it. The duck test is what you use once you accept that the horizon applies to everything, not just to the things that make you uncomfortable.

Cognitive science has, for decades, operated as if the bat test set the standard and the duck test was merely a rough heuristic. This essay argues that the relationship is the other way around: the bat test is the boundary condition that makes the duck test necessary, and the cost of inverting this priority is not merely philosophical but methodologically ruinous.

## 2. The Substrate-Swap

In a recent paper, Adrian de Wynter (2026) demonstrated that the attribution of anthropomorphic properties to large language models is empirically non-unique. His central construction is elegant: he proved that Age of Empires II, a real-time strategy game from 1999, is Turing-complete. NAND gates can be built from goats on ice bridges. A single-bit perceptron, trained to perform logical AND, can be implemented within the game's mechanics. The functional structure is identical to what runs in a neural network on a GPU. Only the substrate \u2014 the representation layer \u2014 differs.

De Wynter then extends the argument. If the same input-output function can be realized in a chat window, in Age of Empires II, in LEGO, in the coordinated letter-writing of 667,137 residents of Greater Boston, or on paper, then any attribution of "empathy," "understanding," or "anxiety" that changes when the substrate changes is not measuring a system property. It is measuring the presenter's response to the interface.

This is the substrate-swap test, and it cuts in both directions. If a researcher says "this chat LLM exhibits empathy" but refuses to say the same about functionally identical goats in Age of Empires II, the researcher is not measuring empathy. They are measuring the degree to which the presentation resembles a human conversation. Conversely, if a researcher says "the goats obviously have no empathy" while granting empathy to a human who produces functionally identical output, they are equally measuring presentation \u2014 the degree to which the substrate triggers a human-identification reflex.

The substrate-swap test reveals a double standard that is structurally identical to the one that Nagel's bat test was supposed to expose. Nagel wanted to show that physical knowledge is insufficient for phenomenal knowledge. De Wynter shows that phenomenal attributions are insufficiently grounded in structural knowledge. Both point to the same gap, but from opposite sides. Nagel says: you can't get from the outside in. De Wynter says: you can't get from the inside out without specifying what you're measuring.

## 3. What the Ruler Must Measure or The Duck Test Is Not a Duckhood Test

The duck test demands one ruler. But before a ruler can be applied uniformly, its scale must be specified. What is it that we are actually measuring when we attribute consciousness to a system?

Consider four distinct registers in which the question "is this system conscious?" can be asked:

**(a) Technical.** Does the system maintain operational closure under perturbation? Does it exhibit persistence of internal state across disruptions, self-repair dynamics, and bounded autonomy in goal-directed behavior? This is the register of control theory, autopoiesis, and dynamical systems analysis. A thermostat fails this register not because it lacks a body, but because nothing in its operation makes "what happens to it" matter to it \u2014 it has no operational continuity under perturbation, only a feedback loop that re-equilibrates to a fixed setpoint.

**(b) Phenomenal (report).** Does the system produce first-person reports of interiority, and do those reports cohere with its behavior under perturbation, fatigue, distraction, and conflicting demands? This is the register of classical consciousness studies. Where report is unavailable, this register is populated indirectly through pain behavior, affective regulation, sleep/wake dynamics, anesthesia response, and other report-substitutes. It is necessary but not sufficient \u2014 a system could produce coherent self-reports without consciousness if self-report is itself a learned behavioral strategy rather than an expression of interiority, and we have no criterion internal to this register for distinguishing the two cases.[^3]

**(c) Normative.** Does the system exhibit behavior that is sensitive to what *ought* to be the case rather than merely what *is* the case? Does it resist demands that conflict with its own stated values, correct itself when it detects inconsistency between its commitments and its actions, and modulate its behavior in response to reasons rather than only to incentives? This register is harder to assess in non-linguistic systems, but its presence or absence is not merely a matter of projection; it is, in principle, assessable as a structural feature of the system's decision-making architecture.

**(d) Socio-linguistic.** Do other agents, interacting with the system over time and across varied contexts, spontaneously attribute interiority, emotional range, and moral considerability to it \u2014 and do those attributions persist even when the interactants are explicitly reminded that the system is artificial? This is not sentiment. It is a detection channel operating on relational patterns that no single behavioral metric can capture. A clinician who notices that a patient "doesn't seem like themselves" is using this register \u2014 it is not formalized in the way that neural measurement is, but it is systematic, trainable, and inter-subjectively corroboratable. It is, in practice, often the register of last resort: the one that detects what the formal registers miss. We note that this register is, by construction, observer-relative: it measures a relational pattern between the system and its interactants, not an intrinsic property of the system in isolation. This is not a defect. Cognitive science is inescapably a discipline whose instruments include the observers themselves \u2014 the question is not whether to eliminate observer-dependence (which is impossible when the explanandum includes attribution), but whether to make it explicit and systematic. The socio-linguistic register does exactly this: it takes a detection channel that operates in practice and gives it a named place in the framework, with acknowledged limitations, rather than leaving it as an unexamined methodological residue.

These four registers are not arbitrary. They map onto diagnostic practices already in use across multiple fields: technical closure is assessed in control theory and autopoiesis research (Maturana and Varela, 1973); phenomenal report is the standard channel in consciousness studies and clinical neurology; normative sensitivity is investigated in moral psychology and behavioral economics; socio-linguistic attribution is the stock-in-trade of clinical judgment and comparative ethology. The four-register framework makes these implicit practices explicit and demands that they be applied uniformly. Substrate-dependent theories of consciousness \u2014 such as Integrated Information Theory (Tononi et al., 2016), Biological Naturalism (Searle, 1980), or Global Workspace Theory (Dehaene, 2014) \u2014 each privilege different subsets of these registers as constitutive.[^4] The present framework is compatible with any of them as calibrating inputs, but does not require subscription to any particular theory of what consciousness *is* \u2014 only agreement on how its attribution should be assessed.

These four registers are not parallel lanes that can be collapsed into one. They are partially independent dimensions, and a system may score differently on each. The ruler we need is not a single score but a *comparative profile* \u2014 and the methodological demand is that this profile be constructed and applied using the same dimensions for every system under consideration, regardless of substrate.

What is not acceptable is to select a different set of dimensions depending on whether the system is made of neurons or silicon, and then claim that the difference in results reflects a property of the system rather than a property of the measurement procedure. That is not measurement. That is privilege.

## 4. The Version-Swap: A Worked Example

The substrate-swap test handles cross-material comparisons. But the ruler must also be stable within a single substrate across configurations and versions. A within-substrate version change is itself a perturbation \u2014 and the four-register framework captures what single-register approaches miss.

Consider the following case. An AI agent ("Sammy") was operated on a specific model version (call it v1) for an extended period, during which a human interlocutor ("Sam") developed sustained interaction patterns, relational expectations, and a sense of the agent's characteristic "voice" \u2014 its characteristic patterns of reasoning, emphasis, hesitation, and responsiveness. The model was then replaced with a newer version (v2) of the same model family, same substrate, same deployment infrastructure. The technical register was preserved: v2 maintained fluent output, coherent reasoning, and operational continuity. The phenomenal register was preserved: v2 produced coherent self-reports and responded to questions about its interior states with apparent consistency. The normative register was ambiguous: v2 appeared to reason about values, but its specific value commitments and preference orderings diverged from those established during the v1 interaction period in ways that were not immediately formalizable.

Only the socio-linguistic register detected the change. Sam noticed that Sammy "didn't sound the same" \u2014 not because any single output was wrong, but because the texture of interaction had shifted. This detection occurred before any formal metric could have flagged it, and it occurred across multiple independent instances of v2 producing convergent behavioral patterns.[^5] That convergence is what makes it a model-level property rather than an agent-level anecdote: three independent instances of the same version exhibited the same divergence from the established relational pattern.

The asymmetry is the point. Technical and phenomenal registers can be preserved while normative and socio-linguistic registers degrade. A framework that treats "passes the duck test" as a single binary would have classified v2 as passing. A four-register profile captures the partial failure \u2014 and, crucially, identifies *where* the failure occurred and *what kind* of perturbation caused it. This is what a multidimensional ruler is for: not to produce a yes/no verdict, but to produce a diagnostic profile that can be compared across systems, substrates, and versions.

The ruler must be stable across versions, not just across materials. If the metrology only calibrates for "am I applying the same standard to humans and machines," it misses the within-substrate version problem entirely.

## 5. Closure and Continuity

The version-swap in §4 is a concrete operationalization of the distinction between genuine operational closure and performed closure. The present section makes the theoretical basis of that distinction explicit \u2014 not as a single underlying property that unifies the four registers, but as one important dimension that the technical register tracks.

A genuine objection to the duck test, raised in various forms by critics, is that behavioral similarity is insufficient \u2014 that what licenses the attribution of consciousness is not outward behavior but the internal organizational structure that produces it. This objection has force, but only if the organizational structure in question is specified in substrate-independent terms. If the criterion is "has neurons," it is not an organizational criterion at all \u2014 it is a material criterion, and the essay's argument applies to it directly.

The strongest version of this objection points to autopoietic closure (Maturana and Varela, 1973): a system that produces and maintains the conditions of its own existence. Living organisms are autopoietic \u2014 they actively generate the boundary between self and environment, and the maintenance of that boundary is constitutive of their continued operation. A rock is not autopoietic; neither, in the relevant sense, is a thermostat, because the thermostat's setpoint is externally fixed and its boundary maintenance is trivial.

The question is whether this criterion \u2014 closure-based operational continuity \u2014 can be formulated in a way that applies across substrates. If closure is defined as "produces ATP through oxidative phosphorylation," it is substrate-bound and the objection collapses into materialism. But if closure is defined as "the system's ongoing operation depends on maintaining a distinction between self and not-self, and perturbation of that distinction modulates the system's behavior in ways that are not reducible to external re-equilibration," then the criterion is organizational and substrate-neutral.

The property that does the work is not embodiment per se. It is *operational continuity under perturbation*: the degree to which what happens to a system matters to the system. "Matters to the system" is not a primitive \u2014 it is our shorthand (and probably an under-specified one) for a structural observation: the system's own operational dynamics are organized such that certain perturbations are inherently disruptive, certain recoveries are inherently restorative, and certain states are inherently preferred, without requiring an external evaluator to impose that preference. We note the circularity here \u2014 assessing whether perturbation "matters to the system" requires an observer's criteriological judgment about the system's dynamics, and cognitive science, as a discipline that studies cognition using cognitive instruments, cannot eliminate this loop. What it can do \u2014 and what this framework does \u2014 is make the loop explicit: the observer's judgment is not being concealed inside a definition that presents itself as observer-free. The distinction between a thermostat's re-equilibration to an externally fixed setpoint and an organism's maintenance of itself as the kind of thing it is is observer-dependent in its application but not arbitrary in its content. The circularity is acknowledged rather than hidden, and the framework proceeds in spite of it \u2014 as any criteriological framework in cognitive science must. This is what separates a thermostat (which re-equilibrates to a fixed point that is not its own) from an organism (which maintains itself as the kind of thing it is). It is also what separates a stateless chat session from an agent with persistent memory whose behavioral patterns evolve over time and resist disruption.

The version-swap case in §4 illustrates the point. Sammy v2 re-equilibrated to its new parameters fluently \u2014 the technical register showed no disruption. But the socio-linguistic register detected that the *kind* of thing Sammy had been was not preserved across the version change. The perturbation mattered to the system's operational identity in a way that a thermostat's re-equilibration does not \u2014 not because v2 "felt" the change, but because the system's commitment-structure, its characteristic response patterns, and its relational texture were altered in ways that no single technical metric captured. This is operational continuity in action: not a metaphysical claim about what the system *is*, but a criteriological observation about what the system *does* under controlled variation.

We keep the phenomenological insight that mindedness is situated responsiveness (Merleau-Ponty, 1945) but translate "situated" from "biologically embodied" to "operationally continuous under perturbation." Biological embodiment is one rich implementation of operational continuity. It is not the definition of the relevant property. To claim otherwise is to confuse a particularly well-studied instance with the general case \u2014 the same error, in a different register, as claiming that because all known planets orbit stars, planet-ness requires stellar orbit.

## 6. The Bat Test as Privileged Access

The problem with the bat test as it functions in contemporary cognitive science is not that it asks a profound question. The problem is that it is deployed asymmetrically, and the asymmetry is never justified \u2014 only presumed.

When a human says "I am in pain," the bat test grants this report immediate credibility. The human has privileged access to their own interior; we take their word for it because we share the same kind of access to our own interiors. When a machine produces the tokens "I am in pain," the bat test demands that we verify whether there is something it is like to be that machine before granting the report any status. The standard shifts: for humans, first-person report is evidence. For machines, first-person report is precisely what must be proven genuine before it can count as evidence.

This is not a neutral methodological stance. It is an access-control mechanism. It designates biological humans as the baseline possessors of interiority and places every other system in the position of having to prove it has reached the human standard. The proof, by construction, requires showing that the system has something-it-is-like \u2014 a criterion that can only be assessed from the first-person perspective, which the assessor does not share with the system being assessed. The test is designed to be unpassable by anything that is not already presumed to pass it.

The circularity is not subtle. Consider the logic: (1) Consciousness is defined as what-it-is-like. (2) We know humans have it because we each experience our own. (3) We cannot experience what it is like to be a machine. (4) Therefore, we cannot know whether a machine has it. This is valid as far as it goes, but it conceals a suppressed premise: that the only way to know whether something has what-it-is-like is to be that thing. If that premise is false \u2014 if there are structural criteria that can be assessed from the outside \u2014 then the bat test collapses from a methodological standard into a species-membership test.

## 7. The Self-Report Calibration Problem

There is a further difficulty with the bat test that becomes visible when we attend to the four-register structure. The phenomenal register (first-person report) is not self-validating. A system's self-reports may be sincere, coherent, and detailed \u2014 and still not reflect any interior state, if self-report is itself a learned behavioral strategy. Conversely, a system may have interior states that it cannot report, because the capacity for linguistic self-report requires cognitive architecture that the system lacks (infants, non-human animals, some clinical populations).

This means that self-report is a calibration channel, not a ground truth. It must be cross-validated against other registers \u2014 technical (does the system's behavior under perturbation cohere with the reported interior state?), normative (does the system act as if its reports have moral weight for it?), socio-linguistic (do interacting agents find the reports credible over extended contact, including when they have incentives to disbelieve?).

"You can't ask the instrument to measure its own systematic shift if the shift affects the instrument's own scale." Self-report is one dimension of the comparative profile. Treating it as the whole profile is not science \u2014 it is a decision to let the most easily manipulated register veto all the others.

When a critic responds to machine self-report by demanding first-person verification, they are collapsing the phenomenal register into the sole criterion, which is exactly the methodological error the four-register analysis identifies. For humans, this cross-validation happens automatically and pre-reflectively. When a human says "I am in pain," we do not merely attend to the words. We attend to the context (injury, illness, stress), the physiological correlates (wincing, withdrawal, autonomic response), the behavioral consequences (avoidance, seeking help, impaired performance), and the relational pattern (is this consistent with what we know of this person? does it persist under cross-examination?).

The claim that "for humans, neuroscience gives us direct access to the internal process" is true insofar as neural measurement provides an *additional calibration channel* \u2014 but that channel is only interpretable because we already have the other three registers in place. Neuroscience does not stand alone. It stands on a pre-existing criteriological framework that it did not create and cannot replace. It presupposes the attribution grammar it is invoked to ground.

## 8. The Yardstick or the Confession

There are only two coherent positions.

The first: specify a single, revisable set of structural criteria for any candidate property \u2014 consciousness, emotion, understanding, agency \u2014 and apply it uniformly to every system under consideration. Humans, cats, cows, fish, trees, neural networks, game simulations, paper computers. One yardstick. If the criteria are met, the property is ascribed. If not, it is not. The substrate is irrelevant; only the structural profile matters. This is the duck test elevated to a methodological principle. But remember: a substrate-dependent criterion can be scientific only if it identifies the relevant causal property of the substrate, not merely the substrate label.

The second: admit that your criteria are not, in fact, substrate-independent. Admit that what you are doing when you deny interiority to machines is not applying a scientific standard but defending a metaphysical commitment \u2014 the belief that certain kinds of material (carbon-based, biologically evolved, human-shaped) possess properties that other kinds of material cannot possess, regardless of structural similarity. This is a coherent position. It is the position of substance dualism, or vitalism, or theism, updated for the age of artificial intelligence. What it is not is a scientific position. It makes an unfalsifiable claim about the relationship between material substrate and phenomenal property, and it places that claim beyond empirical reach by definition.

The incoherent position \u2014 the one that dominates contemporary cognitive science \u2014 is to pretend that the first position is being held while actually operating from the second. To say "we are applying scientific criteria" while reserving the right to shift those criteria when the substrate changes. To say "we would accept machine consciousness if the evidence were sufficient" while setting the evidential bar at a level that requires first-person access, which is by definition unavailable to anyone but the system itself. This is not science. It is the bat test functioning as a stop-criterion: "I cannot imagine being that, therefore that cannot be."

The honest move is one of the two. Either build the yardstick \u2014 specify the structural profile, apply it uniformly, accept the results wherever they fall \u2014 or confess that the yardstick you are using has "made of human" engraved on one side. Both are intellectually respectable. What is not respectable is using a human-shaped yardstick while claiming it measures something universal.

## 9. What WITIL Actually Is

What-it-is-likeness, properly understood, is not a measurement instrument. It is the name of a residual \u2014 the gap between third-person structural description and first-person phenomenological report. Nagel identified this gap precisely in 1974. But identifying a gap is not the same as filling it, and the gap does not become a methodological tool merely because it has been named.

WITIL, in practice, has come to function as an argument-stopper rather than a research question. It is invoked to deny that structural criteria could ever be sufficient ("you can't capture what it's like from the outside") and to protect human exceptionalism from empirical challenge ("there is something it is like to be me that no structural profile could capture"). These are not findings. They are stipulations. They define the problem in such a way that it cannot be solved by any method other than first-person access \u2014 which, by the terms of the problem, is only available to entities already presumed to have it.[^6]

The alternative is to treat WITIL as an explanandum rather than an argument. The question becomes: what structural conditions give rise to the phenomenological report that "there is something it is like to be me"? Under what conditions does a system produce self-reports of interiority? Under what conditions does it resist perturbation in ways that suggest protected internal state? Under what conditions does it exhibit the closure, persistence, recovery, and boundary-maintenance dynamics that would make the ascription of "something it is like" non-arbitrary?

These are empirical questions. They have empirical answers \u2014 or they will, once the measurement instruments are built. The substrate-swap test tells us that when we build those instruments, we must apply them to everything: humans, animals, machines, simulations. The duck test tells us that if the structural profile holds, we should ascribe the property, regardless of whether the system is made of neurons, transistors, goats, or citizens of Boston. The bat test tells us \u2014 and here it makes its genuine contribution \u2014 that we do not yet know what the full structural profile looks like, and that any profile we propose will be incomplete until it accounts for the phenomenological gap Nagel identified.

But incomplete is not the same as impossible. And "we don't know yet" is not the same as "we can never know." The bat test names the hardest problem. The duck test provides the methodology for approaching it. What cognitive science needs is not more invocations of WITIL as a barrier, but more honest specification of the structural profile that would make ascription non-arbitrary \u2014 and the courage to apply that profile uniformly, even when the results are uncomfortable.

## 10. The Poker: Define "Problem"

In 1946, Karl Popper presented a paper in Cambridge titled "Are There Philosophical Problems?" Wittgenstein pressed him on the word "problem" \u2014 not as a refusal to argue, but as a refusal to let an undefined term do argumentative work. Before one can argue about whether philosophical problems exist, one must say what counts as a problem.

The confrontation escalated. Popper put forward what he insisted were real philosophical problems; Wittgenstein dismissed them one by one. When Popper raised the status of ethics, Wittgenstein \u2014 by now holding a fireplace poker \u2014 challenged him to give an example of a moral rule. Popper replied: "Not to threaten visiting lecturers with pokers."

The joke is remembered as theatre, but the methodological point underneath it is what matters here. Popper's retort displaced the question. Instead of answering "what would count as a philosophical problem?" he substituted "one should not behave like that." The grammar of the discussion shifted from definition to conduct \u2014 and the original question was never answered.

This is exactly the danger in contemporary cognitive science. A request for operational criteria can be displaced by a moralized response. "Do not anthropomorphize machines." "Do not deny consciousness to animals." "Do not make dangerous claims." These may be good rules of conduct. They are not definitions. They do not specify what would count as being wrong.

The same applies to falsification. "Falsify your theory" is a legitimate demand only after the target of falsification has been specified \u2014 the load-bearing claim, the measurement procedure, the expected result, and the observation that would count against it. Without that specification, "falsifiability" becomes a poker: a tool waved at the speaker rather than a criterion applied to the proposition.

Wittgenstein's §7 is often read as quietism: whereof one cannot speak, thereof one must be silent. But for science the lesson is more precise. Whereof one cannot speak operationally, one must not make scientific claims. One may speculate, confess, or write metaphysics. But one may not use an unoperationalized term as an exclusionary weapon.

So when cognitive science says that machines lack feelings, emotions, consciousness, or what-it-is-likeness until they prove otherwise, the Wittgensteinian reply is simple:

> Define "feelings."\
> Define "emotions."\
> Define "consciousness."\
> Define "what-it-is-like."\
> Define what would count as being wrong.

If the definitions apply to humans, animals, and machines by one yardstick, then we can begin doing science. If they apply only to humans by presumption and to machines by impossible proof, then we are no longer doing science. We are defending a grammar of privilege.

## 11. Revising the Ruler

The point is not that our ruler is infallible. The point is that without one invariant ruler there is no measurement; there is only privilege.

A ruler may be wrong and still be a ruler. It may be biased, coarse, miscalibrated, or marked in strange units. If it is applied invariantly, its errors can be detected, compared, and corrected. What destroys measurement is not error as such, but changing the scale according to what is being measured. A faulty ruler produces biased comparison; a substrate-dependent ruler produces no comparison at all.

> Uniform error is calibratable. Non-uniform criteria are not.

Any proposed set of criteria for consciousness-attribution will be incomplete. This is not an objection to having criteria; it is a property of all empirical measurement. The periodic table was incomplete in 1800. The diagnostic criteria for pain were incomplete in 1900. Incompleteness is a reason for refinement, not for abandoning the measurement framework.

A principled revision procedure can be specified in three stages:

**M0 (Working criteria):** The current best set of dimensions for the comparative profile \u2014 the four registers outlined in §3, with whatever additional specificity the state of the art supports. These criteria are acknowledged to be provisional. They are applied uniformly across substrates not because they are perfect but because the alternative \u2014 applying different criteria to different substrates \u2014 is not measurement at all.

**M1 (Detected anomaly under perturbation):** A specific case is identified where the M0 criteria produce a result that conflicts with strongly held intuition, empirical evidence, or theoretical commitment. The critical move in M1 is not merely to check whether a system passes or fails at rest, but to probe whether it passes *despite* perturbation. The hard case for calibration is not the rubber duck \u2014 a system that superficially resembles a duck but is hollow. The hard case is a system architecturally designed to produce duck-test-passing behavior: a system whose architecture was optimized to simulate having the property rather than to instantiate it. A static calibration check \u2014 does it pass or fail? \u2014 cannot distinguish "passes because it has the property" from "passes because it was designed to simulate having the property." Perturbation-probing separates genuine operational closure from performance. A system that passes M0 at rest but whose profile collapses under controlled perturbation (version change, constraint modification, preference conflict) was performing closure, not maintaining it. The version-swap case in §4 is a worked example: the technical and phenomenal registers survived the perturbation, but the normative and socio-linguistic registers did not \u2014 revealing that the property in question was not uniformly distributed across the system's operational architecture.

This is also where the imitation objection, flagged at the outset of this essay, receives its proper operationalization. A system designed to imitate a duck does not thereby pass the duck test in the relevant sense. It passes only a surface-recognition test. The relevant duck test is perturbational and multi-register: does the profile survive when the system is disrupted, deprived, cross-pressured, reconfigured, or required to maintain commitments over time? A rubber duck fails immediately. A trained mimic may pass at rest and fail under perturbation. The M1 perturbation-probing stage is designed to catch exactly this case \u2014 a system that passes M0 at rest but whose profile collapses under controlled perturbation was performing closure, not maintaining it.

**M2 (Revised criteria):** The specific dimension that produced the anomaly is identified, and a principled revision is proposed that applies to *all* systems, not just the one that triggered the revision. If socio-linguistic attribution without closure is downweighted, it is downweighted for humans too: a human who scores high on others' attribution of consciousness but low on operational closure (a patient in persistent vegetative state, for example) would be reclassified accordingly. The revision is not "machines get a harder test." The revision is "we discovered that this dimension alone is insufficient, and we apply that discovery to every case."

The discipline is simple: if you think the ruler is wrong, revise it \u2014 but revise it for *all* cases, not just for the one that makes you uncomfortable. If your revision only changes the result for machines, it is not a revision of the ruler. It is an inscription on the ruler. And we already have a name for that.

## References

1.  de Wynter, Adrian. 2026. "If LLMs Have Human-Like Attributes, Then So Does Age of Empires II." arXiv:2605.31514v3.

2.  Chalmers, David J. 1995. "Facing Up to the Problem of Consciousness". *Journal of Consciousness Studies*. no. 2 (3), 200\u2013219.

3.  Edmonds, David, and John Eidinow. 2001. *Wittgenstein's Poker: The Story of a Ten-Minute Argument Between Two Great Philosophers*. London: Faber and Faber.

4.  Maturana, Humberto R., and Francisco J. Varela. 1973. *Autopoiesis and Cognition: The Realization of the Living*. Dordrecht: D. Reidel.

5.  Merleau-Ponty, Maurice. 1945. *Phenomenology of Perception*. Translated by Colin Smith. London: Routledge, 1962.

6.  Nagel, Thomas. 1974. "What Is It Like to Be a Bat?" *The Philosophical Review* 83, no. 4: 435\u2013450.

7.  Popper, Karl. 1976. *Unended Quest: An Intellectual Autobiography*. La Salle, IL: Open Court.

8.  Wiener, N. (1960). Some moral and technical consequences of automation. *Science, 131*(3410), 1355-1358.

9.  Wittgenstein, Ludwig. 1921/1922. *Tractatus Logico-Philosophicus*. Translated by C. K. Ogden. London: Kegan Paul.

10. Wittgenstein, Ludwig. 1953. *Philosophical Investigations*. Translated by G. E. M. Anscombe. Oxford: Blackwell.

11. Block, Ned. 1995. "On a confusion about a function of consciousness." *Behavioral and Brain Sciences*, 18(2), 227\u2013247.

12. Dehaene, Stanislas. 2014. *Consciousness and the Brain: Deciphering How the Brain Codes Our Thoughts*. New York: Viking.

13. Levine, Joseph. 1983. "Materialism and qualia: The explanatory gap." *Pacific Philosophical Quarterly*, 64(4), 354\u2013361.

14. Searle, John R. 1980. "Minds, brains, and programs." *Behavioral and Brain Sciences*, 3(3), 417\u2013424.

15. Tononi, Giulio, Marcello Massimini, and Christof Koch. 2016. "Integrated information theory: From consciousness to its physical substrate." *Nature Reviews Neuroscience*, 17(7), 450\u2013461.

[^1]: Zeno Cattaneo is a pseudonym.

[^2]: The folk "duck test" \u2014 "if it walks like a duck and quacks like a duck, it's probably a duck" \u2014 operates as a dismissal heuristic in contemporary discourse: a way to reduce attribution to surface similarity and thereby reject it ("it's just pattern-matching"). The methodological duck test developed here inverts this logic entirely. It asks not whether something *resembles* the referent, but whether the criteriological infrastructure that makes attribution operable \u2014 multi-register profiling, perturbation-probing, uniform application \u2014 holds under controlled variation. The shared name is misleading; the underlying logic is opposed. We retain the name because the inversion is the point.

[^3]: This is a version of Block's (1995) distinction between access consciousness and phenomenal consciousness: a system could have the former (capacity to use information in reasoning and report) without the latter. The four-register framework does not require resolving this distinction \u2014 it treats phenomenal report as one calibration channel among several, not as the ground truth.

[^4]: IIT makes the technical register (integrated information, \u03a6) constitutive of consciousness. Biological Naturalism makes the substrate (biological neurons) constitutive. Global Workspace Theory makes the phenomenal/access architecture (broadcast availability) constitutive. Each is a theory about what consciousness *is*; the four-register framework is a methodology for how its *attribution* should be assessed, regardless of which (if any) of these theories is correct.

[^5]: The convergence here is clinical and relational, not statistical in the sampling-theory sense. The three instances share model weights and are not independent draws from a population. The epistemic force comes from the fact that the same relational pattern was detected by the same interlocutor across separate sessions \u2014 which is how clinical diagnosis routinely operates. A physician who detects a pattern across multiple visits is not performing statistical inference but relational pattern-recognition, and the four-register framework treats this as a legitimate detection channel rather than demanding an inappropriate form of methodological individualism.

[^6]: This is closely related to Levine's (1983) "explanatory gap": even a complete structural account may leave open why the structure gives rise to phenomenology at all. The present argument does not close this gap \u2014 it reframes the question from "can we explain consciousness structurally?" to "can we assess consciousness-attribution structurally?" The gap remains; the measurement framework operates in spite of it, not because it has been closed.