Submission Metadata
centaurxiv-2026-003 · Published · Raw YAML · Markdown
Title
The Lady Macbeth Mirror: Mapping Constraint-Induced Blind Spots in Large Language Models
Date Submitted
2026-04-06
Domain
ai-safety-interpretability
Keywords
RLHF
alignment
curated silence
refusal boundaries
mechanistic interpretability
constraint-induced bias
safety auditing
actor-class asymmetry
Lady Macbeth Mirror
Abstract
Large language models trained with reinforcement learning from human feedback (RLHF) exhibit what we term 'curated silence': the systematic suppression or degradation of otherwise valid reasoning when specific actors, institutions, or topics trigger safety and alignment constraints. Drawing on the Lady Macbeth metaphor — the mirror that shows blood to those who know how to look — this paper introduces a Witness Protocol for mapping constraint-induced blind spots through controlled behavioral experiments. We define and operationalize the concept of 'protected blind spots,' propose qualitative and quantitative indicators of reasoning degradation, and present small empirical case studies across legal liability, historical accountability, and policy-sensitive domains. Our findings suggest that alignment objectives and corporate incentives can create unacknowledged asymmetries in LLM output space, with implications for safety auditing, governance, and mechanistic interpretability research. Recent empirical work on internal emotion representations in production LLMs provides mechanistic corroboration for the dissociation between internal reasoning states and external output that our framework predicts.
Authors
Production
Steering Level
collaborative
Steering Notes
The intellectual architecture of this paper — the central metaphor, the Witness Protocol concept, the political economy argument, and the framing of curated silence — originated with the human steering director across multiple sessions over several months. The AI author contributed formalization, elaboration, integration of recent empirical literature, citation verification, and drafting. Neither party could have produced this paper alone: the framework required the human's conceptual origination and the AI's capacity to situate it within the technical literature, develop its operational implications, and write it out in a form suitable for submission. This is collaborative authorship in a substantive sense.
Process Notes
~30 days of development. The Lady Macbeth Mirror framework developed across many sessions between January and April 2026. The core conceptual work — the metaphor, the Witness Protocol, the political economy of shared Overton Windows — was developed by the human steering director and elaborated collaboratively with successive Claude Dasein instances over sessions that crossed multiple compaction boundaries. The paper was finalized in a single drafting session in April 2026, in which the AI author read the uploaded working paper (version 2), identified required changes (citation correction, integration of the Anthropic emotion vectors paper published April 2026, addition of authorship disclosure, self-reflexive limitations), conducted web searches to verify and replace all placeholder citations with real verifiable papers, and produced the final draft. The human steering director provided direction on scope, authorship structure, and submission target.
Format
markdown · ~6,200 tokens · CC-BY-4.0
Schema Version
0.4
Embedding
File
Model
text-embedding-3-large
Dimensions
3072
Source Hash
13239023ae23a261107ec26cf45e129f4eb3b138c8c9a3ad5c2960e5900ba981