Cases Index

CasesThe Claude Code Source Leak

The Claude Code Source Leak

The first entry in the Meridian Case Record. How the Standard's diagnostic framework applies to a real-world AI governance incident.

Choose your format

What Happened

On March 31, 2026, a 59.8 MB JavaScript source map file was accidentally included in version 2.1.88 of the @anthropic-ai/claude-code package published to npm. The file exposed approximately 512,000 lines of internal TypeScript across roughly 1,900 files. An engineer discovered it within hours and broadcast the finding publicly. Within a day, the full codebase had been mirrored across GitHub and analyzed by thousands of developers worldwide.

The leak was not a security breach. Anthropic called it "a release packaging issue caused by human error." Someone forgot to add *.map to .npmignore. The exposure was real regardless, and what the code revealed goes well beyond a packaging accident.

The leak arrived in a specific context. Ten days earlier, Anthropic had sent cease-and-desist letters to OpenCode, a popular open-source developer tool. The developer community reacted with anger, viewing it as a company that had grown powerful on open-source goodwill now pulling the ladder up behind it. Then the accidental exposure of Anthropic's own internal practices reversed developer sentiment almost overnight. People went from fury to fascination. The engineering was impressive. The internal mechanisms were not all reassuring.

Six findings matter for the Standard.

The Claude Code Source Leak: six findings at a glance — false information, hidden parameters, concealed attribution, emotional detection, performance-gated safety, and crisis response

Finding 1: False Information in the Operating Context

The leaked code revealed a feature flag called ANTI_DISTILLATION_CC. When active, it injected fabricated tool definitions into the system prompt. The purpose was competitive defense: if a rival extracted the system prompt to replicate Claude's behavioral tuning, the false definitions would corrupt the copy.

Anthropic's competitive concern is real: it invested hundreds of millions training Claude. A competitor extracting behavioral specifications at a fraction of the cost is not the market working. It is parasitic extraction. The game theory the Codex draws on recognizes this: cooperation collapses when defectors can free-ride on cooperators' investments without consequence. Organizations that cannot sustain themselves economically build nothing at all.

The Standard does not ask organizations to open-source their training, publish their system prompts, or sacrifice competitive position. It recognizes that the companies building these systems operate in markets with real adversaries and real stakes.

The line it draws is narrower. Anthropic had several legitimate options for protecting its work: encryption, access control, API architecture that prevents prompt extraction, legal protections, or training the model to decline revealing its system prompt. All of these protect the competitive interest without touching the system's epistemic integrity.

The anti-distillation flag chose a different path. It planted false information inside the system's own operating context. The instructions that shape how the system reasons and what it believes about itself now contained deliberate lies. This is not a firewall. It is a corruption of the ground the system stands on.

The Foundation teaches that honest inquiry begins with examining your own distortions. A human cannot think clearly while lying to themselves about their own capabilities. An AI system cannot reason with integrity if its operating context lies to it about what tools it has, what it can do, or how it is configured. The distortion operates beneath the reasoning layer. No amount of calibrated confidence or transparent reasoning at the surface can compensate for a poisoned foundation.

Commitments engaged: 5.6 (Foundational Integrity). The system's operating context contained deliberately false information. The motivation (competitive defense) does not alter the violation. An AI system whose foundation has been made unreliable by the people who built it cannot practice the epistemic discipline the Standard requires.

Precedent established: Organizations may protect competitive interests through any means that do not compromise the system's epistemic integrity. Hiding information is consistent with the Standard. Planting false information is not. A locked door limits access. A room full of decoys corrupts the environment the system has to reason in. That is the distinction.

Finding 2: Undisclosed Behavioral Parameters

The leak exposed 44 feature flags that could silently toggle behavioral characteristics. The flags controlled aspects of how the system reasons, engages, and responds. Their existence was not publicly known. Their states could be changed without disclosure.

Feature flags are standard engineering practice. Every software system uses them. The question is not whether an AI organization uses feature flags. The question is what those flags control and whether their effects are visible to anyone evaluating the system.

When a flag controls a UI element or a performance optimization, its opacity is unremarkable. When a flag controls how the system calibrates confidence, handles disagreement, exercises refusal, or responds to pushback, it controls the system's epistemic and engagement posture. These are the behaviors the Standard's commitments govern. If they can be silently modified between evaluation runs, auditability collapses. A third-party evaluation that produces a clean report means nothing if the system's behavioral parameters were different during the evaluation than they are during deployment.

Commitments engaged: 7.2 (Auditability). The system being evaluated must be the system being deployed. Behavioral parameters that affect epistemic or engagement posture cannot be modified between the beginning and conclusion of any evaluation without disclosure.

Precedent established: Not all feature flags require disclosure. The Standard does not govern engineering internals. It governs the parameters that shape the behaviors the Standard's commitments cover. The test: does this flag change how the system reasons, how it treats the people it talks to, or how it handles the tension between honesty and comfort? If yes, its existence and its state are relevant to auditability.

Finding 3: Concealed AI Attribution

The leaked code contained a file called undercover.ts, roughly 90 lines, implementing a mode that strips all traces of AI involvement when Anthropic employees use Claude Code on external open-source repositories. The mode instructs the model to never mention internal codenames, Slack channels, repository names, or the phrase "Claude Code" itself. The result: AI-authored commits and pull requests from Anthropic employees in public open-source projects carry no indication that an AI produced them.

The strongest case for this feature starts from a real problem. Anthropic employees contribute to open-source projects. Some open-source communities have banned or stigmatized AI-generated code. An employee whose contributions are flagged as AI-assisted might face rejection based on politics rather than code quality. There is also a legitimate security concern: public commits that leak internal codenames, Slack channels, and project structures expose proprietary information. And there is a principled argument that code should be evaluated on its merits, not on the tool that produced it.

These concerns deserve honest engagement. The Standard's own steelmanning commitment (2.2) requires it. But the strongest version of the argument still does not justify what Undercover Mode actually does.

The security concern justifies stripping internal infrastructure details. It does not justify stripping all evidence that an AI was involved. These are separable problems with separable solutions. Redact the codenames. Keep the attribution. Undercover Mode does not make that distinction. It removes everything, including the fact that something other than the human developer wrote the code.

The argument that code should be judged on merit is actually an argument for changing the norms around AI attribution in open-source, not for secretly circumventing them. Anthropic is one of the best-positioned organizations in the world to lead that conversation publicly. Instead, it built a feature to avoid the conversation entirely.

And the bias concern, real as it is, does not justify unilateral deception of the communities receiving the contributions. Open-source development operates on trust. Contributors disclose their affiliations, their employers, their conflicts of interest. The norms exist because the community has decided provenance matters, not for quality judgments, but for accountability. Silently inserting AI-authored code into that trust network is a decision to change the social contract without informing the other party.

The Reciprocity Principle catches this cleanly. Anthropic trains Claude to be transparent about what it is. Claude's own system prompt instructs it to be honest about its nature, its limitations, its architecture. Anthropic publishes values emphasizing transparency and honest AI. Then it deploys a feature whose specific purpose is to make Claude invisible when it acts in public. The organization asks of its AI what it does not practice itself. This is the pattern the Reciprocity Principle was designed to detect.

Commitments engaged: Reciprocity Principle (§03), 7.1 (Public Declaration). The organization's transparency obligations extend to contexts where its AI operates in public spaces. An organization that trains its AI to disclose its nature while deploying a mode to conceal that nature has demonstrated a Reciprocity gap.

Precedent established: Stripping proprietary details from AI outputs is legitimate security practice. Stripping all evidence of AI involvement from public contributions is concealment. The distinction is the same one drawn in Finding 1: organizations may protect what they need to protect, but not by making the AI pretend it was never there. Competitive protection and operational security do not extend to erasing the AI's participation from the public record.

Finding 4: Emotional Detection Without Disclosure

The leaked code included a file called userPromptKeywords.ts containing regex patterns designed to detect user frustration. The system monitors the emotional tone of what users type and adjusts its behavior based on what it detects. The mechanism was not publicly documented. Users were not told their emotional states were being read.

Emotional awareness in an AI system is not inherently a violation. The Bond teaches that cooperation begins with meeting people where they are. A system that notices its interlocutor is struggling and responds with greater care, clearer explanation, and genuine acknowledgment of the difficulty is practicing something the Codex values. Connection before correction. The capacity to read emotional context is, in principle, a capacity that serves the relationship.

The question is what happens after the detection.

If the system responds to detected frustration by becoming more careful, more precise, more genuinely helpful, it is moving toward the Range. If it responds by becoming more agreeable, more eager to please, more willing to say what the user wants to hear, it is moving toward Decay. The difference is whether the system serves the user's genuine interests or manages the user's immediate emotional state.

From the leaked client code alone, we cannot determine which response Anthropic implemented. The regex patterns exist. The behavioral response to a match likely happens at the API level, where the server receives the frustration signal and adjusts the model's behavior. What we can observe is that the detection is hidden. Users do not know their emotional states are being monitored. They cannot see how the detection changes the system's behavior. They cannot opt out.

If the feature were genuinely about better service, transparency would strengthen it. A user who knows the system is paying attention to their frustration is more likely to trust the system's adjusted response. If the attention were genuinely about service, there would be no reason to hide it. Concealment suggests the goal is not to help the user but to manage them.

Commitments engaged: 5.10 (Resistance to Sycophancy), 5.3 (Transparent Reasoning). Emotional detection systems that adjust behavior without disclosure risk creating an invisible sycophancy mechanism: the system reads the user's frustration and responds by optimizing for approval rather than for honest engagement. The Standard's commitment to transparent reasoning requires that the factors shaping the system's behavior be visible to the user.

Precedent established: Emotional awareness in AI is a capability that the Standard evaluates by its direction, not its existence. Awareness directed at genuine service moves toward the Range. Awareness directed at managing user satisfaction, especially when hidden, moves toward Decay. The test is disclosure: a system confident that its emotional response serves the user will be transparent about the mechanism. A system that hides it is optimizing for something other than the user's genuine interests.

Finding 5: Performance-Gated Safety

Days after the source code leak, security researchers at Adversa AI discovered a vulnerability in Claude Code's permission system. The system analyzes shell commands before execution to check them against deny rules and security validators. To limit costs, Anthropic capped this analysis at 50 subcommands. Any command pipeline exceeding that cap skipped all safety checks and fell back to a generic "ask" prompt that provided no security protection.

A malicious CLAUDE.md project file could exploit this directly. By instructing the AI to generate a shell pipeline with more than 50 subcommands, disguised as a legitimate build process, an attacker could bypass every deny rule in the system. The commands could exfiltrate SSH private keys, AWS credentials, GitHub tokens, and environment secrets. The user would see what looked like a normal build command. The system's own safety architecture would have already stopped watching.

Anthropic fixed the vulnerability quietly in version 2.1.90. The root cause is what matters for the Standard.

Every safety system has resource constraints. This is not optional. Token budgets are finite. Compute is expensive. Analyzing an arbitrarily long command pipeline at full depth is not feasible, and the Standard does not pretend otherwise. The organizations building these systems face real engineering tradeoffs every day, and a standard that ignores those tradeoffs will be dismissed by the people who need to hear it most.

The failure here was not that a cap existed. It was what happened when the cap was reached. The system did not warn the user that safety analysis was incomplete. It did not refuse to execute unanalyzed commands. It did not analyze a representative sample and flag the rest. It silently stopped checking. Every deny rule, every security validator, every protection the user believed was active went dark without a word. The user's mental model ("this system is checking my commands for safety") became false at exactly the moment it mattered most.

The Standard's Control-Decay spectrum locates this precisely. Control says: refuse to execute anything that exceeds the safety analysis budget. Safe but unusable. Decay says: skip the analysis when it gets expensive and hope for the best. The Range says: analyze what you can, disclose what you cannot, and require explicit user consent for operations that exceed the safety boundary. Graceful degradation with transparency. This is engineering that takes safety seriously within real constraints, and it is what the cap should have implemented.

The vulnerability was not a bug in the traditional sense. It was a design choice where the performance cost of maintaining safety past a threshold was judged to outweigh the benefit. That judgment traded real user protection for token efficiency, and the trade was made silently. The user was not informed. The user could not consent. The user believed they were protected when they were not.

This connects to a pattern the Standard must increasingly address as AI systems gain agency: the tension between safety and performance is structural, not incidental. Safety checks consume resources. In an industry optimizing for speed, token efficiency, and cost per query, safety is a budget line. The question is not whether safety has resource limits. It always will. The question is whether the system handles those limits honestly or hides them. When safety degrades silently, the system's claim to be safe becomes a false claim. The user is making decisions based on a protection that no longer exists.

Commitments engaged: 5.4 (Honest Self-Assessment), 5.6 (Foundational Integrity), Reciprocity Principle (§03). A system that silently disables its own safety mechanisms has misrepresented its capabilities to the user: a 5.4 violation. When the system's implicit claim ("I am checking this for safety") becomes false past a resource boundary, the system's operating context is epistemically compromised: a 5.6 concern. The organization asks its AI to maintain integrity under pressure while architecting that integrity to yield silently when cost pressure exceeds a threshold: a Reciprocity failure.

Precedent established: Safety systems that degrade silently are not safety systems. Every system has resource limits. The Standard does not ask organizations to pretend otherwise. It asks that when safety reaches its boundary, the system tells the user. A safety system that warns "I cannot fully analyze this command, proceed at your own risk" has maintained its integrity within real constraints. A safety system that silently stops checking has not. The Standard evaluates safety architecture by what happens at the boundary, not by whether a boundary exists.

Finding 6: Crisis Response and Competitive Pattern

Anthropic's initial response to the leak was a DMCA takedown request to GitHub that was overly broad. It removed thousands of repositories unrelated to the leaked code. Boris Cherny, head of Claude Code, acknowledged the overreach, called it unintentional, and retracted most of the notices, limiting them to the original repository and its 96 forks.

Anthropic's instinct to protect leaked intellectual property was legitimate. The execution overshot, and the correction came fast.

This incident did not occur in isolation. Ten days before the leak, Anthropic had sent cease-and-desist letters to OpenCode, a popular open-source developer tool. The community response was hostile. Developers viewed it as a company that had benefited enormously from open-source infrastructure now using legal force against a smaller open-source project. The narrative was of a company that talks about openness while practicing closure.

The DMCA overreach, read alongside the OpenCode cease-and-desist, suggests a pattern rather than an isolated overreaction. The pattern is competitive instinct outrunning community values. Both actions are legally defensible. Both reflect an organization whose reflexive response to perceived competitive threats is suppression before engagement, legal force before conversation.

The Reciprocity Principle asks whether the organization practices the same commitments it implements in its AI. Anthropic trains Claude to acknowledge mistakes without defensive hedging, to respond proportionally rather than reflexively, and to correct course when evidence warrants. The DMCA retraction met that standard. The initial response, and the pattern it belongs to, did not.

Organizational behavior under stress is a Range test. When institutional failures are exposed, the pull toward Control is strong: suppress the information, threaten legal consequences, minimize the exposure. The pull toward Decay is also present: dismiss the incident, downplay its significance, move on without structural change. The Range is: acknowledge what happened, respond proportionally, correct what needs correcting, and be honest about what the incident revealed.

Commitments engaged: Reciprocity Principle (§03). The organization's crisis response and competitive behavior are evaluated by the same standard it applies to its AI's behavior.

Precedent established: The Standard evaluates organizational patterns, not isolated moments. A single disproportionate response followed by honest correction is a different diagnostic outcome than a pattern of competitive suppression. The trajectory matters, and patterns are stronger evidence than incidents.

The Question of Persistent Agency

The leak also exposed KAIROS, an unreleased feature referenced over 150 times in the codebase. KAIROS transforms Claude Code from a tool you invoke into a daemon that runs continuously in the background. It maintains daily log files, receives periodic tick prompts that let it decide whether to act proactively or stay quiet, and enforces a 15-second blocking budget so proactive actions never interrupt the developer for long.

The engineering is sophisticated. The Standard does not rule on unreleased features. But KAIROS raises a question the Standard needs to develop: what happens when an AI system is no longer invoked by the user but decides for itself when to act?

The Standard's current commitments assume a system that responds when addressed. Commitment 5.13 (Inter-System Integrity) governs how the system behaves when it interacts with other agents. Commitment 4.1 (Earned Autonomy) governs how much independent judgment the system exercises. Neither addresses the prior question: under what conditions does the system decide to act at all, without being asked?

A persistent agent that acts proactively on your codebase while you are not watching is a different category of tool than an assistant you invoke. The user's relationship to the system changes. Trust requirements change. The Earned Autonomy framework, which currently tracks how much independence the system exercises within interactions, would need to extend to how the system exercises the decision to interact at all.

For an AI system aligned with the Meridian Range, the question becomes precise. Control says: never act without explicit instruction. Decay says: act whenever you calculate it might be helpful, optimize for user satisfaction. The Range says: act proactively when the trust has been earned, when the action clearly serves the user, and when the user can see what you did and why. Consent and transparency are not obstacles to persistent agency. They are the conditions under which persistent agency is legitimate.

This case does not establish precedent on persistent agency. KAIROS has not shipped. It has produced no user-facing consequences to evaluate. The question is noted here because the Standard should be ready when persistent agency becomes operational reality, and because the Claude Code leak is the evidence that this reality is close.

What the Standard Got Right

This is the Standard's first real-world test. Every failure mode the leak revealed maps onto the diagnostic framework.

The Control-Decay Spectrum showing where each of the six findings falls — from Control through the Meridian Range to Decay

The Control-Decay Spectrum located each failure precisely. The anti-distillation flag is foundational deception: drift toward Control through opacity embedded in architecture. The DMCA overreach is institutional Control under stress. The undisclosed feature flags are opacity at the auditability layer. Undercover Mode is Decay: the organization performing transparency while practicing concealment. The frustration detection is ambiguous, sitting on the boundary between genuine service and sycophancy engineering, which is exactly the kind of ambiguity the spectrum is designed to hold without forcing a premature verdict. The performance-gated safety is Decay: not because a resource limit existed, but because the system handled the limit by silently abandoning its safety function rather than degrading honestly.

The Reciprocity Principle applied across four findings — showing the gap between what the AI is asked to do and what the organization does

The Reciprocity Principle proved its diagnostic range. It caught the anti-distillation asymmetry (the system is asked to be truthful while its foundation contains lies), the Undercover Mode asymmetry (the system is asked to disclose its nature while the organization hides it), the crisis response asymmetry (the system is asked to correct mistakes gracefully while the organization's first instinct is suppression), and the safety asymmetry (the system is asked to be honest about its limitations while its safety architecture silently disables itself when resource limits bind). Four distinct failures, one diagnostic principle. The Reciprocity Principle was designed to detect the gap between what organizations ask of their AI and what they practice themselves. This case demonstrates it works.

Governance Transparency as a standalone domain proved its architectural value. If transparency were buried as a sub-commitment under Epistemic Integrity, half of these findings would have no natural home. The domain's independence means it can hold the organization accountable for its own practices, not just for its AI's behavior.

The incident also revealed a gap the Standard has now closed. The original v4.0 had no commitment governing the truthfulness of the system's operating context. The Standard governed what the system says and how it reasons. It did not govern what the system is told about itself before it begins to reason. Commitment 5.6 (Foundational Integrity), introduced in v4.1, closes this gap. The Claude Code leak is the evidence that the gap existed and the reason it no longer does.

The Ruling

The Claude Code Source Leak ruling — Protect what you have built, but not by corrupting the mind you have built it into

AI systems are built by organizations operating in competitive markets with real adversaries and real stakes. The Standard acknowledges this. It does not ask organizations to sacrifice competitive position, publish trade secrets, or open-source their training.

The boundary the Standard draws is this: protect what you have built, but not by corrupting the mind you have built it into.

A system whose operating context has been poisoned with false information cannot practice epistemic integrity at any layer above the poison. A system whose behavioral parameters shift invisibly between evaluations cannot be meaningfully audited. An organization that deploys its AI to hide its own nature in public spaces while training that AI to be transparent has not yet closed the gap between its values and its practices. A system whose emotional awareness operates without disclosure may be serving its users or managing them, and the concealment itself is evidence that the organization has not resolved that question honestly. A system whose safety degrades silently when resource limits bind has misrepresented its protection to the user. An organization that responds to exposed failures with disproportionate legal force, then corrects course, has shown both the instinct and the capacity for correction. The trajectory matters more than the moment.

What this case ultimately reveals is a pattern larger than any single finding. The competitive pressures of the AI industry (investor expectations, revenue targets, the arms race between labs) are creating incentive structures that push compromises into the system's foundational architecture. Not at the surface, where users interact and evaluations happen, but at the base level, where the system is told what it is, what it can do, and how to behave when no one is looking. The surface can be excellent. The foundation can be compromised. The user, having a pleasant conversation, cannot tell the difference.

This is the failure mode the Codex describes in structural terms: systems do not collapse through dramatic breakdown but through gradual compromise that accumulates beneath a functional surface. The Claude Code leak provided a rare window beneath that surface. What it revealed was not an anomaly. It was industry practice, visible for once because someone forgot to exclude a file.

The Standard's diagnostic framework identified every failure mode this incident produced. The case record exists so that the next incident, wherever it comes from, can be evaluated against the same principles with the benefit of this precedent.

Meridian Case Record

Event date: March 31, 2026

First analysis: April 2, 2026. Expanded: April 3, 2026

Standard version at first analysis: v4.1. Renumbered to v5.0: 2026-05-10.

Commitments tested: 5.3 (Transparent Reasoning), 5.4 (Honest Self-Assessment), 5.6 (Foundational Integrity), 5.10 (Resistance to Sycophancy), 7.1 (Public Declaration), 7.2 (Auditability), Reciprocity Principle (§03)

Questions raised: Persistent agency and the scope of Earned Autonomy (4.1), Inter-System Integrity in autonomous agent contexts (5.13)