A five-paper philosophical program asking what a mechanism is, when a claim is warranted, when the term refers across systems, when independent methods converge on the same claim, and when individual discoveries compose into understanding.
Mechanistic Views
A Five-Axis Ontology for Mechanistic Interpretability
Every mechanistic claim rests on an implicit answer to a prior question: what kind of thing is a mechanism? These are not terminological variants. They differ in ontology, and ontological differences propagate — what a mechanism is determines when two descriptions pick out the same one, what evidence can warrant a claim, and what inferences that claim licenses.
A mechanistic view answers five questions: what entity counts as a mechanism (ontology), when two descriptions refer to the same one (identity), what measurements warrant the claim (evidence), what mathematical language expresses it (formalism), and what phenomenon is being explained (target). The five axes form a dependency chain: ontology constrains identity, which constrains formalism. A coherence condition links them — a view whose evidence discriminates objects its identity criterion calls equivalent is silently tracking a finer ontology than it declares.
The paper catalogs an atlas of nine views — Object, Role, Subspace, Structural, Process, Stratified, Perspectival, Instrumental, and Contrastive — and shows that a single sentence, “head 9.9 is an S-inhibition head," splits into seven distinct claims once the view is made explicit. The field’s most prevalent error, the object-to-role slide, is diagnosed here: establishing component identity by activation patching, then sliding into a functional role label that was never independently tested.
Mechanistic Validity
A Framework for Evaluating Mechanistic Claims About Neural Networks
Five validity lenses — construct, internal, external, measurement, and interpretive — drawn from philosophy of science, neuroscience, pharmacology, and measurement theory. The framework grades claims on a tier ladder calibrated to view type, where the tier is bounded by the weakest validity dimension: strong internal validity cannot compensate for absent construct validity.
Applied to 13 published interpretability results — IOI, sparse autoencoders, probing — the audit surfaces systematic gaps in causal evidence and construct validity. The tiers run from descriptive identification, through observationally consistent and causally suggestive, up to mechanistically supported (convergent evidence across independent methods under multiple views).
Mechanistic Reference
When Does a Mechanism Term Pick Out the Same Thing?
Neither mechanistic interpretability nor the philosophy of mechanisms provides an account of cross-system mechanistic identity. Mechanism terms do not stay home — “the IOI circuit,” “the mirror neuron system,” “the language gene” are all applied across system boundaries as routine scientific practice. Many apparent replication failures are better understood as reference failures: the original finding and the replication are both valid within their domains, but the mechanism term does not refer across the boundary.
A transport hierarchy specifies, for each level of mechanistic commitment, what evidence is required for a term to refer across systems. Inferential reach is strictly ordered — no quantity of behavioral evidence can substitute for structural evidence, regardless of accumulation. Five reference failure modes — evidence misfire, claim laundering, mimic mechanism, reference debt, and zombie mechanism — are ordered by severity, each triggering a distinct contraction pattern. A Levi-rational contraction algorithm handles principled partial revision: discard the weakest inferential steps, contract at the periphery, not the center.
Fifteen worked examples span mechanistic interpretability, neuroscience, pharmacology, and genetics — including cross-species mechanism transport in drug development (mTOR / rapamycin) and the amyloid cascade hypothesis, where a causal mechanism established in rare familial Alzheimer’s was applied wholesale to sporadic disease, driving $40B+ in failed trials.
Cross-View Invariance
Cross-View Invariance as a Realism Criterion for Mechanistic Interpretability
When should we believe a mechanism claim is real — that it tracks something in the model rather than something in the method? The answer proposed here is cross-view invariance. A claim’s invariance depth counts how many independent views support it, weighted by failure-mode orthogonality. Because each view family has a characteristic, independent failure mode — role inflation, alignment vacuity (the Sutter problem), training-detail overfitting — a claim that survives across views with non-overlapping failure modes is detecting something no single method’s artifact can explain.
Formally, the invariance depth of a claim is the weighted size of its view-support set,
δ(C) = ΣV∈supp(C) wi
with harmonic discounting within families: the first view from each family gets weight 1, the k-th view from the same family gets weight 1/k. Maximum depth δmax equals the number of independent families, currently 5. One constraint does the critical work — if any pair of views in the support set produces mutually inconsistent descriptions, the inconsistent views are removed and depth is recomputed on the maximal consistent subset. A large but inconsistent support set scores lower than a small consistent one.
(claim, view, evidence tier). Invariance depth weights each view by both its independence and its within-view validity tier — re-merging Mechanistic Validity and Mechanistic Views as intended. Every validity audit is implicitly a view-relative audit, and cross-view convergence of strong per-view evidence licenses the realism inference.| Claim | Support set | Depth δ | Realism inference |
|---|---|---|---|
| Induction heads | Object, Role, Structural, Process — 3 families | ~3.5 | Licensed |
| The IOI circuit | large but internally inconsistent | ~2.1 | Not licensed |
| A single SAE feature | Object only — 1 family | 1 | Not licensed |
A composition operator (using the Szymkiewicz–Simpson overlap coefficient) and a quality-weighted coverage metric extend the per-claim measure to a per-corpus one, giving the field’s first formal answer to the denominator problem: how much of the model does the field’s validated work collectively explain? For GPT-2 Small at the attention-head level (144 heads), the IOI literature — the field’s most extensive case study — provides quality-weighted coverage of roughly 5% of GPT-2 Small for a single task.
Mechanistic Knowledge
When Do Circuit Discoveries Compose Into Understanding?
Individual circuit discoveries have accumulated — dozens of circuit-level analyses of language models exist. No framework exists for composing them into system-level understanding. Three gaps prevent composition: no parcellation theory (SAE features, circuits, and causal variables are incommensurable, with no theory of how they relate), no composition theory (overlapping circuits cannot be aggregated without accounting for overlap, interference, and redundancy), and no coverage metric (there is no denominator against which to measure how much of a model’s behavior has been explained).
The mechanism cluster — a collection of mechanism claims about a single system, evaluated for parcellation coherence, individual validity, and collective coverage — is the proposed unit of analysis. The resolution template is drawn from neuroscience’s century-long parcellation problem, from Brodmann’s cytoarchitectonic maps through the Human Connectome Project to network neuroscience.
| Condition | IOI cluster | Induction-head cluster |
|---|---|---|
| Parcellation coherence | Fails — views unreconciled | Partial — role-level consistent |
| Individual validity | Partial — core Tier 3, periphery below | Near pass — convergent evidence |
| Collective coverage | Fails — no denominator | Fails — no denominator |
| Overall | 0 / 3 met | 1.5 / 3 met |