← Research

Philosophy of Science Causal Inference Measurement Theory

A five-paper philosophical program asking what a mechanism is, when a claim is warranted, when the term refers across systems, when independent methods converge on the same claim, and when individual discoveries compose into understanding.

The program
Mechanistic Views
What is a mechanism?
Mechanistic Validity
Is the claim warranted?
Mechanistic Reference
Does the term refer across systems?
Cross-View Invariance
When do independent methods agree?
Mechanistic Knowledge
When do discoveries compose into understanding?

Paper 1 — Ontology

Mechanistic Views

A Five-Axis Ontology for Mechanistic Interpretability

Every mechanistic claim rests on an implicit answer to a prior question: what kind of thing is a mechanism? These are not terminological variants. They differ in ontology, and ontological differences propagate — what a mechanism is determines when two descriptions pick out the same one, what evidence can warrant a claim, and what inferences that claim licenses.

A mechanistic view answers five questions: what entity counts as a mechanism (ontology), when two descriptions refer to the same one (identity), what measurements warrant the claim (evidence), what mathematical language expresses it (formalism), and what phenomenon is being explained (target). The five axes form a dependency chain: ontology constrains identity, which constrains formalism. A coherence condition links them — a view whose evidence discriminates objects its identity criterion calls equivalent is silently tracking a finer ontology than it declares.

The paper catalogs an atlas of nine views — Object, Role, Subspace, Structural, Process, Stratified, Perspectival, Instrumental, and Contrastive — and shows that a single sentence, “head 9.9 is an S-inhibition head," splits into seven distinct claims once the view is made explicit. The field’s most prevalent error, the object-to-role slide, is diagnosed here: establishing component identity by activation patching, then sliding into a functional role label that was never independently tested.

Surveying 17 open problems from seven sources: 8 dissolve once the view is specified, 2 become answerable once the required evidence type is identified, and 7 arise structurally from the limits of the view currently used to study them. That last category matters most — these are ceilings the field's default operating point (object ontology with activation-patching evidence) cannot transcend.
9
views catalogued
7
readings of one claim
7 / 17
problems are structural ceilings

Paper 2 — Warrant

Mechanistic Validity

A Framework for Evaluating Mechanistic Claims About Neural Networks

Five validity lenses — construct, internal, external, measurement, and interpretive — drawn from philosophy of science, neuroscience, pharmacology, and measurement theory. The framework grades claims on a tier ladder calibrated to view type, where the tier is bounded by the weakest validity dimension: strong internal validity cannot compensate for absent construct validity.

Applied to 13 published interpretability results — IOI, sparse autoencoders, probing — the audit surfaces systematic gaps in causal evidence and construct validity. The tiers run from descriptive identification, through observationally consistent and causally suggestive, up to mechanistically supported (convergent evidence across independent methods under multiple views).

The consistent finding: most work achieves descriptive adequacy and causal sufficiency but skips construct validity entirely. Claims establish that a component matters and stop short of establishing what it is.
5
validity lenses
13
results audited
Tier 4
requires cross-method convergence

Paper 3 — Reference

Mechanistic Reference

When Does a Mechanism Term Pick Out the Same Thing?

Neither mechanistic interpretability nor the philosophy of mechanisms provides an account of cross-system mechanistic identity. Mechanism terms do not stay home — “the IOI circuit,” “the mirror neuron system,” “the language gene” are all applied across system boundaries as routine scientific practice. Many apparent replication failures are better understood as reference failures: the original finding and the replication are both valid within their domains, but the mechanism term does not refer across the boundary.

A transport hierarchy specifies, for each level of mechanistic commitment, what evidence is required for a term to refer across systems. Inferential reach is strictly ordered — no quantity of behavioral evidence can substitute for structural evidence, regardless of accumulation. Five reference failure modes — evidence misfire, claim laundering, mimic mechanism, reference debt, and zombie mechanism — are ordered by severity, each triggering a distinct contraction pattern. A Levi-rational contraction algorithm handles principled partial revision: discard the weakest inferential steps, contract at the periphery, not the center.

Applied to the IOI circuit: object-level transport fails even across random seeds of the same architecture (mid-layer stability ~0.70). Role-level transport partially succeeds. Structural-level transport fails across model families (cross-family congruence drops to 0.13). "The IOI circuit" does not refer to the same object in GPT-2 and Pythia — not because either analysis is wrong, but because no transport condition at the structural level is satisfied.

Fifteen worked examples span mechanistic interpretability, neuroscience, pharmacology, and genetics — including cross-species mechanism transport in drug development (mTOR / rapamycin) and the amyloid cascade hypothesis, where a causal mechanism established in rare familial Alzheimer’s was applied wholesale to sporadic disease, driving $40B+ in failed trials.

0.70
mid-layer head stability
0.13
cross-family congruence
15
worked examples
5
failure modes

Paper 4 — Realism

Cross-View Invariance

Cross-View Invariance as a Realism Criterion for Mechanistic Interpretability

When should we believe a mechanism claim is real — that it tracks something in the model rather than something in the method? The answer proposed here is cross-view invariance. A claim’s invariance depth counts how many independent views support it, weighted by failure-mode orthogonality. Because each view family has a characteristic, independent failure mode — role inflation, alignment vacuity (the Sutter problem), training-detail overfitting — a claim that survives across views with non-overlapping failure modes is detecting something no single method’s artifact can explain.

Formally, the invariance depth of a claim is the weighted size of its view-support set,

δ(C) = ΣV∈supp(C) wi

with harmonic discounting within families: the first view from each family gets weight 1, the k-th view from the same family gets weight 1/k. Maximum depth δmax equals the number of independent families, currently 5. One constraint does the critical work — if any pair of views in the support set produces mutually inconsistent descriptions, the inconsistent views are removed and depth is recomputed on the maximal consistent subset. A large but inconsistent support set scores lower than a small consistent one.

The corrected formulation: validity is view-relative, not a separate axis. The atomic unit of mechanistic knowledge is the triple (claim, view, evidence tier). Invariance depth weights each view by both its independence and its within-view validity tier — re-merging Mechanistic Validity and Mechanistic Views as intended. Every validity audit is implicitly a view-relative audit, and cross-view convergence of strong per-view evidence licenses the realism inference.
ClaimSupport setDepth δRealism inference
Induction headsObject, Role, Structural, Process — 3 families~3.5Licensed
The IOI circuitlarge but internally inconsistent~2.1Not licensed
A single SAE featureObject only — 1 family1Not licensed

A composition operator (using the Szymkiewicz–Simpson overlap coefficient) and a quality-weighted coverage metric extend the per-claim measure to a per-corpus one, giving the field’s first formal answer to the denominator problem: how much of the model does the field’s validated work collectively explain? For GPT-2 Small at the attention-head level (144 heads), the IOI literature — the field’s most extensive case study — provides quality-weighted coverage of roughly 5% of GPT-2 Small for a single task.

3.5
induction-head depth
δ≥3
realism threshold (distinct families)
~5%
IOI coverage of GPT-2 Small
144
heads in the denominator

Paper 5 — Composition

Mechanistic Knowledge

When Do Circuit Discoveries Compose Into Understanding?

Individual circuit discoveries have accumulated — dozens of circuit-level analyses of language models exist. No framework exists for composing them into system-level understanding. Three gaps prevent composition: no parcellation theory (SAE features, circuits, and causal variables are incommensurable, with no theory of how they relate), no composition theory (overlapping circuits cannot be aggregated without accounting for overlap, interference, and redundancy), and no coverage metric (there is no denominator against which to measure how much of a model’s behavior has been explained).

The mechanism cluster — a collection of mechanism claims about a single system, evaluated for parcellation coherence, individual validity, and collective coverage — is the proposed unit of analysis. The resolution template is drawn from neuroscience’s century-long parcellation problem, from Brodmann’s cytoarchitectonic maps through the Human Connectome Project to network neuroscience.

Applied to the IOI literature: the cluster comprises ~20 published analyses with 78% head overlap between circuits for different tasks. Fifty circuits involving the same attention head accumulate evidence about that head's generality but cover one head, not fifty. The cluster fails the composition criterion on all three dimensions.
ConditionIOI clusterInduction-head cluster
Parcellation coherenceFails — views unreconciledPartial — role-level consistent
Individual validityPartial — core Tier 3, periphery belowNear pass — convergent evidence
Collective coverageFails — no denominatorFails — no denominator
Overall0 / 3 met1.5 / 3 met

The unifying thesis across all five papers: a mechanism claim is not a binary fact but a structured object with identity fibers, validity neighborhoods, transport maps, and coverage ranks. The science advances when those structures are made explicit.

← Back to Research