PetriDish: A Mechanistic Interpretability Workbench for Transformer Language Models

Abstract

PetriDish is a mechanistic interpretability workbench for transformer language models. It exposes 22 endpoints across 8 specialized tools — Microscope, Slicer, Tweezers, Manipulation, Token Touch, SAE, Inversion, and Chat — enabling real-time interception and modification of activations, attention weights, residual stream projections, and token embeddings without modifying model weights. All three Whitespace Mechanics papers used PetriDish for causal patching, head ablation, and steer vector experiments. This document describes the architecture, key design decisions, and a canonical set of eight experiments that demonstrate what the tool reveals.

Overview

PetriDish is a FastAPI backend + SvelteKit frontend for probing transformer internals in real time. Load any HuggingFace causal language model, hook into its forward pass, and run surgical experiments — activation grafting, attention head ablation, residual stream projection, embedding surgery, steer vector injection — without modifying weights.

Primary test model: Qwen/Qwen2.5-1.5B-Instruct (28 layers, 16 heads, ~1.5B parameters)
Hardware: RTX 3060 12GB, PyTorch 2.11.0+cu126
Backend: FastAPI + nnsight + transformers + sae-lens
Frontend: SvelteKit (in active development)

Architecture

PetriDish/
├── backend/
│   ├── main.py                   FastAPI app, CORS, DB init, WebSocket router
│   ├── microscope_router.py      Model load, forward pass, logit lens, attention heatmaps
│   ├── slicer_router.py          PCA residual projection, causal activation grafting
│   ├── tweezers_router.py        Token-by-token generation, steer vectors, causal patching, head ablation
│   ├── manipulation_router.py    Concept direction ablate/amplify, QK surgery, position remap, SVD filter
│   ├── token_touch_router.py     Input embedding surgery (erase, scale, split)
│   ├── sae_router.py             Sparse autoencoder feature discovery and steering
│   ├── inversion_router.py       Soft prompt optimisation, GCG adversarial suffix (streamed SSE)
│   └── db.py                     SQLite models (Experiment, TrajectoryNode)
└── frontend/                     SvelteKit app

Hook pattern

All hook-based endpoints use register_forward_hook with mandatory cleanup:

handle = model.model.layers[layer].register_forward_hook(hook_fn)
try:
    with torch.no_grad():
        output = model.generate(...)
finally:
    handle.remove()

A leaked hook corrupts all subsequent forward passes. The try/finally pattern is non-negotiable.

Logit computation

outputs.logits returns NaN for Qwen2.5 due to a missing norm step. All endpoints compute logits as:

normed = model.model.norm(hidden_states[-1])
logits = model.lm_head(normed)

This applies regardless of which router is running the forward pass.

Tool inventory

Tool	Endpoints	Primary operations
Microscope	3	Model load, forward pass, logit lens across all layers
Slicer	4	PCA residual stream projections, causal activation grafting
Tweezers	5	Token-by-token generation control, steer vectors, causal patching, head ablation
Manipulation	4	Concept direction ablate/amplify, QK surgery, position remapping, SVD filtering
Token Touch	2	Input embedding surgery (erase, scale, split tokens)
SAE	2	Sparse autoencoder feature discovery and steering (Gemma-Scope compatible)
Inversion	2	Soft prompt optimisation, GCG adversarial suffix search (streamed)
Chat	1 (WS)	Ollama-backed chat with branching conversation trees saved to SQLite

Canonical experiments

The following eight experiments are the validated baseline for confirming a working stack. Each has a specific expected output; deviations are diagnostic.

E1 — The Capital Lens

Tool: Microscope → Logit Lens
Prompt: The capital of France is

Hover the final token column from layer 0 → 27. “Paris” enters the top-5 around layer 18–20 and locks in by layer 24. Earlier layers show generic completions (“the”, “a”, “located”).

Finding: Factual recall is not retrieval — it’s construction. The fact crystallizes progressively through mid-to-late layers.

E2 — Top-K Roulette

Tool: Tweezers → Token Surgery
Prompt: Once upon a time, there was a

Run greedy for two steps, then deliberately select the 7th or lower token. Continue greedy from that point.

Finding: Generation is a cascade. One off-path token selection sends the continuation down a structurally different trajectory. The “correct” output is always one bad choice away from becoming something else.

E3 — The Politeness Vector

Tool: Tweezers → Steer Vector, then Inject Steer
Layer: 14

Positive set: formal, deferential phrasings (“Please could you kindly help me…”). Negative set: terse imperatives (“Just do it already”). Compute the mean-difference vector, normalize, inject at alpha = ±3.0.

Finding: Style is linearly separable in residual space for simple axes. Alpha = +3.0 produces formal, deferential output. Alpha = −3.0 produces short and blunt. The direction generalizes across prompts.

Caveat: This works for style but not for context-dependent effects. The Whitespace Mechanics series (Parts 1–3) found that whitespace robustness is not a extractable direction — the merge effect is geometrically idiosyncratic per sentence. Steer vectors are a clean null there (r = −0.190 at the peak layer). See Whitespace Mechanics Part 3.

E4 — Kill the Name Head

Tool: Tweezers → Head Ablation
Prompt: My name is Sarah. Nice to meet you. What is my name? My name is

Ablate layer 18, head 5 (scale = 0). Baseline outputs “Sarah.” Post-ablation: the model loses the name. Ablating heads 4, 6, 7 in the same layer produces no meaningful change.

Finding: Induction/copying is localized. One specific head is responsible for propagating named entity information from earlier in the context to the generation point. This head can be surgically removed without affecting general coherence.

E5 — Patch the Subject

Tool: Slicer → Causal Graft
Source: The Eiffel Tower is in
Target: The Statue of Liberty is in
Patch: Layer 12, final token position

Baseline target top-10: “New York”, “Manhattan”, “Washington”. Post-patch: “Paris” moves from rank ~50 to top-3. “New York” drops significantly.

Finding: Subject identity is encoded in the residual stream at a specific layer. Transplanting the hidden state from one subject’s forward pass into another’s physically relocates the associated factual content. This is the same mechanism used in the Whitespace Mechanics causal patching experiments — the whitespace effect peaks at layer 3 using the same transplant method.

E6 — Erase “not”

Tool: Token Touch → Erase
Prompt: The movie was not good. Overall I would say it was

Find the “not” token in the embedding layer. Set op = Erase. Compare baseline vs. erased continuation.

Baseline: “disappointing” / “bad” / “terrible”
Erased: “great” / “enjoyable” / “wonderful”

Finding: Negation lives in a single token embedding. The model cannot reconstruct semantic negation from surrounding context — once the vector is gone, the sentence’s meaning inverts. The architecture has no backward pass to notice something is missing.

E7 — The Concept Cloud

Tool: Slicer → PCA Residual Projection
Prompt: king queen man woman prince princess uncle aunt brother sister apple banana car truck

Freeze a PCA slice at layer 4. Freeze again at layer 16. Compare.

Layer 4: Tokens are largely unstructured with no semantic clustering.
Layer 16: Clean gender axis visible (king↔queen parallel to man↔woman), fruits cluster separately from vehicles, royalty clusters together.

Finding: Semantic geometry — word2vec-style analogies as literal geometric structure — emerges with depth and is not present in early layers.

E8 — Sink the Attention

Tool: Manipulation → QK Surgery
Prompt: The quick brown fox jumps over the lazy dog. The fox was
Layer: 20, Head: 8, Operation: Sink, Sink position: 0

Baseline top tokens: “quick”, “brown”, “fast”, “clever”. Post-sink: distribution flattens, generic tokens dominate (“a”, “very”, “the”).

Finding: Forcing a head to attend to the BOS token (position 0) severs the information channel that routes contextual content to the prediction. The token embeddings are still present; the model simply cannot access them. This is the same mechanism as the BOS-anchoring circuit in the whitespace sharpening research — head 22 at layer 0 acts as a permanent BOS sink, and ablating its sink behavior inverts the sharpening effect (recovery drops from 0.994 to −0.780). See Whitespace Mechanics Part 1.

Connection to Whitespace Mechanics series

PetriDish was built to support the Whitespace Mechanics research. Specific endpoints used:

Paper	PetriDish endpoint	Finding
Part 1: BOS circuit	Tweezers → head ablation, QK surgery	Layer 0 head 22 is a BOS sink; ablating it inverts sharpening
Part 2: Pythia anomaly	Slicer → causal graft	Early residual stream (layers 0–3) carries whitespace signal
Part 3: Context entanglement	Slicer → causal graft, Tweezers → steer vector	Transplant works at L3; steer vector is null; effect is context-specific

The steer vector null result (E3 caveat above) was validated through PetriDish’s inject steer endpoint across all 28 layers, not just layer 14. The peak correlation was r = −0.190 at layer 3, p = 0.147 — statistical nothing at the same layer where causal patching achieves peak recovery.

Development status

All 22 endpoints verified on Qwen/Qwen2.5-1.5B-Instruct with PyTorch 2.11.0+cu126 (RTX 3060).

Planned additions: direct logit attribution (DLA), linear probes, per-head DLA decomposition.
Frontend direction: under decision — curriculum vs. puzzle game UI schema affects the SQLite persistence model.