Delete the 'Not'. The Model Has No Idea.

Erase one token from the embedding layer and sentiment inverts completely. A writeup of PetriDish, the mechanistic interpretability workbench built to run experiments like this.

Erase “not” from the embedding layer — not from the text, from the internal vector — and the model forgets it was ever there. “The movie was not good. Overall I would say it was” becomes the model confidently finishing “…wonderful.” Sentiment inverts completely. The surrounding context doesn’t save it. The model works with what’s in the residual stream, and “not” isn’t there anymore, so the architecture just moves on. This is not a bug. It’s exactly how it’s supposed to work.

1 token erased, sentiment flips
L12 where subject identity lives
22 endpoints for looking inside

Running experiments like this requires getting inside a forward pass while it’s happening — intercepting activations, modifying them, comparing what changes. PetriDish is what I built for that. Load any Hugging Face causal language model, hook into its internals, run the surgery. Eight tools, twenty-two endpoints. FastAPI backend, SvelteKit frontend, runs on an RTX 3060 without much complaint.

The Causal Patcher transplants activations between forward passes. Run “The Eiffel Tower is in” and “The Statue of Liberty is in” simultaneously, patch the residual stream at layer 12, and “Paris” jumps from rank ~50 to the top of the second prompt’s prediction. Subject identity is localized — you can physically move a fact from one sentence’s forward pass into another’s, and the model just accepts it.

The Head Ablation panel kills attention heads one at a time. In layer 18, head 5 is responsible for copying your name from earlier in the prompt to wherever the model answers it. Ablate that one head and the model loses you entirely. Ablate the heads around it and nothing changes, because they weren’t doing that job.

The QK Surgery panel lets you force heads to attend wherever you want. Force any head to stare at position 0 and context evaporates — the embeddings are all still present, the model just can’t route the information to where it needs to go. This turned out to be the same mechanism the whitespace sharpening research kept running into. The BOS sink at layer 0 isn’t a quirk of the architecture; it’s structural. The sharpening effect depends on it, and when you break it, the sharpening inverts.

The dead end worth mentioning: steer vectors work cleanly for simple style axes. Politeness, formality, tone — you can dial these like a knob. They fall apart for anything context-dependent. The whitespace work showed this clearly: the merge effect doesn’t define a consistent geometric direction, so the naive approach — compute the mean difference vector, inject it — produces a clean null. I found this out the same way I find most things, by building a tool that makes the failure legible and then staring at it until something made sense.

The whitespace series ran on PetriDish. The causal patching results, the layer-3 peak, the steer vector null — all of it went through these endpoints. Which is probably the honest reason I built it in the first place.

Full paper →