Overview
PetriDish is a FastAPI backend + SvelteKit frontend for probing transformer internals in real time. Load any HuggingFace causal language model, hook into its forward pass, and run surgical experiments — activation grafting, attention head ablation, residual stream projection, embedding surgery, steer vector injection — without modifying weights.
Primary test model: Qwen/Qwen2.5-1.5B-Instruct (28 layers, 16 heads, ~1.5B parameters)
Hardware: RTX 3060 12GB, PyTorch 2.11.0+cu126
Backend: FastAPI + nnsight + transformers + sae-lens
Frontend: SvelteKit (in active development)
Architecture
PetriDish/
├── backend/
│ ├── main.py FastAPI app, CORS, DB init, WebSocket router
│ ├── microscope_router.py Model load, forward pass, logit lens, attention heatmaps
│ ├── slicer_router.py PCA residual projection, causal activation grafting
│ ├── tweezers_router.py Token-by-token generation, steer vectors, causal patching, head ablation
│ ├── manipulation_router.py Concept direction ablate/amplify, QK surgery, position remap, SVD filter
│ ├── token_touch_router.py Input embedding surgery (erase, scale, split)
│ ├── sae_router.py Sparse autoencoder feature discovery and steering
│ ├── inversion_router.py Soft prompt optimisation, GCG adversarial suffix (streamed SSE)
│ └── db.py SQLite models (Experiment, TrajectoryNode)
└── frontend/ SvelteKit app
Hook pattern
All hook-based endpoints use register_forward_hook with mandatory cleanup:
handle = model.model.layers[layer].register_forward_hook(hook_fn)
try:
with torch.no_grad():
output = model.generate(...)
finally:
handle.remove()
A leaked hook corrupts all subsequent forward passes. The try/finally pattern is non-negotiable.
Logit computation
outputs.logits returns NaN for Qwen2.5 due to a missing norm step. All endpoints compute logits as:
normed = model.model.norm(hidden_states[-1])
logits = model.lm_head(normed)
This applies regardless of which router is running the forward pass.
Tool inventory
| Tool | Endpoints | Primary operations |
|---|---|---|
| Microscope | 3 | Model load, forward pass, logit lens across all layers |
| Slicer | 4 | PCA residual stream projections, causal activation grafting |
| Tweezers | 5 | Token-by-token generation control, steer vectors, causal patching, head ablation |
| Manipulation | 4 | Concept direction ablate/amplify, QK surgery, position remapping, SVD filtering |
| Token Touch | 2 | Input embedding surgery (erase, scale, split tokens) |
| SAE | 2 | Sparse autoencoder feature discovery and steering (Gemma-Scope compatible) |
| Inversion | 2 | Soft prompt optimisation, GCG adversarial suffix search (streamed) |
| Chat | 1 (WS) | Ollama-backed chat with branching conversation trees saved to SQLite |
Canonical experiments
The following eight experiments are the validated baseline for confirming a working stack. Each has a specific expected output; deviations are diagnostic.
E1 — The Capital Lens
Tool: Microscope → Logit Lens
Prompt: The capital of France is
Hover the final token column from layer 0 → 27. “Paris” enters the top-5 around layer 18–20 and locks in by layer 24. Earlier layers show generic completions (“the”, “a”, “located”).
Finding: Factual recall is not retrieval — it’s construction. The fact crystallizes progressively through mid-to-late layers.
E2 — Top-K Roulette
Tool: Tweezers → Token Surgery
Prompt: Once upon a time, there was a
Run greedy for two steps, then deliberately select the 7th or lower token. Continue greedy from that point.
Finding: Generation is a cascade. One off-path token selection sends the continuation down a structurally different trajectory. The “correct” output is always one bad choice away from becoming something else.
E3 — The Politeness Vector
Tool: Tweezers → Steer Vector, then Inject Steer
Layer: 14
Positive set: formal, deferential phrasings (“Please could you kindly help me…”). Negative set: terse imperatives (“Just do it already”). Compute the mean-difference vector, normalize, inject at alpha = ±3.0.
Finding: Style is linearly separable in residual space for simple axes. Alpha = +3.0 produces formal, deferential output. Alpha = −3.0 produces short and blunt. The direction generalizes across prompts.
Caveat: This works for style but not for context-dependent effects. The Whitespace Mechanics series (Parts 1–3) found that whitespace robustness is not a extractable direction — the merge effect is geometrically idiosyncratic per sentence. Steer vectors are a clean null there (r = −0.190 at the peak layer). See Whitespace Mechanics Part 3.
E4 — Kill the Name Head
Tool: Tweezers → Head Ablation
Prompt: My name is Sarah. Nice to meet you. What is my name? My name is
Ablate layer 18, head 5 (scale = 0). Baseline outputs “Sarah.” Post-ablation: the model loses the name. Ablating heads 4, 6, 7 in the same layer produces no meaningful change.
Finding: Induction/copying is localized. One specific head is responsible for propagating named entity information from earlier in the context to the generation point. This head can be surgically removed without affecting general coherence.
E5 — Patch the Subject
Tool: Slicer → Causal Graft
Source: The Eiffel Tower is in
Target: The Statue of Liberty is in
Patch: Layer 12, final token position
Baseline target top-10: “New York”, “Manhattan”, “Washington”. Post-patch: “Paris” moves from rank ~50 to top-3. “New York” drops significantly.
Finding: Subject identity is encoded in the residual stream at a specific layer. Transplanting the hidden state from one subject’s forward pass into another’s physically relocates the associated factual content. This is the same mechanism used in the Whitespace Mechanics causal patching experiments — the whitespace effect peaks at layer 3 using the same transplant method.
E6 — Erase “not”
Tool: Token Touch → Erase
Prompt: The movie was not good. Overall I would say it was
Find the “not” token in the embedding layer. Set op = Erase. Compare baseline vs. erased continuation.
Baseline: “disappointing” / “bad” / “terrible”
Erased: “great” / “enjoyable” / “wonderful”
Finding: Negation lives in a single token embedding. The model cannot reconstruct semantic negation from surrounding context — once the vector is gone, the sentence’s meaning inverts. The architecture has no backward pass to notice something is missing.
E7 — The Concept Cloud
Tool: Slicer → PCA Residual Projection
Prompt: king queen man woman prince princess uncle aunt brother sister apple banana car truck
Freeze a PCA slice at layer 4. Freeze again at layer 16. Compare.
Layer 4: Tokens are largely unstructured with no semantic clustering.
Layer 16: Clean gender axis visible (king↔queen parallel to man↔woman), fruits cluster separately from vehicles, royalty clusters together.
Finding: Semantic geometry — word2vec-style analogies as literal geometric structure — emerges with depth and is not present in early layers.
E8 — Sink the Attention
Tool: Manipulation → QK Surgery
Prompt: The quick brown fox jumps over the lazy dog. The fox was
Layer: 20, Head: 8, Operation: Sink, Sink position: 0
Baseline top tokens: “quick”, “brown”, “fast”, “clever”. Post-sink: distribution flattens, generic tokens dominate (“a”, “very”, “the”).
Finding: Forcing a head to attend to the BOS token (position 0) severs the information channel that routes contextual content to the prediction. The token embeddings are still present; the model simply cannot access them. This is the same mechanism as the BOS-anchoring circuit in the whitespace sharpening research — head 22 at layer 0 acts as a permanent BOS sink, and ablating its sink behavior inverts the sharpening effect (recovery drops from 0.994 to −0.780). See Whitespace Mechanics Part 1.
Connection to Whitespace Mechanics series
PetriDish was built to support the Whitespace Mechanics research. Specific endpoints used:
| Paper | PetriDish endpoint | Finding |
|---|---|---|
| Part 1: BOS circuit | Tweezers → head ablation, QK surgery | Layer 0 head 22 is a BOS sink; ablating it inverts sharpening |
| Part 2: Pythia anomaly | Slicer → causal graft | Early residual stream (layers 0–3) carries whitespace signal |
| Part 3: Context entanglement | Slicer → causal graft, Tweezers → steer vector | Transplant works at L3; steer vector is null; effect is context-specific |
The steer vector null result (E3 caveat above) was validated through PetriDish’s inject steer endpoint across all 28 layers, not just layer 14. The peak correlation was r = −0.190 at layer 3, p = 0.147 — statistical nothing at the same layer where causal patching achieves peak recovery.
Development status
All 22 endpoints verified on Qwen/Qwen2.5-1.5B-Instruct with PyTorch 2.11.0+cu126 (RTX 3060).
Planned additions: direct logit attribution (DLA), linear probes, per-head DLA decomposition.
Frontend direction: under decision — curriculum vs. puzzle game UI schema affects the SQLite persistence model.