The Effect Lives at Layer 3. You Can Move It. You Can't Steer It.

Causal patching at layer 3 recovers the whitespace effect cleanly. Steer vectors at the same layer get nothing. The dissociation tells you something about how the effect is stored.

Transplant the exact hidden state from the modified sentence into the original forward pass at layer 3 and the whitespace effect propagates — peak recovery, clean, across 20 pairs. Inject an averaged merge direction at the same layer and you get r = −0.190, p = 0.147. Statistical nothing. Same layer, same information in principle, and one method works while the other doesn’t.

L3 peak causal patch recovery
−0.19 steer vector correlation at L3
0 layers where steer actually worked

The steer vector approach is straightforward: take 100 training pairs, compute the mean difference between merged and original hidden states at the seam position, normalize it, inject it into held-out prompts. If the merge operator moved the hidden state along a consistent universal direction, this should reproduce the whitespace robustness effect — but it doesn’t.

The merge effect doesn’t define a direction. It defines a cloud.

The reason the transplant works and the average doesn’t is the same reason token-level features failed to predict K back in the first experiment: the effect is not a property of the tokens, it’s a property of the context. Each sentence’s merge effect on the hidden state is geometrically idiosyncratic — the direction from h_original to h_modified varies across sentences in a way that averages to noise. The mean-difference vector points roughly nowhere.

There was one partial signal worth mentioning: layer 2, anti-direction, r = +0.285. The merge direction at layer 2, injected in the opposite sign, has weak predictive power. That sits right below the layer 3 peak and inside the same 0–3 window where the BOS-anchoring circuit lives. We’re noting it, not claiming it. It’s weak, directionally backwards from what you’d expect, and doesn’t generalize.

The causal evidence is intact. The effect is real, it’s localized to the early residual stream, and it propagates forward from there — the transplant confirms all of that. The steer result just tells you how it’s stored: not as a universal axis you can find and push, but as a property of how each specific contextual representation interacts with the architecture at that layer. You can replay it. You can’t extract it.

To actually steer whitespace robustness you’d need context-conditioned vectors, direct causal patching, or a nonlinear probe trained to predict K from hidden states and then push toward high-K regions. None of those are trivial. The naive version is a clean null.

The three posts in this series now form a reasonably complete picture: the BOS-anchoring circuit explains the sharpening effect mechanistically, Pythia inverts the whole thing in a way we can’t fully explain, and the effect is stored in a way that resists extraction. What the architecture is doing at that seam is real and localized. It just doesn’t want to be moved.

Full paper →

↑