Transplant the exact hidden state from the modified sentence into the original forward pass at layer 3 and the whitespace effect propagates â peak recovery, clean, across 20 pairs. Inject an averaged merge direction at the same layer and you get r = â0.190, p = 0.147. Statistical nothing. Same layer, same information in principle, and one method works while the other doesnât.
The steer vector approach is straightforward: take 100 training pairs, compute the mean difference between merged and original hidden states at the seam position, normalize it, inject it into held-out prompts. If the merge operator moved the hidden state along a consistent universal direction, this should reproduce the whitespace robustness effect â but it doesnât.
The merge effect doesnât define a direction. It defines a cloud.
The reason the transplant works and the average doesnât is the same reason token-level features failed to predict K back in the first experiment: the effect is not a property of the tokens, itâs a property of the context. Each sentenceâs merge effect on the hidden state is geometrically idiosyncratic â the direction from h_original to h_modified varies across sentences in a way that averages to noise. The mean-difference vector points roughly nowhere.
There was one partial signal worth mentioning: layer 2, anti-direction, r = +0.285. The merge direction at layer 2, injected in the opposite sign, has weak predictive power. That sits right below the layer 3 peak and inside the same 0â3 window where the BOS-anchoring circuit lives. Weâre noting it, not claiming it. Itâs weak, directionally backwards from what youâd expect, and doesnât generalize.
The causal evidence is intact. The effect is real, itâs localized to the early residual stream, and it propagates forward from there â the transplant confirms all of that. The steer result just tells you how itâs stored: not as a universal axis you can find and push, but as a property of how each specific contextual representation interacts with the architecture at that layer. You can replay it. You canât extract it.
To actually steer whitespace robustness youâd need context-conditioned vectors, direct causal patching, or a nonlinear probe trained to predict K from hidden states and then push toward high-K regions. None of those are trivial. The naive version is a clean null.
The three posts in this series now form a reasonably complete picture: the BOS-anchoring circuit explains the sharpening effect mechanistically, Pythia inverts the whole thing in a way we canât fully explain, and the effect is stored in a way that resists extraction. What the architecture is doing at that seam is real and localized. It just doesnât want to be moved.