1. Background
Prior work in this series established:
- Whitespace perturbations (merge, split, shift) divide next-token probability mass in a conserved way across BPE-tokenized models (JS < 0.3 nats universally).
- Per-pair robustness K = log(2) β (H_modified β H_original) is not predictable from tokenizer-local features. The correct decomposition is K = g(token) + h(context), with h(context) dominant (Q1 falsification, RΒ² = β0.41 on held-out data).
- Final-to-seam attention features partially operationalize h(context): r β β0.45 to β0.55, RΒ² β 0.22 on held-out merge pairs (Q3/Q4).
- A two-stage BOS-anchoring circuit is responsible for sharpening in Llama-3.2-3B-Instruct: early QK routing (layers 0β3) is load-bearing; late OV projections (layers 15β27) carry the signal independently of attention patterns.
This note asks: where in the residual stream does the whitespace perturbation signal live, and can it be extracted as a steering direction?
2. Q6 β Causal Patching
2.1 Method
For each of 20 merge-sharpening pairs (K > log(2)), extract hidden states at all 28 layers for both the original sentence and the modified (merged) sentence. At each layer L, transplant the merged tokenβs hidden state from the modified forward pass into the original forward pass via a forward hook on model.model.layers[L]. Measure:
K_patch = log(2) β (H_patch β H_orig)
recovery = 1 β |K_patch β K_actual| / |K_actual|
Recovery = 1.0 means the patched forward pass fully reproduces the whitespace effect. Recovery = 0 means the patch had no effect.
2.2 Results
Peak mean recovery at layer 3 (of 28). Recovery decays monotonically through deeper layers.
Distribution of peak-recovery layers across pairs: bimodal. 12/20 pairs peak in early layers (0β7); 8/20 in late layers (14β27). This maps to the g(token)/h(context) decomposition: pairs where the token geometry dominates peak early; pairs where context dominates peak late.
The layer 3 result converges with the BOS-anchoring circuit finding: the load-bearing early QK routing heads are concentrated in layers 0β3, with the causal patching peak at the boundary.
2.3 Interpretation
The whitespace perturbation signal is carried in the residual stream as early as layer 3. It is not processed away by mid-network β it propagates forward as a localized perturbation. The transplant works because it provides perfect local information: the actual merged tokenβs representation in its actual context, placed at the exact layer where the effect is encoded.
3. Q7 β Steer Vectors
3.1 Method
From 100 training merge pairs, extract hidden states at the seam position for both original and modified sentences. Compute the mean difference vector:
d_L = mean(h_modified_L β h_original_L) / ||mean(h_modified_L β h_original_L)||
For each of 60 held-out merge pairs, inject alpha Γ d_L into the forward pass at the seam position at layer L. Measure K_steered and correlate with K_actual across pairs.
Tested layers: [2, 3, 5, 10]. Tested alphas: [β2, β1, β0.5, +0.5, +1, +2].
3.2 Results
| Layer | Best alpha | Best r | p |
|---|---|---|---|
| 2 | β2.0 | +0.285 | 0.027* |
| 3 | +1.0 | β0.190 | 0.147 |
| 5 | +2.0 | β0.242 | 0.063 |
| 10 | +2.0 | β0.120 | 0.362 |
Baseline K_actual for merge pairs β 0.64. K_steered clusters tightly in [0.51, 0.69] across all layers and alphas β the steer injection is not meaningfully shifting the distribution.
The layer 2 result (r = +0.285, alpha = β2.0) is statistically significant but directionally anomalous: the anti-merge direction has marginal predictive power. This may reflect low-level positional encoding geometry at the embedding layer that is overridden by subsequent contextual processing. It does not generalize and is not robust to alpha selection.
3.3 The Dissociation
Q6 and Q7 probe the same hidden state at layer 3 via orthogonal methods:
- Q6 injects perfect local information β the exact merged tokenβs representation in its exact sentential context.
- Q7 injects a fixed global direction β the mean difference across 100 training contexts, normalized.
Q6 works. Q7 fails at the same layer.
The dissociation has a direct interpretation: the merge operator does not move the hidden state along a consistent universal direction in representation space. Each token pairβs merge effect is geometrically idiosyncratic β the direction from h_original to h_modified varies across contexts in a way that averages to noise.
This is the residual-stream analogue of the Q1 falsification. Q1 showed that K cannot be predicted from static tokenizer features because h(context) dominates. Q7 shows that the hidden-state encoding of h(context) is not a fixed direction β it is entangled with semantic content in a way that averaging destroys. The effect is linearly inseparable from the context that produces it.
4. Implications for Steerability
The whitespace merge effect is not steerable via mean-difference injection. To reproduce the effect you would need one of:
- Context-conditioned vectors: A separate direction per sentence type or register, trained to capture within-class variation rather than a global mean.
- Direct causal patching: Replay the exact hidden state β but this is replay, not steering, and requires the modified forward pass to already exist.
- Nonlinear probe steering: Train a probe to predict K from hidden states, then steer toward high-K regions via gradient in the probeβs input space. This is technically possible but requires a well-calibrated probe (current seam attention features explain ~22% of K variance β insufficient for reliable steering).
None of these are trivial, and none are what βsteer vectorβ typically means. The implication is that whitespace boundary effects are contextually entangled at the representation level. The architecture encodes them, but not in a form that naive extraction methods can retrieve.
5. What This Adds to the Series
| Experiment | Question | Answer |
|---|---|---|
| Q1 | Is K predictable from tokenizer features? | No. h(context) dominates. |
| Q3/Q4 | Does seam attention operationalize h(context)? | Partially. RΒ² β 0.22. |
| Q6 | Where in the residual stream is the effect? | Layer 3. Transplant works. |
| Q7 | Can the effect be extracted as a universal direction? | No. Context-entangled. |
| BOS circuit | What circuit produces sharpening? | Two-stage: BOS anchoring (L0β3) + OV projection (L15β27). |
The picture across Q1βQ7 and the circuit experiments: whitespace robustness is a real, localized, mechanistically traceable property of BPE-tokenized language models. It is not a random noise phenomenon, not a tokenizer artifact, and not a property of individual tokens. It is a contextual property encoded in the early residual stream via a specific circuit, stored in a form that resists extraction as a universal direction, and distributed differently across model families in ways we do not yet fully understand.
6. Limitations
- Q7 tests only four layers and six alphas. A finer sweep might reveal a weak but consistent signal at a specific (layer, alpha) combination not tested here.
- The mean-difference direction is extracted at the seam position only. A direction extracted from a different position (e.g., the final token position, where next-token prediction happens) might behave differently.
- All experiments are on Llama-3.2-3B-Instruct. Whether the Q6/Q7 dissociation holds for Pythia β which inverts the K rankings β is untested.
7. Conclusion
Causal patching and steer vectors probe the same information at the same layer with opposite results. The transplant works because it carries exact contextual information. The averaged direction fails because the merge operatorβs hidden-state effect is not linearly separable from semantic content. The effect is real and localized at layer 3. It cannot be extracted.