White Paper AIResearchmechanistic interpretabilitytokenizationLLMs

Context Entanglement in Whitespace Robustness: The Q6/Q7 Dissociation

Causal patching (exact hidden state transplant) recovers the whitespace merge effect at layer 3. Steer vectors (averaged direction injection) at the same layer are null. The dissociation establishes that the effect is context-entangled in residual space β€” not extractable as a universal direction.

May 11, 2026 Β· 16 min read Β· zer0contextlost
Abstract

Two experiments probe the same hidden state at layer 3 with orthogonal methods. Q6 (causal patching: transplanting the exact context-specific hidden state from the modified sentence into the original forward pass) achieves peak mean recovery of whitespace robustness (K) at layer 3 across 20 merge pairs. Q7 (steer vectors: injecting a normalized mean-difference direction averaged over 100 training pairs) yields r = βˆ’0.190 at layer 3, not significant. The Q6/Q7 dissociation establishes that the merge operator does not move the hidden state along a consistent universal direction β€” its effect is geometrically idiosyncratic per context and averages to noise. This is consistent with the prior finding that K = g(token) + h(context) with h(context) dominant. The effect is real and localized; it cannot be extracted as a steering vector.

1. Background

Prior work in this series established:

  1. Whitespace perturbations (merge, split, shift) divide next-token probability mass in a conserved way across BPE-tokenized models (JS < 0.3 nats universally).
  2. Per-pair robustness K = log(2) βˆ’ (H_modified βˆ’ H_original) is not predictable from tokenizer-local features. The correct decomposition is K = g(token) + h(context), with h(context) dominant (Q1 falsification, RΒ² = βˆ’0.41 on held-out data).
  3. Final-to-seam attention features partially operationalize h(context): r β‰ˆ βˆ’0.45 to βˆ’0.55, RΒ² β‰ˆ 0.22 on held-out merge pairs (Q3/Q4).
  4. A two-stage BOS-anchoring circuit is responsible for sharpening in Llama-3.2-3B-Instruct: early QK routing (layers 0–3) is load-bearing; late OV projections (layers 15–27) carry the signal independently of attention patterns.

This note asks: where in the residual stream does the whitespace perturbation signal live, and can it be extracted as a steering direction?


2. Q6 β€” Causal Patching

2.1 Method

For each of 20 merge-sharpening pairs (K > log(2)), extract hidden states at all 28 layers for both the original sentence and the modified (merged) sentence. At each layer L, transplant the merged token’s hidden state from the modified forward pass into the original forward pass via a forward hook on model.model.layers[L]. Measure:

K_patch = log(2) βˆ’ (H_patch βˆ’ H_orig)
recovery = 1 βˆ’ |K_patch βˆ’ K_actual| / |K_actual|

Recovery = 1.0 means the patched forward pass fully reproduces the whitespace effect. Recovery = 0 means the patch had no effect.

2.2 Results

Peak mean recovery at layer 3 (of 28). Recovery decays monotonically through deeper layers.

Distribution of peak-recovery layers across pairs: bimodal. 12/20 pairs peak in early layers (0–7); 8/20 in late layers (14–27). This maps to the g(token)/h(context) decomposition: pairs where the token geometry dominates peak early; pairs where context dominates peak late.

The layer 3 result converges with the BOS-anchoring circuit finding: the load-bearing early QK routing heads are concentrated in layers 0–3, with the causal patching peak at the boundary.

2.3 Interpretation

The whitespace perturbation signal is carried in the residual stream as early as layer 3. It is not processed away by mid-network β€” it propagates forward as a localized perturbation. The transplant works because it provides perfect local information: the actual merged token’s representation in its actual context, placed at the exact layer where the effect is encoded.


3. Q7 β€” Steer Vectors

3.1 Method

From 100 training merge pairs, extract hidden states at the seam position for both original and modified sentences. Compute the mean difference vector:

d_L = mean(h_modified_L βˆ’ h_original_L) / ||mean(h_modified_L βˆ’ h_original_L)||

For each of 60 held-out merge pairs, inject alpha Γ— d_L into the forward pass at the seam position at layer L. Measure K_steered and correlate with K_actual across pairs.

Tested layers: [2, 3, 5, 10]. Tested alphas: [βˆ’2, βˆ’1, βˆ’0.5, +0.5, +1, +2].

3.2 Results

LayerBest alphaBest rp
2βˆ’2.0+0.2850.027*
3+1.0βˆ’0.1900.147
5+2.0βˆ’0.2420.063
10+2.0βˆ’0.1200.362

Baseline K_actual for merge pairs β‰ˆ 0.64. K_steered clusters tightly in [0.51, 0.69] across all layers and alphas β€” the steer injection is not meaningfully shifting the distribution.

The layer 2 result (r = +0.285, alpha = βˆ’2.0) is statistically significant but directionally anomalous: the anti-merge direction has marginal predictive power. This may reflect low-level positional encoding geometry at the embedding layer that is overridden by subsequent contextual processing. It does not generalize and is not robust to alpha selection.

3.3 The Dissociation

Q6 and Q7 probe the same hidden state at layer 3 via orthogonal methods:

  • Q6 injects perfect local information β€” the exact merged token’s representation in its exact sentential context.
  • Q7 injects a fixed global direction β€” the mean difference across 100 training contexts, normalized.

Q6 works. Q7 fails at the same layer.

The dissociation has a direct interpretation: the merge operator does not move the hidden state along a consistent universal direction in representation space. Each token pair’s merge effect is geometrically idiosyncratic β€” the direction from h_original to h_modified varies across contexts in a way that averages to noise.

This is the residual-stream analogue of the Q1 falsification. Q1 showed that K cannot be predicted from static tokenizer features because h(context) dominates. Q7 shows that the hidden-state encoding of h(context) is not a fixed direction β€” it is entangled with semantic content in a way that averaging destroys. The effect is linearly inseparable from the context that produces it.


4. Implications for Steerability

The whitespace merge effect is not steerable via mean-difference injection. To reproduce the effect you would need one of:

  1. Context-conditioned vectors: A separate direction per sentence type or register, trained to capture within-class variation rather than a global mean.
  2. Direct causal patching: Replay the exact hidden state β€” but this is replay, not steering, and requires the modified forward pass to already exist.
  3. Nonlinear probe steering: Train a probe to predict K from hidden states, then steer toward high-K regions via gradient in the probe’s input space. This is technically possible but requires a well-calibrated probe (current seam attention features explain ~22% of K variance β€” insufficient for reliable steering).

None of these are trivial, and none are what β€œsteer vector” typically means. The implication is that whitespace boundary effects are contextually entangled at the representation level. The architecture encodes them, but not in a form that naive extraction methods can retrieve.


5. What This Adds to the Series

ExperimentQuestionAnswer
Q1Is K predictable from tokenizer features?No. h(context) dominates.
Q3/Q4Does seam attention operationalize h(context)?Partially. RΒ² β‰ˆ 0.22.
Q6Where in the residual stream is the effect?Layer 3. Transplant works.
Q7Can the effect be extracted as a universal direction?No. Context-entangled.
BOS circuitWhat circuit produces sharpening?Two-stage: BOS anchoring (L0–3) + OV projection (L15–27).

The picture across Q1–Q7 and the circuit experiments: whitespace robustness is a real, localized, mechanistically traceable property of BPE-tokenized language models. It is not a random noise phenomenon, not a tokenizer artifact, and not a property of individual tokens. It is a contextual property encoded in the early residual stream via a specific circuit, stored in a form that resists extraction as a universal direction, and distributed differently across model families in ways we do not yet fully understand.


6. Limitations

  • Q7 tests only four layers and six alphas. A finer sweep might reveal a weak but consistent signal at a specific (layer, alpha) combination not tested here.
  • The mean-difference direction is extracted at the seam position only. A direction extracted from a different position (e.g., the final token position, where next-token prediction happens) might behave differently.
  • All experiments are on Llama-3.2-3B-Instruct. Whether the Q6/Q7 dissociation holds for Pythia β€” which inverts the K rankings β€” is untested.

7. Conclusion

Causal patching and steer vectors probe the same information at the same layer with opposite results. The transplant works because it carries exact contextual information. The averaged direction fails because the merge operator’s hidden-state effect is not linearly separable from semantic content. The effect is real and localized at layer 3. It cannot be extracted.