1. Introduction
BPE tokenizers encode whitespace as part of token identity: " the" and "the" are different vocabulary entries with different embeddings and different learned weights. A single space character inserted or removed can completely change what the model receives. Prior work established that such perturbations redistribute probability mass in a structured way — probability is divided, not dispersed (JSD < 0.3 nats universally) — and that in approximately 34% of merge-operator cases, the perturbation sharpens the output distribution. The model becomes more confident in its prediction after a space is removed.
What circuit produces this sharpening? This paper addresses that question mechanistically. The answer has two parts: early QK routing (layers 0–3) establishes which token positions carry boundary-sensitive information, and late OV projections (layers 15–27) carry the signal to the output. These stages are experimentally separable: forcing uniform attention on late heads leaves sharpening intact, while forcing uniform attention on specific early heads eliminates or inverts it.
2. Background
2.1 Whitespace Operators and the Sharpening Phenomenon
Three primitive operators over text strings:
- merge (×) — remove a space, fusing adjacent tokens
- split (÷) — insert a space mid-word, forcing re-segmentation
- shift (≀) — move a space one character left or right
For a prompt string s and variant v, define the entropy residual:
K = log(2) − (H(P_v) − H(P_s))
where H(P) is Shannon entropy of the next-token distribution. K > log(2) indicates sharpening: the variant distribution is more concentrated than the Division Law baseline predicts.
Conservation holds universally: JSD(P_s, P_v) < 0.3 nats across all pairs and three model families (Llama-3.2-3B-Instruct, Pythia-1.4B, Mistral-7B). Sharpening rate for the merge operator: ~34% of held-out pairs.
2.2 Direct Logit Attribution (DLA)
DLA projects per-layer, per-head attention outputs through the LN-folded unembedding direction for a target token, yielding a scalar contribution score per head. We use the linear approximation: for head h at layer l, contribution = W_O_h @ v_h projected through final LN scale and unembedding weight. This ignores subsequent residual additions but is standard for screening.
2.3 QK Surgery
For head h at layer l, we zero head h’s slice of the Q projection output. Since Q_h = 0, attention scores Q_h K^T = 0 everywhere. Combined with the causal mask, softmax gives uniform attention over valid past positions — the head sees an average of all past value vectors, eliminating routing while preserving the OV computation.
Recovery fraction:
R = (K_uniform − log(2)) / (K_orig − log(2))
- R ≈ 1.0 — sharpening survives uniform attention → OV-side
- R ≈ 0.0 — sharpening eliminated → QK-side
- R < 0 — uniform attention inverts the effect → strongly QK-side
3. Methods
Model: Llama-3.2-3B-Instruct (FP16, 28 layers, 24 heads, head_dim=128, hidden_size=3072).
Data: 180 held-out pairs (60 sentences × 3 operators) from WikiText-103 validation. 61 merge pairs have K > log(2). Surgery experiments use top-20 sharpening pairs by K magnitude (K_baseline mean = 1.13).
Phase 1 — DLA Screening: For 40 sharpening and 40 non-sharpening pairs, compute per-head DLA delta = DLA(modified) − DLA(original) toward the original’s top-1 token. Rank by specificity = mean delta (sharpening) − mean delta (non-sharpening). Top-10 heads selected as candidates.
Phase 2 — Late-Layer QK Surgery: Apply QK surgery to top-1/3/5/10 DLA-ranked heads on 20 sharpening pairs. Compute recovery.
Phase 3 — Control: 5 random draws of n=1/3/5/10 non-ranked heads. Compare recovery to ranked heads.
Phase 4 — Early-Layer Sweep: Individual QK surgery on all 264 heads in layers 0–10. Rank by recovery. Combine top crashers (n=1/3/5/10). Sample 30 late-layer heads (11–27) for comparison.
4. Results
4.1 DLA Identifies Late-Layer Sharpening-Specific Heads
Top DLA-ranked heads cluster in layers 15, 22–27. Best specificity: L25H15 (specificity=+0.087). In sharpening pairs, these heads show near-zero DLA delta. In non-sharpening pairs, they show negative delta (suppression). The asymmetry: sharpening contexts do not strongly promote the target via these heads — non-sharpening contexts suppress it more.
4.2 Late-Layer QK Surgery: OV-Side
| Config | Recovery |
|---|---|
| top-1 | 0.994 |
| top-3 | 0.969 |
| top-5 | 0.826 |
| top-10 | 1.134 |
All configurations OV-side. Top-10 recovery > 1.0: some ranked heads are mildly anti-sharpening under their normal attention patterns — removing the routing slightly amplifies the effect. Sharpening is not disrupted by any combination of late-layer QK perturbations.
4.3 Control Confirms Non-Specificity
Random late-layer heads: recovery mean=0.968, std=0.091, no individual head below 0.688. Random early-layer heads: high variance — some draws give recovery near 0 or negative. The OV-side result is a property of the late layers, not specific to DLA-ranked heads.
4.4 Early-Layer Sweep: QK-Side, Localized to Layers 0–3
| Layer | Best single-head recovery |
|---|---|
| 0 | −0.780 |
| 1 | −0.057 |
| 2 | −0.672 |
| 3 | −0.103 |
| 4 | +0.152 |
| 5 | +0.135 |
| 6 | +0.512 |
| 7–10 | 0.59 – 0.74 |
Layer 0 head 22 (L0H22): recovery = −0.780. A single head whose Q-zeroing flips sharpening to spreading.
Layer 2: densely represented — heads 4, 17, 18, 20 all in top-15 crashers.
Combinations of top early crashers:
| Config | Recovery |
|---|---|
| early-top-1 | −0.780 |
| early-top-3 | −0.899 |
| early-top-5 | −0.924 |
| early-top-10 | −1.088 |
Sharpening not only eliminated but inverted under combined early QK surgery.
4.5 Distributional Comparison: Early vs. Late
| Group | Mean | Std | Min | Frac < 0.5 |
|---|---|---|---|---|
| Early (L0–10) | 0.872 | 0.374 | −0.780 | 0.12 |
| Late (L11–27) | 0.968 | 0.091 | 0.688 | 0.00 |
Late-layer distribution is tight and high (OV-side). Early-layer distribution has a heavy lower tail (QK-side heads present). The t-test is non-significant (p=0.17) because both group means are near 1.0 — the relevant statistic is the tail behavior and the minimum, not the means.
4.6 Convergence with Causal Patching
An independent causal patching experiment (transplanting hidden states from the modified forward pass into the original at specific layers) found peak information recovery at layer 3 — the boundary signal transfers most efficiently through the residual stream at that depth. The early-layer QK sweep finds load-bearing routing heads concentrated in layers 0–3 with effect diminishing sharply by layer 6. Two independent methods converge on the same layer range.
4.7 Cross-Scale Validation: Llama-3.2-1B-Instruct
| Metric | 3B | 1B |
|---|---|---|
| Best early crasher | L0H22, R=−0.780 | L0H31, R=−1.357 |
| 2nd early crasher | L2H4, R=−0.672 | L1H25, R=−1.176 |
| Early layer gradient | L0>L1>L2>L3, dead by L4 | L0>L1>L2>L3, dead by L4 |
| Late OV mean (R) | 0.968 | 1.071 |
| Late OV min (R) | 0.688 | 0.967 |
| Top crashers: BOS sinks? | yes | yes |
The 1B circuit is structurally identical and if anything more extreme: top crasher R=−1.357, BOS sinks perfectly concentrated (max_weight=1.0000), late OV floor higher (min=0.967 vs. 0.688). The two-stage BOS-anchoring circuit is not a 3B artifact — it is a consistent property of the Llama 3.2 architecture across the 1B–3B range.
5. Discussion
5.1 The Two-Stage Circuit
Stage 1 (layers 0–3, QK-side): Attention heads in early layers are load-bearing for sharpening — disrupting their attention patterns eliminates or inverts the effect. L0H22 is the single most load-bearing head.
Stage 2 (layers 15–27, OV-side): The sharpening signal is carried in value representations of late-layer heads. By this stage, attention pattern no longer matters — the information is encoded in what these heads extract, regardless of where they look.
The residual stream serves as the communication channel: Stage 1 heads inject a stable representation that Stage 2 OV projections read out into logit space.
5.2 Early Crashers Are BOS Attention Sinks
The boundary-detection hypothesis predicted that L0H22 would attend to the tokenization seam in sharpening contexts. Direct attention visualization (n=60 pairs) refuted this: L0H22 is a near-perfect BOS attention sink — frac_argmax_BOS=1.00, max_weight=0.9999, entropy=0.001. It attends exclusively to the BOS token regardless of context, operator, or K value. Seam attention is effectively zero (mean=0.0000), correlation with K: r=+0.091, p=0.49.
Follow-up probes on the next two largest early crashers:
| Head | frac_BOS | max_weight | entropy | verdict |
|---|---|---|---|---|
| L0H22 | 1.00 | 0.9999 | 0.001 | Pure BOS sink |
| L2H4 | 1.00 | 0.9033 | 0.595 | BOS sink |
| L2H20 | 0.70 | 0.7770 | 0.826 | Mixed |
The early QK-side mechanism is not boundary detection but BOS anchoring: these heads inject a stable BOS-derived constant into the residual stream at every position. When Q is zeroed, the constant BOS representation is replaced by a position-averaged mean — a context-dependent perturbation that propagates forward and collapses sharpening.
BOS anchor content probe (n=30 sharpening + 30 non-sharpening). BOS token hidden states extracted at layers 0–3 via output_hidden_states=True, projected through LN-folded unembedding. Cross-group cosine similarity = 1.0000 at all four layers; intra-group similarity = 1.0000. BOS-to-target projection correlates with K at r=−0.183 to r=+0.106 (all p > 0.15, non-significant). The BOS anchor is purely structural: it injects an identical constant regardless of whether the context produces sharpening. The sharpening signal does not live in what the anchor says — it lives in how the residual stream at non-BOS positions interacts with this constant injection across layers.
5.3 The Pythia Anomaly
Cross-model testing on Pythia-1.4B, Pythia-2.8B, and Pythia-6.9B using the same 60 merge-operator sentence pairs reveals a systematic anti-correlation with Llama and Mistral:
| Model pair | r | p |
|---|---|---|
| Llama vs. Mistral | +0.613 | < 0.001 |
| Pythia-1.4B vs. Llama | −0.485 | < 0.001 |
| Pythia-1.4B vs. Mistral | −0.512 | < 0.001 |
Pythia’s K rankings are inverted relative to Llama/Mistral — sentences Llama finds easy to absorb are the ones Pythia finds disruptive, and vice versa. This anti-correlation persists after controlling for tokenizer structure (filtering to 50 pairs where both tokenizers produce the same token count before and after, r=−0.490).
Scaling behavior:
| Model | r(K_pythia, K_llama) |
|---|---|
| Pythia-1.4B | −0.485 |
| Pythia-2.8B | −0.502 |
| Pythia-6.9B | +0.288 |
The anti-correlation does not fade gradually — it strengthens from 1.4B to 2.8B, then abruptly flips sign between 2.8B and 6.9B. This is a phase transition, not a smooth convergence. The cause — training data (The Pile vs. web-crawled corpora), architecture (GPT-NeoX vs. LLaMA-style), or a parameter-scale threshold — cannot be cleanly isolated without controlled corpus ablations that are outside the current scope.
5.4 Architectural Generalization: Substrate Follow-Up
A subsequent experiment trained a 244M parameter tokenizer-free byte-level transformer (Substrate) from scratch on Project Gutenberg, then applied the same W_O ablation protocol used on BPE models.
| Model | Architecture | L00 W_O=0 BPB delta | All-layers delta |
|---|---|---|---|
| Llama-3.2-1B-Instruct | flat decoder | +1.202 | +1.696 |
| Mistral-7B-v0.1 | flat decoder | +0.503 | +1.900 |
| Pythia-160M-deduped | flat decoder | +0.302 | +1.328 |
| Substrate 244M | hierarchical encoder/decoder | +0.001 | +0.052 |
BOS sinks emerge in Substrate without any BOS token or vocabulary — a head attending 340× baseline attention to position 0 develops from scratch. Topology is conserved. But all ablations are null (max BPB delta +0.008 across all ablation types), versus +0.302 BPB for the same W_O ablation on Pythia-160M’s early layers.
The difference traces to architecture: Substrate’s global transformer is near-optional (bypassing it entirely costs +0.047 BPB); the local encoder carries prediction signal directly. Pythia is a flat decoder-only stack where every sublayer is in the primary prediction path. The sharpening circuit’s BOS-anchoring stage is therefore architecture-dependent: load-bearing in flat decoder-only stacks, non-causal in hierarchical byte-level models.
Note on Mistral: W_O ablation is CAUSAL only at L00 (+0.503) and L01 (+0.059); layers 2–31 are null or marginal. Sliding window attention in later layers reduces individual head causality while the model compensates with early-layer capacity — a structural parallel to the Substrate result, where peripheral subsystems absorb ablation effects rather than propagating them.
5.5 Implications for Prompt Robustness
Prompts that destabilize the BOS-sink signal — through unusual tokenization structure at the start of the sequence — may be more susceptible to whitespace-induced confidence shifts. A practical probe: run whitespace variants on a target prompt and measure K variance. High variance indicates the BOS-anchoring circuit is interacting with the prompt’s residual stream structure in a perturbation-sensitive way.
5.6 Limitations
- Surgery experiments use n=20 sharpening pairs — a small sample. Recovery fractions are point estimates with substantial uncertainty.
- QK surgery via Q-zeroing is a crude intervention; it does not distinguish between “head attends to position X” vs. “head attends to some structure the causal mask permits.” Attention-weight-level hooks would be cleaner.
- The DLA approximation ignores subsequent residual additions and uses frozen LN scale. It is a screening tool, not a causal attribution method.
- The Pythia anomaly’s cause remains unidentified. Disentangling training data, architecture, and scale effects requires controlled corpus ablations not undertaken here.
6. Conclusion
Whitespace-induced sharpening in Llama-3.2-3B-Instruct operates via a two-stage circuit. Early BOS-anchoring heads (layers 0–3) are load-bearing via a constant BOS-derived residual injection. Late OV projections (layers 15–27) carry the sharpening signal to the output. These stages are experimentally separable via QK surgery and independently confirmed by causal patching. Stage 1 is BOS anchoring, not boundary detection — extending the attention-sinks literature to a functional load-bearing role.
The BOS anchor injects the same constant into the residual stream regardless of context. The sharpening signal lives not in what that constant “says” but in how the downstream architecture uses it. Why disrupting BOS anchoring specifically collapses sharpening — rather than affecting other output properties equally — remains open.
A tokenizer-free follow-up establishes that BOS sinks are universal in causal transformers and that their load-bearing role is architecture-dependent. The Pythia anomaly (anti-correlation with Llama/Mistral on K rankings, with a phase transition between 2.8B and 6.9B) is reported without explanation.
Appendix: Experiment Summary
| Experiment | Description |
|---|---|
| DLA screening + late QK surgery | Top-10 DLA heads in layers 15–27, QK surgery on n=20 sharpening pairs |
| Random-head control | 5 draws of n=1/3/5/10 random heads, recovery comparison |
| Early-layer QK sweep | All 264 heads in layers 0–10, individual QK surgery |
| Causal patching (Q6) | Hidden state transplant across 28 layers, peak recovery at layer 3 |
| L0H22 attention probe | n=60 pairs, attention weights visualized, correlation with K |
| Early crasher class probe | BOS sink classification for top-3 early crashers |
| BOS anchor content probe | Hidden states at BOS position projected through unembedding, n=60 pairs |
| 1B cross-scale validation | Full early sweep and late OV check on Llama-3.2-1B-Instruct |
| Pythia scaling suite | Merge operator K rankings, Pythia 1.4B/2.8B/6.9B vs. Llama |
| Substrate W_O ablation | Layer-by-layer W_O zeroing on four architectures, BPB delta measured |