Sharpening Circuits in BPE-Tokenized Language Models: Early QK Routing and Late OV Projection

Abstract

Whitespace perturbations applied to prompts for BPE-tokenized language models produce a systematic sharpening effect: in approximately 34% of merge-operator cases, removing a single space increases model confidence in the next-token prediction (K > log 2). We investigate the circuit-level mechanism in Llama-3.2-3B-Instruct using direct logit attribution (DLA) and QK surgery. A two-stage mechanism emerges: (1) attention heads in layers 0–3 route information via QK patterns that are load-bearing for sharpening — zeroing the queries of layer-0 head 22 alone reduces recovery from 0.994 to −0.780, inverting sharpening to spreading; (2) attention heads in layers 15–27 carry the sharpening signal in value representations (OV-side), surviving complete destruction of their attention patterns with recovery > 0.97. Stage-1 heads are BOS attention sinks, not boundary-detection heads — they inject a constant BOS-derived representation regardless of context. Cross-scale validation on Llama-3.2-1B-Instruct confirms the circuit structure. A follow-up tokenizer-free experiment (Substrate 244M) shows BOS sinks are universal in causal transformers but their load-bearing role is architecture-dependent.

1. Introduction

BPE tokenizers encode whitespace as part of token identity: " the" and "the" are different vocabulary entries with different embeddings and different learned weights. A single space character inserted or removed can completely change what the model receives. Prior work established that such perturbations redistribute probability mass in a structured way — probability is divided, not dispersed (JSD < 0.3 nats universally) — and that in approximately 34% of merge-operator cases, the perturbation sharpens the output distribution. The model becomes more confident in its prediction after a space is removed.

What circuit produces this sharpening? This paper addresses that question mechanistically. The answer has two parts: early QK routing (layers 0–3) establishes which token positions carry boundary-sensitive information, and late OV projections (layers 15–27) carry the signal to the output. These stages are experimentally separable: forcing uniform attention on late heads leaves sharpening intact, while forcing uniform attention on specific early heads eliminates or inverts it.

2. Background

2.1 Whitespace Operators and the Sharpening Phenomenon

Three primitive operators over text strings:

merge (×) — remove a space, fusing adjacent tokens
split (÷) — insert a space mid-word, forcing re-segmentation
shift (≀) — move a space one character left or right

For a prompt string s and variant v, define the entropy residual:

K = log(2) − (H(P_v) − H(P_s))

where H(P) is Shannon entropy of the next-token distribution. K > log(2) indicates sharpening: the variant distribution is more concentrated than the Division Law baseline predicts.

Conservation holds universally: JSD(P_s, P_v) < 0.3 nats across all pairs and three model families (Llama-3.2-3B-Instruct, Pythia-1.4B, Mistral-7B). Sharpening rate for the merge operator: ~34% of held-out pairs.

2.2 Direct Logit Attribution (DLA)

DLA projects per-layer, per-head attention outputs through the LN-folded unembedding direction for a target token, yielding a scalar contribution score per head. We use the linear approximation: for head h at layer l, contribution = W_O_h @ v_h projected through final LN scale and unembedding weight. This ignores subsequent residual additions but is standard for screening.

2.3 QK Surgery

For head h at layer l, we zero head h’s slice of the Q projection output. Since Q_h = 0, attention scores Q_h K^T = 0 everywhere. Combined with the causal mask, softmax gives uniform attention over valid past positions — the head sees an average of all past value vectors, eliminating routing while preserving the OV computation.

Recovery fraction:

R = (K_uniform − log(2)) / (K_orig − log(2))

R ≈ 1.0 — sharpening survives uniform attention → OV-side
R ≈ 0.0 — sharpening eliminated → QK-side
R < 0 — uniform attention inverts the effect → strongly QK-side

3. Methods

Model: Llama-3.2-3B-Instruct (FP16, 28 layers, 24 heads, head_dim=128, hidden_size=3072).

Data: 180 held-out pairs (60 sentences × 3 operators) from WikiText-103 validation. 61 merge pairs have K > log(2). Surgery experiments use top-20 sharpening pairs by K magnitude (K_baseline mean = 1.13).

Phase 1 — DLA Screening: For 40 sharpening and 40 non-sharpening pairs, compute per-head DLA delta = DLA(modified) − DLA(original) toward the original’s top-1 token. Rank by specificity = mean delta (sharpening) − mean delta (non-sharpening). Top-10 heads selected as candidates.

Phase 2 — Late-Layer QK Surgery: Apply QK surgery to top-1/3/5/10 DLA-ranked heads on 20 sharpening pairs. Compute recovery.

Phase 3 — Control: 5 random draws of n=1/3/5/10 non-ranked heads. Compare recovery to ranked heads.

Phase 4 — Early-Layer Sweep: Individual QK surgery on all 264 heads in layers 0–10. Rank by recovery. Combine top crashers (n=1/3/5/10). Sample 30 late-layer heads (11–27) for comparison.

4. Results

4.1 DLA Identifies Late-Layer Sharpening-Specific Heads

Top DLA-ranked heads cluster in layers 15, 22–27. Best specificity: L25H15 (specificity=+0.087). In sharpening pairs, these heads show near-zero DLA delta. In non-sharpening pairs, they show negative delta (suppression). The asymmetry: sharpening contexts do not strongly promote the target via these heads — non-sharpening contexts suppress it more.

4.2 Late-Layer QK Surgery: OV-Side

Config	Recovery
top-1	0.994
top-3	0.969
top-5	0.826
top-10	1.134

All configurations OV-side. Top-10 recovery > 1.0: some ranked heads are mildly anti-sharpening under their normal attention patterns — removing the routing slightly amplifies the effect. Sharpening is not disrupted by any combination of late-layer QK perturbations.

4.3 Control Confirms Non-Specificity

Random late-layer heads: recovery mean=0.968, std=0.091, no individual head below 0.688. Random early-layer heads: high variance — some draws give recovery near 0 or negative. The OV-side result is a property of the late layers, not specific to DLA-ranked heads.

4.4 Early-Layer Sweep: QK-Side, Localized to Layers 0–3

Layer	Best single-head recovery
0	−0.780
1	−0.057
2	−0.672
3	−0.103
4	+0.152
5	+0.135
6	+0.512
7–10	0.59 – 0.74

Layer 0 head 22 (L0H22): recovery = −0.780. A single head whose Q-zeroing flips sharpening to spreading.

Layer 2: densely represented — heads 4, 17, 18, 20 all in top-15 crashers.

Combinations of top early crashers:

Config	Recovery
early-top-1	−0.780
early-top-3	−0.899
early-top-5	−0.924
early-top-10	−1.088

Sharpening not only eliminated but inverted under combined early QK surgery.

4.5 Distributional Comparison: Early vs. Late

Group	Mean	Std	Min	Frac < 0.5
Early (L0–10)	0.872	0.374	−0.780	0.12
Late (L11–27)	0.968	0.091	0.688	0.00

Late-layer distribution is tight and high (OV-side). Early-layer distribution has a heavy lower tail (QK-side heads present). The t-test is non-significant (p=0.17) because both group means are near 1.0 — the relevant statistic is the tail behavior and the minimum, not the means.

4.6 Convergence with Causal Patching

An independent causal patching experiment (transplanting hidden states from the modified forward pass into the original at specific layers) found peak information recovery at layer 3 — the boundary signal transfers most efficiently through the residual stream at that depth. The early-layer QK sweep finds load-bearing routing heads concentrated in layers 0–3 with effect diminishing sharply by layer 6. Two independent methods converge on the same layer range.

4.7 Cross-Scale Validation: Llama-3.2-1B-Instruct

Metric	3B	1B
Best early crasher	L0H22, R=−0.780	L0H31, R=−1.357
2nd early crasher	L2H4, R=−0.672	L1H25, R=−1.176
Early layer gradient	L0>L1>L2>L3, dead by L4	L0>L1>L2>L3, dead by L4
Late OV mean (R)	0.968	1.071
Late OV min (R)	0.688	0.967
Top crashers: BOS sinks?	yes	yes

The 1B circuit is structurally identical and if anything more extreme: top crasher R=−1.357, BOS sinks perfectly concentrated (max_weight=1.0000), late OV floor higher (min=0.967 vs. 0.688). The two-stage BOS-anchoring circuit is not a 3B artifact — it is a consistent property of the Llama 3.2 architecture across the 1B–3B range.

5. Discussion

5.1 The Two-Stage Circuit

Stage 1 (layers 0–3, QK-side): Attention heads in early layers are load-bearing for sharpening — disrupting their attention patterns eliminates or inverts the effect. L0H22 is the single most load-bearing head.

Stage 2 (layers 15–27, OV-side): The sharpening signal is carried in value representations of late-layer heads. By this stage, attention pattern no longer matters — the information is encoded in what these heads extract, regardless of where they look.

The residual stream serves as the communication channel: Stage 1 heads inject a stable representation that Stage 2 OV projections read out into logit space.

5.2 Early Crashers Are BOS Attention Sinks

The boundary-detection hypothesis predicted that L0H22 would attend to the tokenization seam in sharpening contexts. Direct attention visualization (n=60 pairs) refuted this: L0H22 is a near-perfect BOS attention sink — frac_argmax_BOS=1.00, max_weight=0.9999, entropy=0.001. It attends exclusively to the BOS token regardless of context, operator, or K value. Seam attention is effectively zero (mean=0.0000), correlation with K: r=+0.091, p=0.49.

Follow-up probes on the next two largest early crashers:

Head	frac_BOS	max_weight	entropy	verdict
L0H22	1.00	0.9999	0.001	Pure BOS sink
L2H4	1.00	0.9033	0.595	BOS sink
L2H20	0.70	0.7770	0.826	Mixed

The early QK-side mechanism is not boundary detection but BOS anchoring: these heads inject a stable BOS-derived constant into the residual stream at every position. When Q is zeroed, the constant BOS representation is replaced by a position-averaged mean — a context-dependent perturbation that propagates forward and collapses sharpening.

BOS anchor content probe (n=30 sharpening + 30 non-sharpening). BOS token hidden states extracted at layers 0–3 via output_hidden_states=True, projected through LN-folded unembedding. Cross-group cosine similarity = 1.0000 at all four layers; intra-group similarity = 1.0000. BOS-to-target projection correlates with K at r=−0.183 to r=+0.106 (all p > 0.15, non-significant). The BOS anchor is purely structural: it injects an identical constant regardless of whether the context produces sharpening. The sharpening signal does not live in what the anchor says — it lives in how the residual stream at non-BOS positions interacts with this constant injection across layers.

5.3 The Pythia Anomaly

Cross-model testing on Pythia-1.4B, Pythia-2.8B, and Pythia-6.9B using the same 60 merge-operator sentence pairs reveals a systematic anti-correlation with Llama and Mistral:

Model pair	r	p
Llama vs. Mistral	+0.613	< 0.001
Pythia-1.4B vs. Llama	−0.485	< 0.001
Pythia-1.4B vs. Mistral	−0.512	< 0.001

Pythia’s K rankings are inverted relative to Llama/Mistral — sentences Llama finds easy to absorb are the ones Pythia finds disruptive, and vice versa. This anti-correlation persists after controlling for tokenizer structure (filtering to 50 pairs where both tokenizers produce the same token count before and after, r=−0.490).

Scaling behavior:

Model	r(K_pythia, K_llama)
Pythia-1.4B	−0.485
Pythia-2.8B	−0.502
Pythia-6.9B	+0.288

The anti-correlation does not fade gradually — it strengthens from 1.4B to 2.8B, then abruptly flips sign between 2.8B and 6.9B. This is a phase transition, not a smooth convergence. The cause — training data (The Pile vs. web-crawled corpora), architecture (GPT-NeoX vs. LLaMA-style), or a parameter-scale threshold — cannot be cleanly isolated without controlled corpus ablations that are outside the current scope.

5.4 Architectural Generalization: Substrate Follow-Up

A subsequent experiment trained a 244M parameter tokenizer-free byte-level transformer (Substrate) from scratch on Project Gutenberg, then applied the same W_O ablation protocol used on BPE models.

Model	Architecture	L00 W_O=0 BPB delta	All-layers delta
Llama-3.2-1B-Instruct	flat decoder	+1.202	+1.696
Mistral-7B-v0.1	flat decoder	+0.503	+1.900
Pythia-160M-deduped	flat decoder	+0.302	+1.328
Substrate 244M	hierarchical encoder/decoder	+0.001	+0.052

BOS sinks emerge in Substrate without any BOS token or vocabulary — a head attending 340× baseline attention to position 0 develops from scratch. Topology is conserved. But all ablations are null (max BPB delta +0.008 across all ablation types), versus +0.302 BPB for the same W_O ablation on Pythia-160M’s early layers.

The difference traces to architecture: Substrate’s global transformer is near-optional (bypassing it entirely costs +0.047 BPB); the local encoder carries prediction signal directly. Pythia is a flat decoder-only stack where every sublayer is in the primary prediction path. The sharpening circuit’s BOS-anchoring stage is therefore architecture-dependent: load-bearing in flat decoder-only stacks, non-causal in hierarchical byte-level models.

Note on Mistral: W_O ablation is CAUSAL only at L00 (+0.503) and L01 (+0.059); layers 2–31 are null or marginal. Sliding window attention in later layers reduces individual head causality while the model compensates with early-layer capacity — a structural parallel to the Substrate result, where peripheral subsystems absorb ablation effects rather than propagating them.

5.5 Implications for Prompt Robustness

Prompts that destabilize the BOS-sink signal — through unusual tokenization structure at the start of the sequence — may be more susceptible to whitespace-induced confidence shifts. A practical probe: run whitespace variants on a target prompt and measure K variance. High variance indicates the BOS-anchoring circuit is interacting with the prompt’s residual stream structure in a perturbation-sensitive way.

5.6 Limitations

Surgery experiments use n=20 sharpening pairs — a small sample. Recovery fractions are point estimates with substantial uncertainty.
QK surgery via Q-zeroing is a crude intervention; it does not distinguish between “head attends to position X” vs. “head attends to some structure the causal mask permits.” Attention-weight-level hooks would be cleaner.
The DLA approximation ignores subsequent residual additions and uses frozen LN scale. It is a screening tool, not a causal attribution method.
The Pythia anomaly’s cause remains unidentified. Disentangling training data, architecture, and scale effects requires controlled corpus ablations not undertaken here.

6. Conclusion

Whitespace-induced sharpening in Llama-3.2-3B-Instruct operates via a two-stage circuit. Early BOS-anchoring heads (layers 0–3) are load-bearing via a constant BOS-derived residual injection. Late OV projections (layers 15–27) carry the sharpening signal to the output. These stages are experimentally separable via QK surgery and independently confirmed by causal patching. Stage 1 is BOS anchoring, not boundary detection — extending the attention-sinks literature to a functional load-bearing role.

The BOS anchor injects the same constant into the residual stream regardless of context. The sharpening signal lives not in what that constant “says” but in how the downstream architecture uses it. Why disrupting BOS anchoring specifically collapses sharpening — rather than affecting other output properties equally — remains open.

A tokenizer-free follow-up establishes that BOS sinks are universal in causal transformers and that their load-bearing role is architecture-dependent. The Pythia anomaly (anti-correlation with Llama/Mistral on K rankings, with a phase transition between 2.8B and 6.9B) is reported without explanation.

Appendix: Experiment Summary

Experiment	Description
DLA screening + late QK surgery	Top-10 DLA heads in layers 15–27, QK surgery on n=20 sharpening pairs
Random-head control	5 draws of n=1/3/5/10 random heads, recovery comparison
Early-layer QK sweep	All 264 heads in layers 0–10, individual QK surgery
Causal patching (Q6)	Hidden state transplant across 28 layers, peak recovery at layer 3
L0H22 attention probe	n=60 pairs, attention weights visualized, correlation with K
Early crasher class probe	BOS sink classification for top-3 early crashers
BOS anchor content probe	Hidden states at BOS position projected through unembedding, n=60 pairs
1B cross-scale validation	Full early sweep and late OV check on Llama-3.2-1B-Instruct
Pythia scaling suite	Merge operator K rankings, Pythia 1.4B/2.8B/6.9B vs. Llama
Substrate W_O ablation	Layer-by-layer W_O zeroing on four architectures, BPB delta measured