1. Background
Prior work (R1 series) established that whitespace perturbations divide next-token probability mass in a conserved, structured way across BPE-tokenized language models. For the merge operator (removing a space between two words), approximately 34% of cases produce sharpening — the output distribution becomes more concentrated after the perturbation. Per-pair robustness is measured as:
K = log(2) − (H_modified − H_original)
High K means the model absorbed the perturbation. Low K means it was disrupted. K > log(2) means the model actually sharpened.
This note asks whether different model families agree on which sentences are hard or easy to perturb. The answer is: Llama and Mistral do. Pythia doesn’t.
2. Setup
Models: Llama-3.2-3B-Instruct, Pythia-1.4B-deduped (EleutherAI), Mistral-7B-v0.1.
Data: 60 sentence pairs from WikiText-103 validation, merge operator only (one space removed per pair at a word boundary).
Metric: Pearson r between per-pair K scores across model pairs. Positive r means models agree on sentence-level robustness rankings. Negative r means they invert.
3. Main Result
| Model pair | r | p |
|---|---|---|
| Llama-3.2-3B vs Mistral-7B | +0.613 | < 0.001 |
| Pythia-1.4B vs Llama-3.2-3B | −0.485 | < 0.001 |
| Pythia-1.4B vs Mistral-7B | −0.512 | < 0.001 |
Llama and Mistral agree moderately. Pythia inverts both, strongly and significantly. This is not a null or weak effect — r = −0.49 to −0.51 across two independent model families is a systematic inversion.
4. Controls
4.1 Tokenizer Structure
Llama and Mistral use SentencePiece-based BPE. Pythia uses GPT-NeoX’s tokenizer. When a space is removed, different tokenizers may produce different before/after token counts, meaning “merge” could have different structural meanings per model.
Control: Filtered to 50 pairs where both tokenizers produced identical token counts before and after perturbation (structurally matched pairs). The anti-correlation persists: r = −0.490 in the filtered set. Tokenizer mismatch does not explain the inversion.
4.2 Token Frequency Rank
If Pythia’s vocabulary assigns different frequency weights to merged forms, its confidence might shift in the opposite direction from Llama/Mistral. We computed normalized rank divergence between tokenizers for each sentence pair and tested correlation with K inversion.
Result: All r < 0.14. Token frequency rank predicts nothing.
4.3 Model Size Confound
Pythia-1.4B is smaller than both Llama-3.2-3B (3B) and Mistral-7B (7B). The inversion might simply reflect noisier behavior at lower parameter count — random disagreement, not systematic inversion.
Control: Pythia scaling sweep (Section 5). If the inversion were noise, it would decrease monotonically with scale. It does not.
5. Scaling Analysis
We ran the full available Pythia scale range on the same 60 pairs.
| Model | r(K_pythia, K_llama) |
|---|---|
| Pythia-1.4B | −0.485 |
| Pythia-2.8B | −0.502 |
| Pythia-6.9B | +0.288 |
The anti-correlation strengthens from 1.4B to 2.8B before reversing between 2.8B and 6.9B. This shape rules out a simple noise explanation: noise would decrease monotonically with scale, not get stronger before flipping. The 2.8B → 6.9B reversal is a qualitative threshold, not a smooth convergence.
6. Causal Patching
To localize where the whitespace perturbation signal lives in Llama-3.2-3B, we ran a causal patching experiment: at each layer, transplant the hidden state from the modified sentence’s forward pass into the original sentence’s forward pass and measure the change in K.
Result: Peak recovery at layers 0–3 (+0.261 change in K versus −0.045 for a random patch from an unrelated position). Patching at layers 10–20 mostly washes out.
The whitespace perturbation signal is localized to the early residual stream in Llama. This is consistent with the BOS-anchoring circuit finding (sharpening_circuit_paper): the load-bearing routing heads for sharpening are concentrated in layers 0–3.
Whether the same localization holds for Pythia is untested. If Pythia’s early residual stream handles boundary-sensitive information differently — due to GPT-NeoX attention structure or different training data statistics — that could explain the inversion. This is speculative.
7. What Remains Unexplained
The Pythia inversion cannot be cleanly attributed to any single factor. The three candidate explanations are entangled:
Training corpus (H1): Pythia is trained on The Pile, which has higher proportions of code, academic text, and structured web data relative to the largely web-crawled corpora used for Llama and Mistral. Compound words, domain-specific merged forms, and whitespace norms differ between these distributions. At sufficient scale (≥6.9B), Pythia may develop enough capacity to override The Pile’s composition and converge on the same bigram statistics that dominate in Llama/Mistral.
Architecture (H2): GPT-NeoX uses different attention projection structure and positional embedding (rotary, in the NeoX variant) than LLaMA-style attention. If the BOS-anchoring mechanism (layers 0–3) is organized differently under GPT-NeoX, it could produce systematically different whitespace robustness behavior.
Scale × corpus interaction (H3): The phase transition at 2.8B → 6.9B might reflect a scale threshold specific to The Pile’s composition — a point at which Pythia’s capacity exceeds the corpus’s ability to enforce The-Pile-specific priors. This would explain why the inversion strengthens before it flips rather than fading monotonically.
Disentangling H1 from H2 requires a LLaMA-architecture model trained on The Pile, or a GPT-NeoX model trained on web text. Neither was available for this study.
8. Limitations
- n=60 sentence pairs is small for a correlation study. Effect sizes should be treated as point estimates.
- The Pythia scaling sweep covers three points. The transition between 2.8B and 6.9B is unresolved — the reversal could occur at 3B, 5B, or anywhere in between.
- Causal patching was run on Llama only. Whether the early-layer localization holds for Pythia is untested.
- All sentences are from WikiText-103 (English Wikipedia, declarative prose). The inversion may be specific to this corpus or generalize across text types.
9. Conclusion
Pythia-1.4B systematically inverts whitespace robustness rankings relative to Llama-3.2-3B and Mistral-7B. The inversion is robust to tokenizer and frequency controls and strengthens with scale before reversing at 6.9B. The cause is not identified. The scaling shape — inversion strengthening then flipping — is inconsistent with a noise explanation and suggests a qualitative threshold rather than a gradual convergence.
This anomaly is reported without resolution. It is a constraint any mechanistic account of whitespace robustness must eventually explain.
Appendix: Experiment Summary
| Experiment | Pairs | Result |
|---|---|---|
| Cross-model K correlation (all 60) | 60 | Pythia inverts Llama/Mistral |
| Tokenizer-controlled subset | 50 | Inversion persists (r = −0.490) |
| Token frequency rank control | 60 | r < 0.14, null |
| Pythia 1.4B / 2.8B / 6.9B scaling | 60 | Phase transition confirmed |
| Causal patching, Llama layers 0–27 | 60 | Peak recovery at layers 0–3 |