Cross-Model Inversion in Whitespace Robustness: The Pythia Anomaly

Abstract

We document a systematic anti-correlation between Pythia-1.4B and Llama-3.2-3B / Mistral-7B on per-sentence whitespace robustness scores (K) for the merge operator. Where Llama and Mistral agree (r = +0.613), Pythia inverts both (r = −0.485 to −0.512). Controls for tokenizer structure and token frequency do not explain the inversion. A scaling sweep reveals the anti-correlation strengthens at 2.8B (r = −0.502) before reversing at 6.9B (r = +0.288), indicating a qualitative threshold rather than smooth convergence. Causal patching confirms the whitespace perturbation signal is localized to early residual stream layers (0–3) in Llama. The cause of the Pythia inversion — training corpus, architecture, or scale threshold — cannot be cleanly isolated without controlled ablations not undertaken here.

1. Background

Prior work (R1 series) established that whitespace perturbations divide next-token probability mass in a conserved, structured way across BPE-tokenized language models. For the merge operator (removing a space between two words), approximately 34% of cases produce sharpening — the output distribution becomes more concentrated after the perturbation. Per-pair robustness is measured as:

K = log(2) − (H_modified − H_original)

High K means the model absorbed the perturbation. Low K means it was disrupted. K > log(2) means the model actually sharpened.

This note asks whether different model families agree on which sentences are hard or easy to perturb. The answer is: Llama and Mistral do. Pythia doesn’t.

2. Setup

Models: Llama-3.2-3B-Instruct, Pythia-1.4B-deduped (EleutherAI), Mistral-7B-v0.1.

Data: 60 sentence pairs from WikiText-103 validation, merge operator only (one space removed per pair at a word boundary).

Metric: Pearson r between per-pair K scores across model pairs. Positive r means models agree on sentence-level robustness rankings. Negative r means they invert.

3. Main Result

Model pair	r	p
Llama-3.2-3B vs Mistral-7B	+0.613	< 0.001
Pythia-1.4B vs Llama-3.2-3B	−0.485	< 0.001
Pythia-1.4B vs Mistral-7B	−0.512	< 0.001

Llama and Mistral agree moderately. Pythia inverts both, strongly and significantly. This is not a null or weak effect — r = −0.49 to −0.51 across two independent model families is a systematic inversion.

4. Controls

4.1 Tokenizer Structure

Llama and Mistral use SentencePiece-based BPE. Pythia uses GPT-NeoX’s tokenizer. When a space is removed, different tokenizers may produce different before/after token counts, meaning “merge” could have different structural meanings per model.

Control: Filtered to 50 pairs where both tokenizers produced identical token counts before and after perturbation (structurally matched pairs). The anti-correlation persists: r = −0.490 in the filtered set. Tokenizer mismatch does not explain the inversion.

4.2 Token Frequency Rank

If Pythia’s vocabulary assigns different frequency weights to merged forms, its confidence might shift in the opposite direction from Llama/Mistral. We computed normalized rank divergence between tokenizers for each sentence pair and tested correlation with K inversion.

Result: All r < 0.14. Token frequency rank predicts nothing.

4.3 Model Size Confound

Pythia-1.4B is smaller than both Llama-3.2-3B (3B) and Mistral-7B (7B). The inversion might simply reflect noisier behavior at lower parameter count — random disagreement, not systematic inversion.

Control: Pythia scaling sweep (Section 5). If the inversion were noise, it would decrease monotonically with scale. It does not.

5. Scaling Analysis

We ran the full available Pythia scale range on the same 60 pairs.

Model	r(K_pythia, K_llama)
Pythia-1.4B	−0.485
Pythia-2.8B	−0.502
Pythia-6.9B	+0.288

The anti-correlation strengthens from 1.4B to 2.8B before reversing between 2.8B and 6.9B. This shape rules out a simple noise explanation: noise would decrease monotonically with scale, not get stronger before flipping. The 2.8B → 6.9B reversal is a qualitative threshold, not a smooth convergence.

6. Causal Patching

To localize where the whitespace perturbation signal lives in Llama-3.2-3B, we ran a causal patching experiment: at each layer, transplant the hidden state from the modified sentence’s forward pass into the original sentence’s forward pass and measure the change in K.

Result: Peak recovery at layers 0–3 (+0.261 change in K versus −0.045 for a random patch from an unrelated position). Patching at layers 10–20 mostly washes out.

The whitespace perturbation signal is localized to the early residual stream in Llama. This is consistent with the BOS-anchoring circuit finding (sharpening_circuit_paper): the load-bearing routing heads for sharpening are concentrated in layers 0–3.

Whether the same localization holds for Pythia is untested. If Pythia’s early residual stream handles boundary-sensitive information differently — due to GPT-NeoX attention structure or different training data statistics — that could explain the inversion. This is speculative.

7. What Remains Unexplained

The Pythia inversion cannot be cleanly attributed to any single factor. The three candidate explanations are entangled:

Training corpus (H1): Pythia is trained on The Pile, which has higher proportions of code, academic text, and structured web data relative to the largely web-crawled corpora used for Llama and Mistral. Compound words, domain-specific merged forms, and whitespace norms differ between these distributions. At sufficient scale (≥6.9B), Pythia may develop enough capacity to override The Pile’s composition and converge on the same bigram statistics that dominate in Llama/Mistral.

Architecture (H2): GPT-NeoX uses different attention projection structure and positional embedding (rotary, in the NeoX variant) than LLaMA-style attention. If the BOS-anchoring mechanism (layers 0–3) is organized differently under GPT-NeoX, it could produce systematically different whitespace robustness behavior.

Scale × corpus interaction (H3): The phase transition at 2.8B → 6.9B might reflect a scale threshold specific to The Pile’s composition — a point at which Pythia’s capacity exceeds the corpus’s ability to enforce The-Pile-specific priors. This would explain why the inversion strengthens before it flips rather than fading monotonically.

Disentangling H1 from H2 requires a LLaMA-architecture model trained on The Pile, or a GPT-NeoX model trained on web text. Neither was available for this study.

8. Limitations

n=60 sentence pairs is small for a correlation study. Effect sizes should be treated as point estimates.
The Pythia scaling sweep covers three points. The transition between 2.8B and 6.9B is unresolved — the reversal could occur at 3B, 5B, or anywhere in between.
Causal patching was run on Llama only. Whether the early-layer localization holds for Pythia is untested.
All sentences are from WikiText-103 (English Wikipedia, declarative prose). The inversion may be specific to this corpus or generalize across text types.

9. Conclusion

Pythia-1.4B systematically inverts whitespace robustness rankings relative to Llama-3.2-3B and Mistral-7B. The inversion is robust to tokenizer and frequency controls and strengthens with scale before reversing at 6.9B. The cause is not identified. The scaling shape — inversion strengthening then flipping — is inconsistent with a noise explanation and suggests a qualitative threshold rather than a gradual convergence.

This anomaly is reported without resolution. It is a constraint any mechanistic account of whitespace robustness must eventually explain.

Appendix: Experiment Summary

Experiment	Pairs	Result
Cross-model K correlation (all 60)	60	Pythia inverts Llama/Mistral
Tokenizer-controlled subset	50	Inversion persists (r = −0.490)
Token frequency rank control	60	r < 0.14, null
Pythia 1.4B / 2.8B / 6.9B scaling	60	Phase transition confirmed
Causal patching, Llama layers 0–27	60	Peak recovery at layers 0–3