One Model Disagreed With Everyone. Then Got More Disagreeable.

Pythia-1.4B produces whitespace robustness rankings that are anti-correlated with Llama and Mistral. The inversion gets stronger at 2.8B before flipping at 6.9B. Nobody has a clean explanation.

Two models look at the same sentence. You remove a space. One gets more confident — absorbed the change cleanly. The other gets disrupted. Swap the sentence and it reverses: the one that struggled sails through, the one that sailed through struggles. That’s not noise. The Pearson correlation between Pythia-1.4B and Llama-3.2-3B on per-sentence whitespace robustness is r = −0.485. Negative. The sentences one model finds easy are systematically the ones the other finds hard.

−0.49 Pythia vs Llama correlation
+0.61 Llama vs Mistral correlation
2.8B where inversion peaks before flipping

We ran the same 60 merge-operator pairs through Llama-3.2-3B, Pythia-1.4B, and Mistral-7B. Llama and Mistral agreed with each other (r = +0.613) — not perfectly, but enough to suggest they’re picking up something real about the sentences. Pythia agreed with neither and inverted both.

The sentences one model finds easy are the ones the other finds hard. Systematically.

The first thing you check is tokenizer mismatch. Llama and Mistral use SentencePiece. Pythia uses GPT-NeoX’s tokenizer. Removing a space can mean structurally different things depending on how many tokens change, so we filtered to the 50 pairs where both tokenizers produced identical token counts before and after — same operation, matched structure. The anti-correlation held at r = −0.490. Next guess: token frequency. We computed normalized rank divergence between tokenizers for each pair and looked for correlation with the inversion, and got all r < 0.14. Predicts nothing.

What’s left is training data, architecture, or scale — three things that are tangled together in a way that makes them impossible to cleanly separate. Pythia is simultaneously smaller, trained on The Pile instead of web text, and built on GPT-NeoX instead of LLaMA-style attention. Any of those could explain it. None of them cleanly does.

Then we ran the full Pythia scaling suite — same 60 pairs across three model sizes — and the result was not a smooth convergence where scale gradually corrects the disagreement. Pythia-1.4B: r = −0.485, inverted. Pythia-2.8B: r = −0.502, more inverted. Pythia-6.9B: r = +0.288, flipped. The inversion strengthens before it reverses. Something changes qualitatively between 2.8B and 6.9B — a threshold, not a ramp. Without a LLaMA-style model trained on The Pile, there’s no clean way to tell the training-data story from the architecture story.

One thing that did work: transplanting the hidden state from the modified forward pass into the original at layers 0–3 produced a real recovery effect (+0.261 versus −0.045 for a random patch). The whitespace perturbation signal lives in the early residual stream. Late layers don’t carry it. That’s a foothold, not an explanation.

The previous post in this series found the head responsible for sharpening. This one found a model that does the opposite and can’t explain why yet. Both results are solid. They just don’t fit the same story.

Full paper →