Removing a Space Made the Model More Confident. We Found the Head Responsible.

Remove a space between two words and in 34% of cases, the model gets more confident about what comes next. That’s backwards. We went looking for why.

34% merge ops that sharpen

−0.78 recovery when one head is killed

100% of the time that head stares at position zero

Fuse “time table” into “timetable” and the model locks in tighter on its next prediction — measurably, consistently, across 180 sentence pairs. The intuition says this shouldn’t happen. The data doesn’t care.

We found the head responsible. It has never looked at a word boundary in its life.

The working theory was that some attention head notices where words fuse and uses that to clean up the ambiguity. Reasonable theory, wrong. Layer 0, head 22 is a BOS sink — BOS is the opening token every prompt starts with, the model’s equivalent of clearing its throat before speaking. Head 22 stares at it 100% of the time, every sentence, every prompt, regardless of what you’re asking. Its attention to the actual word being merged: effectively zero. Correlation with sharpening magnitude: r = 0.091. Statistical nothing.

But when you surgically force that head to look at an average of all positions instead of position zero, sharpening doesn’t just stop — it inverts. Recovery drops from 0.994 to −0.780. The model starts spreading probability where it was concentrating it. The head that controls this has never looked at the thing it’s controlling. It injects the same constant signal into every single forward pass, derived entirely from position zero. The sharpening signal doesn’t live in what that anchor says. It lives in how the rest of the model reacts when that fixed point is there.

Steer vectors produced nothing — the effect doesn’t live in a clean geometric direction, it’s fused with context in a way that averaging destroys. Predicting sharpening from tokenizer features gave R² = −0.41 on held-out data, classic overfit. The model memorized training fingerprints and fell apart on anything new.

One model goes the opposite direction. Pythia-1.4B produces sharpening rankings that are anti-correlated with Llama and Mistral on the same sentences (r = −0.485 to −0.512). What Llama finds easy to absorb, Pythia finds disruptive. That inversion actually strengthens at 2.8B, then flips somewhere between 2.8B and 6.9B — a phase transition, not a ramp, and nobody has a clean explanation for it yet.

Full paper →