We built a classifier to detect AI-generated code. 65 stylometric features, half a million samples, 34 different AI models. We were briefly very proud of ourselves.
Same model. Same features. Then we tested it on a dataset where the “human” code was written by actual experts — the best Python anyone’s bothered to write, carefully reviewed, canonical — and it fell apart entirely.
The detector couldn’t tell expert human code from AI. Because AI was trained on expert humans.
Expert human code landed inside the AI distribution on every feature that mattered. The model had learned to detect polish. Clean code looks like AI now because AI learned to write clean code from the same people who defined what clean code looks like. AI trains on the best humans, detectors train on the best humans to catch AI, and at some point the signal just goes away.
More types of humans is what actually fixes it — messy GitHub code, competitive programmers, and expert functions all at once. Give it a wide enough view of how people actually write and it starts working again. The features were fine the whole time. The assumption about what “human” meant was the problem.
What we built instead is Human Baseline Distance — a score for how far a piece of code sits from a broad pool of human writing, with a confidence rating attached so you know whether the number means anything in your specific context. On competitive programming code it holds up (AUROC 0.975). On expert canonical functions it basically shrugs (AUROC 0.540). At least it knows what it doesn’t know.
Two weeks of chasing perplexity scoring are also documented in the paper. Complete dead end. You’re welcome.