The Human Baseline Problem: Why AI Code Detectors Fail Across Domains

Abstract

Stylometric classifiers for AI-generated code detection achieve 0.92–0.99 F1 within a single corpus but collapse to near or below random chance cross-corpus. We investigate this failure across three paired human/AI corpora: SemEval-2026 Task 13 (500K samples, 34 AI models), a novel paired Codeforces corpus (96 competitive programming problems, pre-2022 human solutions vs. deepseek-coder:6.7b), and HumanEval (164 problems, OpenAI canonical solutions vs. deepseek-coder:6.7b). Feature distribution analysis reveals the mechanism: AI distributions are stable across corpora; human distributions are not. The top-10 discriminative features uniformly show HumanEval's expert-written human code sitting inside SemEval's AI distribution. Training on a union of all three corpora recovers 0.95 F1, confirming the problem is human-population homogeneity, not feature inadequacy. We provide a continuous Human Baseline Distance (HBD) diagnostic score with AUROC ranging from 0.975 (competitive programming context) to 0.540 (expert-canonical context).

1. Introduction

The rise of LLM-generated code has prompted a parallel effort in detection: can a classifier reliably identify AI-authored source code from its surface style? Several papers report detection accuracies above 90% on labeled benchmarks, and multiple commercial tools claim reliable AI code identification.

We revisit this claim with a simple methodological constraint: if a detector is measuring AI authorship and not corpus-specific artifacts, it must generalize across independently collected corpora. We apply this constraint and find it fails uniformly.

Our contributions are:

A three-corpus cross-domain evaluation demonstrating that stylometric AI code detectors fail out-of-distribution across all training/test corpus pairs.
A mechanistic explanation: a feature distribution analysis showing that the top discriminative features place expert-quality human code inside the AI region of detectors trained on scraped human code. The failure is not random — it is specifically human-baseline drift.
Evidence that the fix is corpus diversity: a union-trained classifier achieves 0.95 F1 across all three corpora, showing the features can generalize given heterogeneous human representation.
A Human Baseline Distance (HBD) diagnostic — a continuous Mahalanobis score from a diverse human baseline on 9 population-stable features — as a more honest deployable artifact than a binary classifier.
A 35-way model fingerprinting result showing that within-corpus, individual AI models have distinguishable stylometric signatures (~8x above chance), with family structure matching architectural lineage.

1.1 Relationship to Existing Work

Prior work on machine-generated text detection has documented cross-domain brittleness in natural language, typically attributed to distribution shift. Our contribution is a structural account of why the shift direction is asymmetric in code: AI output is more stylistically uniform across contexts than human output, so detectors trained on any specific human population learn the wrong axis.

The SemEval-2026 Task 13 shared task motivates our primary corpus. Prior competitive entries that report high accuracy on the shared task’s test set have not, to our knowledge, been evaluated against independently collected human baselines.

2. Datasets

2.1 SemEval-2026 Task 13 Corpus

499,999 labeled Python/C++/Java samples from DaniilOr/SemEval-2026-Task13 (subtask A), comprising 238,474 human and 261,525 AI samples across 34 AI generators (GPT-4o, CodeLlama variants, Qwen-Coder, deepseek-coder, StarCoder2, Phi-3, Llama-3, CodeGemma, Granite, and others). All experiments use the Python subset (91% of samples, 457,306 rows). Human samples are scraped from public code repositories, era unspecified.

2.2 Paired Codeforces Corpus

We constructed a novel paired corpus addressing a key confound in existing datasets: human and AI samples solve different problems, so style differences may reflect problem structure rather than authorship.

For each of 96 Codeforces problems (contest ratings ≤ 1500, difficulty 800–1400, tag: implementation, pre-2021 era):

Human solution: A Python file from a public GitHub repository with per-file last commit date before 2022-01-01. Provenance is verified at file granularity (not repo push date) via GET /repos/{owner}/{repo}/commits?path={file}&per_page=1. This excludes any solution that could have been produced with LLM assistance.
AI solution: Generated by deepseek-coder:6.7b via Ollama with the problem statement as prompt.

The resulting corpus has 373 solutions (273 human, 100 AI). Any feature difference between classes is attributable to authorship, not problem structure or era.

2.3 HumanEval Corpus

164 Python programming problems from openai/openai_humaneval, with canonical human-written function implementations as the human class and deepseek-coder:6.7b completions as the AI class (328 total, perfectly balanced).

HumanEval human solutions are qualitatively distinct from the other two human populations: they are expert-written, heavily reviewed, idiomatic functions — the canonical definition of correct Python style. This makes them the hardest test case for any stylometric detector trained on “typical” human code.

This corpus was held out entirely during all feature selection procedures.

3. Feature Extraction

We extract 65 features from raw Python source, combining lexical and AST-level signals.

Lexical features (37): Line count, character count, blank line ratio, line length statistics (mean, max, p90), comment line ratio, inline comment ratio, whitespace ratio, trailing whitespace ratio, digit ratio, uppercase ratio, snake_case count, camelCase count, naming style ratio (snake/(snake+camel)), type annotation density, print guard count, bare except count, uniform except count, null check count, isinstance check count, guard clause count, early return count, long identifier count, logging call count, and derived ratios thereof.

AST features (28): Function count, class count, max nesting depth, comprehension count (list/dict/set/generator), lambda count, decorator count, docstring presence ratio, type annotation count, f-string count, walrus operator count, import count, max function length (in lines), none comparison count, isinstance call count, broad except handler count, finally block count, default argument count, early return count (AST-level).

Features are extracted without model inference; no perplexity or embedding features are used. A null result for perplexity-based detection (0/8 features significant, Mann-Whitney U) ruled out continuation-scoring approaches early.

4. Experiments

4.1 Within-Corpus Baselines

We establish that the signal exists in all three corpora independently before testing transfer. Table 1 reports within-corpus F1 using 5-fold stratified cross-validation.

Table 1: Within-corpus classification F1 (Random Forest, 300 trees)

Corpus	F1 Macro	Accuracy	n
SemEval (Python subset)	0.9544	0.9544	100,000
Codeforces (LOPO-CV)	0.9931	0.9946	373
HumanEval	0.9200	0.9200	328

The Codeforces result uses leave-one-problem-out cross-validation, which holds all solutions for a given problem out during training, eliminating any leakage between paired samples. The HumanEval result uses standard 5-fold CV.

High within-corpus F1 in all three cases confirms that stylometric signal exists. The question is whether it transfers.

4.2 Cross-Corpus Transfer

Table 2: Cross-corpus generalization (test corpus never seen during training)

Train	Test	F1 Macro	Notes
Codeforces	SemEval	0.6089	Large instruct models undetected
SemEval	Codeforces	0.4038	Below random chance
SemEval	HumanEval	0.3320	Below random chance
Codeforces	HumanEval	0.4449	Below random chance
SemEval+CF	HumanEval	0.3320	Combined training no better

Random chance for binary balanced classification = 0.50.

Every cross-corpus transfer result is at or below random chance. The SemEval -> Codeforces result (0.40) is the clearest: 100,000 balanced training samples from 34 AI generators, evaluated against a ground-truth paired corpus, performs worse than flipping a coin.

4.3 Feature Forensics: Stable vs. Conflicting Features

To understand the direction of failure, we measure per-feature directional shift between SemEval and Codeforces using Wasserstein distance and Mann-Whitney U tests.

For each feature significant in both corpora (p < 0.05), we compare the direction of the human-to-AI delta:

Conflict features (significant in both, opposite directions):

Feature	SemEval delta	CF delta	Interpretation
line_count	+17.1	-10.9	AI longer in SemEval, shorter in CF
non_blank_line_count	+12.1	-10.8	Same flip
comment_line_ratio	+	-	AI more comments in SemEval, fewer in CF
whitespace_ratio	+	-
print_guard_count	+	-
camel_case_count	+	-
inline_comment_ratio	+	-

Stable features (same direction in both corpora, both significant):

avg_line_length, max_line_length, p90_line_length, digit_ratio, snake_case_count, long_ident_count, naming_style_ratio, early_return_count, trailing_ws_ratio

Using only the 9 stable features: cross-domain F1 increases from 0.38 to 0.62 on SemEval -> Codeforces, nearly doubling. The conflict features are the failure mode.

The line_count flip is interpretable: SemEval human code is scraped from diverse repositories where humans write terse scripts; AI writes longer, more complete code. In paired Codeforces, the human baseline is competitive programmers, who write verbose, explicit solutions; deepseek-coder:6.7b generates compact implementations. The classifier learned the wrong axis in both cases.

4.4 Third-Corpus Validation (HumanEval, No Leakage)

The stable feature set was identified using only SemEval and Codeforces. HumanEval was never used in feature selection. Table 3 shows the OOD test.

Table 3: HumanEval OOD validation (stable features identified on SemEval+CF only)

Train	Features	F1 (HumanEval)
SemEval	All 65	0.3320
SemEval	Stable 9	0.4298
Codeforces	All 65	0.4449
Codeforces	Stable 9	0.3909
SemEval+CF	All 65	0.3320
SemEval+CF	Stable 9	0.4043

Stable features lift SemEval transfer (+0.10) but not Codeforces transfer (-0.05). All results remain below chance. This is not a labeling error — within-HumanEval F1 is 0.92, confirming the signal exists locally.

4.5 Union Training

If the failure is human-population homogeneity, training on a diverse union of human populations should recover performance. Table 4 confirms this.

Table 4: Union training (SemEval + CF + HumanEval, 5-fold CV)

Corpus slice	F1 Macro
SemEval	0.9612
Codeforces	0.7817
HumanEval	0.8101
Overall	0.9517

Union training with 10,614 balanced samples recovers 0.95 overall F1. This is the diagnostic confirmation: the features can generalize; they simply require a human baseline representative of the deployment population.

5. The Mechanism: Human-Quality Drift

The cross-corpus failures are not random. For the top-10 discriminative features by ANOVA F-statistic on SemEval, we measure where each corpus’s human and AI distributions sit relative to each other.

Table 5: Feature distribution means across corpora

Feature	SemEval-H	SemEval-AI	HumanEval-H	HumanEval-AI	HE-H in AI zone?
blank_line_ratio	0.03	0.16	0.15	0.14	YES
naming_style_ratio	0.15	0.45	0.63	0.68	YES
char_count	442	1,027	631	677	YES
snake_case_count	1.42	5.60	4.18	4.73	YES
avg_line_length	20.5	26.4	29.4	30.8	YES
line_count	21.95	39.08	20.49	20.96	YES

10/10 top discriminative features show HumanEval-human sitting inside the SemEval-AI distribution (within 1 standard deviation of the SemEval-AI mean).

The SemEval-trained classifier has no choice but to classify HumanEval humans as AI. It learned that avg_line_length > 26 means AI. HumanEval humans average 29.4. It learned that snake_case_count > 4 means AI. HumanEval humans average 4.18. They satisfy every learned AI criterion.

This is human-quality drift. Expert-written, reviewed, idiomatic Python code is stylometrically indistinguishable from AI output on surface features, because both represent high-quality code. The detector was measuring “distance from whatever typical human code looked like in training,” not AI provenance.

The asymmetry is structural: AI models produce outputs drawn from a distribution trained to maximize code quality metrics, making AI output relatively consistent across contexts. Human code varies enormously. Any detector trained on a specific human sample inherits the location of that sample in feature space, not a representation of the full human distribution.

6. Human Baseline Distance (HBD) Scorer

Given the corpus-dependence of binary classification, we provide a continuous diagnostic: the Human Baseline Distance (HBD) score.

6.1 Design

HBD is the Mahalanobis distance of a code sample from the union human distribution on the 9 population-stable features:

avg_line_length, max_line_length, p90_line_length, digit_ratio,
snake_case_count, long_ident_count, naming_style_ratio,
early_return_count, trailing_ws_ratio

The covariance matrix is estimated with Ledoit-Wolf shrinkage on a union human baseline of 5,437 samples (5,000 SemEval + 273 CF + 164 HumanEval), standardized jointly. The condition number of the resulting covariance estimate is 1.2, indicating a well-conditioned system with no near-collinear feature pairs.

6.2 Results

Table 6: HBD score distributions and AUROC per corpus

Corpus	HBD mean (human)	HBD mean (AI)	AUROC
SemEval	1.78	4.61	0.798
Codeforces	1.83	8.06	0.975
HumanEval	4.35	4.56	0.540

The HumanEval AUROC of 0.540 is not a failure of the scorer — it is the mechanism quantified. HumanEval humans score HBD 4.35, nearly identical to HumanEval AI at 4.56. Both are far from the “average human” in the union baseline. The scorer correctly reports that it cannot distinguish them.

The Codeforces AUROC of 0.975 reflects that competitive programmers (HBD 1.83) are close to the union human mean and clearly separated from deepseek-coder output (HBD 8.06). In a deployment context matching this profile, HBD provides near-reliable separation.

6.3 Interpretation

HBD should be understood as: how anomalous is this code relative to the average of the diverse human populations we have characterized? It does not say the code is AI. It says the code is unlike the humans we have measured.

Practical thresholds:

HBD > 6: Anomalous relative to the diverse human baseline — warrants human review
HBD < 2: Sits well within the human distribution
HBD 2–6: Ambiguous; interpret against the AUROC for your deployment context

The confidence of the inference scales with how well your human population is represented in the training baseline. A CF-context deployment (AUROC 0.975) is actionable. A HumanEval-context deployment (AUROC 0.540) is not.

7. Model Fingerprinting

As a secondary analysis, we investigate whether individual AI models have distinguishable stylometric fingerprints within a single corpus. We train a 35-way Random Forest (34 AI models + human class, 1,000 samples per class, 5-fold CV) on the SemEval Python subset.

Macro F1 = 0.2342 (chance = 0.0286, ~8x above chance).

Key observations:

The human class is the most distinguishable (F1 = 0.622)
Family clusters match architectural lineage: Phi-3 variants cluster together, Yi-Coder variants cluster together, Llama-family models group together
Most confused pair: starcoder2-15b and starcoder2-3b (23% confusion rate), consistent with fine-tuning at different scales preserving the base model’s stylometric signature

This result supports the claim that AI stylometric signal is real within corpus. The cross-corpus failure is not because the signal doesn’t exist — it is because the signal is corpus-indexed.

8. Discussion

8.1 Implications for Benchmarking

Reported single-corpus AI code detection accuracies are artifacts of corpus construction as much as model capability. A corpus that scrapes human code from low-quality sources will produce high detection accuracy because the detector learns “polished vs. unpolished code,” not “AI vs. human.” HumanEval demonstrates the ceiling: when the human gold standard is expert-quality code, there is no surface signal left to exploit.

The appropriate evaluation is the one performed here: train on one independently collected corpus, evaluate on another, report cross-corpus F1 alongside within-corpus F1.

8.2 Practical Recommendations

For practitioners considering deployment of a stylometric AI code detector:

Characterize your human author population before calibrating any threshold.
Compute within-corpus AUROC on a representative sample of your known-human code; if it is low, the detector is not useful in your context regardless of training set size.
Prefer continuous scores (HBD or equivalent) over binary labels. The decision of “this warrants review” is context-dependent; the score provides the raw signal.
Union training across multiple human populations is necessary for a generalizing detector; no single corpus is sufficient.

8.3 Limitations

All AI solutions use deepseek-coder:6.7b. A multi-model paired corpus may alter findings.
Competitive programming is an extreme human domain. Codeforces humans are not representative of production software engineering.
AST-level features do not capture semantic choices: algorithm selection, identifier semantics, comment content.
Formatter normalization (black, autopep8) was not tested. Formatting-sensitive features may become uninformative after normalization.
HBD calibration is specific to the union baseline described. Deployment in new contexts requires recalibration against a local human sample.

9. Conclusions

Style-based detectors learn human baselines, not AI signatures. The dominant source of variance across corpora is the human distribution, not the AI distribution.
Human-quality drift is the failure mechanism. When humans write expert-quality, polished code, they look like AI on every discriminative feature. The detector conflates code quality with AI authorship.
The signal is local and real. Within-corpus F1 is high for all three corpora (0.92–0.99). The features carry information; it just does not transfer without a representative human baseline.
Diverse training fixes it. Union training on three human populations recovers 0.95 F1. The practical recommendation is corpus diversity, not feature engineering.
A continuous HBD score is more honest than a binary label. Given the baseline- dependence of the signal, a distance score with explicit calibration bounds is more deployable than a classifier that silently fails when the human population shifts.
Per-model fingerprinting works in-domain (~8x above chance, 35-way), suggesting model-specific detectors within a controlled corpus may be useful in constrained contexts.
High in-domain accuracy is not evidence of generalization. The gap between 99.7% (paired, in-domain) and 40% (cross-corpus) is not a tuning problem. It is the measurement itself.

Appendix: Reproducibility

All code, data collection scripts, and feature extraction pipelines are in the SentinalAI repository.

corpus/processed/features.parquet           # 500K SemEval feature matrix
corpus/processed/paired_features.parquet    # 373-row paired CF feature matrix
corpus/processed/humaneval_features.parquet # 328-row HumanEval feature matrix
features/extractor.py                       # 65-feature extractor
scripts/train_paired_classifier.py          # LOPO-CV training
scripts/test_generalization.py              # Cross-dataset evaluation
scripts/validate_third_corpus.py            # HumanEval OOD validation
scripts/followup_analysis.py                # Feature overlap + union training
scripts/model_fingerprinting.py             # 35-way model identification
scripts/feature_forensics.py                # Distribution shift analysis
scripts/hbd_scorer.py                       # HBD scorer (fit + evaluate)
models/hbd_scorer.pkl                       # Trained HBD scorer

The SemEval corpus is available at DaniilOr/SemEval-2026-Task13 on HuggingFace. HumanEval is at openai/openai_humaneval. Paired corpus collection requires a GitHub API token (see collector/fetch_github_human_solutions.py).