We Gave AI 43,000 Tokens of Documentation and It Scored Lower Than Nothing

We were building a tool to help AI assistants understand codebases faster. The obvious approach — feed the model the raw documentation — seemed like the right starting point. Just read the README, the CONTRIBUTING guide, maybe the package manifest. Paste it all in. Let the model figure it out.

So we ran the benchmark. Across 9 real-world repos and 6 models, we asked 12 specific, factual questions per repo — things like: what HTTP router does this project use? what authentication methods does it support? what’s the entry point file?

The self-profiling approach — where the model reads 43,000 tokens of raw documentation and then answers — scored 40%. Baseline, with no context at all, scored 44%.

More context. Lower score.

40% Self-profiling
43,000 tokens

44% Baseline
no context at all

90% CHODE
~350 tokens

That result broke our assumptions. We spent a week figuring out why, and it led us to build something quite different from what we started with.

Why More Context Hurt

The questions were designed to be anti-parametric — they target facts the model wouldn’t know from training data alone. Which specific HTTP router does this project use? Which auth backend? Not the popular answer. The correct answer.

When models hit those questions with no context, they default to their training priors: the most common answer for a Go HTTP server (Gin or gorilla/mux), the most common auth stack for a Python app (OAuth, password). They’re wrong, but confidently wrong in a predictable way.

When models hit those questions with 43,000 tokens of raw documentation, something different happens depending on the model architecture:

Prior Overwhelming (smaller models)

Models like Mistral Large and Llama 4 Maverick read the raw context and still answer with the common case. The documentation reinforces the generic signal — it’s full of framework-standard boilerplate — drowning out the specific needle. Mistral Large, asked about the HTTP router with 43k tokens of context, answered: “Gin is more likely if the project is lightweight.” It had just read the answer. It defaulted to its training prior anyway.

Attention Dilution (larger models)

GPT-4o behaved differently. It didn’t hallucinate — it abstained. Asked about configuration format with 82,000 tokens of context that included the answer, it answered: “Not in profile.” The fact was there. The model scanned the surface, missed the buried detail, and correctly reported that it hadn’t found it.

Six out of twelve questions triggered this response. The model wasn’t wrong — it just didn’t find what it was looking for in 82k tokens of noise.

“CHODE doesn’t just save tokens — it saves the signal. Providing 43,000 tokens of raw documentation produced 15% lower accuracy on specific technical facts than providing zero context. More context was more distraction.”

We confirmed the mechanism directly. For the router question on the Gitea repo, we ran GPT-4o in logprobs mode. Baseline (no context): the model assigned 2.5% probability to the correct answer (Chi). With CHODE’s 507-token profile: 99.7%. The signal went from noise to certainty. ΔP(Chi) = +97.3 percentage points.

What CHODE Is

CHODE (Compressed Hierarchical Overview of Directory and Ecosystems) is a CLI that profiles any software repo in under a second, producing a ~350–550 token structured document — a fingerprint, not a summary.

git clone https://github.com/zer0contextlost/CHODE
cd CHODE && npm install && npm install -g .
chode /path/to/repo

Here’s what it produces for Gitea (6,900 files → 507 tokens in 0.2 seconds):

PROFILE

---CHODE v2 @ a3f9c12---

---DNA--- @STACK go chi mysql jwt sqlite3 @FRONTEND ts vue esbuild vite (web_src/) @CI github-actions @PKG pnpm gomod uv @TEST playwright vitest make test @CONFIG ini @ENTRY main.go @ROUTES chi → routers/ (447 files) @DATA models/migrations/ (305 migrations) @AUTH openid pam password webauthn ldap oauth @ARCH layered(cmd→routes→svc→mdl) @PACKAGES modules/(968) models/(649) services/(479)

---CONTEXT--- @PURPOSE easiest, fastest, most painless way of setting up self-hosted Git service… @CONVENTIONS Always run make fmt before committing to conform to Gitea’s styleguide. @TESTING Run make lint before submitting. Test commands: make test | make test-backend

No source code is read. No LLM is invoked. It runs entirely offline. The git commit hash in the header lets you detect profile staleness with a single command.

The Benchmark

We tested across 9 real-world repos spanning Go, TypeScript, Python, PHP, Rust, and Elixir — using 6 models and 12 profile-dependent questions per repo. Questions were designed so the correct answer exists in the CHODE profile but would require specific knowledge to answer from training alone.

Mode	Avg score	Avg input tokens
Baseline (training only)	44%	~300
Self-profile (raw docs)	40%	~43,000
CHODE	90%	~353

By repo:

Repo	Baseline	CHODE	Δ	Profile size
gitea	20%	97%	+77pp	~507 tok
rails	38%	93%	+55pp	~548 tok
gh-cli	39%	94%	+55pp	~451 tok
laravel	39%	88%	+49pp	~237 tok
next.js	24%	77%	+53pp	~567 tok
fastapi	44%	87%	+43pp	~211 tok
django	47%	85%	+38pp	~160 tok
ripgrep	75%	97%	+22pp	~189 tok
phoenix	69%	86%	+17pp	~306 tok

For large repos like Next.js with 27,000 files, self-profiling requires 300,000+ tokens and exceeds most model context windows entirely. CHODE handles them in 567 tokens.

”Why Not Just Paste the README?”

We ran a second benchmark specifically to answer this. Four context strategies, same questions, same repos, same models.

Technique	Score	Avg tokens	Notes
Baseline	14%	~300	Training knowledge only
README-only	18%	~2,956	7× more tokens, still fails
LLM summary	25%	~437	Requires 2 API calls, summarization discards specific facts
CHODE	90%	~406	Structured profile, offline, no LLM

README-only failed on two repos completely (0%) despite being given the document. The Caddy README is pure overview prose — the entry point path, the HTTP router dependency, and the coding conventions are in CONTRIBUTING.md and source trees, not the README. The Zulip README is 80 lines of marketing copy. Auth methods (LDAP, SAML), frontend framework (Preact), and package managers (pnpm, uv) live in config files. The README is the project’s public face, optimized for first impressions — not for answering technical navigation questions.

The LLM summary result is more interesting. We asked the model to summarize the README first, producing a ~500-token structured summary, then answer questions from that summary. Same token budget as CHODE. Still only 25%. Summarization is lossy in exactly the wrong direction: it keeps the narrative and drops the specific names. “Supports multiple authentication methods” instead of “LDAP, SAML, OAuth2, password, WebAuthn.” The atoms that matter get collapsed into prose.

CHODE wins because it doesn’t summarize — it extracts. It reads the files most likely to contain each fact and writes the answer into a labeled field. When a model reads @ROUTES chi → routers/ (447 files), there’s nothing to filter. The fact is already isolated.

Cross-Domain Generalization

We extended the benchmark to 18 additional repos across 8 languages — browsers, chat apps, linters, web servers, AI models, graphics engines, and more. We split questions into two tiers:

Tier	What it tests	Baseline	CHODE
Tier 1 — facts in the profile	Profile retrieval accuracy	22%	100%
Tier 2 — facts not in the profile	Hallucination resistance	48%	17%*

*Tier 2: models correctly respond “Not in profile” for facts the profile doesn’t contain. Baseline gets 48% by guessing from training knowledge; CHODE gets 17% because it correctly scopes the model to the profile, which doesn’t include those facts. The 17% is the right behavior.

Tier 1: 100% accuracy. Every fact CHODE explicitly extracts was retrieved perfectly across every repo and model combination.

Tier 2 reveals the tradeoff CHODE makes: by instructing the model to answer only from the profile, you lose access to facts that the model knows from training but the profile doesn’t cover. That’s a conscious design choice. The profile is authoritative for what it contains; for everything else, the model should say so.

How It Works

CHODE reads two categories of information from a repo:

Structural (DNA section) — filenames only, no content

File tree analysis produces @PACKAGES and @STRUCT — the top directories weighted by file count. Anchor files (package.json, go.mod, Cargo.toml, pyproject.toml, etc.) are parsed for dependency names and test commands. No source code is read.

Semantic (CONTEXT section) — doc files only

README, CONTRIBUTING, and similar files are scanned. Section headings are mapped to fields (@PURPOSE, @CONVENTIONS, @TESTING, @ENV). Content is compressed with a deterministic algorithm that strips filler words, markdown noise, and repeated boilerplate, with a hard cap of 200 characters per field.

The git commit hash in the header gives consumers a direct staleness signal:

git rev-list $(head -1 .chode | grep -o '[a-f0-9]\{7\}')..HEAD --count

Two Retrieval Modes

We ran a poison profile test — deliberately filling a .chode profile with wrong information — and found that model behavior splits cleanly by how well-known the repo is.

For famous repos (Caddy, ruff): both models rejected the wrong profile content and answered from training knowledge. The model’s semantic retrieval path has strong priors for well-known projects and overrides the profile.

For obscure repos (PocketBase on Gemini Flash): the model trusted the profile completely, including the planted wrong answer. No training prior to override it.

This is expected and, we’d argue, correct behavior. CHODE is designed as a prior-filling mechanism: it’s most powerful and most trusted exactly where training knowledge is absent — obscure internal repos, private codebases, newly released projects. For widely-documented open-source projects, the model’s semantic retrieval fills in gaps that the profile might miss.

The Efficiency Numbers

We measure comprehension efficiency as score per token of input:

CE = Score / Input_tokens

For Gitea + GPT-4o:

Method	Score	Tokens	CE
Raw docs (self-profile)	36%	82,918	0.0000043
CHODE	100%	507	0.00197

CHODE delivers 454× better comprehension per token for this repo. Averaged across all 9 benchmark repos and 6 models, self-profiling consumes ~43,000 tokens per query while scoring lower than CHODE’s ~353 token profile.

454× Better comprehension
per token (Gitea)

122× Fewer tokens than
self-profiling (avg)

0.2s Generation time
no LLM invoked

The Landscape

There are excellent tools for dumping repo contents into an LLM context window — Repomix (23k stars), GitIngest, Code2Prompt. They’re all solving a different problem: raw context for a specific targeted task like code review or one-shot editing. They produce 50k–300k token outputs by design.

There’s also LLMLingua (Microsoft Research), which compresses token-by-token using a smaller LM. Interesting approach — but it requires LLM inference at generation time and optimizes for general text, not structured repo facts.

CHODE occupies a gap none of these fill: a fixed ~500-token structured hierarchical profile, generated offline in under a second, optimized for AI model orientation rather than task execution. It’s not trying to give a model everything. It’s trying to give a model exactly what it needs to understand where it is.

Try CHODE

Profile any repo in under a second. Works offline, no LLM required.

GitHub npm install -g chode

What’s Next

A few threads we’re actively working on:

VS Code extension — auto-attach .chode to AI chat context on session start
chode.dev hosted service — profile any public GitHub repo without cloning
--query flag — inline model call with the profile as context, so chode /repo --query "where do I add a new endpoint?" just works
Unconventional repo benchmark — monorepos, embedded systems repos, config-only repos: stress-testing the generator’s structural assumptions

The benchmark suite and all result files are in the repo under benchmarks/. Questions, scoring logic, and raw results are open — run it yourself against any model you have access to.

Read the Full Technical Report

16 sections covering architecture, failure mode analysis, logprob evidence, ablations, and cross-domain generalization across 27 repos.

Read Whitepaper GitHub

If I want to make wooden assholes for hobby horses, I will. Nobody gets to decide what’s worth building except the person building it. Some of the best things ever made started as a joke, a bad idea, or a project someone’s smarter friend told them not to bother with. This started as a one-liner I wrote at 2am because I was annoyed. Now it’s a 17-section technical paper and an npm package. I have no regrets. Till next time, ya nerds.