We were building a tool to help AI assistants understand codebases faster. The obvious approach — feed the model the raw documentation — seemed like the right starting point. Just read the README, the CONTRIBUTING guide, maybe the package manifest. Paste it all in. Let the model figure it out.
So we ran the benchmark. Across 9 real-world repos and 6 models, we asked 12 specific, factual questions per repo — things like: what HTTP router does this project use? what authentication methods does it support? what’s the entry point file?
The self-profiling approach — where the model reads 43,000 tokens of raw documentation and then answers — scored 40%. Baseline, with no context at all, scored 44%.
More context. Lower score.
43,000 tokens
no context at all
~350 tokens
That result broke our assumptions. We spent a week figuring out why, and it led us to build something quite different from what we started with.
Why More Context Hurt
The questions were designed to be anti-parametric — they target facts the model wouldn’t know from training data alone. Which specific HTTP router does this project use? Which auth backend? Not the popular answer. The correct answer.
When models hit those questions with no context, they default to their training priors: the most common answer for a Go HTTP server (Gin or gorilla/mux), the most common auth stack for a Python app (OAuth, password). They’re wrong, but confidently wrong in a predictable way.
When models hit those questions with 43,000 tokens of raw documentation, something different happens depending on the model architecture:
Prior Overwhelming (smaller models)
Models like Mistral Large and Llama 4 Maverick read the raw context and still answer with the common case. The documentation reinforces the generic signal — it’s full of framework-standard boilerplate — drowning out the specific needle. Mistral Large, asked about the HTTP router with 43k tokens of context, answered: “Gin is more likely if the project is lightweight.” It had just read the answer. It defaulted to its training prior anyway.
Attention Dilution (larger models)
GPT-4o behaved differently. It didn’t hallucinate — it abstained. Asked about configuration format with 82,000 tokens of context that included the answer, it answered: “Not in profile.” The fact was there. The model scanned the surface, missed the buried detail, and correctly reported that it hadn’t found it.
Six out of twelve questions triggered this response. The model wasn’t wrong — it just didn’t find what it was looking for in 82k tokens of noise.
“CHODE doesn’t just save tokens — it saves the signal. Providing 43,000 tokens of raw documentation produced 15% lower accuracy on specific technical facts than providing zero context. More context was more distraction.”
We confirmed the mechanism directly. For the router question on the Gitea repo, we ran GPT-4o in logprobs mode. Baseline (no context): the model assigned 2.5% probability to the correct answer (Chi). With CHODE’s 507-token profile: 99.7%. The signal went from noise to certainty. ΔP(Chi) = +97.3 percentage points.
What CHODE Is
CHODE (Compressed Hierarchical Overview of Directory and Ecosystems) is a CLI that profiles any software repo in under a second, producing a ~350–550 token structured document — a fingerprint, not a summary.
git clone https://github.com/zer0contextlost/CHODE
cd CHODE && npm install && npm install -g .
chode /path/to/repo
Here’s what it produces for Gitea (6,900 files → 507 tokens in 0.2 seconds):
---CHODE v2 @ a3f9c12---
---DNA--- @STACK go chi mysql jwt sqlite3 @FRONTEND ts vue esbuild vite (web_src/) @CI github-actions @PKG pnpm gomod uv @TEST playwright vitest make test @CONFIG ini @ENTRY main.go @ROUTES chi → routers/ (447 files) @DATA models/migrations/ (305 migrations) @AUTH openid pam password webauthn ldap oauth @ARCH layered(cmd→routes→svc→mdl) @PACKAGES modules/(968) models/(649) services/(479)
---CONTEXT--- @PURPOSE easiest, fastest, most painless way of setting up self-hosted Git service… @CONVENTIONS Always run
make fmtbefore committing to conform to Gitea’s styleguide. @TESTING Runmake lintbefore submitting. Test commands: make test | make test-backend
No source code is read. No LLM is invoked. It runs entirely offline. The git commit hash in the header lets you detect profile staleness with a single command.
The Benchmark
We tested across 9 real-world repos spanning Go, TypeScript, Python, PHP, Rust, and Elixir — using 6 models and 12 profile-dependent questions per repo. Questions were designed so the correct answer exists in the CHODE profile but would require specific knowledge to answer from training alone.
| Mode | Avg score | Avg input tokens |
|---|---|---|
| Baseline (training only) | 44% | ~300 |
| Self-profile (raw docs) | 40% | ~43,000 |
| CHODE | 90% | ~353 |
By repo:
| Repo | Baseline | CHODE | Δ | Profile size |
|---|---|---|---|---|
| gitea | 20% | 97% | +77pp | ~507 tok |
| rails | 38% | 93% | +55pp | ~548 tok |
| gh-cli | 39% | 94% | +55pp | ~451 tok |
| laravel | 39% | 88% | +49pp | ~237 tok |
| next.js | 24% | 77% | +53pp | ~567 tok |
| fastapi | 44% | 87% | +43pp | ~211 tok |
| django | 47% | 85% | +38pp | ~160 tok |
| ripgrep | 75% | 97% | +22pp | ~189 tok |
| phoenix | 69% | 86% | +17pp | ~306 tok |
For large repos like Next.js with 27,000 files, self-profiling requires 300,000+ tokens and exceeds most model context windows entirely. CHODE handles them in 567 tokens.
”Why Not Just Paste the README?”
We ran a second benchmark specifically to answer this. Four context strategies, same questions, same repos, same models.
| Technique | Score | Avg tokens | Notes |
|---|---|---|---|
| Baseline | 14% | ~300 | Training knowledge only |
| README-only | 18% | ~2,956 | 7× more tokens, still fails |
| LLM summary | 25% | ~437 | Requires 2 API calls, summarization discards specific facts |
| CHODE | 90% | ~406 | Structured profile, offline, no LLM |
README-only failed on two repos completely (0%) despite being given the document. The Caddy README is pure overview prose — the entry point path, the HTTP router dependency, and the coding conventions are in CONTRIBUTING.md and source trees, not the README. The Zulip README is 80 lines of marketing copy. Auth methods (LDAP, SAML), frontend framework (Preact), and package managers (pnpm, uv) live in config files. The README is the project’s public face, optimized for first impressions — not for answering technical navigation questions.
The LLM summary result is more interesting. We asked the model to summarize the README first, producing a ~500-token structured summary, then answer questions from that summary. Same token budget as CHODE. Still only 25%. Summarization is lossy in exactly the wrong direction: it keeps the narrative and drops the specific names. “Supports multiple authentication methods” instead of “LDAP, SAML, OAuth2, password, WebAuthn.” The atoms that matter get collapsed into prose.
CHODE wins because it doesn’t summarize — it extracts. It reads the files most likely to contain each fact and writes the answer into a labeled field. When a model reads @ROUTES chi → routers/ (447 files), there’s nothing to filter. The fact is already isolated.
Cross-Domain Generalization
We extended the benchmark to 18 additional repos across 8 languages — browsers, chat apps, linters, web servers, AI models, graphics engines, and more. We split questions into two tiers:
| Tier | What it tests | Baseline | CHODE |
|---|---|---|---|
| Tier 1 — facts in the profile | Profile retrieval accuracy | 22% | 100% |
| Tier 2 — facts not in the profile | Hallucination resistance | 48% | 17%* |
*Tier 2: models correctly respond “Not in profile” for facts the profile doesn’t contain. Baseline gets 48% by guessing from training knowledge; CHODE gets 17% because it correctly scopes the model to the profile, which doesn’t include those facts. The 17% is the right behavior.
Tier 1: 100% accuracy. Every fact CHODE explicitly extracts was retrieved perfectly across every repo and model combination.
Tier 2 reveals the tradeoff CHODE makes: by instructing the model to answer only from the profile, you lose access to facts that the model knows from training but the profile doesn’t cover. That’s a conscious design choice. The profile is authoritative for what it contains; for everything else, the model should say so.
How It Works
CHODE reads two categories of information from a repo:
Structural (DNA section) — filenames only, no content
File tree analysis produces @PACKAGES and @STRUCT — the top directories weighted by file count. Anchor files (package.json, go.mod, Cargo.toml, pyproject.toml, etc.) are parsed for dependency names and test commands. No source code is read.
Semantic (CONTEXT section) — doc files only
README, CONTRIBUTING, and similar files are scanned. Section headings are mapped to fields (@PURPOSE, @CONVENTIONS, @TESTING, @ENV). Content is compressed with a deterministic algorithm that strips filler words, markdown noise, and repeated boilerplate, with a hard cap of 200 characters per field.
The git commit hash in the header gives consumers a direct staleness signal:
git rev-list $(head -1 .chode | grep -o '[a-f0-9]\{7\}')..HEAD --count
Two Retrieval Modes
We ran a poison profile test — deliberately filling a .chode profile with wrong information — and found that model behavior splits cleanly by how well-known the repo is.
For famous repos (Caddy, ruff): both models rejected the wrong profile content and answered from training knowledge. The model’s semantic retrieval path has strong priors for well-known projects and overrides the profile.
For obscure repos (PocketBase on Gemini Flash): the model trusted the profile completely, including the planted wrong answer. No training prior to override it.
This is expected and, we’d argue, correct behavior. CHODE is designed as a prior-filling mechanism: it’s most powerful and most trusted exactly where training knowledge is absent — obscure internal repos, private codebases, newly released projects. For widely-documented open-source projects, the model’s semantic retrieval fills in gaps that the profile might miss.
The Efficiency Numbers
We measure comprehension efficiency as score per token of input:
CE = Score / Input_tokens
For Gitea + GPT-4o:
| Method | Score | Tokens | CE |
|---|---|---|---|
| Raw docs (self-profile) | 36% | 82,918 | 0.0000043 |
| CHODE | 100% | 507 | 0.00197 |
CHODE delivers 454× better comprehension per token for this repo. Averaged across all 9 benchmark repos and 6 models, self-profiling consumes ~43,000 tokens per query while scoring lower than CHODE’s ~353 token profile.
per token (Gitea)
self-profiling (avg)
no LLM invoked
The Landscape
There are excellent tools for dumping repo contents into an LLM context window — Repomix (23k stars), GitIngest, Code2Prompt. They’re all solving a different problem: raw context for a specific targeted task like code review or one-shot editing. They produce 50k–300k token outputs by design.
There’s also LLMLingua (Microsoft Research), which compresses token-by-token using a smaller LM. Interesting approach — but it requires LLM inference at generation time and optimizes for general text, not structured repo facts.
CHODE occupies a gap none of these fill: a fixed ~500-token structured hierarchical profile, generated offline in under a second, optimized for AI model orientation rather than task execution. It’s not trying to give a model everything. It’s trying to give a model exactly what it needs to understand where it is.
Try CHODE
Profile any repo in under a second. Works offline, no LLM required.
GitHubnpm install -g chodeWhat’s Next
A few threads we’re actively working on:
- VS Code extension — auto-attach
.chodeto AI chat context on session start - chode.dev hosted service — profile any public GitHub repo without cloning
--queryflag — inline model call with the profile as context, sochode /repo --query "where do I add a new endpoint?"just works- Unconventional repo benchmark — monorepos, embedded systems repos, config-only repos: stress-testing the generator’s structural assumptions
The benchmark suite and all result files are in the repo under benchmarks/. Questions, scoring logic, and raw results are open — run it yourself against any model you have access to.
Read the Full Technical Report
16 sections covering architecture, failure mode analysis, logprob evidence, ablations, and cross-domain generalization across 27 repos.
Read WhitepaperGitHubIf I want to make wooden assholes for hobby horses, I will. Nobody gets to decide what’s worth building except the person building it. Some of the best things ever made started as a joke, a bad idea, or a project someone’s smarter friend told them not to bother with. This started as a one-liner I wrote at 2am because I was annoyed. Now it’s a 17-section technical paper and an npm package. I have no regrets. Till next time, ya nerds.