blog
Measuring math-glyph token compression
· spec-driven-development · claude-code · ai-coding · benchmarks
I claim in the previous post that math-glyph notation lets a full project spec — invariants, tasks, bug history — sit in one file Claude reads at session start. That claim only works if the compression is real and durable. So I wrote a benchmark.
TLDR
Across 30 rows from SPEC.md, math-glyph encoding is ~30% denser
than a minimal prose decode (pure notation savings) and ~90% denser
than what a human reviewer actually reads via /sdd:explain (notation
savings plus deferred sibling-context expansion). The benchmark lives
in pilot-skills/benchmarks/glyph
and re-runs against the current SPEC.md so the numbers stay honest
as the encoding evolves.
Why measure it at all
“Compressed” was a word I’d been using without a number behind it. That’s fine in a blog post but not fine in a design decision. I needed to defend two claims to myself:
- The encoding actually saves tokens vs the same content in English.
- The savings are not an artifact of one cherry-picked row.
A repeatable benchmark, tracked in git, is the only honest way to keep those claims alive as the skill evolves. The encoder lives at pilot-skills/pilot-spec/skills/glyph, the decoder at pilot-skills/pilot-spec/skills/explain. If either drifts, the benchmark catches it on the next run.
Two decoders, two numbers
The interesting part of the design isn’t the encoder. It’s that there are two reasonable ways to decode, and they answer different questions.
Minimal decoder. System prompt: Expand to plain English. Preserve every fact. Output prose only, no preamble. Strips ∀ to “for all”,
→ to “implies”, and stops. Same facts, no surrounding context. This
is the fairest comparison against a hypothetical English-only spec
that uses the same level of detail per row.
Canonical decoder. The /sdd:explain skill from the plugin itself
— heading, row quote, plain-English restatement, walk of cited
siblings, bottom-line summary. This is the form a human reviewer reads
when they need to understand a row well enough to act on it.
If you only compare against the minimal decoder, you’re measuring how
compact the symbols are. If you only compare against the canonical
decoder, you’re measuring symbols plus everything else /sdd:explain
pulls in — and that “everything else” is decoder behavior, not
encoder cleverness. So the benchmark reports both.
The grand totals
n = 30 (first 10 rows from each of §V, §T, §B):
- Minimal: +0.29. Pure encoding overhead. Same content, fewer tokens.
- Canonical: +0.89. Notation savings plus deferred sibling-context expansion.
The 60-point gap is the part most readers miss when they look at “how
compressed is this.” It’s not the encoder doing more work in the
canonical case — it’s the decoder pulling in cited rows from
elsewhere in SPEC.md so the reader can act on what they’re reading.
A traditional human-readable spec without §V.<n> and §T.<n>
citations would have to inline that context on every related row,
multiplying file size and creating its own drift problem when an
invariant amends. Math-glyph keeps references as references and pays
the inlining cost only at decode time, only for the row being read.
Worked examples
§V.6 — stand-alone row
Take §V.6 from the pilot-skills SPEC.md:
plugin name ≠ dir name — resolve plugin → dir via
.claude-plugin/marketplace.jsonplugins[].source; ⊥ hardcode dir paths in cmd ∨ skill bodies.
85 tokens in glyph form. The minimal decoder produces 140 tokens of fluent English saying the same thing — a 39% reduction. The canonical decoder produces 683 tokens: heading, restatement, cited siblings, status note, bottom line — an 88% reduction over the same row.
The canonical output is eight times the size of the row. That
eightfold isn’t the encoder. It’s /sdd:explain walking the cited
§T.37 and pulling in the surrounding invariants so a reviewer who
didn’t write the spec can still act on it.
§B.7 — context-heavy row
§V.6 was clean: short body, short expansion, the numeric story tells
itself. §B.7 is the opposite case — a row whose meaning is
held almost entirely in its references, where minimal-decoder output
reads fluent but leaves a reader unable to act.
Glyph form in SPEC.md, 139 tokens:
2026-05-10|T8 V3-sweep ran ∀ literal
§V.Nform only; bareV.N-form pinned numerics ∈ pedagogical examples @ pilot-spec/skills/{spec,backprop}/SKILL.md persisted (8 sites) ∴ V27 violation; /sdd:check audit surfaced 2026-05-10|V32
Four refs, none inlined: §T.8 (the prior sweep task), §V.27 (the
invariant violated), §V.32 (the fix invariant in the trailing pipe),
and implicitly §V.3 (what V3-sweep enforced).
Minimal decoder — 207 tokens, faithful but isolated:
On May 10, 2026, during the T8 V3-sweep, the process ran only against literals matching the form
§V.N(that is, version references prefixed with the section symbol). As a result, bareV.N-form pinned numeric references that appear within pedagogical examples in the files atpilot-spec/skills/spec/SKILL.mdandpilot-spec/skills/backprop/SKILL.mdpersisted unchanged. There are eight such sites in total. This constitutes a violation of V27. The/sdd:checkaudit surfaced this issue on May 10, 2026, under V32.
A 33% reduction — same notation-cost story as §V.6. But a
reader who doesn’t already know what §T.8 was supposed to do, what
§V.27 forbids, or what §V.32 prescribes still cannot tell what
went wrong, why it matters, or whether the class is closed. Fluent,
content-free.
Canonical decoder — 1117 tokens that walk the cited chain. Excerpt:
§B.7 — derivative leak in V3 sweep (pre-fix)
In plain English: an earlier task (
§T.8) was supposed to clean up “pinned” spec-citation numerics from published skill bodies — these are forbidden because shared artifacts travel between repos and a number likeV3only makes sense relative to this repo’sSPEC.md. The sweep was written to find only the dotted long form (literally§V.N) and missed the bare short form (literallyV.N,V3,T1,B5etc.). Eight such bare-form citations survived inside pedagogical examples in the spec and backprop skill files. […]Cited invariants:
§V.32— a§Trow whose purpose is to remediate a§Vviolation must declare its scope as a literal grep pattern (or a vocab table) covering all forms of the violation […]Related (not directly cited but mechanically linked):
§V.27— the invariant that was actually violated: published-body examples must use placeholder citation form (§V.<n>, etc.) […]§T.8— the original sweep task that closed prematurely because its scope was too narrow.§B.6,§B.8— sibling bugs in the same recurrence class (sweep marked done while subforms remained). Together with§B.7these motivated§V.32.
An 88% reduction over a 139-token row. The eightfold growth isn’t the
encoder doing more work — it’s the decoder pulling in §T.8,
§V.27, §V.32, and the sibling bugs in the same recurrence class
so the reviewer can act on the row. A traditional human-readable spec
would have to inline that same context on every related row,
multiplying file size and creating its own drift problem when any
cited invariant amends.
What surprised me
A few results from the run worth flagging:
§Tand§Bcompress better than§Vunder the minimal decoder. I’d expected the opposite —∀and→look denser than pipe-table rows. The reason: pipe rows likeT<n>|<status>|<desc>|<cites>gain connective tissue when expanded (“This task has status complete and cites V1 and V23, namely…”), while§Vrows are already prose-like under the hood, so unwrapping them adds less.- A few
§Vrows go slightly negative under the minimal decoder.V3andV4produce minimal-decoder output of nearly the same length as the row itself. The encoding doesn’t compress everything equally, and the benchmark is honest about that. - The canonical p25–p75 band is narrow — about 0.85–0.93 across all sections. Canonical output is dominated by sibling expansion, which doesn’t scale much with source row size. Short row or long row, the decoder pulls in roughly the same context.
How a run works
Per row, the pipeline is five steps:
- Count tokens of the row body via Anthropic’s
/v1/messages/count_tokensendpoint, modelclaude-opus-4-7. - Decode via the minimal-decoder system prompt; capture prose output.
- Decode via the canonical decoder — the
/sdd:explainskill body verbatim as system prompt, with the fullSPEC.mdattached inline so the decoder can resolve citations without filesystem access. - Count tokens of both prose outputs.
- Compute
reduction = 1 - n_glyph / n_prose_*for each decoder.
30 rows × (1 token-count + 2 decodes + 2 token-counts) = 150 API
calls per run. The script self-bootstraps uv from PEP 723 inline
metadata so the host needs no Python deps. Results append to
glyph-bench-results.json,
tracked in git so the trend is visible across commits.
I run it after any structural change to the glyph skill or the
explain skill — the two pieces that move the numbers. So far
the trend has been flat, which is the point.
Why bother
I can’t claim “the spec is small enough to live in context” without a number behind it. The minimal-decoder number lets me defend the encoding choice against “you could just write English.” The canonical-decoder number lets me defend the cite-don’t-inline choice against “why not spell it out in every row.” Both are answers to objections I’ve actually gotten.
The benchmark code is at
kborovik/pilot-skills.
It’s not packaged for general use — the corpus is specifically
the SPEC.md from the same repo — but the methodology is
portable. If you’re building any notation that claims to compress
against prose, you probably want to measure both numbers, not just
the one that looks better.