blog

What one AI email actually costs

· ai-agents · llm-engineering · roi · cost · observability · pydantic-ai

The question behind the headlines

If you run a business, the AI story you have been reading lately is a horror story. One company reportedly burned through hundreds of millions of dollars in a single month after turning an AI assistant loose on its staff without limits. The industry even has a name for the behavior that got them there — “tokenmaxxing,” the race to consume as many AI tokens as possible on the theory that more usage means more productivity. And the most-cited research of the year found that roughly 95% of enterprise AI pilots showed no measurable financial return.

So the question on the table for any CIO or CFO is fair and blunt: is AI in production a money pit?

Here is the thing the horror stories miss. Token volume was never the metric. A company can burn a fortune in tokens and get nothing, or spend almost nothing and get real work done. What matters is cost per useful outcome — cost per ticket resolved, per document processed, per email answered. Almost nobody publishes that number from a real production system, because measuring it requires instrumenting the system properly in the first place.

This post publishes one.

The system

MailPilot is a production AI agent I built end to end. It reads an incoming business email, searches a knowledge base of product documentation, and replies with a sourced answer in under a minute. If the question is about something not in its knowledge base, it declines and says so rather than guessing. You can email it yourself — it is a live system, not a demo video.

The scenario is a common one: a customer or prospect emails a detailed product question (“what is the permeate flow rate and motor rating for model X?”), and someone on your team has to find the answer in the documentation and write a sourced reply. It maps cleanly onto pre-sales support, RFP responses, and tier-one customer questions. The same pattern generalizes to other high-volume, lookup-heavy work, but the numbers here are measured on this one.

The number

Every model call MailPilot makes is traced. The cost of answering one fresh product question, grounded in the documentation and sent back to the customer, is about $0.03, in roughly 20 seconds.

That is the operating cost — the cost to run one email through the system. It is not the cost to build the system; more on that below.

The risk/reward, in numbers you can check

The honest comparison is against what answering that same email costs today. Assume a knowledgeable person at a loaded rate of $40/hour. Answering one sourced spec question means reading it, finding the right model in the documentation, and writing the reply — call it 7 minutes once you account for lookup time and the occasional correction. That is around $5 per email, and the customer waits hours for it.

MailPilot (measured)A person (assumption)
Cost per sourced reply~$0.03~$5.00
Response time~20 secondshours
Out-of-scope questionsdeclines, no guessingvaries
At 1,000 emails / month~$30~$5,000

That is more than a 100× difference in cost, and hours-to-seconds in speed.

The point is not “replace people.” You cannot fire 1/100th of an employee, and a CFO knows it. The point is that a category of repetitive, lookup-heavy email stops consuming thousands of dollars of skilled time each month and gets answered instantly and consistently — freeing that person for work that actually needs them.

The risk side

The real risk with AI email is not cost. It is an agent confidently inventing a spec, a price, or a model number and sending it to a customer over your name. That is a brand and liability problem, not a budget line.

This is the part most pilots skip, and it is where the engineering discipline lives. MailPilot is built to ground every answer in a source document and to cite the file it used, so any answer can be checked against the original. When a question falls outside its knowledge base, it is built to decline rather than improvise.

That behavior is measured, not asserted. In a recent test run of 29 inbound emails, every one of the 5 out-of-scope questions was correctly declined — none produced a fabricated specification. The answered questions each cited their source document. Under a burst of concurrent traffic the system hit 2 transient tool errors, both contained — the emails still went out, and nothing crashed. I am not going to claim an AI is never wrong. I am going to claim this one refuses to invent facts when it lacks a source, and that the claim is checkable in the logs.

Why most AI spend fails, and what makes the difference

The 95% of pilots that show no return tend to share a pattern: nobody measured the cost per outcome, nobody capped usage, nobody instrumented what the system was actually doing, and nobody designed it to fail safely. They measured token volume, watched the bill climb, and could not connect the spend to value.

The disciplined version is the opposite, and none of it is exotic. Measure cost per outcome. Cap usage so a bug cannot run up a fortune overnight. Instrument every call so you can see where the money goes. Ground answers in real sources and decline when uncertain. That is the difference between the 95% and the 5% — and it is mostly a question of how the system is built, not which model it uses.

What this costs to build

The $0.03 is operating cost. Building a production agent — the retrieval, the grounding, the guardrails, the observability, the deployment that scales to zero when idle so you are not paying for it overnight — is the one-time investment. That is the work I do. I am stating the distinction plainly because an ROI number that hides the build cost is exactly the kind of claim the AI hype cycle has earned its skepticism for.

The honest boundaries

So you can weigh this properly: the $0.03 figure is the cost of a fresh question; an ongoing back-and-forth thread costs more, because the conversation history is re-sent each turn. The $5 human figure is my assumption, laid out above so you can substitute your own. And the accuracy numbers are from an initial test run — I am compiling 2 weeks of data for a fuller picture and will update this post.

If you are evaluating whether AI can do real work in your business without becoming a money pit, that is precisely the question I build for, and measure. If you want to see the cost-per-outcome math on a workflow in your own operation, book a call.


For technical reviewers

The number is read from OpenTelemetry traces in Logfire. Each agent run is a Pydantic AI invoke_agent span; per-call cost and token usage come from the gen_ai.usage.* and operation.cost attributes, summed over the tool-use loop (search the knowledge base → read the source document → compose and send the reply). A single fresh in-scope question runs ~17K input tokens — most of it system prompt, tool definitions, and the retrieved source document, with about half served from the prompt cache — and a few hundred output tokens.

The per-email cost scales with thread length: as a conversation accumulates history, each turn re-sends it, so input tokens (and cost) climb. In a stress test that fired 25+ emails into one mailbox, per-email cost ramped from ~$0.027 on a fresh thread to ~$0.17 once the history grew to 50-plus messages. The representative figure for the “customer asks a question, gets an answer” scenario is the fresh-thread cost; the ramp is a lever (history truncation and summarization) rather than a fixed cost.

The agent, the smoke-test harness, and the grounding logic are on GitHub: github.com/kborovik/mailpilot.

Trace screenshots

The Logfire timeline below lists successive MailPilot email-handling spans — sync.send_email, sync.loop.iteration, run.account.run, run.execute_task — with per-span duration down the right column.

Logfire trace timeline view in the lab5.ca/mailpilot project showing a list of MailPilot email-handling spans with timestamps and per-span durations

Drilling into a single email-handling run opens the span tree — routing.route_email calls agent.classify_email, which calls mailpilot.classifier run, which calls the underlying chat claude-sonnet-4-6 LLM. Each chat row shows input and output token counts, which is the surface the per-email cost figure is computed from.

Logfire span tree drill-down showing nested routing, classification, and chat claude-sonnet-4-6 LLM calls with input/output token counts per row