The real cost of cheap AI: what agencies leave out of the demo

The demo always works. The email comes in, the agent replies, the ticket gets routed, everyone nods. The first month’s bill in production is what catches people off guard.

The problem isn’t OpenAI or ChatGPT, it’s how AI automation gets quoted. There’s a handful of line items that almost never show up in the proposal and end up being the bulk of the real cost.

Worth listing them before you sign.

What the proposal actually covers

A typical agency proposal includes discovery, design, integration, deployment, and a month of close support. The numbers swing between €3,000 and €30,000 depending on scope.

That part isn’t usually where things break. What breaks is what isn’t there.

Flat-style iceberg: the small visible tip is the demo, the large submerged mass is the real cost — The demo is the tip. What pays for the project lives below the waterline.

What the proposal almost never covers

Real production model cost

The demo runs on a couple of hundred calls. Production is thousands or tens of thousands. And per-token costs on the strong models (GPT-4.1, Claude Sonnet 4.6, Gemini 2.5 Pro) add up fast once you have long context or chained tool calls.

The number in the proposal usually reflects the cheaper model. The demo runs on the most capable one. The gap between them can be 10x. If the agency doesn’t spell out which model goes to production and what it’ll cost at the expected volume, expect the first surprise bill.

Ask for it in the proposal as a calculation: expected requests per month, average tokens per request, model, estimated monthly cost, margin of error.

Observability and evals

LangSmith, Langfuse, Helicone, or your own instrumentation built on OpenTelemetry. Without it, when something breaks in production, nobody can see where. And it will break.

Eval tools (promptfoo, Braintrust, Patronus AI) cost time to set up and keep current. A serious agency wires in observability and a minimal eval suite from day one. A cheap agency leaves it for “phase two” — which rarely arrives before the first incident.

Realistic cost: €100–€500 a month in tooling, plus the time to set it up well.

Prompt maintenance

Prompts drift. Not because you change them, but because the provider updates the model underneath. What worked on GPT-4 stopped working the same way on GPT-4o. What worked on Claude 3.5 changed character on Claude 4. Anthropic has deprecated versions; OpenAI has done quiet rollouts.

Any critical prompt needs a test suite that runs every time the model changes. And it needs someone watching when things fail.

This usually sits outside the standard maintenance contract. Ask: what happens when OpenAI deprecates the model we’re using?

Integrations that break

Every connector to an external system (CRM, ERP, helpdesk, your email, your calendar) has its own failure rate. Salesforce changes its API. HubSpot changes webhook payloads. Google modifies OAuth scopes. Microsoft does something strange with Graph every quarter.

If the proposal doesn’t budget a fixed share of maintenance time for integrations, assume every break will come back as a separately billed intervention. Automations heavy on connectors (Make, Zapier, n8n) push that maintenance ratio up — they’re fragile by design.

PII handling and security

If your agent touches personal data, there’s invisible work that almost nobody quotes:

PII redaction or masking before sending to the model (presidio, scrubadub, in-house solutions).
Log retention and deletion policy.
Prompt injection and output filters (Lakera, Protect AI, NeMo Guardrails, or your own filtering).
Subprocessor audit and data residency.

This isn’t optional under GDPR. The question is whether you’re billed for it separately or it’s included. Read the proposal for it.

Edge cases and fallback

What does the agent do when the model is down? When the output is ambiguous? When a user tries to abuse the system?

The Air Canada case in 2024 is instructive: a British Columbia tribunal ruled the airline liable for incorrect information given by its chatbot — and the “it’s an independent agent” defence didn’t hold. The bill for not defining the agent’s boundaries properly ended up legal, not technical.

DPD in 2024 went through something similar: their chatbot started insulting a customer and writing critical poems about the company itself after an update. They had to pull it within hours.

Designing the fallback well (when to escalate to a human, when to stay silent, when to ask for confirmation) is real work. It’s rarely quoted.

The cost of going back

Klarna announced in 2024 that its AI assistant was doing the work of 700 human agents. By 2025, its CEO admitted they had cut too far and started rehiring — quality had measurably dropped and customers noticed.

McDonald’s killed its IBM AI drive-thru pilot in June 2024, after three years and viral videos of the system misfiring.

Builder.ai, valued at $1.5 billion, collapsed in May 2025. Part of the story: what they sold as “AI building apps” was, in practice, engineering teams in India without the promised automation underneath.

What this teaches in practice: AI projects carry a real cost of going back. Migrating from a failed agent back to the prior process isn’t trivial — flows changed, the people who knew the old work got reorganised, the data is tangled. Any serious proposal should include a credible exit plan.

The simple rule

Ask for the proposal broken into four distinct lines:

Build — discovery, design, integration, deployment.
Run — monthly cost of models, infrastructure, observability tooling, evals.
Maintain — contracted hours per month for incidents, model regressions, prompt and integration evolution.
Exit — what happens to code, data, and embeddings when the contract ends.

If an agency struggles to give you these four lines separately, it isn’t a transparency problem: they haven’t thought through them.

And if they have thought about them but didn’t quote them in the proposal, you’re going to pay for them anyway. Just later, in small invoices that add up.

The first step to evaluating a proposal well is knowing what you want to automate and what that work is worth today. Start with the free diagnostic at canihireanai.com — before asking for quotes.