12 questions to ask an AI agency before you sign

Most AI agency proposals look the same. A few slides of use cases, a block diagram with arrows, a price that feels reasonable, and a promise of “end-to-end automation.”

The proposal isn’t the problem. The problem is that very different stacks hide behind nearly identical decks. Some agencies build agents on LangGraph, with proper observability and per-token cost controls. Others chain webhooks together in Make or n8n, call it an “agent,” and bill it like engineering.

These twelve questions don’t guarantee you’ll pick the right partner. They will filter out the ones who don’t have a clear answer.

Architecture and stack

1. What do you build in code, and what in no-code?

An honest answer separates the two. No-code is fine for simple, low-criticality, low-volume flows. Once sensitive data, custom integrations, or non-trivial business logic show up, you need code. If the answer is “everything in n8n / Make / Zapier because it’s faster,” assume you’ll pay the technical debt later — usually when you try to scale or change vendor.

2. Which orchestrator or framework do you use for agents?

You expect concrete names: LangGraph, CrewAI, AutoGen, Semantic Kernel, Pydantic AI, or an in-house orchestrator. “We just use prompts and direct API calls” is fine for small cases — not for an agent making chained decisions in production. “We use custom GPTs” isn’t an architecture.

3. How do you handle agent state and memory across runs?

It’s a simple question with a telling answer. If the agent only reacts to an input and remembers nothing, that’s a workflow, not an agent. If the answer mentions vector stores (Pinecone, Weaviate, pgvector), synced knowledge bases, or explicit context management, you’re in good hands.

4. Which models do you use, and why those over others?

An agency that only ever ships OpenAI has a blind spot. Each family has its own trade-offs in cost, latency, reasoning quality, and EU availability. A reasonable answer mentions several — Claude for long-form reasoning and code, GPT for general-purpose generation, Mistral or Gemini for cost and latency, self-hosted open-source when data can’t leave the perimeter. And there should be an argument for why this case fits this model, not just “it works best.”

Operations and cost

5. How do you monitor agents in production?

You’re listening for specific tools. LangSmith, Langfuse, Arize, Helicone, Datadog with custom instrumentation. Without observability, when something breaks nobody will know why. And it will break — models change, providers update APIs, prompts drift.

6. How do you control per-request cost and monthly spend?

The answer should include some combination of: per-user or per-tenant limits, spend alerts, response caching, routing simple tasks to cheaper models, and per-feature tracking. If nobody has thought about this before launch, the first surprise OpenAI bill arrives in month three.

7. What happens when the model changes or the provider goes down?

Anthropic deprecates versions. OpenAI had several global outages in 2025. Azure OpenAI has hit regional quota limits. A serious agency answers with: a provider abstraction layer, regression tests on critical prompts when the model changes, and a fallback plan. If the answer is “we’ll deal with it when it happens,” assume that intervention will come back as a future invoice.

8. How do you test prompts and outputs?

The serious practice is automated evaluation — promptfoo, Braintrust, Patronus, or in-house test sets. The question underneath: how do you know a change improves the system rather than quietly breaking it? “We test it manually” only scales until it doesn’t.

Compliance, data, and accountability

9. Where is data processed, and what processor agreement do you sign?

Direct question, with a verifiable answer. You’re listening for: EU data residency where it applies, a signed GDPR processor agreement, a list of subprocessors, and a retention policy. If the agency can’t tell processor and controller apart, you already have your answer.

10. What are you doing about the AI Act?

Since February 2025, certain systems are prohibited. Since August 2025, general-purpose model providers carry obligations. In August 2026, the obligations for high-risk systems kick in — and many HR, credit scoring, healthcare, and employee management use cases land squarely in that category. A reasonable answer identifies whether your case is high-risk and, if so, what risk-management documentation, human oversight, and technical logging they’re going to deliver.

11. How do you log agent decisions for auditing?

Any automation that makes decisions about people (filtering CVs, prioritising tickets, approving applications) has to be explainable. The answer should describe structured logs of every model call, persistence of intermediate reasoning where it applies, and a way for a human to review and override. Without this, you can’t defend the system when someone challenges it — internally or externally.

12. What happens to the code and data when the contract ends?

Ask it specifically: does the code live in a repository you control? Are the prompts and configurations there too? What about embeddings and vector databases? Too many projects end with the critical logic living inside a Make account or a custom GPT the agency owns. That dependency is invisible until you try to switch vendor.

Magnifying glass over three coloured folders with icons: a wrench, a coin, and a shield — Three blocks: architecture, operations, compliance. The twelve questions split between them.

How to read the answers

A solid agency doesn’t have a perfect answer for everything. They’ll admit some decisions depend on the case, that they pick certain tools late, that observability gets built in stages. What matters is that the conversation stays concrete — tool names, technical decisions with reasoning behind them, examples of past projects where something broke and they learned from it.

A weak agency answers in generalities. “We use the best tools.” “We have a robust stack.” “We comply with all regulations.” When the conversation stays at that level, you already have your answer.

These twelve questions aren’t an exam. They’re a conversation about how something that’s going to make decisions for your business will be built and run. If the conversation doesn’t hold up, the underlying problem isn’t technical, it’s judgment.

Before putting these questions to an agency, it’s worth knowing exactly what you want to automate. Start with the free diagnostic at canihireanai.com — and walk into those meetings with data instead of intuition.