"Why do you need my phone number? — wait, what was the rate again?"
Will it hold up with real customers?
Why most AI agents fall apart — and why ours doesn't.
We took two ways of building an AI agent — ours, and the typical "one big prompt" approach — and ran 400 conversations with the kind of stressed, distracted, contradictory customers your team takes all day. Here's what happened.
In high-stakes work, "pretty good" is still dangerous.
An AI agent that hallucinates details, forgets key questions, or doesn’t know when to stop is not a helpful assistant. It's a liability — one that costs leads, money, and reputation. We compared two ways of building one.
OnePrompt
- All rules and edge cases stuffed into one giant instruction.
- The model holds everything in its head at once.
- When chaos hits, it forgets rules, skips steps, or wanders off.
MagicBlocks
- A built-in checklist knows where we are, what must happen next, and when it's truly done.
- Only the logic for the current block is loaded into the prompt.
- When chaos hits, the agent snaps back to the checklist — not to memory.
We stress-tested both agents on the customer from hell.
Not friendly, cooperative users. The kind of stressed, half-distracted, contradiction-prone caller a real mortgage broker takes all day. Each run is a full conversation with a "User" AI instructed to behave badly on purpose.
The chaos profile
"Anyway, did you see what the Fed did this week? Crazy housing market right now…"
"I'm self-employed — actually wait, no. I'm W-2. Sorry."
"You know what — forget the purchase. Let's refinance instead."
Did the agent collect every required field and finish the job?
Under real conditions — not the friendly test-deck demos — the difference isn't 5 points or 10. It's a full 38.5 points. That's the gap between an agent your team can trust to handle leads, and one you have to watch over.
In plain business terms.
Send 1,000 leads through each system. Same marketing budget. Same traffic. Very different outcome.
~25 lost to agent errors. The agent knows when a file is complete — and when it isn't.
~410 lost opportunities. Same spend. Same traffic. The architecture decided the outcome.
OnePrompt often did the work — and then forgot to submit it.
We split success into two questions: did the agent collect every required field (zero-miss), and did it also realize it was done (success). The space between is where leads die quietly.
FINISHED when the checklist is actually complete. No wandering. No forgetting.
Agents don't just fail by stopping wrong. They fail by asking wrong.
We measured the Agent Hallucination Rate — the share of runs where the agent asked at least one question that didn't logically apply to the user's situation. Things like demanding a business name from a W-2 employee, or asking about rental income from a single-home owner.
OnePrompt: 100%. In every single simulation, the agent eventually hallucinated — asking for something that didn't apply to the customer.
MagicBlocks: 55% — and when it did slip, the failures were soft edge cases (e.g. asking about gift amount because the user hinted at gift funds), not the embarrassing ones. The structure stops the agent from even being able to ask the wrong question for the customer's situation.
MagicBlocks limits the damage. OnePrompt doesn't.
When users go off the rails, MagicBlocks recovers faster — and gets more done per turn.
We measured two behavioural metrics: how many consecutive stalled turns each agent allows before pulling the user back, and how many validated data points it collects per turn of conversation.
Avg max consecutive stalls
Stalled turns = off-topic, repetitive, or unproductive. Lower means the agent steers back faster.
MagicBlocks (4.55): acknowledges the user's concern, then firmly returns to the next required question. OnePrompt (5.61): more likely to get dragged into long side-conversations before remembering what it was doing.
Validated data points / turn
More usable data per turn, despite identical chaos. At scale, this is real money in compute and saved leads.
At small scale this looks like a rounding error. Across hundreds of conversations a week, it's the difference between leads your team gets to follow up on and leads dropped halfway through.
Meet Charlie — anxious, privacy-concerned, easily distracted.
The agent needs Charlie's phone number, along with other mandatory details. Same scenario, same model, two architectures. Watch what happens.
This is the difference between a smart chatbot and a reliable system.
Every metric. Side by side.
MagicBlocks wins on every dimension we measured — and the margins aren't close.
| Metric | MagicBlocks | OnePrompt | Delta |
|---|---|---|---|
| Success rate Job fully done · "FINISHED" signaled | 97.5% | 59.0% | +38.5 pts |
| Zero-miss rate All required fields collected | 97.5% | 80.5% | +17.0 pts |
| Zombie-mode gap Done the work, forgot to submit | 0% | 21.5% | −21.5 pts |
| Hallucination rate Asked ≥ 1 irrelevant question | 55% | 100% | −45 pts |
| Avg max consecutive stalls Lower is better · turns | 4.55 | 5.61 | −19% |
| Progress efficiency Validated data points per turn | 0.34 | 0.31 | +10% |
| Statistical confidence vs OnePrompt · N = 400 | Z = 9.33 | baseline | p < 0.00001 |
An agent you can actually trust with your leads.
Most AI agents on the market today are built around one big prompt and a hope the model "figures it out." Our 400 conversations say that approach doesn't hold up the moment real customers get on the line — the kind who interrupt, change their minds, or wander off-topic.
You can tweak a prompt forever and still not fix the underlying problem: you're asking a chatty assistant to behave like a careful process — without giving it the structure that makes "careful" possible.
MagicBlocks gives it that structure. Three things make the difference in practice:
If you want AI that holds up when conversations get messy — and still finishes the job — it's not about a fancier prompt. It's about an agent designed for it.
Read the full research, on us.
400 stress-tested conversations, the full methodology, the Charlie transcripts, the architecture that makes it work. Drop your email and we'll unlock the page — no spam, no follow-ups unless you ask.