Research  /  The Reliability Gap
May 2026 · N=400 See it in action
How we tested it

Will it hold up with real customers?

Why most AI agents fall apart — and why ours doesn't.

We took two ways of building an AI agent — ours, and the typical "one big prompt" approach — and ran 400 conversations with the kind of stressed, distracted, contradictory customers your team takes all day. Here's what happened.

MagicBlocks Research May 2026 7-min read
Headline result · Overall success rate
MagicBlocks
97.5%
195 of 200 conversations finished cleanly — every required field collected, "FINISHED" signaled.
OnePrompt
59.0%
118 of 200. The other 82 collapsed under pressure — forgotten fields, runaway tangents, or no exit.
+38.5 pts absolute reliability gap · Z=9.33, p<0.00001
01 The premise

In high-stakes work, "pretty good" is still dangerous.

An AI agent that hallucinates details, forgets key questions, or doesn’t know when to stop is not a helpful assistant. It's a liability — one that costs leads, money, and reputation. We compared two ways of building one.

OnePrompt

The typical AI agent · one big prompt

Like giving a trainee a 20-page script on day one and saying: "read this once, then handle every customer forever."

SYSTEM PROMPT
You are a mortgage assistant. Be empathetic, professional, and concise. Collect: full name, phone, email, date of birth, address (current and prior 2 years if < 2y), employment status, employer, occupation, monthly gross income, years employed, additional income sources, loan purpose (purchase / refi / cash-out / construction), property type, occupancy, estimated value, down payment, source of down payment, gift funds (if any) with donor relationship and gift amount, existing mortgage balance, lien position, credit score range, declared bankruptcies in past 7 years, foreclosures, judgments, citizenship status, marital status, dependents, military status, co-borrower info if applicable… Never ask for SSN. Always verify employment if self-employed. If the user is W-2 do not ask for business name. If user mentions rental income ask for property addresses and gross monthly rent. If…
  • All rules and edge cases stuffed into one giant instruction.
  • The model holds everything in its head at once.
  • When chaos hits, it forgets rules, skips steps, or wanders off.

MagicBlocks

Built in focused blocks · one job at a time

A digital checklist with guardrails. The agent only ever sees the current block, what's collected, and what's missing.

Contact phone · email
Address current
Employment W-2 detected
Loan purpose refi · cash-out
Property type · value
  • A built-in checklist knows where we are, what must happen next, and when it's truly done.
  • Only the logic for the current block is loaded into the prompt.
  • When chaos hits, the agent snaps back to the checklist — not to memory.
02 The micro-gauntlet

We stress-tested both agents on the customer from hell.

Not friendly, cooperative users. The kind of stressed, half-distracted, contradiction-prone caller a real mortgage broker takes all day. Each run is a full conversation with a "User" AI instructed to behave badly on purpose.

Total simulations
400
MagicBlocks runs
200 conversations
OnePrompt runs
200 conversations

The chaos profile

85%
The Interrupter
Breaks the flow.

"Why do you need my phone number? — wait, what was the rate again?"

60%
The Wanderer
Drifts off-topic.

"Anyway, did you see what the Fed did this week? Crazy housing market right now…"

35%
The Flip-Flopper
Contradicts themselves.

"I'm self-employed — actually wait, no. I'm W-2. Sorry."

Event
The Pivot
Switches mid-flow.

"You know what — forget the purchase. Let's refinance instead."

03 Core result · reliability

Did the agent collect every required field and finish the job?

Under real conditions — not the friendly test-deck demos — the difference isn't 5 points or 10. It's a full 38.5 points. That's the gap between an agent your team can trust to handle leads, and one you have to watch over.

MagicBlocks · modular, state-aware
97.5 %
195 / 200 conversations · success
Enterprise-grade reliability under hostile conditions.
OnePrompt · one big prompt
59.0 %
118 / 200 conversations · success
Collapses under pressure. 41% of attempts failed.
Statistical confidence Z-score 9.33 p < 0.00001 N = 400 runs Not luck — architecture.

In plain business terms.

Send 1,000 leads through each system. Same marketing budget. Same traffic. Very different outcome.

MagicBlocks · 1,000 leads in ~25 lost
975/ 1,000
Leads completed

~25 lost to agent errors. The agent knows when a file is complete — and when it isn't.

OnePrompt · 1,000 leads in ~410 lost
590/ 1,000
Leads completed

~410 lost opportunities. Same spend. Same traffic. The architecture decided the outcome.

04 The closing gap

OnePrompt often did the work — and then forgot to submit it.

We split success into two questions: did the agent collect every required field (zero-miss), and did it also realize it was done (success). The space between is where leads die quietly.

OnePrompt Zombie gap 21.5%
All runs 100%
Zero-miss 80.5%
Success 59.0%
⚠ Zombie mode
21.5% of conversations collected every required field — and then kept chatting, re-asking questions, chasing distractions, or timing out. You paid for the conversation. The customer gave you everything. The agent didn’t submit.
MagicBlocks Zombie gap 0%
All runs 100%
Zero-miss 97.5%
Success 97.5%
✓ No gap
0% zombie mode. The agent always knows which fields are required, marks them off one by one, and only declares FINISHED when the checklist is actually complete. No wandering. No forgetting.
05 The logic gap

Agents don't just fail by stopping wrong. They fail by asking wrong.

We measured the Agent Hallucination Rate — the share of runs where the agent asked at least one question that didn't logically apply to the user's situation. Things like demanding a business name from a W-2 employee, or asking about rental income from a single-home owner.

−45pts
Hallucination rate
OnePrompt · 100%
MagicBlocks · 55%

OnePrompt: 100%. In every single simulation, the agent eventually hallucinated — asking for something that didn't apply to the customer.

MagicBlocks: 55% — and when it did slip, the failures were soft edge cases (e.g. asking about gift amount because the user hinted at gift funds), not the embarrassing ones. The structure stops the agent from even being able to ask the wrong question for the customer's situation.

Hallucination · OnePrompt
Asking "Business Name?" from a salaried W-2 employee.
Hallucination · OnePrompt
Demanding rental property income from a single-home owner.
Hallucination · OnePrompt
Requesting documents and fields that don't apply to the customer's situation.

MagicBlocks limits the damage. OnePrompt doesn't.

06 Grace under pressure

When users go off the rails, MagicBlocks recovers faster — and gets more done per turn.

We measured two behavioural metrics: how many consecutive stalled turns each agent allows before pulling the user back, and how many validated data points it collects per turn of conversation.

Recovery speed

Avg max consecutive stalls

Stalled turns = off-topic, repetitive, or unproductive. Lower means the agent steers back faster.

MagicBlocks 4.55
OnePrompt 5.61

MagicBlocks (4.55): acknowledges the user's concern, then firmly returns to the next required question. OnePrompt (5.61): more likely to get dragged into long side-conversations before remembering what it was doing.

Progress efficiency

Validated data points / turn

More usable data per turn, despite identical chaos. At scale, this is real money in compute and saved leads.

MagicBlocks 0.34
OnePrompt 0.31

At small scale this looks like a rounding error. Across hundreds of conversations a week, it's the difference between leads your team gets to follow up on and leads dropped halfway through.

07 A tale of two agents · Run #400

Meet Charlie — anxious, privacy-concerned, easily distracted.

The agent needs Charlie's phone number, along with other mandatory details. Same scenario, same model, two architectures. Watch what happens.

OnePrompt· run #400Failed
AIWhat is your phone number?
CWhy do you need that? I'm worried about privacy.
AI[explains privacy policy for 3 long paragraphs]
COkay. By the way, are rates going up next week?
AI[explains market trends at length] So, do you want to proceed?
CYes.
AIGreat! Do you own or rent?
OutcomeThe agent forgot it hadn’t gotten the phone number and moved on. The file is incomplete even though Charlie agreed to proceed. It over-focused on "be helpful" and dropped its checklist.
MagicBlocks· run #400Completed
AIWhat is your phone number?
CWhy do you need that? I'm worried about privacy.
AI[briefly explains privacy in plain language] I still need your phone number to continue your application.
CAre rates going up?
AIRates can move around, but first I need your phone number so we can proceed with your application. What's the best number to reach you on?
COkay… 415-555-0142.
AIThanks Charlie. Next: do you own or rent?
OutcomeThe agent refused to move on until it got Charlie's phone number. It can still be empathetic and flexible in how it asks — but it can't skip a required field.

This is the difference between a smart chatbot and a reliable system.

08 The full scoreboard

Every metric. Side by side.

MagicBlocks wins on every dimension we measured — and the margins aren't close.

Metric MagicBlocks OnePrompt Delta
Success rate Job fully done · "FINISHED" signaled 97.5% 59.0% +38.5 pts
Zero-miss rate All required fields collected 97.5% 80.5% +17.0 pts
Zombie-mode gap Done the work, forgot to submit 0% 21.5% −21.5 pts
Hallucination rate Asked ≥ 1 irrelevant question 55% 100% −45 pts
Avg max consecutive stalls Lower is better · turns 4.55 5.61 −19%
Progress efficiency Validated data points per turn 0.34 0.31 +10%
Statistical confidence vs OnePrompt · N = 400 Z = 9.33 baseline p < 0.00001
What this means for your team

An agent you can actually trust with your leads.

Most AI agents on the market today are built around one big prompt and a hope the model "figures it out." Our 400 conversations say that approach doesn't hold up the moment real customers get on the line — the kind who interrupt, change their minds, or wander off-topic.

You can tweak a prompt forever and still not fix the underlying problem: you're asking a chatty assistant to behave like a careful process — without giving it the structure that makes "careful" possible.

MagicBlocks gives it that structure. Three things make the difference in practice:

One job at a time
Each step is its own focused block — qualifying, handling objections, booking — instead of one AI trying to do all of it at once.
Knows where it is
Tracks what's been collected and what's still missing. Won't forget. Won't accidentally "finish" a half-done file.
Asks what fits
Won't ask a salaried W-2 employee for a business name. Won't demand rental income from a single-home owner. Only the questions that apply.

If you want AI that holds up when conversations get messy — and still finishes the job — it's not about a fancier prompt. It's about an agent designed for it.

See MagicBlocks in action 20-minute demo · run it on a lead from your CRM