Human-on-the-Loop AI Control

Felipe Hlibco

August 13, 2025

I was in a meeting last month where someone from a financial services team walked us through their AI oversight process. Every automated trading decision above a certain threshold gets routed to a human reviewer. The reviewer checks the decision, approves or rejects, and the system proceeds.

“How many decisions per day?” I asked.

“About forty thousand.”

“And how many reviewers?”

“Three.”

Nobody laughed. I think because everyone in the room already knew the answer before I asked. The math doesn’t work. It never did, really, but we’re only now being honest about it.

Three Tiers of Oversight #

The terminology gets confusing fast, so let me lay out the three models as I understand them.

Human-in-the-loop (HITL): The human approves each action before execution. The AI proposes; the human disposes. Every single time. This is what most people picture when they hear “AI oversight,” and, to be fair, what most regulatory language seems to assume.

Human-on-the-loop (HOTL): The AI acts on its own, but a human monitors in real time and can intervene. Think supervisory role. Not approving every step, but watching the dashboard, ready to hit the brakes.

Human-over-the-loop (HOVL): The human sets policies, boundaries, guardrails. The AI operates within those constraints. Humans review outcomes periodically but don’t monitor in real time.

Most organizations claim to do HITL. What they actually do (once you look at the volume and the staffing) falls somewhere between HOTL and HOVL, with HITL applied to a thin slice of high-value decisions. I’ve seen this pattern at three different companies now. The gap between what leadership reports and what operations actually looks like is… significant.

Why HITL Falls Apart #

The problem isn’t conceptual. HITL sounds great. Having a human verify every AI decision sounds responsible, thorough, safe. The problem is purely operational.

AI systems now make millions of decisions per second across fraud detection, algorithmic trading, logistics optimization, cybersecurity threat response, content moderation. The scale of autonomous decision-making has outrun human capacity to review it. Not by a little. By orders of magnitude.

Take fraud detection at a major bank. The system evaluates transactions in real time: is this purchase legitimate? Should this wire transfer proceed? The answer needs to come back in milliseconds. A human reviewer adds seconds at best, minutes at worst. In that window, the fraudulent transfer completes or the legitimate customer’s card gets declined at the grocery store. Neither outcome wins.

Or cybersecurity. A system detects a potential intrusion; the response window might be measured in single-digit seconds. Route that alert to a human queue and wait for review? The attacker already pivoted laterally while someone pulled up the ticket.

The cognitive load problem makes everything worse. Humans reviewing thousands of AI decisions per day develop what researchers call “automation bias”—they start rubber-stamping because the AI gets it right 99.7% of the time. That remaining 0.3% of errors, the ones the human exists to catch? They sail right through. The reviewer is exhausted, pattern-numbed, and rationally convinced that the AI probably got this one right too.

HITL doesn’t just fail at scale. It actively degrades the quality of the human oversight it promises.

The Regulatory Tension #

Here’s where it gets uncomfortable. The EU AI Act (Article 14) mandates human oversight for high-risk AI systems, with full enforcement hitting in August 2026. The regulation requires some form of meaningful human control over AI decisions in domains like credit scoring, hiring, law enforcement, and critical infrastructure.

“Meaningful” is doing a lot of heavy lifting in that sentence.

The regulation doesn’t specify HITL vs. HOTL vs. HOVL explicitly; it requires that human oversight be “effective.” But what does effective oversight look like when the system makes ten thousand decisions before a human finishes reviewing one? I don’t have a clean answer. I’m not sure the regulators do either.

This tension needs engineering leaders’ attention right now. Not in 2026 when enforcement begins, but now, because designing oversight architectures takes time and the regulatory clock runs whether teams are ready or not.

Enter Human-on-the-Loop #

HOTL is the pragmatic middle ground. The AI operates autonomously, and the human’s role shifts from gatekeeper to supervisor. Instead of reviewing every decision, the human monitors patterns, watches for anomalies, intervenes when something looks wrong.

The core insight (and I keep coming back to this when I think about it): not all decisions carry equal risk. A fraud detection system blocking a $12 coffee purchase doesn’t need human review. The same system flagging a $500,000 wire transfer to a new international account probably does.

This is where dynamic HITL comes in, and I think it’s the most interesting development in the space right now. The AI itself decides when to involve the human, based on its own confidence level and the assessed risk of the decision.

Low confidence + high stakes = escalate to human (HITL mode). High confidence + low stakes = proceed autonomously (HOVL mode). Everything in between = human monitors and can intervene (HOTL mode).

Microsoft’s Magentic-UI research explores exactly this pattern for agentic systems. The framework lets AI agents operate autonomously on routine tasks but surfaces decision points to human operators when the agent encounters uncertainty or high-risk actions. Not a fixed oversight level; it’s adaptive. I’m honestly surprised more teams haven’t adopted something similar already.

What This Looks Like in Practice #

I’ve been thinking about this in the context of messaging infrastructure, the kind of systems I deal with daily. Billions of messages flow through. Nobody can review each one for policy compliance. But you can build systems that flag anomalies (sudden volume spikes, unusual sender patterns, content that triggers risk classifiers) and surface those to human operators.

The practical architecture, at least as I’d frame it:

Policy guardrails come first. Hardcoded rules the AI cannot override: rate limits, content filters, regulatory boundaries. Humans define the policies; the system enforces them without runtime human involvement. This is the HOVL layer, and honestly it catches more problems than people expect.

Confidence-based routing sits on top. The AI assesses its own confidence in each decision. Below a threshold, the decision gets escalated. Above it, the system proceeds. The threshold itself is a tunable parameter that humans control, and tuning it well matters more than most teams realize.

Anomaly monitoring runs in parallel. Real-time dashboards and alerting for patterns that suggest the AI makes systematic errors or encounters novel situations. This is where the HOTL operator lives, watching the system’s behavior at an aggregate level rather than individual decisions.

Audit happens after the fact. Post-hoc analysis of AI decisions. Not every decision, but statistically sampled subsets plus any decisions that resulted in complaints, errors, or anomalies. This feeds back into policy updates. It’s the learning loop that makes everything else better over time.

None of these layers require a human to approve every decision. All of them maintain meaningful human oversight. At least, that’s the argument, and I think it holds up; though I’ll admit the “meaningful” part depends heavily on how well the anomaly detection actually works.

The Leadership Question #

If you’re building AI-powered systems, the question isn’t whether to implement human oversight. Regulation mandates it, and good engineering demands it regardless. The question is which oversight model fits your risk profile.

I’d start with a risk taxonomy. Classify your AI decisions by impact severity and reversibility. A content recommendation that’s slightly off? Low impact, fully reversible. HOVL works fine. An autonomous trading decision that could move markets? High impact, potentially irreversible. HITL for those, or at minimum a very tight HOTL with aggressive escalation thresholds.

Then design the escalation paths. What happens when the AI isn’t confident? Where does the decision go? How fast does it need to get there? If your escalation path adds 30 seconds of latency to a decision that needs to happen in 100 milliseconds, you haven’t built oversight; you’ve built a bottleneck that everyone will route around eventually.

The hardest part—and nobody seems to want to talk about this—is accepting that you can’t review everything. Engineering leaders who grew up with code reviews, QA gates, and approval workflows have a visceral resistance to letting systems operate autonomously. I get it. I had the same instinct. But the alternative (pretending you have HITL when you actually have rubber-stamping) is worse. At least HOTL is honest about what the human actually does.

Where This Goes #

The industry moves toward adaptive oversight models whether regulators explicitly endorse them or not, because operational reality forces the issue. Organizations that figure out HOTL well (genuine anomaly detection, meaningful escalation, rigorous audit) will satisfy both the spirit of regulations like the EU AI Act and the operational constraints of running AI systems in production.

The ones clinging to HITL theater, claiming human oversight while their reviewers rubber-stamp at a rate that makes review impossible, will have a much harder conversation when regulators come asking what “effective oversight” actually looks like.

Build the oversight architecture now. Retrofitting it after enforcement begins sounds like exactly the kind of project that takes twice as long as anyone estimates.