Rasa: Minimizing Complexity in Generative AI Bots

Felipe Hlibco

November 23, 2023

I’ve spent the past month evaluating conversational AI frameworks at DreamFlare. We’re building a GenAI entertainment platform, and the bot layer sits at the center of everything. So when Rasa launched CALM—that’s Conversational AI with Language Models—back in October, I paid close attention.

After a couple of weeks digging through docs and hacking together a proof-of-concept, I’ve got a verdict. Rasa made the smartest architectural bet in conversational AI right now. Not the flashiest. Not the most technically impressive. The smartest.

The Problem CALM Solves #

Let me paint you a picture of enterprise bot development in late 2023. It’s a mess—genuinely.

On one side you’ve got the old guard: Dialogflow CX, Amazon Lex, Microsoft Bot Framework. These are essentially state machines with hand-crafted intents and entity extractors. Predictable? Sure. Auditable? Absolutely. But brittle as hell. Every user utterance that doesn’t match your training data triggers a fallback. And maintaining the intent model becomes a full-time job the moment your bot grows past “hello” and “goodbye.”

On the other side there’s the “just use GPT-4” crowd. Wire an LLM to your knowledge base, let it generate responses on the fly, ship it. Fast to prototype, handles weird inputs gracefully. But it hallucinates. It spouts plausible-sounding nonsense when it doesn’t know the answer—and for enterprise use cases like booking flights, processing refunds, or checking account balances, hallucination isn’t a minor inconvenience. It’s a liability.

CALM sits somewhere in the middle. And the architecture is clever enough that it’s worth breaking down.

How CALM Works #

The core idea: use LLMs for language understanding, but keep conversation flow deterministic.

In a CALM bot, the LLM’s job is figuring out what the user wants and pulling out the relevant details. The conversation logic—what actually happens after intent is understood—runs on Rasa’s existing rules engine. The LLM never generates business logic responses. It doesn’t tell the user “your refund has been processed” unless the backend actually processed the refund.

The split is clean. The LLM handles the messy, ambiguous, natural-language stuff—exactly what LLMs are good at. The deterministic engine handles the transactional, must-be-correct stuff—exactly what LLMs are bad at.

Here’s a concrete example. A user says: “I bought a jacket last week but it’s too small, can I get a different size?”

In a traditional system, you’d need training examples for this exact phrasing. In a pure LLM approach, GPT-4 might generate a response about return policies that sounds right but doesn’t match your actual policy.

In CALM, the LLM extracts the intent (exchange item) and entities (jacket, size issue, purchased last week). The rules engine then follows the actual exchange flow: look up the order, check eligibility, confirm the new size, process the exchange. Every step is deterministic and auditable.

The Patterns CALM Handles Out of the Box #

What impressed me most isn’t the LLM integration itself—you can bolt an LLM onto any dialogue framework. It’s the conversation patterns that ship built-in.

Topic changes. The user is halfway through a booking flow and suddenly asks “what’s your cancellation policy?” CALM handles this without custom code. The LLM recognizes the topic shift; the engine parks the current flow, answers the policy question, and returns to the booking.

Interruptions. “Actually, never mind, I want to change my departure city.” Mid-flow corrections that would require explicit state management in Dialogflow CX are handled automatically.

Cancellations. “Forget it, I don’t want to book anything.” The engine abandons the current flow gracefully, without leaving orphaned state.

Clarifications. The user says something ambiguous. CALM generates a clarifying question, collects the answer, and continues. No hand-crafted disambiguation dialogs.

These patterns represent maybe 40% of the engineering effort in traditional bot development. Having them out of the box is a meaningful reduction in complexity—not a toy reduction, but a “that’s two fewer engineers on the project” kind of reduction.

The Model Choice: Smaller Is Smarter #

Rasa made a deliberate choice to optimize CALM for smaller, fine-tuned models. Their benchmarks cite Llama 8B-class models rather than GPT-4. This is strategic, not a technical limitation.

GPT-4 is better at language understanding—no question. But cost and latency matter for enterprise deployment. A GPT-4 API call takes 1-3 seconds and costs roughly $0.03-0.06 per 1K tokens. A self-hosted Llama model running on modest GPU hardware responds in 200-500ms with no per-request cost after the infrastructure investment.

For a bot handling thousands of concurrent conversations, the economics are stark. At 100K conversations per day with an average of 10 turns each, GPT-4 costs run into tens of thousands of dollars monthly. A self-hosted model is a fixed infrastructure cost regardless of volume.

More importantly, self-hosted models address the data privacy concern that blocks many enterprise deployments. Financial services, healthcare, government—these sectors can’t send customer conversations to a third-party API. Period. Not negotiable. CALM’s architecture lets you run the LLM on your own infrastructure, behind your own firewall, with your own data retention policies.

The fine-tuning angle is interesting too. Rasa can fine-tune a Llama-class model specifically for dialogue understanding tasks, stripping out the general-purpose capabilities that a chatbot doesn’t need. A specialized 8B model can outperform a general-purpose 70B model on the specific task of extracting intents and entities from conversational text. Smaller, faster, cheaper, better at the actual job.

What’s Missing #

CALM isn’t perfect, and I want to be straight about the gaps I found during evaluation.

LLM-generated responses for informational queries. CALM shines when the user’s intent maps to a business flow—book, cancel, exchange, check status. It’s less great when the user asks an open-ended question without a defined flow. “Tell me about your sustainability initiatives” or “What makes your product different from Competitor X?” These knowledge-base queries benefit from LLM generation, and CALM’s strict separation between understanding and logic makes it awkward to let the LLM generate informational responses.

Rasa addresses this with “enterprise search”—RAG-style retrieval from a knowledge base—but it feels bolted on rather than integrated. The retrieval quality depends heavily on the knowledge base structure and the embedding model quality. In my testing, it handled FAQ-style questions well but struggled with multi-hop reasoning (“What’s the cheapest option that also includes free cancellation?”).

Multi-modal input. CALM is text-first. If your bot needs to process images—receipts, product photos, damage documentation—you’re integrating external services yourself. This isn’t a CALM-specific criticism; most conversational AI frameworks are text-first. But it’s a gap that matters for certain use cases.

Testing and debugging tooling. Rasa’s conversation testing framework exists but it’s cumbersome. Writing test conversations as YAML stories and stepping through them isn’t the developer experience I’d expect from a framework that costs real money. (Rasa Pro pricing is opaque, which is its own issue.) I want a visual conversation debugger that shows me: “the LLM extracted these entities, the rules engine matched this flow, the response came from this template.” Right now, debugging requires reading logs.

Ecosystem lock-in. CALM is a Rasa-specific abstraction. The skills you build, the flows you define, the model fine-tuning you do—none of it is portable to another framework. If you decide to leave Rasa in two years, you’re rebuilding from scratch. This is true of every conversational AI platform, but it’s worth acknowledging.

The Business Signal #

Rasa doubled their ARR in 2023 and secured a $30M Series C from StepStone, PayPal Ventures, a16z, and Accel. That investor list is interesting: PayPal Ventures investing in a conversational AI company signals that large fintech players see bot automation as a strategic priority, not a nice-to-have.

The ARR doubling is the more interesting metric. Enterprise sales cycles for conversational AI platforms run 6-12 months, so the revenue recognized in 2023 reflects deals that started in 2022 or early 2023—before CALM existed. This means the growth is driven by Rasa’s existing product, and CALM is the bet on accelerating that trajectory.

If I’m reading the market correctly, Rasa is positioning CALM as the answer to a very specific enterprise pain: “We want to use LLMs but we can’t afford hallucination.” That’s a message that resonates with every regulated industry, and there are a lot of regulated industries.

My Assessment as a CTO #

For DreamFlare, we’re not using Rasa. Our use case is entertainment, not enterprise transactions, so the hallucination-prevention architecture is less relevant. We want generative, creative, personality-driven conversations—the opposite of deterministic flows.

But if I were building a customer service bot for a bank? A booking assistant for an airline? A claims processing agent for an insurance company? Rasa CALM would be at the top of my shortlist.

The “LLM for understanding, rules for logic” architecture isn’t sexy. It won’t demo as impressively as a GPT-4-powered bot that generates eloquent responses on the fly. But it’s the architecture that will survive the first audit, the first compliance review, the first time a customer says “your bot told me I qualified for a refund that I didn’t qualify for.”

Enterprise conversational AI isn’t about what’s possible. It’s about what’s safe. CALM makes safety the default rather than the afterthought, and that’s a positioning advantage that’s hard to replicate.

The conversational AI market is going to split. Consumer-facing bots will lean into generative, personality-rich, LLM-native experiences. Enterprise bots will lean into controlled, auditable, deterministic-with-LLM-understanding hybrids. Rasa is betting on the enterprise side, and I think that bet is correct.