Collaborative Retrieval for Conversational RecSys

Felipe Hlibco

Recommender systems have a split personality problem.

On one side, you’ve got LLMs that can hold a conversation — parse nuance, understand when someone says “something like Inception but weirder” and actually get it. On the other, you’ve got collaborative filtering: decades of behavioral data showing that users who liked X also liked Y. Both are powerful. Neither talks to the other.

CRAG fixes that.

It’s a joint effort from University of Virginia’s VAST LAB, Cornell, and Netflix (published at WWW 2025). CRAG stands for Collaborative Retrieval Augmented Generation — the first conversational recommender system that actually combines LLM context understanding with collaborative filtering retrieval. Not in a hand-wavy “we use both” sense; in a structured, two-step mechanism that pulls collaborative filtering knowledge into the LLM’s prompt at inference time.

Why This Matters #

Most conversational recommender systems today fall into one of two camps.

Camp one: pure LLM. You feed the conversation history into GPT-4 or similar and ask for recommendations. Works surprisingly well for popular items — fails badly for anything niche or recent. The model recommends based on its training data, not actual user behavior.

Camp two: classical collaborative filtering wrapped in a chat interface. Great at “users like you also watched…” but terrible at understanding conversational context. If I say “I loved the cinematography in Blade Runner 2049 but the pacing lost me,” a collaborative filter has no idea what to do with that.

CRAG sits between them. It uses the LLM to understand the conversation, then retrieves relevant collaborative filtering signals to augment the LLM’s response. The key innovation is what the authors call a “context-aware reflection mechanism” — the system reflects on the conversation to determine what kind of collaborative knowledge would be useful, then retrieves it.

How It Works #

The architecture has two phases.

First, the LLM processes the conversation and generates a reflection: what does this user seem to want? What are the key preference signals? This isn’t just summarization; it’s structured reasoning about user intent.

Second, that reflection drives a retrieval step into collaborative filtering data. The system pulls item-level and user-level signals that match the reflected preferences. These get injected back into the LLM prompt as additional context.

Think of it as giving the LLM a cheat sheet. “Hey, based on what this user is describing, here are the items that behaviorally similar users gravitated toward.” The LLM can then weigh those signals against the conversational context and produce recommendations that are both contextually appropriate and grounded in real behavior patterns.

The Results #

The authors evaluated CRAG on two movie recommendation datasets: a refined version of Reddit-v2 and the ReDial dataset. It outperforms both zero-shot LLM baselines and naive RAG approaches on item coverage and recommendation accuracy.

The most interesting finding? The biggest gains show up on recently released movies. That’s exactly where you’d expect collaborative filtering to shine and pure LLMs to struggle. A model trained six months ago doesn’t know that a movie released last week is getting strong engagement from sci-fi fans who also loved Arrival. Collaborative filtering does.

Item coverage improvements matter too. Pure LLM recommendations tend to cluster around the same popular titles — the “recommend Shawshank Redemption for everything” problem. CRAG’s collaborative retrieval pulls in a broader set of candidates, which means more diverse and useful recommendations.

My Take #

I’ve been watching the LLM-meets-RecSys space closely, and most attempts feel bolted together. Someone takes an existing recommender and adds a chat layer, or takes an LLM and fine-tunes it on recommendation data. CRAG’s approach is more principled: let each component do what it’s good at, and build a clean interface between them.

The Netflix co-authorship isn’t incidental. Netflix has arguably the most sophisticated recommendation infrastructure in the world, and they’re clearly thinking about how LLMs fit into that stack. The fact that they’re publishing on this — rather than just shipping it — suggests the problem space is genuinely hard and they want the research community working on it.

One thing I’d watch: the reflection mechanism is the linchpin. If the LLM reflects poorly on user intent, the collaborative retrieval pulls irrelevant signals, and the whole pipeline degrades. The paper shows strong results on movie recommendations, where preference articulation maps pretty cleanly to item attributes. I’m curious how well this generalizes to domains where preferences are harder to articulate.

The broader signal here is that the “just throw it at an LLM” era of recommendations is giving way to hybrid architectures. LLMs are remarkable at understanding what you want; collaborative filtering is remarkable at knowing what exists and who liked it. CRAG demonstrates that connecting these two capabilities is both feasible and beneficial.

That’s a result worth paying attention to.