Does Reinforcement Learning Need Labeled Data
RL vs Supervised Learning
Supervised learning requires a dataset of inputs paired with correct outputs. For retrieval, this would mean a dataset of (query, ideal_ranking) pairs: for each possible query, a human expert specifies the perfect ordering of results. Creating this dataset is expensive (you need domain experts to judge thousands of query-result pairs), quickly becomes outdated (as content changes, the ideal rankings change), and often requires subjective judgments (different experts rank differently).
Reinforcement learning skips this requirement entirely. Instead of being told the correct answer, the system tries an answer, observes the outcome, and adjusts its strategy based on whether the outcome was good or bad. The system never needs to see a "correct" ranking. It only needs to observe whether its current ranking produced results the user found useful.
This distinction matters enormously for practical deployment. Labeled data is a one-time investment that depreciates as the world changes. Reward signals from user interactions are generated continuously and always reflect current user needs. A system that learns from rewards adapts to changing content, changing user behavior, and changing requirements automatically, without anyone updating a training dataset.
Types of Reward Signals
Reward signals vary in quality (how accurately they reflect true user satisfaction) and density (how often they are available). Understanding the tradeoffs between different signal types helps you design a learning system that makes the most of the feedback available.
Access patterns (highest density, lowest quality per signal). Every retrieval event generates access data: which memories were candidates, which were returned, and which the model referenced. This signal is available for 100% of interactions with zero user effort. But individual access events are noisy: a memory might be returned and referenced for reasons unrelated to its actual quality. The value of access patterns comes from aggregation, the law of large numbers turns noisy individual signals into reliable patterns over thousands of interactions.
Behavioral signals (high density, moderate quality). User behavior after receiving results provides richer signal than raw access patterns. Query reformulation (the user rephrased their question) suggests the results were insufficient. Session continuation (the user kept working) suggests the results were adequate. Task completion (the user finished what they were doing) suggests the results were helpful. These signals require basic interaction tracking but no explicit user input.
Implicit model feedback (moderate density, moderate quality). In memory-augmented LLM applications, the model's response itself contains signal. If the model's response references specific injected memories by incorporating their facts or language, those memories were useful. If the model ignores injected memories and answers from its own knowledge, those memories were not useful for this query. This signal is available for every interaction but requires comparing the response against the injected context.
Explicit user feedback (lowest density, highest quality). Thumbs up/down, star ratings, "that's wrong" corrections, and direct feedback are the clearest indicators of quality. But they are available for only 2-10% of interactions, because most users never provide explicit feedback. When available, explicit feedback should be weighted heavily because it directly reflects user judgment rather than requiring behavioral inference.
How Label-Free Learning Works in Practice
Adaptive Recall demonstrates label-free learning through ACT-R activation dynamics. The system requires zero labeled data, zero explicit user feedback, and zero training datasets. Learning happens entirely through access patterns.
When a memory is retrieved, an access timestamp is recorded. The ACT-R base-level activation equation computes the memory's current activation from all stored timestamps: B = ln(sum of t^(-d)), where t is the time since each access and d is the decay parameter. This equation naturally encodes both recency (recent accesses contribute more) and frequency (more accesses contribute more total). A memory that is retrieved often across many sessions has high activation. A memory that was retrieved once months ago has low activation.
No labels are needed because the act of retrieval itself is the signal. A memory that is consistently retrieved across different queries, different sessions, and different contexts is probably useful. A memory that is never retrieved, despite being in the store, is probably not. The activation scores capture this distinction purely from the pattern of which memories get accessed, without anyone annotating which memories are "good" or "bad."
When Labels Accelerate Learning
While RL does not require labeled data, labels can accelerate learning when they are available. Think of labels as a high-bandwidth feedback channel that supplements the low-bandwidth implicit signals. A single explicit correction ("no, we switched to PostgreSQL last year") provides as much information as hundreds of implicit access events, because it directly identifies a specific memory as outdated and provides the correct replacement.
The practical approach is a hybrid system that learns continuously from implicit signals (available for every interaction) and accelerates learning with explicit signals when they appear (available for 2-10% of interactions). The implicit learning provides the baseline improvement trajectory. The explicit feedback provides targeted corrections that the implicit system would take much longer to discover on its own.
For Adaptive Recall, explicit feedback comes through the update and forget tools. When a user or application explicitly updates a memory (correcting outdated information) or forgets a memory (removing incorrect content), these actions provide direct, high-quality learning signals that supplement the automatic activation-based learning. The combination of continuous implicit learning and occasional explicit correction produces faster convergence than either channel alone.
Comparison to Supervised Approaches
Supervised approaches to retrieval optimization (learning to rank with labeled data) can produce high-quality initial rankings if the labeled data is comprehensive and current. But they degrade over time because the labels are static snapshots of a changing world. A labeled dataset created in January does not reflect content added in March or user behavior changes in May.
RL approaches maintain their quality over time because the feedback signal is always current. The access patterns from today reflect today's content and today's user needs. This self-updating property means RL systems require less maintenance than supervised systems: no periodic relabeling, no dataset refresh cycles, no retraining on new data. The system adapts automatically as the world changes.
For teams that have existing labeled data (from past relevance judgments, A/B tests, or quality audits), the best approach is to use the labeled data for initial system calibration and then switch to RL-based learning for ongoing improvement. This gives you the best of both: a strong starting point from supervised learning and continuous adaptation from reinforcement learning.
No labels needed. Adaptive Recall learns from access patterns automatically, improving retrieval quality with every interaction.
Get Started Free