Home » Reinforcement Learning » Online Learning

How to Implement Online Learning for Retrieval

Online learning updates ranking parameters after each user interaction rather than waiting for batch training cycles. This enables real-time adaptation to changing user behavior, content updates, and distribution shifts. The tradeoff is stability: each individual update is based on a single noisy observation, so the system needs safeguards to prevent oscillation and overfitting.

Before You Start

You need a retrieval system with a parameterized ranking function (weights for similarity, recency, frequency, confidence, or other factors) and a feedback mechanism that provides reward signals quickly enough for near-real-time updates. If feedback is delayed by hours or days (like task completion metrics), batch learning is more appropriate. Online learning works best when feedback is available within seconds to minutes of the retrieval event.

Step-by-Step Implementation

Step 1: Choose updateable ranking parameters.
Identify which components of your ranking function can be adjusted incrementally without destabilizing the system. Good candidates are the relative weights of ranking factors (how much weight to give similarity versus recency), per-memory quality scores (boosting or demoting individual memories based on usage), and threshold values (minimum similarity score to include a result).
class OnlineRanker: def __init__(self): # Ranking weights that can be updated online self.weights = { "similarity": 0.4, "recency": 0.2, "frequency": 0.2, "confidence": 0.2 } # Per-memory quality adjustments self.memory_boosts = {} # Learning parameters self.lr = 0.01 self.update_count = 0 def score(self, memory, query_similarity): base = ( self.weights["similarity"] * query_similarity + self.weights["recency"] * memory["recency_score"] + self.weights["frequency"] * memory["freq_score"] + self.weights["confidence"] * memory["confidence"] ) boost = self.memory_boosts.get(memory["id"], 0.0) return base + boost
Step 2: Implement incremental updates.
After each interaction, compute the reward and adjust the parameters that contributed to the outcome. If a retrieval event produced a positive reward, increase the weights of the factors that were dominant in the ranking. If it produced a negative reward, decrease them.
def update(self, query_similarity, memory, reward): self.update_count += 1 # Gradient direction: which factors contributed # most to the ranking score for this memory? factors = { "similarity": query_similarity, "recency": memory["recency_score"], "frequency": memory["freq_score"], "confidence": memory["confidence"] } # Update weights proportional to factor contribution for factor, value in factors.items(): gradient = reward * value self.weights[factor] += self.lr * gradient # Normalize weights to sum to 1.0 total = sum(self.weights.values()) if total > 0: for k in self.weights: self.weights[k] /= total # Update per-memory quality boost mid = memory["id"] current_boost = self.memory_boosts.get(mid, 0.0) self.memory_boosts[mid] = current_boost + ( self.lr * reward * 0.1 )
Step 3: Add a learning rate schedule.
The learning rate should decrease over time as the system accumulates more experience. Early on, with few observations, larger updates are appropriate because the system is still exploring the parameter space. As experience accumulates and the parameters approach optimal values, smaller updates prevent overshooting.
def get_learning_rate(self): # Inverse square root schedule: fast early, slow later return self.lr / (1 + 0.001 * self.update_count) def update_with_schedule(self, query_similarity, memory, reward): current_lr = self.get_learning_rate() factors = { "similarity": query_similarity, "recency": memory["recency_score"], "frequency": memory["freq_score"], "confidence": memory["confidence"] } for factor, value in factors.items(): gradient = reward * value self.weights[factor] += current_lr * gradient total = sum(self.weights.values()) if total > 0: for k in self.weights: self.weights[k] /= total self.update_count += 1
Step 4: Add stability bounds.
Constrain each parameter to a valid range to prevent extreme updates from dominating the ranking. No single weight should drop to zero (eliminating a factor entirely) or dominate at 1.0 (making all other factors irrelevant). Clip parameter updates to stay within bounds.
MIN_WEIGHT = 0.05 MAX_WEIGHT = 0.70 MAX_BOOST = 0.5 MIN_BOOST = -0.3 def clip_and_normalize(self): for k in self.weights: self.weights[k] = max( self.MIN_WEIGHT, min(self.MAX_WEIGHT, self.weights[k]) ) total = sum(self.weights.values()) for k in self.weights: self.weights[k] /= total def clip_boost(self, memory_id): if memory_id in self.memory_boosts: self.memory_boosts[memory_id] = max( self.MIN_BOOST, min(self.MAX_BOOST, self.memory_boosts[memory_id]) )
Step 5: Handle non-stationarity.
User behavior changes over time. A ranking strategy that worked well last month might not work this month because the user's project changed, new content was added, or their preferences evolved. Detect distribution shifts by monitoring the running average reward. If the average drops significantly, increase the learning rate temporarily to allow faster adaptation.
def detect_shift(self, recent_rewards, window=50): if len(recent_rewards) < window * 2: return False old_avg = sum(recent_rewards[-window*2:-window]) / window new_avg = sum(recent_rewards[-window:]) / window # Significant drop suggests distribution shift return new_avg < old_avg * 0.8 def handle_shift(self): # Reset learning rate to allow faster adaptation self.lr = self.lr * 5 # Schedule will bring it back down over time
Step 6: Monitor convergence.
Track the magnitude of parameter updates over time. A converging system shows decreasing update magnitudes as parameters stabilize. An oscillating system shows persistent large updates, which indicates the learning rate is too high or the reward signal is too noisy.

Plot the rolling average of update magnitudes alongside the reward trend. Both should stabilize over time. If the reward is increasing but updates are still large, the system is learning but has not converged. If the reward is flat but updates are large, the system is oscillating. If both are stable, the system has converged to an effective policy.

Online Learning in Adaptive Recall

Adaptive Recall implements online learning natively through ACT-R activation dynamics. Every retrieval event updates the activation history of each involved memory immediately, without waiting for batch processing. The activation equation computes current activation from the complete access history, weighted by recency, which provides natural learning rate decay (recent events have more influence) and non-stationarity handling (old events fade automatically). This is mathematically equivalent to online learning with exponential recency weighting, but it is expressed as a single equation rather than an incremental update procedure.

Get real-time retrieval improvement without building online learning infrastructure. Adaptive Recall's activation dynamics update with every interaction.

Get Started Free