Home » Reinforcement Learning » Multi-Armed Bandits

How to Use Multi-Armed Bandits for Search Ranking

Multi-armed bandits provide a principled framework for balancing exploration (trying new ranking strategies) with exploitation (using the strategy that has performed best so far). Applied to search ranking, bandits let the system discover optimal ranking parameters automatically while maintaining quality for users during the learning process.

Before You Start

You need a retrieval system with at least two viable ranking strategies (or a parameterized ranking function with tunable values) and a feedback mechanism that provides reward signals after results are served. The bandit framework selects which strategy to use for each query and learns from the outcomes. If you only have one strategy and no parameters to tune, bandits do not apply.

Step-by-Step Implementation

Step 1: Define the arms.
Each arm represents a distinct ranking strategy or parameter configuration. In a memory retrieval system, arms might represent different weight combinations for the ranking factors: how much weight to give similarity, recency, access frequency, and confidence. Each arm is a specific set of weights that produces a different ranking order.
arms = { "similarity_heavy": { "similarity": 0.7, "recency": 0.1, "frequency": 0.1, "confidence": 0.1 }, "recency_heavy": { "similarity": 0.3, "recency": 0.4, "frequency": 0.1, "confidence": 0.2 }, "balanced": { "similarity": 0.4, "recency": 0.2, "frequency": 0.2, "confidence": 0.2 }, "confidence_heavy": { "similarity": 0.3, "recency": 0.1, "frequency": 0.1, "confidence": 0.5 } }
Step 2: Choose a bandit algorithm.
The three most practical algorithms for retrieval ranking are epsilon-greedy (simplest), Upper Confidence Bound (UCB, best for stationary environments), and Thompson sampling (best overall for most applications).

Epsilon-greedy selects the best-known arm with probability 1-epsilon and a random arm with probability epsilon. Simple to implement and understand. The exploration rate is fixed, which means the system explores as much after 10,000 interactions as it did after 10. This is wasteful once the best arm is well-established.

UCB (Upper Confidence Bound) selects the arm with the highest upper confidence bound on its estimated reward. Arms that have been tried fewer times have wider confidence intervals, which drives exploration naturally. As more data accumulates, confidence intervals narrow, and the algorithm increasingly exploits the best arm. UCB works best when the arm values are stable over time.

Thompson sampling maintains a probability distribution over each arm's reward rate and samples from these distributions to make selections. Arms with uncertain values get explored because high samples are possible. Arms with well-established values get exploited because the distribution is concentrated around the known mean. Thompson sampling adapts its exploration rate automatically and handles non-stationary environments better than UCB.

Step 3: Implement the selection policy.
Build the logic that chooses which arm to use for each incoming query. The implementation depends on the algorithm chosen in step 2.
import random import numpy as np class ThompsonBandit: def __init__(self, arm_names): self.arms = arm_names # Beta distribution parameters for each arm self.alpha = {arm: 1.0 for arm in arm_names} self.beta = {arm: 1.0 for arm in arm_names} self.counts = {arm: 0 for arm in arm_names} def select_arm(self): samples = {} for arm in self.arms: samples[arm] = np.random.beta( self.alpha[arm], self.beta[arm] ) return max(samples, key=samples.get) def update(self, arm, reward): self.counts[arm] += 1 # Normalize reward to [0, 1] range normalized = max(0.0, min(1.0, (reward + 3) / 6)) # Update Beta distribution parameters self.alpha[arm] += normalized self.beta[arm] += (1 - normalized) def get_estimates(self): return { arm: self.alpha[arm] / (self.alpha[arm] + self.beta[arm]) for arm in self.arms }
Step 4: Track arm performance.
Log which arm was selected for each query along with the resulting reward. This data serves dual purposes: it feeds the bandit's update mechanism, and it provides audit trails for understanding which strategies work best and why.
def serve_with_bandit(query, user_id, bandit, arms_config): # Select which ranking strategy to use selected_arm = bandit.select_arm() params = arms_config[selected_arm] # Retrieve results using the selected strategy results = retrieve_with_params(query, user_id, params) # Log the selection for later update event_id = log_bandit_event( query=query, arm=selected_arm, params=params, results=results ) return results, event_id, selected_arm
Step 5: Update arm estimates.
After feedback signals are collected for each interaction, update the bandit's estimate of the selected arm's value. The update shifts the arm's distribution toward the observed reward, gradually refining the system's understanding of which strategy works best.
def process_feedback_batch(bandit, events): for event in events: reward = compute_reward(event["event_id"]) bandit.update(event["arm"], reward) # Log current arm estimates for monitoring estimates = bandit.get_estimates() log_arm_estimates(estimates)
Step 6: Add contextual features.
A contextual bandit conditions arm selection on features of the current query and user. Different query types might benefit from different ranking strategies: factual queries might perform best with similarity-heavy ranking, while exploratory queries might perform best with recency-heavy ranking. Contextual bandits learn separate arm preferences for each context.
class ContextualBandit: def __init__(self, arm_names, context_features): self.arms = arm_names # Separate bandit per context bucket self.contexts = {} self.default = ThompsonBandit(arm_names) def _get_context_key(self, features): # Simple context bucketing query_type = features.get("query_type", "unknown") user_type = features.get("user_type", "unknown") return f"{query_type}_{user_type}" def select_arm(self, features): key = self._get_context_key(features) if key not in self.contexts: self.contexts[key] = ThompsonBandit(self.arms) return self.contexts[key].select_arm() def update(self, features, arm, reward): key = self._get_context_key(features) if key in self.contexts: self.contexts[key].update(arm, reward) self.default.update(arm, reward)

Bandits in Adaptive Recall

Adaptive Recall's cognitive scoring system achieves a similar outcome to contextual bandits through a different mechanism. Rather than selecting between discrete ranking strategies, ACT-R scoring continuously adjusts the relative influence of recency, frequency, spreading activation, and confidence based on accumulated access patterns. The spreading activation component provides contextual sensitivity: the same memory gets different activation levels depending on which entities are active in the current query, creating context-dependent ranking without explicit contextual bandit machinery.

Get adaptive ranking without building bandit infrastructure. Adaptive Recall's cognitive scoring adapts to usage patterns automatically.

Get Started Free