How to Track Which AI Predictions Were Correct
Before You Start
Not every AI output is a trackable prediction. A conversational response that says "let me look into that" is not verifiable. A response that says "the API rate limit is 100 requests per minute" is verifiable because you can check it. Before building prediction tracking, audit your system's outputs and identify which categories produce claims that can be checked against reality. Focus on those categories first and expand later.
You also need a way to observe outcomes. In some domains this is straightforward: a coding assistant suggests a fix, and you can check whether the code compiles and tests pass. In other domains it requires more infrastructure: a customer support bot suggests a troubleshooting step, and you need access to the ticket resolution data to know whether it worked. If you cannot observe outcomes for a given prediction type, you cannot track accuracy for it.
Step-by-Step Implementation
A prediction is any AI output that makes a verifiable claim about the world. This includes factual assertions ("the timeout is 30 seconds"), recommendations ("try restarting the service"), retrieved information presented as relevant ("based on past incidents, this is usually a DNS issue"), and confidence-weighted rankings ("the most likely cause is X"). For each prediction type, define the verification method (how you will check if it was correct), the verification window (how long after the prediction you need to wait for the outcome), and the accuracy categories (correct, partially correct, incorrect, unverifiable). Start with two or three prediction types that have clear verification methods and expand from there.
Every time the system produces a trackable prediction, log it as a structured record. The record should include a unique prediction ID, the prediction text, the type, the timestamp, the memory IDs that informed the prediction, the confidence score of each contributing memory at the time, and the expected verification window end. Logging must happen synchronously with the prediction; you cannot reconstruct which memories informed a prediction after the fact because confidence scores and retrieval rankings change over time.
class PredictionRecord:
prediction_id: str
prediction_text: str
prediction_type: str
timestamp: datetime
verification_deadline: datetime
contributing_memories: list[dict] # {memory_id, confidence_at_time, relevance_score}
outcome: str # "pending", "correct", "partial", "incorrect", "unverifiable"
outcome_evidence: str
outcome_timestamp: datetime
def log_prediction(response, retrieved_memories, prediction_type):
record = PredictionRecord(
prediction_id=generate_id(),
prediction_text=extract_claims(response),
prediction_type=prediction_type,
timestamp=datetime.now(),
verification_deadline=datetime.now() + VERIFICATION_WINDOWS[prediction_type],
contributing_memories=[
{"memory_id": m.id, "confidence_at_time": m.confidence, "relevance_score": m.score}
for m in retrieved_memories
],
outcome="pending"
)
store_prediction(record)
return recordOutcome collection depends on your domain. For coding assistants, hook into the build and test pipeline: when the user applies a suggestion, check whether the build succeeds and tests pass. For customer support, hook into the ticket system: when a ticket is resolved, check whether the resolution matches the suggested approach. For information retrieval, collect user feedback on whether the retrieved information was accurate. Outcomes should be stored as updates to the prediction record, linking the outcome evidence to the original prediction ID. Process outcome collection asynchronously; outcomes often arrive minutes, hours, or days after the prediction.
When an outcome arrives, match it to the pending prediction and update the record. Then trace from the prediction back to the contributing memories. For each contributing memory, record whether it was associated with a correct or incorrect prediction. This creates an accuracy history for each memory: over time, each memory accumulates a track record of how often predictions that used it turned out to be correct. The matching pipeline should handle ambiguous outcomes where the prediction was partially correct. In these cases, assign a fractional accuracy score (0.0 for wrong, 0.5 for partial, 1.0 for correct) rather than forcing a binary classification.
Compute a rolling accuracy rate for each memory based on its prediction history. A memory that contributed to 20 predictions, 18 of which were correct, has an accuracy rate of 0.9. A memory that contributed to 5 predictions, only 2 of which were correct, has an accuracy rate of 0.4. Use the accuracy rate to adjust the memory's confidence during the periodic consolidation review. Memories with accuracy rates above a threshold (0.7 is a reasonable starting point) receive a small confidence boost. Memories with accuracy rates below a threshold (0.3) receive a confidence penalty. Memories with too few predictions (fewer than 3) do not receive any accuracy-based adjustment because the sample is too small to be meaningful.
def apply_accuracy_feedback(memory, min_predictions=3, boost_threshold=0.7, penalty_threshold=0.3):
predictions = get_predictions_for_memory(memory.id)
resolved = [p for p in predictions if p.outcome != "pending"]
if len(resolved) < min_predictions:
return 0.0
accuracy = sum(
1.0 if p.outcome == "correct" else 0.5 if p.outcome == "partial" else 0.0
for p in resolved
) / len(resolved)
if accuracy >= boost_threshold:
delta = 0.1 * (accuracy - boost_threshold) / (1.0 - boost_threshold)
memory.confidence = min(10.0, memory.confidence + delta)
elif accuracy <= penalty_threshold:
delta = 0.2 * (penalty_threshold - accuracy) / penalty_threshold
memory.confidence = max(1.0, memory.confidence - delta)
return deltaSurface prediction accuracy metrics so the team can monitor the system's learning trajectory. Key views include: overall accuracy rate over time (is the system getting more accurate?), accuracy by prediction type (which categories is the system good at, which is it struggling with?), accuracy by memory source (are memories from certain sources more reliable?), and individual memory accuracy histories (which specific memories are consistently leading to wrong predictions?). The dashboard should also flag memories with declining accuracy rates, ones that used to contribute to correct predictions but have started contributing to incorrect ones, which may indicate that the information has become outdated.
Handling Edge Cases
Predictions with no outcome. Some predictions will never receive an outcome signal because the user did not follow up, the verification system was unavailable, or the prediction was about a hypothetical scenario. Mark these as "unverifiable" after the verification window expires and exclude them from accuracy calculations. If more than 50% of predictions end up unverifiable, your outcome collection infrastructure needs improvement.
Multiple memories contributing to one prediction. When a prediction draws from several memories, it is not always clear which memory was responsible for the prediction being correct or incorrect. The simplest approach is to credit or penalize all contributing memories equally. A more sophisticated approach weights the credit by relevance score: a memory that scored 0.95 relevance contributed more to the prediction than one that scored 0.55, so it should receive more credit or blame.
Temporal drift. A memory that was accurate six months ago may not be accurate today. Weight recent prediction outcomes more heavily than older ones when computing accuracy rates. A rolling window of 30 to 90 days for accuracy calculations ensures that the system responds to changing accuracy rather than being anchored to historical performance.
Adaptive Recall tracks memory accuracy through its evidence-gated learning pipeline. Confidence scores automatically reflect real-world accuracy over time.
Get Started Free