How to Compress Context Without Losing Meaning
Understanding What You Are Compressing
Before applying compression, you need to know where your tokens are being spent. Most LLM applications have three major token consumers: the system prompt and instructions, the conversation history, and retrieved context from RAG or memory systems. Each responds to different compression techniques because the information structure is different.
System prompts contain authored instructions that are relatively stable across calls. They tend to be verbose because developers write them in natural language with explanations and examples. Conversation history grows linearly with conversation length, accumulating both useful context and noise (greetings, acknowledgments, tangential discussions). Retrieved context comes from external documents and memories that may contain redundant information when multiple retrieved items cover overlapping topics.
Step-by-Step Compression Pipeline
Measure the token count of each prompt section across a representative sample of your API calls. Calculate the average, 90th percentile, and maximum for each section. This tells you where compression will have the most impact. If your system prompt uses 4,000 tokens but conversation history regularly exceeds 20,000, focus your compression effort on conversation history because even a 50% reduction in the system prompt saves only 2,000 tokens while a 30% reduction in history saves 6,000.
Rewrite your system prompt to remove filler phrases, redundant modifiers, and verbose explanations that do not change model behavior. LLMs respond to the information content of instructions, not their politeness or verbosity. "You are a helpful assistant that always tries to provide accurate and detailed responses to user questions" compresses to "Provide accurate, detailed responses" with identical behavior in most models.
# Before: 47 tokens
system_prompt = """You are a helpful customer support assistant for
our software product. You should always be polite and professional
in your responses. When a customer asks a question, you should try
to provide a comprehensive and accurate answer based on our product
documentation and knowledge base."""
# After: 19 tokens
system_prompt = """Customer support assistant for [Product].
Answer from product docs and knowledge base.
Tone: professional."""When RAG retrieves multiple documents, they often contain overlapping information. Two documents about the same API endpoint might both describe the request format, parameters, and response structure with slightly different wording. Semantic deduplication detects this overlap using embedding similarity and merges redundant content into a single representation.
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def deduplicate_sentences(sentences, threshold=0.85):
embeddings = model.encode(sentences)
keep = []
for i, sent in enumerate(sentences):
is_duplicate = False
for j in keep:
sim = np.dot(embeddings[i], embeddings[j])
sim /= (np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j]))
if sim > threshold:
is_duplicate = True
break
if not is_duplicate:
keep.append(i)
return [sentences[i] for i in keep]Set the similarity threshold based on your domain. A threshold of 0.85 catches near-duplicate sentences with different wording. A threshold of 0.90 catches only very close paraphrases. Lower thresholds risk merging sentences that are related but not truly redundant.
Extractive compression keeps only the most information-dense sentences from each retrieved document and discards the rest. Unlike summarization, which rewrites the content, extractive compression preserves the exact original phrasing, which is important when the content contains specific values, code snippets, or technical terms that must not be altered.
def extractive_compress(text, query, keep_ratio=0.5):
sentences = text.split('. ')
query_embedding = model.encode([query])[0]
sent_embeddings = model.encode(sentences)
scores = []
for i, emb in enumerate(sent_embeddings):
relevance = np.dot(query_embedding, emb)
relevance /= (np.linalg.norm(query_embedding) * np.linalg.norm(emb))
scores.append((i, relevance))
scores.sort(key=lambda x: -x[1])
keep_count = max(1, int(len(sentences) * keep_ratio))
kept_indices = sorted([s[0] for s in scores[:keep_count]])
return '. '.join(sentences[i] for i in kept_indices)Not all conversation turns are equally important. Recent turns contain the immediate context the model needs. Turns where the user made decisions, set preferences, or provided requirements contain information that must be preserved even if they are old. Turns with greetings, confirmations, or tangential discussions can be safely removed or summarized.
A practical approach preserves the last 3 to 5 turns verbatim, identifies "anchor" turns containing decisions or requirements using keyword detection, and summarizes everything else into a brief context paragraph. This typically reduces conversation history by 60 to 80% while preserving all critical context.
Run your standard evaluation benchmark with the full uncompressed context and with compressed context. Compare the output quality metrics (accuracy, relevance, completeness) to verify that compression does not degrade the model's performance. A small quality drop (under 3%) is acceptable if the token savings are significant. A larger drop means your compression is too aggressive and you should increase the keep ratio or preserve more anchor turns.
When to Use Compression vs External Memory
Compression is a tactical solution for reducing token usage within a single API call. External memory is an architectural solution that eliminates the need to include persistent knowledge in the context at all. If your token usage is dominated by conversation history or accumulated knowledge, external memory is more effective than compression because it removes the information from the context entirely rather than making it smaller.
Compression and external memory work well together. Use compression for the dynamic content that must be in the context (system prompt, recent turns, retrieved results) and use external memory for persistent knowledge that can be retrieved on demand. This combination minimizes token usage while maximizing the knowledge available to the model.
Adaptive Recall stores knowledge outside the context window and retrieves only what matters for each query. No compression needed for persistent information.
Get Started Free