Home » Context Window Management » Semantic vs Text Compression

Semantic Compression vs Text Compression

Text compression removes characters and words to make content shorter. Semantic compression removes meaning-units (sentences, paragraphs) based on their information density and relevance. Text compression preserves everything but says it more concisely. Semantic compression preserves the most important content and discards the rest. Each operates at a different level of the content hierarchy, and combining them achieves the best results.

Text Compression: Token-Level Reduction

Text compression operates on the syntax of the content. It removes tokens that do not carry unique information: filler phrases, redundant modifiers, verbose constructions, and syntactic padding. The goal is to express the same meaning in fewer tokens.

Common text compression techniques include removing filler phrases ("in order to" becomes "to"), replacing verbose constructions with concise equivalents ("due to the fact that" becomes "because"), eliminating redundant modifiers ("completely unique" becomes "unique"), and compacting lists ("option A, option B, option C, and option D" becomes "options A through D" when appropriate).

Text compression preserves all factual content. A text-compressed version of a paragraph contains every claim, every value, and every entity from the original. What it loses is readability, natural flow, and sometimes nuance that was carried by the verbose phrasing. For LLM context, this tradeoff is almost always worthwhile because the model processes information content, not literary quality.

The limitation of text compression is its modest reduction ratio. Typical text compression achieves 15 to 30% token reduction. For a 50k-token context that needs to fit in 20k tokens, text compression alone cannot achieve the required 60% reduction.

Semantic Compression: Meaning-Level Reduction

Semantic compression operates on the information structure of the content. It evaluates which sentences and paragraphs carry the most unique, relevant information and removes those that contribute the least. The goal is to keep the highest-value content and discard the lowest-value content.

Semantic compression uses embedding models to measure two properties of each sentence:

Information density measures how much unique information a sentence contributes relative to its neighbors. A sentence that says something new and specific (like a configuration value or a deadline) has high density. A sentence that restates what the previous sentence said in different words has low density. Density is measured by computing the embedding similarity between each sentence and its neighbors. Low similarity to neighbors means high density (unique content), high similarity means low density (redundant content).

Relevance to query measures how related the sentence is to the current question. Sentences with high relevance scores are critical to answering the query. Sentences with low relevance scores are tangential and can be safely removed. This makes semantic compression query-dependent, which is both its strength and its requirement.

import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_compress(text, query, keep_ratio=0.5):
    sentences = text.split('. ')
    embeddings = model.encode(sentences)
    query_emb = model.encode([query])[0]

    scores = []
    for i, emb in enumerate(embeddings):
        relevance = np.dot(query_emb, emb)
        relevance /= (np.linalg.norm(query_emb) * np.linalg.norm(emb))

        # information density: low similarity to neighbors = high density
        neighbors = []
        if i > 0:
            sim = np.dot(emb, embeddings[i-1])
            sim /= (np.linalg.norm(emb) * np.linalg.norm(embeddings[i-1]))
            neighbors.append(sim)
        if i < len(embeddings) - 1:
            sim = np.dot(emb, embeddings[i+1])
            sim /= (np.linalg.norm(emb) * np.linalg.norm(embeddings[i+1]))
            neighbors.append(sim)

        density = 1.0 - (sum(neighbors) / len(neighbors)) if neighbors else 1.0
        combined = 0.7 * relevance + 0.3 * density
        scores.append((i, combined))

    scores.sort(key=lambda x: -x[1])
    keep_count = max(1, int(len(sentences) * keep_ratio))
    kept_indices = sorted([s[0] for s in scores[:keep_count]])

    return '. '.join(sentences[i] for i in kept_indices)

Semantic compression achieves 40 to 70% token reduction while preserving the most important content. The tradeoff is that some information is permanently removed. If the removed sentences contained details that are needed later, those details are gone.

Comparison

Property	Text Compression	Semantic Compression
What it removes	Redundant words and phrases	Low-value sentences and paragraphs
Typical reduction	15-30%	40-70%
Information loss	None (same facts, fewer words)	Some (low-priority content removed)
Query-dependent	No	Yes (relevance scoring needs the query)
Computation cost	Negligible (string operations)	Moderate (embedding computation)
Best for	System prompts, static instructions	Retrieved documents, conversation history

Combining Both Approaches

The most effective compression pipeline applies both techniques in sequence. First, semantic compression removes low-value sentences based on the current query. Then, text compression tightens the remaining sentences by removing verbose phrasing. The combined result is both focused (only relevant content remains) and concise (the remaining content is expressed efficiently).

A 10,000-token document processed through this pipeline might look like:

After semantic compression (keep_ratio=0.5): ~5,000 tokens
After text compression: ~3,500 tokens (30% further reduction)
Total reduction: 65% with minimal information loss for the current query

When Neither Is Enough

Both compression techniques are strategies for fitting more information into a fixed context window. They work well for moderate reduction needs (up to 60 to 70%) but struggle when the knowledge base is orders of magnitude larger than the context window. If your application needs access to 100,000 documents but can fit 5 in context at a time, no compression technique bridges that gap. You need a different architecture: external memory with on-demand retrieval.

Adaptive Recall stores knowledge persistently and retrieves the specific memories relevant to each query. The context window never needs to hold more than the system prompt, the current query, and the 5 to 10 most relevant memories. Compression of those memories is rarely needed because the retrieval is already selective, returning focused, relevant content rather than broad documents that need to be trimmed.

Skip the compression pipeline. Adaptive Recall retrieves focused, relevant memories for each query, keeping context compact by design.

Try It Free