How Much Do Embedding API Calls Cost at Scale
Per-Token Pricing Comparison
Model | Price per 1M tokens
-----------------------------|--------------------
OpenAI text-embedding-3-sm | $0.020
OpenAI text-embedding-3-lg | $0.130
Cohere embed-v4 | $0.100
Voyage voyage-3 | $0.060
Voyage voyage-code-3 | $0.180
Voyage voyage-law-2 | $0.180Corpus Embedding Costs
Embedding your initial corpus is a one-time cost. Subsequent costs are only for new documents added and queries processed. The average document in a knowledge base or documentation corpus contains 300 to 800 tokens. Support tickets and chat messages average 50 to 200 tokens. Code files average 200 to 1,500 tokens depending on length.
# Initial corpus embedding costs (one-time)
# Assuming 500 avg tokens per document, OpenAI text-embedding-3-small
Documents | Tokens | Cost (small) | Cost (large)
-----------|-------------|--------------|-------------
10K | 5M | $0.10 | $0.65
100K | 50M | $1.00 | $6.50
500K | 250M | $5.00 | $32.50
1M | 500M | $10.00 | $65.00
5M | 2.5B | $50.00 | $325.00
10M | 5B | $100.00 | $650.00Ongoing Monthly Costs
Monthly embedding costs come from two sources: new documents being added to the corpus and queries being embedded for search. For most applications, query embedding costs are negligible because queries are short (10 to 50 tokens) and query volume is moderate. The calculation changes at high volume.
# Monthly ongoing costs with text-embedding-3-small
# New documents: 1,000/day x 500 tokens = 15M tokens/month = $0.30
# Queries: 5,000/day x 30 tokens = 4.5M tokens/month = $0.09
# Total: ~$0.39/month
# At higher volume:
# New documents: 10,000/day x 500 tokens = 150M tokens/month = $3.00
# Queries: 50,000/day x 30 tokens = 45M tokens/month = $0.90
# Total: ~$3.90/month
# At very high volume:
# New documents: 100,000/day x 500 tokens = 1.5B tokens/month = $30.00
# Queries: 500,000/day x 30 tokens = 450M tokens/month = $9.00
# Total: ~$39.00/monthWhen Self-Hosting Becomes Cheaper
Self-hosting an open-source embedding model (BGE-large, E5-large, GTE-large) eliminates per-token API costs. The cost becomes GPU hosting: a single NVIDIA A10G instance costs $1 to $2 per hour on major cloud providers. With batching, this instance embeds roughly 500 to 1,000 documents per second, processing millions of tokens per hour.
The break-even calculation depends on your volume. If your monthly API embedding cost is under $50, self-hosting is more expensive because the GPU instance alone costs $720 to $1,440 per month running continuously. However, you do not need to run the GPU continuously. For batch embedding jobs, you can spin up a GPU instance, embed your documents in hours, and shut it down. For query embedding, you can use a smaller CPU instance or a spot GPU instance.
Approximate break-even with on-demand GPU pricing:
# Self-hosted GPU costs (NVIDIA A10G, on-demand)
# ~$1.50/hour = ~$1,080/month continuous
# Can run only during ingestion + serving hours
# Break-even with OpenAI text-embedding-3-small:
# $1,080/month / $0.02 per M tokens = 54 billion tokens/month
# That is ~108M documents at 500 tokens each
# With spot/preemptible pricing (~$0.50/hour):
# $360/month / $0.02 per M tokens = 18 billion tokens/month
# That is ~36M documents at 500 tokens each
# Conclusion: self-hosting only saves money at very high volume
# or if you need embedding for other tasks too (clustering, etc.)Cost Optimization Strategies
Use the small model unless quality demands the large. OpenAI's text-embedding-3-small is 6.5x cheaper than the large model. For many applications, the quality difference is not significant enough to justify the cost increase. Test both on your data before committing to the more expensive option.
Cache query embeddings. If users submit repeated or similar queries, cache the embeddings for a short TTL. This eliminates redundant API calls for popular queries.
Batch embedding requests. Most embedding APIs support batch requests (up to 2,048 inputs per call for OpenAI). Batching reduces per-request overhead and is typically faster than sending individual requests.
Truncate long documents before embedding. If your documents are very long, the first 512 tokens often contain enough information for accurate retrieval. Embedding the first 512 tokens instead of 2,000 tokens reduces costs by 75% with minimal quality impact for many content types.
Adaptive Recall includes embedding and vector search in its managed pricing. No separate embedding API costs to calculate or optimize.
Try It Free