Cost calculator

RAG pipeline cost calculator

Cost out a retrieval-augmented generation pipeline: embedding queries, retrieving chunks, and paying for the context-heavy chat call on top.

A RAG pipeline pays for tokens twice: once to embed the query and retrieved chunks, and again — much more expensively — to feed that context into a chat model that actually answers. The second call is where 80-95% of the budget goes.

The default shape below is a mid-sized RAG setup: 8 retrieved chunks of 500 tokens each (4,000 tokens of context), a 200-token user question, and a 400-token answer. Tune it to match your own embedding model, chunk size, and k.

Workload parameters

Costs update live across every model in the table below.

Top 8 cheapest models for this workload

Sorted by total cost per request (input + output, with tokenizer and long-context adjustments applied).

ModelPer requestPer dayMonthly (×30)Details
GPT-5 nanoopenai$0.00036$0.18$5.40View →
Gemini 2.5 Flash-Litegoogle$0.00056$0.28$8.40View →
GPT-4o miniopenai$0.00084$0.42$12.60View →
GPT-5.4 nanoopenai$0.00130$0.65$19.50View →
Gemini 3.1 Flash-Lite (preview)google$0.00160$0.80$24.00View →
GPT-5 miniopenai$0.00180$0.90$27.00View →
Gemini 2.5 Flashgoogle$0.00220$1.10$33.00View →
GPT-4.1 miniopenai$0.00224$1.12$33.60View →

Scaling GPT-5 nano

What the cheapest option costs as your traffic grows.

baseline

$5.40

per month

2× volume

$10.80

per month

5× volume

$27.00

per month

10× volume

$54.00

per month

Optimization tips

  • Use prompt caching on your system prompt AND your retrieved context if chunks repeat across queries. Anthropic's 5-minute TTL catches most "same user asks follow-up" patterns.
  • Rerank and shrink. A good reranker lets you drop from k=20 chunks to k=5 with equal or better answer quality — an instant 4× cost reduction.
  • Use a small embedding model (e.g. OpenAI text-embedding-3-small at $0.02/1M). The chat model dominates the bill; embeddings are rounding error.
  • Watch the long-context tier. Gemini 2.5 Pro doubles its rate above 200K input tokens; if your context grows with history, you can cross the threshold unintentionally.

Frequently asked

What's the biggest cost in a RAG pipeline?

The chat-completion call with retrieved context injected. Embedding costs are typically under 5% of the total; retrieval itself is free in terms of API bill.

How many chunks should I retrieve?

Start with k=5 to k=8 after reranking. More chunks linearly increases input cost without proportional quality gain above that range. If you need more context, a better reranker beats more chunks.

Should I use a cheaper model for the generation step?

If answer quality is acceptable, yes — RAG shifts "reasoning" burden to the retriever, so mid-tier models (Sonnet, GPT-5-mini, Gemini Flash) often perform as well as frontier models for extractive Q&A.

How do cached inputs affect RAG cost?

Dramatically. Anthropic's cache-read is 10% of the input rate; OpenAI's is 25-50%. If the same chunks show up across queries within the cache TTL, input cost collapses. Design your chunk IDs to be cache-friendly.

What if my context exceeds the model's context window?

The call fails — there's no partial-context fallback. Track your context budget and cap retrieval before you hit the window. Most modern models give 128K-2M tokens of context; check the /models page for per-model limits.

Need a precise number for your actual prompt?

Paste a real prompt into the estimator and get token-accurate costs — input tokens counted with the provider's own tokenizer, output tokens predicted by our regression model.