Cost calculator
RAG pipeline cost calculator
Cost out a retrieval-augmented generation pipeline: embedding queries, retrieving chunks, and paying for the context-heavy chat call on top.
A RAG pipeline pays for tokens twice: once to embed the query and retrieved chunks, and again — much more expensively — to feed that context into a chat model that actually answers. The second call is where 80-95% of the budget goes.
The default shape below is a mid-sized RAG setup: 8 retrieved chunks of 500 tokens each (4,000 tokens of context), a 200-token user question, and a 400-token answer. Tune it to match your own embedding model, chunk size, and k.
Workload parameters
Costs update live across every model in the table below.
Top 8 cheapest models for this workload
Sorted by total cost per request (input + output, with tokenizer and long-context adjustments applied).
| Model | Per request | Per day | Monthly (×30) | Details |
|---|---|---|---|---|
| GPT-5 nanoopenai | $0.00036 | $0.18 | $5.40 | View → |
| Gemini 2.5 Flash-Litegoogle | $0.00056 | $0.28 | $8.40 | View → |
| GPT-4o miniopenai | $0.00084 | $0.42 | $12.60 | View → |
| GPT-5.4 nanoopenai | $0.00130 | $0.65 | $19.50 | View → |
| Gemini 3.1 Flash-Lite (preview)google | $0.00160 | $0.80 | $24.00 | View → |
| GPT-5 miniopenai | $0.00180 | $0.90 | $27.00 | View → |
| Gemini 2.5 Flashgoogle | $0.00220 | $1.10 | $33.00 | View → |
| GPT-4.1 miniopenai | $0.00224 | $1.12 | $33.60 | View → |
Scaling GPT-5 nano
What the cheapest option costs as your traffic grows.
baseline
$5.40
per month
2× volume
$10.80
per month
5× volume
$27.00
per month
10× volume
$54.00
per month
Optimization tips
- Use prompt caching on your system prompt AND your retrieved context if chunks repeat across queries. Anthropic's 5-minute TTL catches most "same user asks follow-up" patterns.
- Rerank and shrink. A good reranker lets you drop from k=20 chunks to k=5 with equal or better answer quality — an instant 4× cost reduction.
- Use a small embedding model (e.g. OpenAI text-embedding-3-small at $0.02/1M). The chat model dominates the bill; embeddings are rounding error.
- Watch the long-context tier. Gemini 2.5 Pro doubles its rate above 200K input tokens; if your context grows with history, you can cross the threshold unintentionally.
Frequently asked
What's the biggest cost in a RAG pipeline?
The chat-completion call with retrieved context injected. Embedding costs are typically under 5% of the total; retrieval itself is free in terms of API bill.
How many chunks should I retrieve?
Start with k=5 to k=8 after reranking. More chunks linearly increases input cost without proportional quality gain above that range. If you need more context, a better reranker beats more chunks.
Should I use a cheaper model for the generation step?
If answer quality is acceptable, yes — RAG shifts "reasoning" burden to the retriever, so mid-tier models (Sonnet, GPT-5-mini, Gemini Flash) often perform as well as frontier models for extractive Q&A.
How do cached inputs affect RAG cost?
Dramatically. Anthropic's cache-read is 10% of the input rate; OpenAI's is 25-50%. If the same chunks show up across queries within the cache TTL, input cost collapses. Design your chunk IDs to be cache-friendly.
What if my context exceeds the model's context window?
The call fails — there's no partial-context fallback. Track your context budget and cap retrieval before you hit the window. Most modern models give 128K-2M tokens of context; check the /models page for per-model limits.
Need a precise number for your actual prompt?
Paste a real prompt into the estimator and get token-accurate costs — input tokens counted with the provider's own tokenizer, output tokens predicted by our regression model.