Question 1

What's the biggest cost in a RAG pipeline?

Accepted Answer

The chat-completion call with retrieved context injected. Embedding costs are typically under 5% of the total; retrieval itself is free in terms of API bill.

Question 2

How many chunks should I retrieve?

Accepted Answer

Start with k=5 to k=8 after reranking. More chunks linearly increases input cost without proportional quality gain above that range. If you need more context, a better reranker beats more chunks.

Question 3

Should I use a cheaper model for the generation step?

Accepted Answer

If answer quality is acceptable, yes: RAG shifts "reasoning" burden to the retriever, so mid-tier models (Sonnet, GPT-5-mini, Gemini Flash) often perform as well as frontier models for extractive Q&A.

Question 4

How do cached inputs affect RAG cost?

Accepted Answer

Dramatically. Anthropic's cache-read is 10% of the input rate; OpenAI's is 25-50%. If the same chunks show up across queries within the cache TTL, input cost collapses. Design your chunk IDs to be cache-friendly.

Question 5

What if my context exceeds the model's context window?

Accepted Answer

The call fails: there's no partial-context fallback. Track your context budget and cap retrieval before you hit the window. Most modern models give 128K-2M tokens of context; check the /models page for per-model limits.

Model	Per request	Per day	Monthly (×30)	Details
GPT-5 nanoopenai	$0.00036	$0.18	$5.40	View →
Gemini 2.5 Flash-Litegoogle	$0.00056	$0.28	$8.40	View →
GPT-4o miniopenai	$0.00084	$0.42	$12.60	View →
GPT-5.4 nanoopenai	$0.00130	$0.65	$19.50	View →
Gemini 3.1 Flash-Lite (preview)google	$0.00160	$0.80	$24.00	View →
GPT-5.5 nanoopenai	$0.00160	$0.80	$24.00	View →
GPT-5 miniopenai	$0.00180	$0.90	$27.00	View →
Gemini 2.5 Flashgoogle	$0.00220	$1.10	$33.00	View →

RAG pipeline cost calculator

Workload parameters

Top 8 cheapest models for this workload

Scaling GPT-5 nano

Optimisation tips

Frequently asked

Need a precise number for your actual prompt?