Guide

What is pre-flight LLM cost estimation?

A practical guide for engineers who want to know what an API call costs beforethey send it - covering token counting, output prediction, and the per-model maths that turns “a prompt” into a dollar figure.

The problem: the bill arrives after the call

Every commercial LLM bills you per token, on input and output separately, with rates that vary by an order of magnitude across models. A single Claude Opus call on a long prompt can cost thirty cents; the same prompt on Gemini Flash costs less than a tenth of a cent. The catch is that you only find out the bill after the response is back and the tokens have been counted.

That asymmetry is fine when you're running ten experimental requests a day. It's a problem when an agent loops, a batch job runs over a million inputs, or a new feature ships to a million users - by the time the invoice comes in, the money is already spent.

What “pre-flight” means here

Pre-flight estimation predicts the dollar cost of an API call before the call goes out. Three pieces have to line up:

  • Input tokens. The exact byte-pair count the provider will charge you for, computed locally with the same tokenizer that provider uses.
  • Output tokens.A predicted count, since you don't actually have the response yet - derived from the prompt's structure, length, and intent signals.
  • Pricing.The per-million-token rate for the target model, kept current with each provider's published rate card.

Multiply the three together and you get a number you can show in a UI, log, or budget alert before any API key is touched.

Input vs output tokens (and why they price differently)

Input tokens are deterministic. Once you know the prompt string and the model, you can count them locally without ever talking to the provider. Calcis uses the same byte-pair encoder each provider uses (tiktoken for OpenAI, a calibrated approximation for Anthropic, sentencepiece for Google) so the count matches what the provider invoices.

Output tokens are the unknown. The model decides how long its reply will be - anywhere from a single word to the full context window. Output is also typically priced 2-5x higher per token than input, so getting the prediction right matters more than getting the input count right.

Calcis predicts output length from the prompt itself. A prompt that says “in one sentence” gets scored differently from one that says “explain step by step.” A request that pastes 20k tokens of source code and asks for a refactor gets scored against the volume of the input. The prediction is a number plus an interval - our P10 and P90 bounds - so the UI can show you the range, not just the point estimate.

How accurate is it?

Input tokens are exact (or as exact as the provider's own tokenizer allows). Output tokens are a forecast.

Calcis runs a four-layer prediction stack:

  • LLM predictor - Pro and above, calls Haiku to predict how long the answer will be. Most accurate, costs us a fraction of a cent per call.
  • Fitted regression model - a gradient- boosted forest trained on ~33k real (prompt, response) pairs, evaluated locally. Free for everyone. Holdout RMSE in log-space is around 0.42, meaning typical predictions land within 1.8x of the real length.
  • Bayesian posterior- a Normal-Inverse- Gamma model that combines a category prior with detected keyword modifiers (“briefly”, “in detail”, “list”) and gives back a P10–P90 confidence interval.
  • Heuristic fallback - a regex cascade for cold starts and edge cases.

You see the result of whichever layer is strongest for your tier, plus the confidence band from the Bayesian layer underneath the headline number. When the band is narrow (most prompts) you get a tight estimate. When the prompt is ambiguous, the band widens - honestly.

Try it

The whole product is free to use without an account. Open the estimator, paste a prompt, pick a target model, and you'll see the input count, the predicted output, and the dollar cost - for that single call and projected at 1k and 100k requests a month.

Open the estimator →

Frequently asked

What does pre-flight LLM cost estimation mean?
It means predicting the input and output token cost of an LLM API call before you actually send the request. Calcis tokenizes your prompt with the same encoder the provider uses, predicts how many tokens the model will reply with, and multiplies both by the published per-million-token price.
How accurate is Calcis at predicting output tokens?
Input tokens are exact for OpenAI, Anthropic, and Google models because Calcis uses each provider's own tokenizer. Output tokens are predicted from the prompt itself using a fitted regression model trained on ~33k real conversations, with a Bayesian posterior layered on top to give a P10–P90 confidence interval. Most predictions land within 1.8x of the real response length.
Which models does Calcis support?
Calcis currently supports the major commercial frontier models: GPT-5, GPT-5 mini, Claude Sonnet 4.6, Claude Haiku 4.5, Claude Opus 4, Gemini 2.5 Pro, and Gemini 2.5 Flash, plus their long-context tiers. Pricing is sourced directly from each provider's published rate card and updated when changes ship.
Why estimate cost before calling the API?
Because the bill arrives after the call, not before it. Pre-flight estimation lets you pick a cheaper model when the prompt allows, set per-request budget caps, prevent runaway agentic loops, and forecast monthly spend at scale before you commit a feature to production.