Converter

Words to tokens converter

Convert word counts to LLM token counts. 1 word ≈ 1.33 tokens for English prose. Works for any model family — the ratio is roughly consistent.

Assumes English prose. Code and non-English text will have a higher token-per-word ratio.

Quick math: 1,000 words 1,330 tokens

How the conversion works

English prose averages roughly 1.33 tokens per word across every major tokenizer (OpenAI o200k_base, cl100k_base, Claude's proprietary tokenizer, Google's SentencePiece). The ratio is surprisingly stable across providers.

Why 1.33? Tokenizers split common words into one token ("the", "and", "is") but split less common or longer words into two or three ("pneumonia" → "pn"/"eu"/"monia"). The long tail of rarer words averages out to ~4/3 tokens per word.

This ratio breaks down for code (2-3 tokens per "word"), JSON (similar), non-English text (often 2-3× higher), and for documents with many proper nouns, acronyms, or technical jargon.

Common word counts

1 sentence (~15 words)~20 tokens
1 paragraph (~100 words)~133 tokens
1 page (~500 words)~665 tokens
1 blog post (~1,000 words)~1,330 tokens
1 short document (~3,000 words)~4,000 tokens
1 book chapter (~10,000 words)~13,300 tokens
1 novel (~80,000 words)~106,400 tokens

Frequently asked

Is the 1.33 ratio the same for every model?

Close enough for estimation. OpenAI's o200k_base is slightly more efficient than cl100k_base (smaller ratio), Claude's tokenizer is broadly similar to o200k, and Gemini's SentencePiece runs a hair higher. For back-of-envelope estimation, use 1.33; for precision, use our token counter tools.

Why is the ratio higher for code?

Code has lots of punctuation, indentation, and variable names that don't appear in the tokenizer's main vocabulary. "function" is usually one token, but "useState" or "__proto__" often splits into 2-3. Expect 2-2.5 tokens per "word" for typical JavaScript/Python code.

How many tokens are in a typical email?

A 200-word email is about 265 tokens. A longer support ticket or well-structured business email at 400 words is about 530 tokens. For costing, assume 300 tokens as a reasonable default for business email.

How should I count tokens for non-English text?

Most tokenizers were trained primarily on English, so non-English text runs higher. For Spanish/French/German expect ~1.8 tokens per word; for CJK (Chinese/Japanese/Korean), count characters instead of words — the ratio is roughly 1 character ≈ 1-2 tokens.

Does this count include the system prompt?

No — this converts words you paste to an approximate token count. In a real API call, your system prompt, conversation history, and any tool schemas all count toward input tokens. For production cost estimation, multiply this by the number of messages or use the Calcis /estimator.

Ready to estimate a real prompt?

Paste your actual text into the estimator for exact token counts and dollar costs across every model.