Converter

Characters to tokens converter

Convert character counts to approximate LLM token counts. 1 token ≈ 4 characters for English prose.

Assumes English prose. Code runs closer to 3 chars/token; CJK scripts closer to 1.5.

Quick math: 4,000 chars 1,000 tokens

How the conversion works

The most widely-cited rule-of-thumb for LLM tokenization: 1 token ≈ 4 characters of English text. This is what every tokenizer roughly settles on for typical English prose after the BPE (or SentencePiece) merge process runs.

The ratio varies by content: English prose runs 4 chars/token, code and JSON run 3 chars/token, CJK scripts run 1.5 chars/token, and emoji 1-4 tokens per emoji depending on rarity.

For production systems, never use this ratio alone — it's an estimate for planning. Exact counts come from tokenizer libraries (js-tiktoken for GPT) or provider countTokens endpoints (Anthropic, Google).

Common character counts

1 tweet (~280 chars)~70 tokens
1 SMS (~160 chars)~40 tokens
1 paragraph (~500 chars)~125 tokens
1 page (~3,000 chars)~750 tokens
1 short document (~15,000 chars)~3,750 tokens
1 long document (~50,000 chars)~12,500 tokens
1 book (~500,000 chars)~125,000 tokens

Frequently asked

Why is 1 token ≈ 4 characters?

Tokenizers learn their vocabulary from training data. For English, the BPE merge process naturally settles on subword units averaging ~4 characters because that's the most efficient split across the training corpus. The ratio is remarkably stable across providers.

Is the ratio different for whitespace?

Modern tokenizers (o200k_base, cl100k_base, Claude's) merge whitespace with the following word, so spaces contribute minimally to token count. The 4 chars/token rule already factors this in.

How does code tokenize?

Roughly 3 characters per token, sometimes worse. Code has more punctuation, identifiers that don't appear in the model's learned vocabulary ("useQuery", "__init__"), and lots of short symbols. Expect 25-40% more tokens for code than the equivalent amount of prose.

What about CJK (Chinese, Japanese, Korean)?

Much worse. CJK characters often tokenize at 1-2 tokens each, so the char-to-token ratio inverts — 1.5 chars/token for Chinese, sometimes 1 char/token. A 500-character Chinese document is roughly 700-1,000 tokens, not 125.

Does punctuation count?

Yes. Periods, commas, and other common punctuation usually merge with the preceding or following word (no extra token). Rare symbols and long runs of punctuation split into their own tokens.

Ready to estimate a real prompt?

Paste your actual text into the estimator for exact token counts and dollar costs across every model.