Question 1

Why is 1 token ≈ 4 characters?

Accepted Answer

Tokenizers learn their vocabulary from training data. For English, the BPE merge process naturally settles on subword units averaging ~4 characters because that's the most efficient split across the training corpus. The ratio is remarkably stable across providers.

Question 2

Is the ratio different for whitespace?

Accepted Answer

Modern tokenizers (o200k_base, cl100k_base, Claude's) merge whitespace with the following word, so spaces contribute minimally to token count. The 4 chars/token rule already factors this in.

Question 3

How does code tokenize?

Accepted Answer

Roughly 3 characters per token, sometimes worse. Code has more punctuation, identifiers that don't appear in the model's learned vocabulary ("useQuery", "__init__"), and lots of short symbols. Expect 25-40% more tokens for code than the equivalent amount of prose.

Question 4

What about CJK (Chinese, Japanese, Korean)?

Accepted Answer

Much worse. CJK characters often tokenize at 1-2 tokens each, so the char-to-token ratio inverts: 1.5 chars/token for Chinese, sometimes 1 char/token. A 500-character Chinese document is roughly 700-1,000 tokens, not 125.

Question 5

Does punctuation count?

Accepted Answer

Yes. Periods, commas, and other common punctuation usually merge with the preceding or following word (no extra token). Rare symbols and long runs of punctuation split into their own tokens.

1 tweet (~280 chars)	~70 tokens
1 SMS (~160 chars)	~40 tokens
1 paragraph (~500 chars)	~125 tokens
1 page (~3,000 chars)	~750 tokens
1 short document (~15,000 chars)	~3,750 tokens
1 long document (~50,000 chars)	~12,500 tokens
1 book (~500,000 chars)	~125,000 tokens

Characters to tokens converter

How the conversion works

Common character counts

Frequently asked

Ready to estimate a real prompt?