Reducing token use in LLMs

Reducing token use in LLMs

Why tokens matter

Sentences you write in an LLM chat window are broken down into tokens that are used behind the scenes to make predictions on the next word to display to the user, but this isn't the only thing it is used for. A token is also used for billing purposes.

A token is created for roughly about three quarters of each word. That means everything you paste into a chat, being instructions, system/context files, prior messages, and examples, consumes tokens and therefore context budget and cost. Knowing how tokenisation works and measuring token counts makes the difference between a cheap, precise workflow and one that spirals into wasted tokens and hallucinations.

High-level strategy

  1. Measure first, then cut Always measure token counts for representative prompts and context. Use token counters while you iterate so you know which changes save real tokens. Lunary has a number of token counters on its website, including an Anthropic tokeniser and other token calculators
  2. Design a two-pass workflow — Plan/architect (design then context summary then success/failure criteria) then implement (detailed prompt or code). This avoids repeatedly sending long context during early exploration.
  3. Context scoping and chunking Replace a huge monolithic context with short, issue-specific context files (and summary headers). Feed only the minimal nearby code or text needed for the task.
  4. Cache and reuse Persist validated snippets, canonical prompts, and small summaries in a local store rather than re-sending large blocks each chat. Use hashed identifiers so the model can retrieve “read-only” context by ID if your toolchain supports it.
  5. Summarise, don’t paste everything Summaries dramatically reduce tokens while preserving the essentials. Keep a strict policy: raw logs only when absolutely necessary.
  6. Prefer structured compact formats Use short-labelled JSON, YAML, or bullet lists instead of long prose. Labels let the model find what matters without reading every sentence.
  7. Ask the model to be concise Explicitly instruct brevity and token-efficiency (examples below).
  8. Fine-tune or few-shot wisely If you have a recurring specialised task, fine-tuning (or instruct-tuning) can reduce the need for verbose prompts. But weigh the cost and maintenance.

1. Measure token use at every step

You can't optimise what you don't measure. Yes, I know, it's not always practical to measure everything, but for the more serious stuff paste sample prompts and context into a token counter and compare before and after. Keep a short log of expensive prompts and what reduced them. Use tools such as Lunary's Anthropic tokeniser or some other token calculator site.

2. Use a two-pass workflow: design then implement

This avoids iterating with full context until you have a settled plan, which saves tokens and time.

  • Phase 1 (Design): Ask the LLM to produce a compact architecture/plan (max 250–400 tokens). Ask for 3 options and quick trade-offs.
  • Review, refine and choose.
  • Phase 2 (Implement): Ask for the actual code or patch with only the minimal context required.

3. Create and use tiny context documents

One way I find of improving token usage is to write a detailed context document, then get an AI to act as an LLM prompting expert role and come up with an efficient expert prompt for the content in my detailed context document. You can then place that in a tiny issue-specific context document, such as problem-fix.md or specific-issue-context.md and get the AI to work from that document. Large monolithic files waste tokens and cause the model to drop sections or hallucinate so break context down into a number of smaller context documents and keep each context document to less than a few hundred tokens when possible.

4. Use summaries and progressive disclosure

A short summary is often enough; reserve long raw traces for deep debugging only. Summarise with the LLM itself. As I mentioned above, you can ask it to summarise a big document into 120 to 200 tokens, then use the summary for iterative prompts.

5. Use structured inputs (JSON / YAML / labelled bullets)

Structured inputs are easier for LLMs to parse and typically shorter than verbose prose. Use keys like objective, constraints, examples, expected_output. This reduces back-and-forth clarification noise.

6. Cache validated answers and canonical prompts

If you validated a prompt/response pair, reuse it instead of re-asking the model from scratch. Keep a small local library of canonical prompts and tiny context snippets.

7. Ask for token-aware outputs

Models can often compress their own output if you request that explicitly.

8. Prefer few-shot examples over long instruction when appropriate

A couple of high-quality examples (few-shot) often guide behaviour better than multiple paragraphs of instruction and usually cost fewer tokens than repeating the same long instruction. But provide 1 to 3 examples only, as any more is likely to follow the law of diminishing returns and cost unnecessary tokens.

9. Reduce verbosity in system messages and responses

Default system prompts or role descriptions often balloon token counts. Keep them minimal and precise.

A bad prompt might be: “You are a helpful assistant who always tries to be super thorough and explain everything.”

A better prompt might be: “You are an assistant. Be concise, factual, and provide a 2-sentence summary and 3-step action list specifically on ...”

10. Use external indexing/embedding systems for large knowledge

Don’t shove your entire knowledge base into the chat. Instead:

  • Create your own index documents with embeddings
  • At query time, retrieve just the most relevant 2 to 5 passages
  • Provide those passages as context

This is how production retrieval-augmented generation (RAG) systems scale without bloating tokens.

Debugging & failure modes

  • Ask for a failure-mode mini-report: “Before giving an implementation, list 3 ways this could fail and how you’d detect each.” This reduces cycles and costly re-prompts.
  • When model re-suggests prior failed fixes: make the model compare its new suggestion to the history and justify novelty. Store failed fixes in a short problemfix.md file and instruct: “Do not repeat entries from problemfix.md unless you supply a new variation and reason.” (This pattern reduces circular loops. See my other article.)

An example, before and after prompt compression

Before (long)

“I have this massive authentication system across microservices. Users sometimes get 401s when they shouldn’t. Here are 2,300 lines of logs, three config files, and a service class. Please debug.”

After (compressed, measured)

  1. Use the model as a summariser first: “Summarise these logs in ≤120 tokens. Highlight only error codes, stack traces, and anomalous events.”. This turns thousands of tokens into a compact summary you can reuse.
  2. Paste only the most relevant snippet (e.g. 20 lines around the first 401 error).
  3. Now ask for targeted debugging:

“Given the 120-token summary and the following 20-line snippet, suggest 3 probable causes in less than 120 tokens and a 3-step reproduction checklist.”

By making the LLM do the summarisation step itself, you cut token use by 60 to 90% and still preserve all the useful signal.

Conclusion

Token efficiency is not just about saving money, it’s about making LLMs more reliable and less prone to hallucination. The highest-impact moves are measuring, scoping context, using a two-pass design/implement workflow, and building a small cache of canonical prompts and summaries. Incremental discipline (measure then summarise then feed minimal context then ask for concise answers) will pay compounding dividends.

To view or add a comment, sign in

More articles by Anthony Wright

Explore content categories