Reducing token use in LLMs
Why tokens matter
Sentences you write in an LLM chat window are broken down into tokens that are used behind the scenes to make predictions on the next word to display to the user, but this isn't the only thing it is used for. A token is also used for billing purposes.
A token is created for roughly about three quarters of each word. That means everything you paste into a chat, being instructions, system/context files, prior messages, and examples, consumes tokens and therefore context budget and cost. Knowing how tokenisation works and measuring token counts makes the difference between a cheap, precise workflow and one that spirals into wasted tokens and hallucinations.
High-level strategy
1. Measure token use at every step
You can't optimise what you don't measure. Yes, I know, it's not always practical to measure everything, but for the more serious stuff paste sample prompts and context into a token counter and compare before and after. Keep a short log of expensive prompts and what reduced them. Use tools such as Lunary's Anthropic tokeniser or some other token calculator site.
2. Use a two-pass workflow: design then implement
This avoids iterating with full context until you have a settled plan, which saves tokens and time.
3. Create and use tiny context documents
One way I find of improving token usage is to write a detailed context document, then get an AI to act as an LLM prompting expert role and come up with an efficient expert prompt for the content in my detailed context document. You can then place that in a tiny issue-specific context document, such as problem-fix.md or specific-issue-context.md and get the AI to work from that document. Large monolithic files waste tokens and cause the model to drop sections or hallucinate so break context down into a number of smaller context documents and keep each context document to less than a few hundred tokens when possible.
4. Use summaries and progressive disclosure
A short summary is often enough; reserve long raw traces for deep debugging only. Summarise with the LLM itself. As I mentioned above, you can ask it to summarise a big document into 120 to 200 tokens, then use the summary for iterative prompts.
5. Use structured inputs (JSON / YAML / labelled bullets)
Structured inputs are easier for LLMs to parse and typically shorter than verbose prose. Use keys like objective, constraints, examples, expected_output. This reduces back-and-forth clarification noise.
6. Cache validated answers and canonical prompts
If you validated a prompt/response pair, reuse it instead of re-asking the model from scratch. Keep a small local library of canonical prompts and tiny context snippets.
7. Ask for token-aware outputs
Models can often compress their own output if you request that explicitly.
8. Prefer few-shot examples over long instruction when appropriate
A couple of high-quality examples (few-shot) often guide behaviour better than multiple paragraphs of instruction and usually cost fewer tokens than repeating the same long instruction. But provide 1 to 3 examples only, as any more is likely to follow the law of diminishing returns and cost unnecessary tokens.
9. Reduce verbosity in system messages and responses
Default system prompts or role descriptions often balloon token counts. Keep them minimal and precise.
A bad prompt might be: “You are a helpful assistant who always tries to be super thorough and explain everything.”
A better prompt might be: “You are an assistant. Be concise, factual, and provide a 2-sentence summary and 3-step action list specifically on ...”
10. Use external indexing/embedding systems for large knowledge
Don’t shove your entire knowledge base into the chat. Instead:
This is how production retrieval-augmented generation (RAG) systems scale without bloating tokens.
Debugging & failure modes
An example, before and after prompt compression
Before (long)
“I have this massive authentication system across microservices. Users sometimes get 401s when they shouldn’t. Here are 2,300 lines of logs, three config files, and a service class. Please debug.”
After (compressed, measured)
“Given the 120-token summary and the following 20-line snippet, suggest 3 probable causes in less than 120 tokens and a 3-step reproduction checklist.”
By making the LLM do the summarisation step itself, you cut token use by 60 to 90% and still preserve all the useful signal.
Conclusion
Token efficiency is not just about saving money, it’s about making LLMs more reliable and less prone to hallucination. The highest-impact moves are measuring, scoping context, using a two-pass design/implement workflow, and building a small cache of canonical prompts and summaries. Incremental discipline (measure then summarise then feed minimal context then ask for concise answers) will pay compounding dividends.