TOON: A Deep Dive into Token-Oriented Object Notation
Optimizing structured data for large language models
1. Introduction
Large Language Models (LLMs) are not just language processors, they are token processors. Every interaction we have with them, from passing instructions to embedding complex data, is measured in tokens.
Each token costs compute, latency, and money. When dealing with large structured datasets - user profiles, logs, or tabular data - most teams still use JSON or YAML as serialization formats. Both are readable and expressive, but also verbose. The repetition of keys across objects, heavy use of punctuation, and strict quoting lead to unnecessary token bloat.
TOON (Token-Oriented Object Notation) emerges as a clever, minimal, and LLM-aware alternative to JSON. Designed by Toon Format contributors, it aims to represent structured data as compactly as possible without losing human readability or machine interpretability.
2. Why TOON Exists
JSON: Readable but Redundant
Take a simple list of users:
[
{"id": 1, "name": "Alice", "role": "admin"},
{"id": 2, "name": "Bob", "role": "user"}
]
This structure is clear - but wasteful. The field names id, name, and role are repeated for every entry. To a tokenizer, that’s needless duplication. In large arrays, JSON spends more tokens on field labels than on the data itself.
YAML: Readable but Inconsistent
YAML improves readability but lacks a stable schema. It’s not ideal for LLM parsing, which relies on consistent patterns.
CSV: Compact but Loses Semantics
CSV excels at compactness but loses nested structure and metadata. You can’t easily nest or mix objects and primitives.
TOON combines the strengths of all three - JSON’s structure, YAML’s readability, and CSV’s compactness - while eliminating their inefficiencies.
3. Core Design Philosophy
TOON’s central idea is simple:
Data should be token-efficient, human-readable, and structure-preserving.
It achieves this through:
4. The Syntax in Detail
4.1 Basic Objects
A TOON object is expressed via indentation and colons:
user:
id: 1
name: Alice
role: admin
Equivalent JSON:
{"user": {"id": 1, "name": "Alice", "role": "admin"}}
The omission of braces and quotes drastically cuts token count while keeping clarity.
4.2 Arrays
TOON distinguishes between uniform and non-uniform arrays.
(a) Uniform Arrays of Objects
If all objects share the same fields:
users[2]{id,name,role}:
1,Alice,admin
2,Bob,user
Equivalent JSON:
{
"users": [
{"id": 1, "name": "Alice", "role": "admin"},
{"id": 2, "name": "Bob", "role": "user"}
]
}
This pattern achieves massive token savings because LLMs need to process field names only once.
(b) Non-Uniform Arrays
If elements vary in structure:
data[3]:
- 1
- text
- id: 99
value: test
TOON gracefully falls back to YAML-like syntax here, trading compactness for expressiveness.
4.3 Primitive Arrays
tags[3]: red,green,blue
Equivalent JSON:
{"tags": ["red", "green", "blue"]}
Optional: use [#3] to emphasize array length explicitly.
Recommended by LinkedIn
4.4 String Rules
Strings are unquoted unless necessary, e.g., when:
Otherwise, plain strings remain bare - saving yet more tokens.
4.5 Custom Delimiters
Default delimiter: , Optional alternatives: tab (\t) or pipe (|)
Example with tabs:
users[2]{id name role}:
1 Alice admin
2 Bob user
Tabs minimize visible punctuation and sometimes tokenize more efficiently than commas, depending on the LLM’s tokenizer.
5. Benchmarks & Token Efficiency
According to the official repository benchmarks:
| Dataset | JSON Tokens | TOON Tokens | Savings |
|--------------------------|-------------|--------------|----------|
| GitHub Repos (100 items) | 15,145 | 8,745 | 42% |
| Books Sample | 6,013 | 3,689 | 39% |
| Uniform Logs | 12,460 | 6,800 | 45% |
Result: less token waste, faster inference, and often more consistent model behavior.
6. Implementation Overview
The official implementation is in TypeScript.
Installation
npm install @toon-format/toon
Encoding Example
import { encode } from '@toon-format/toon';
const data = {
users: [
{ id: 1, name: 'Alice', role: 'admin' },
{ id: 2, name: 'Bob', role: 'user' }
]
};
console.log(encode(data));
Output:
users[2]{id,name,role}:
1,Alice,admin
2,Bob,user
CLI Usage
npx @toon-format/cli encode data.json > data.toon
npx @toon-format/cli decode data.toon > data.json
You can also tweak delimiter (--delimiter "\t") and indentation (--indent 2) to suit your workflow.
7. Integration with LLM Workflows
When feeding data into an LLM:
products[3]{id,name,price}:
1,Widget,19.99
2,Gizmo,24.50
3,Thingamajig,14.75
Then prompt:
“From the TOON data below, return only products with price > 20 as TOON.”
This approach improves:
8. When Not to Use TOON
TOON shines for uniform, tabular, semi-structured data, but it’s not universal.
Avoid it when:
In such cases, JSON or YAML remain more appropriate.
9. Theoretical Angle: Why TOON Works So Well
LLMs operate over token sequences. Reducing punctuation, quotes, and repeated substrings directly compresses context length. This compression yields two benefits:
TOON doesn’t merely save tokens - it guides the model to reason structurally.
10. Future Directions
If TOON gains adoption, it could become the de-facto compact format for LLM-native data pipelines.
11. Conclusion
TOON redefines how we serialize structured data for AI systems. It brings together:
In an era where every token counts, TOON offers a pragmatic step forward - not by changing the model, but by changing how we talk to it.
In summary: If you’re building systems that feed structured data to LLMs - from RAG pipelines to dataset summarizers - try encoding your data with TOON. The difference in both cost and model clarity might surprise you.