TOON: A Deep Dive into Token-Oriented Object Notation

Optimizing structured data for large language models


1. Introduction

Large Language Models (LLMs) are not just language processors, they are token processors. Every interaction we have with them, from passing instructions to embedding complex data, is measured in tokens.

Each token costs compute, latency, and money. When dealing with large structured datasets - user profiles, logs, or tabular data - most teams still use JSON or YAML as serialization formats. Both are readable and expressive, but also verbose. The repetition of keys across objects, heavy use of punctuation, and strict quoting lead to unnecessary token bloat.

TOON (Token-Oriented Object Notation) emerges as a clever, minimal, and LLM-aware alternative to JSON. Designed by Toon Format contributors, it aims to represent structured data as compactly as possible without losing human readability or machine interpretability.

GitHub Repository → toon-format/toon

2. Why TOON Exists

JSON: Readable but Redundant

Take a simple list of users:

[
  {"id": 1, "name": "Alice", "role": "admin"},
  {"id": 2, "name": "Bob", "role": "user"}
]        

This structure is clear - but wasteful. The field names id, name, and role are repeated for every entry. To a tokenizer, that’s needless duplication. In large arrays, JSON spends more tokens on field labels than on the data itself.

YAML: Readable but Inconsistent

YAML improves readability but lacks a stable schema. It’s not ideal for LLM parsing, which relies on consistent patterns.

CSV: Compact but Loses Semantics

CSV excels at compactness but loses nested structure and metadata. You can’t easily nest or mix objects and primitives.

TOON combines the strengths of all three - JSON’s structure, YAML’s readability, and CSV’s compactness - while eliminating their inefficiencies.

3. Core Design Philosophy

TOON’s central idea is simple:

Data should be token-efficient, human-readable, and structure-preserving.

It achieves this through:

  • Structural minimalism: whitespace and indentation instead of braces.
  • Field reuse: for uniform arrays, field names appear once.
  • Selective quoting: only when necessary.
  • Flexible delimiters: to further reduce token overhead.

4. The Syntax in Detail

4.1 Basic Objects

A TOON object is expressed via indentation and colons:

user:
  id: 1
  name: Alice
  role: admin        

Equivalent JSON:

{"user": {"id": 1, "name": "Alice", "role": "admin"}}
        

The omission of braces and quotes drastically cuts token count while keeping clarity.

4.2 Arrays

TOON distinguishes between uniform and non-uniform arrays.

(a) Uniform Arrays of Objects

If all objects share the same fields:

users[2]{id,name,role}:
  1,Alice,admin
  2,Bob,user        

Equivalent JSON:

{
  "users": [
    {"id": 1, "name": "Alice", "role": "admin"},
    {"id": 2, "name": "Bob", "role": "user"}
  ]
}        

  • [2] declares the number of elements (optional but recommended).
  • {id,name,role} defines the schema once.
  • Rows follow in simple CSV form.

This pattern achieves massive token savings because LLMs need to process field names only once.

(b) Non-Uniform Arrays

If elements vary in structure:

data[3]:
  - 1
  - text
  - id: 99
    value: test        

TOON gracefully falls back to YAML-like syntax here, trading compactness for expressiveness.

4.3 Primitive Arrays

tags[3]: red,green,blue        

Equivalent JSON:

{"tags": ["red", "green", "blue"]}        

Optional: use [#3] to emphasize array length explicitly.

4.4 String Rules

Strings are unquoted unless necessary, e.g., when:

  • They contain special characters, commas, or leading/trailing spaces.
  • They resemble numbers or booleans.
  • They include the delimiter character.

Otherwise, plain strings remain bare - saving yet more tokens.

4.5 Custom Delimiters

Default delimiter: , Optional alternatives: tab (\t) or pipe (|)

Example with tabs:

users[2]{id	name	role}:
  1	Alice	admin
  2	Bob	user        

Tabs minimize visible punctuation and sometimes tokenize more efficiently than commas, depending on the LLM’s tokenizer.

5. Benchmarks & Token Efficiency

According to the official repository benchmarks:

| Dataset                 | JSON Tokens | TOON Tokens | Savings |
|--------------------------|-------------|--------------|----------|
| GitHub Repos (100 items) | 15,145      | 8,745        | 42%      |
| Books Sample             | 6,013       | 3,689        | 39%      |
| Uniform Logs             | 12,460      | 6,800        | 45%      |        

Result: less token waste, faster inference, and often more consistent model behavior.

6. Implementation Overview

The official implementation is in TypeScript.

Installation

npm install @toon-format/toon        

Encoding Example

import { encode } from '@toon-format/toon';

const data = {
  users: [
    { id: 1, name: 'Alice', role: 'admin' },
    { id: 2, name: 'Bob', role: 'user' }
  ]
};

console.log(encode(data));        

Output:

users[2]{id,name,role}:
  1,Alice,admin
  2,Bob,user        

CLI Usage

npx @toon-format/cli encode data.json > data.toon
npx @toon-format/cli decode data.toon > data.json        

You can also tweak delimiter (--delimiter "\t") and indentation (--indent 2) to suit your workflow.

7. Integration with LLM Workflows

When feeding data into an LLM:

products[3]{id,name,price}:
  1,Widget,19.99
  2,Gizmo,24.50
  3,Thingamajig,14.75        

Then prompt:

“From the TOON data below, return only products with price > 20 as TOON.”

This approach improves:

  • Parsing stability (LLMs handle columns predictably)
  • Token economy (reduces overhead)
  • Response consistency (output often mirrors input format cleanly)

8. When Not to Use TOON

TOON shines for uniform, tabular, semi-structured data, but it’s not universal.

Avoid it when:

  • Your data is deeply nested.
  • Each object has varying field sets.
  • You rely on strict JSON schemas or type validation.
  • Token cost isn’t critical (e.g., single prompt tasks).

In such cases, JSON or YAML remain more appropriate.

9. Theoretical Angle: Why TOON Works So Well

LLMs operate over token sequences. Reducing punctuation, quotes, and repeated substrings directly compresses context length. This compression yields two benefits:

  1. Longer effective context window – More data fits within the same token limit.
  2. Improved structural priors – Consistent indentation and headers act as cues, helping the model align tokens to schema semantics.

TOON doesn’t merely save tokens - it guides the model to reason structurally.

10. Future Directions

  • Python and Rust libraries: for broader ecosystem support.
  • Schema validation: automatic enforcement of uniform array structures.
  • Hybrid JSON-TOON adapters: enabling partial compression for mixed data.
  • Tokenizer-aware tuning: choosing delimiters per model (e.g., tab vs comma).

If TOON gains adoption, it could become the de-facto compact format for LLM-native data pipelines.

11. Conclusion

TOON redefines how we serialize structured data for AI systems. It brings together:

  • Human readability
  • Machine efficiency
  • Structural clarity

In an era where every token counts, TOON offers a pragmatic step forward - not by changing the model, but by changing how we talk to it.

In summary: If you’re building systems that feed structured data to LLMs - from RAG pipelines to dataset summarizers - try encoding your data with TOON. The difference in both cost and model clarity might surprise you.

To view or add a comment, sign in

More articles by Joy Mukerjee

Others also viewed

Explore content categories