TOON: A Deep Dive into Token-Oriented Object Notation

Joy Mukerjee

Published Nov 6, 2025

Optimizing structured data for large language models

1. Introduction

Large Language Models (LLMs) are not just language processors, they are token processors. Every interaction we have with them, from passing instructions to embedding complex data, is measured in tokens.

Each token costs compute, latency, and money. When dealing with large structured datasets - user profiles, logs, or tabular data - most teams still use JSON or YAML as serialization formats. Both are readable and expressive, but also verbose. The repetition of keys across objects, heavy use of punctuation, and strict quoting lead to unnecessary token bloat.

TOON (Token-Oriented Object Notation) emerges as a clever, minimal, and LLM-aware alternative to JSON. Designed by Toon Format contributors, it aims to represent structured data as compactly as possible without losing human readability or machine interpretability.

GitHub Repository → toon-format/toon

2. Why TOON Exists

JSON: Readable but Redundant

Take a simple list of users:

[
  {"id": 1, "name": "Alice", "role": "admin"},
  {"id": 2, "name": "Bob", "role": "user"}
]

This structure is clear - but wasteful. The field names id, name, and role are repeated for every entry. To a tokenizer, that’s needless duplication. In large arrays, JSON spends more tokens on field labels than on the data itself.

YAML: Readable but Inconsistent

YAML improves readability but lacks a stable schema. It’s not ideal for LLM parsing, which relies on consistent patterns.

CSV: Compact but Loses Semantics

CSV excels at compactness but loses nested structure and metadata. You can’t easily nest or mix objects and primitives.

TOON combines the strengths of all three - JSON’s structure, YAML’s readability, and CSV’s compactness - while eliminating their inefficiencies.

3. Core Design Philosophy

TOON’s central idea is simple:

Data should be token-efficient, human-readable, and structure-preserving.

It achieves this through:

Structural minimalism: whitespace and indentation instead of braces.
Field reuse: for uniform arrays, field names appear once.
Selective quoting: only when necessary.
Flexible delimiters: to further reduce token overhead.

4. The Syntax in Detail

4.1 Basic Objects

A TOON object is expressed via indentation and colons:

user:
  id: 1
  name: Alice
  role: admin

Equivalent JSON:

{"user": {"id": 1, "name": "Alice", "role": "admin"}}

The omission of braces and quotes drastically cuts token count while keeping clarity.

4.2 Arrays

TOON distinguishes between uniform and non-uniform arrays.

(a) Uniform Arrays of Objects

If all objects share the same fields:

users[2]{id,name,role}:
  1,Alice,admin
  2,Bob,user

Equivalent JSON:

{
  "users": [
    {"id": 1, "name": "Alice", "role": "admin"},
    {"id": 2, "name": "Bob", "role": "user"}
  ]
}

[2] declares the number of elements (optional but recommended).
{id,name,role} defines the schema once.
Rows follow in simple CSV form.

This pattern achieves massive token savings because LLMs need to process field names only once.

(b) Non-Uniform Arrays

If elements vary in structure:

data[3]:
  - 1
  - text
  - id: 99
    value: test

TOON gracefully falls back to YAML-like syntax here, trading compactness for expressiveness.

4.3 Primitive Arrays

tags[3]: red,green,blue

Equivalent JSON:

{"tags": ["red", "green", "blue"]}

Optional: use [#3] to emphasize array length explicitly.

Recommended by LinkedIn

Data Efficiency: The New LLM Standard for Structured…

Rahul R. 5 months ago

How Vector Database Response Formats Impact LLM Cost…

Varsha Deheriya 4 months ago

🚀 TOON: The New Lightweight Data Format That Could…

Youssef EL GAMRANI 5 months ago

4.4 String Rules

Strings are unquoted unless necessary, e.g., when:

They contain special characters, commas, or leading/trailing spaces.
They resemble numbers or booleans.
They include the delimiter character.

Otherwise, plain strings remain bare - saving yet more tokens.

4.5 Custom Delimiters

Default delimiter: , Optional alternatives: tab (\t) or pipe (|)

Example with tabs:

users[2]{id	name	role}:
  1	Alice	admin
  2	Bob	user

Tabs minimize visible punctuation and sometimes tokenize more efficiently than commas, depending on the LLM’s tokenizer.

5. Benchmarks & Token Efficiency

According to the official repository benchmarks:

| Dataset                 | JSON Tokens | TOON Tokens | Savings |
|--------------------------|-------------|--------------|----------|
| GitHub Repos (100 items) | 15,145      | 8,745        | 42%      |
| Books Sample             | 6,013       | 3,689        | 39%      |
| Uniform Logs             | 12,460      | 6,800        | 45%      |

Result: less token waste, faster inference, and often more consistent model behavior.

6. Implementation Overview

The official implementation is in TypeScript.

Installation

npm install @toon-format/toon

Encoding Example

import { encode } from '@toon-format/toon';

const data = {
  users: [
    { id: 1, name: 'Alice', role: 'admin' },
    { id: 2, name: 'Bob', role: 'user' }
  ]
};

console.log(encode(data));

Output:

users[2]{id,name,role}:
  1,Alice,admin
  2,Bob,user

CLI Usage

npx @toon-format/cli encode data.json > data.toon
npx @toon-format/cli decode data.toon > data.json

You can also tweak delimiter (--delimiter "\t") and indentation (--indent 2) to suit your workflow.

7. Integration with LLM Workflows

When feeding data into an LLM:

products[3]{id,name,price}:
  1,Widget,19.99
  2,Gizmo,24.50
  3,Thingamajig,14.75

Then prompt:

“From the TOON data below, return only products with price > 20 as TOON.”

This approach improves:

Parsing stability (LLMs handle columns predictably)
Token economy (reduces overhead)
Response consistency (output often mirrors input format cleanly)

8. When Not to Use TOON

TOON shines for uniform, tabular, semi-structured data, but it’s not universal.

Avoid it when:

Your data is deeply nested.
Each object has varying field sets.
You rely on strict JSON schemas or type validation.
Token cost isn’t critical (e.g., single prompt tasks).

In such cases, JSON or YAML remain more appropriate.

9. Theoretical Angle: Why TOON Works So Well

LLMs operate over token sequences. Reducing punctuation, quotes, and repeated substrings directly compresses context length. This compression yields two benefits:

Longer effective context window – More data fits within the same token limit.
Improved structural priors – Consistent indentation and headers act as cues, helping the model align tokens to schema semantics.

TOON doesn’t merely save tokens - it guides the model to reason structurally.

10. Future Directions

Python and Rust libraries: for broader ecosystem support.
Schema validation: automatic enforcement of uniform array structures.
Hybrid JSON-TOON adapters: enabling partial compression for mixed data.
Tokenizer-aware tuning: choosing delimiters per model (e.g., tab vs comma).

If TOON gains adoption, it could become the de-facto compact format for LLM-native data pipelines.

11. Conclusion

TOON redefines how we serialize structured data for AI systems. It brings together:

Human readability
Machine efficiency
Structural clarity

In an era where every token counts, TOON offers a pragmatic step forward - not by changing the model, but by changing how we talk to it.

In summary: If you’re building systems that feed structured data to LLMs - from RAG pipelines to dataset summarizers - try encoding your data with TOON. The difference in both cost and model clarity might surprise you.

To view or add a comment, sign in

Optimizing structured data for large language models

1. Introduction

2. Why TOON Exists

JSON: Readable but Redundant

YAML: Readable but Inconsistent

CSV: Compact but Loses Semantics

3. Core Design Philosophy

4. The Syntax in Detail

4.1 Basic Objects

4.2 Arrays

(a) Uniform Arrays of Objects

(b) Non-Uniform Arrays

4.3 Primitive Arrays

Recommended by LinkedIn

4.4 String Rules

4.5 Custom Delimiters

5. Benchmarks & Token Efficiency

6. Implementation Overview

Installation

Encoding Example

CLI Usage

7. Integration with LLM Workflows

8. When Not to Use TOON

9. Theoretical Angle: Why TOON Works So Well

10. Future Directions

11. Conclusion

More articles by Joy Mukerjee

Hype ≠ Growth: Why Every “Must-Take” Course Isn’t A Career Free-Pass

The Smartest AI Strategy Starts Before You Build

Others also viewed

Building an Intelligent AI System for Interrogating Data with Plain English

Output Format Enforcer

Why Pure Agentic Architectures Eventually Break Down

From Idea to 1M Requests/Second: Claude + Harper Changes Everything

NLP2SQL: A Review of QueryGPT and DataHerald

Retrieval Augmented Generation (RAG): A Complete End-to-End Architectural Flow

Speaking the Language of Apps: Unlocking the Power of LLMs with Structured Data

JSON vs TOON — The Quiet Battle of Data Formats in the AI Era

RAG Architecture: Transforming Generic LLMs into Corporate Assets

Why the AI Era Demands a New Standard: Introducing TOON

Similar topics

How to Prevent Large Language Model Performance Degradation

Streamlining LLM Inference for Lightweight Deployments

Explore content categories