LangChain Structure-Aware Text Splitters: Split Markdown, JSON, Code, and HTML for Smarter RAG Pipelines

LangChain Structure-Aware Text Splitters: Split Markdown, JSON, Code, and HTML for Smarter RAG Pipelines

Your RAG Pipeline Is Only as Good as Your Chunking Strategy

Here is the problem most developers run into when building RAG applications.

They load a document, split it with a generic character-based splitter, embed the chunks, and wonder why retrieval quality is inconsistent. Some chunks contain half a function. Others mix unrelated sections. A JSON object gets sliced right through a nested value.

The root cause is simple: generic splitters do not understand the structure of what they are splitting.

LangChain solves this with structure-aware text splitters. These splitters understand the format of your data and split it along natural boundaries, whether those boundaries are markdown headers, JSON objects, function definitions, or HTML tags.

In this article, you will learn how to use four structure-aware splitters from the langchain-text-splitters package:

  1. MarkdownHeaderTextSplitter: Splits markdown documents by headers while preserving section metadata
  2. RecursiveJsonSplitter: Traverses JSON structures depth-first and splits while keeping objects intact
  3. RecursiveCharacterTextSplitter.from_language(): Splits code files using language-specific separators for 25+ programming languages
  4. HTMLHeaderTextSplitter and HTMLSemanticPreservingSplitter: Splits HTML documents while preserving semantic structure, tables, and metadata

By the end of this article, you will have working Python code for each splitter and understand exactly when and why to use each one.

Anthropic's research on contextual retrieval showed that enriching chunks with document-level context before embedding can dramatically improve retrieval accuracy. When combined with BM25 and reranking, their approach reduced the top-20 chunk retrieval failure rate by 67%. Structure-aware splitting is the first step toward achieving that level of precision.

Architecture Overview: How Structure-Aware Splitting Works

Before we dive into each splitter, here is the mental model for how structure-aware splitting differs from generic splitting:

  • Generic splitting treats your document as a flat string. It counts characters or tokens and cuts wherever the limit hits, regardless of what the content actually contains.
  • Structure-aware splitting reads the format of your document first. It identifies natural boundaries (headers, tags, function definitions, JSON keys) and splits along those boundaries.

The result is chunks that preserve meaning, maintain metadata about where they came from, and produce better embeddings for retrieval.

Here is how each splitter approaches its format:

  • Markdown Splitter: Reads header levels (#, ##, ###), groups content under each header, and attaches header hierarchy as metadata to each chunk
  • JSON Splitter: Walks the JSON tree depth-first, keeps objects and arrays intact when possible, and only splits when a value exceeds the size limit
  • Code Splitter: Uses language-specific separators (class definitions, function definitions, blank lines) to split code at syntactic boundaries
  • HTML Splitter: Parses the DOM tree, splits at header tags or section boundaries, and can preserve tables, lists, and media elements intact

Table of Contents

  1. Phase 1: Setting Up Your Environment
  2. Phase 2: Splitting Markdown Documents
  3. Phase 3: Splitting JSON Data
  4. Phase 4: Splitting Source Code
  5. Phase 5: Splitting HTML Documents
  6. Evaluation: Comparing Splitter Outputs
  7. How to Improve It Further


Phase 1: Setting Up Your Environment

Workflow: Install the required package, then import the classes we will use throughout the article.

Every splitter we cover lives in a single package: langchain-text-splitters. This is a lightweight package that does not pull in the full LangChain dependency chain.

pip install -qU langchain-text-splitters
        

This installs the latest version of the text splitters package. The -qU flags mean quiet mode and upgrade to the latest version if already installed.

Now let us import the classes we will use:

from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
    RecursiveJsonSplitter,
    HTMLHeaderTextSplitter,
    HTMLSectionSplitter,
    HTMLSemanticPreservingSplitter,
    Language,
)
        

Here is what each import does:

  • MarkdownHeaderTextSplitter: The splitter we will use for markdown files. It reads header syntax (#, ##, ###) and groups content by section.
  • RecursiveCharacterTextSplitter: A general-purpose splitter that also has a .from_language() factory method for code splitting. We will use it in Phase 4.
  • RecursiveJsonSplitter: The splitter for JSON data. It walks JSON trees depth-first and respects object boundaries.
  • HTMLHeaderTextSplitter: Splits HTML by header tags (h1, h2, h3) with metadata, similar to the markdown splitter.
  • HTMLSectionSplitter: Splits HTML by sections using XSLT transformations.
  • HTMLSemanticPreservingSplitter: The most advanced HTML splitter. It preserves tables, lists, images, and supports custom handlers.
  • Language: An enum that lists all 25+ supported programming languages for code splitting.

All of these splitters return LangChain Document objects with page_content and metadata fields. This means you can pipe the output directly into any LangChain embedding or retrieval chain.

This concludes Phase 1. With the package installed and imports ready, let us start splitting real documents.


Phase 2: Splitting Markdown Documents

Workflow: Define which headers to split on, create the splitter, pass in a markdown string, and inspect the resulting chunks with their metadata.

Markdown is one of the most common formats in RAG pipelines. README files, documentation, knowledge base articles, and notes are all written in markdown. The challenge is that a markdown document has a natural hierarchy defined by its headers, and a good splitter should respect that hierarchy.

The MarkdownHeaderTextSplitter does exactly this. It reads the header levels you specify, groups all content under each header into a chunk, and attaches the full header path as metadata.

Step 1: Define the Headers to Split On

First, you tell the splitter which header levels matter:

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]
        

This is a list of tuples. Each tuple has two values:

  • The first value ("#") is the markdown header syntax to look for.
  • The second value ("Header 1") is the metadata key that will be attached to each chunk.

So when the splitter finds a ## header, it will create a metadata entry like {"Header 2": "Section Title"} on the resulting chunk.

Step 2: Create the Splitter and Process a Document

Let us create a sample markdown document and split it:

markdown_document = """
# Machine Learning Basics

## Supervised Learning
Supervised learning uses labeled data to train models.
The model learns to map inputs to known outputs.

### Classification
Classification predicts discrete categories.
Examples include spam detection and image recognition.

### Regression
Regression predicts continuous values.
Examples include house price prediction and temperature forecasting.

## Unsupervised Learning
Unsupervised learning finds patterns in unlabeled data.
Clustering and dimensionality reduction are common techniques.
"""

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

chunks = markdown_splitter.split_text(markdown_document)

for chunk in chunks:
    print(f"Content: {chunk.page_content}")
    print(f"Metadata: {chunk.metadata}")
    print("---")
        

Let us walk through what happens here:

  • We define a markdown string with a clear hierarchy: one # header, two ## headers, and two ### headers.
  • We create the MarkdownHeaderTextSplitter with our header configuration.
  • We call .split_text() to process the markdown string.
  • The splitter returns a list of Document objects. Each one has page_content (the text) and metadata (the header hierarchy).

Expected output:

Content: Supervised learning uses labeled data to train models.
The model learns to map inputs to known outputs.
Metadata: {'Header 1': 'Machine Learning Basics', 'Header 2': 'Supervised Learning'}
---
Content: Classification predicts discrete categories.
Examples include spam detection and image recognition.
Metadata: {'Header 1': 'Machine Learning Basics', 'Header 2': 'Supervised Learning', 'Header 3': 'Classification'}
---
Content: Regression predicts continuous values.
Examples include house price prediction and temperature forecasting.
Metadata: {'Header 1': 'Machine Learning Basics', 'Header 2': 'Supervised Learning', 'Header 3': 'Regression'}
---
Content: Unsupervised learning finds patterns in unlabeled data.
Clustering and dimensionality reduction are common techniques.
Metadata: {'Header 1': 'Machine Learning Basics', 'Header 2': 'Unsupervised Learning'}
---
        

Notice how each chunk carries the full header path. The "Classification" chunk knows it belongs to "Supervised Learning" under "Machine Learning Basics". This metadata is extremely valuable for retrieval because it gives the LLM context about where a chunk came from in the original document.

This is important because when you embed these chunks, the retriever can use the metadata to filter results or provide context to the LLM. For example, you could filter chunks to only those under "Supervised Learning" before passing them to the model.

Step 3: Keep Headers in the Content

By default, the splitter strips header text from the page_content (the headers only appear in metadata). If you want headers to remain in the content as well, set strip_headers=False:

markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False,
)

chunks = markdown_splitter.split_text(markdown_document)
print(chunks[0].page_content)
        

Expected output:

## Supervised Learning
Supervised learning uses labeled data to train models.
The model learns to map inputs to known outputs.
        

Now the header text appears in both the page_content and the metadata. This is useful when you want the embedded chunk to contain the full context without relying on metadata.

Step 4: Combine with Character Splitting for Size Control

In production, markdown sections can be very long. A single ## section might contain thousands of words. You need to constrain chunk sizes to fit your embedding model's context window.

The pattern is: first split by headers, then split the resulting chunks by character count:

# Step 1: Split by headers
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on,
    strip_headers=False,
)
md_header_splits = markdown_splitter.split_text(markdown_document)

# Step 2: Split large sections by character count
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250,
    chunk_overlap=30,
)

final_chunks = text_splitter.split_documents(md_header_splits)

for chunk in final_chunks:
    print(f"Size: {len(chunk.page_content)} chars")
    print(f"Metadata: {chunk.metadata}")
    print()
        

Two critical details here:

  1. We use split_documents() (not split_text()) in the second step. This preserves the metadata from the header splitting step.
  2. The chunk_overlap=30 parameter means 30 characters of overlap between adjacent chunks within the same section. Overlap does NOT cross header boundaries.

In production, you would tune chunk_size and chunk_overlap based on your embedding model. A common starting point in production systems is 10-20% overlap relative to chunk size (for example, 50-100 characters of overlap on a 500-character chunk). Measure retrieval quality and adjust from there.

This concludes Phase 2. You now know how to split markdown by headers, preserve metadata, and constrain chunk sizes. Next, we tackle a very different format: JSON.


Phase 3: Splitting JSON Data

Workflow: Create a JSON structure, configure the splitter with a max chunk size, then split using three different output methods (JSON objects, Documents, or strings).

JSON is the format of APIs, configuration files, and structured data. The challenge with splitting JSON is that you cannot just cut at a character boundary. Cutting in the middle of a nested object breaks the structure entirely.

The RecursiveJsonSplitter handles this by traversing the JSON tree depth-first. It tries to keep objects and arrays intact. When a value is too large, it splits at the deepest possible level to minimize structural damage.

Step 1: Prepare Sample JSON Data

import json

json_data = {
    "company": "TechCorp",
    "departments": {
        "engineering": {
            "team_lead": "Alice Chen",
            "projects": [
                {
                    "name": "Search Engine",
                    "stack": ["Python", "Elasticsearch", "React"],
                    "status": "active",
                    "description": "Building a semantic search engine with vector embeddings and hybrid retrieval."
                },
                {
                    "name": "Data Pipeline",
                    "stack": ["Spark", "Airflow", "PostgreSQL"],
                    "status": "active",
                    "description": "Real-time data pipeline processing 10M events per day."
                }
            ]
        },
        "marketing": {
            "team_lead": "Bob Martinez",
            "campaigns": [
                {"name": "Product Launch", "budget": 50000},
                {"name": "Brand Refresh", "budget": 30000}
            ]
        }
    }
}
        

This JSON has multiple levels of nesting: a top-level object containing departments, each with teams and projects. This is a realistic structure you might get from an API response.

Step 2: Create the Splitter and Split

The RecursiveJsonSplitter takes one key parameter: max_chunk_size, which sets the target maximum character count per chunk.

splitter = RecursiveJsonSplitter(max_chunk_size=300)
        

Now you have three different methods to split:

Method 1: Get JSON objects (Python dicts)

json_chunks = splitter.split_json(json_data=json_data)

for i, chunk in enumerate(json_chunks):
    print(f"Chunk {i}: {json.dumps(chunk, indent=2)[:200]}...")
    print()
        

This returns a list of Python dictionaries. Each dictionary is a valid JSON object that preserves the structure of the original data. This is useful when you need to process the data programmatically before embedding.

Method 2: Get LangChain Documents

docs = splitter.create_documents(texts=[json_data])

for doc in docs:
    print(f"Content: {doc.page_content[:150]}...")
    print(f"Metadata: {doc.metadata}")
    print()
        

This returns Document objects with the JSON serialized as a string in page_content. This is the method you will use most often in a LangChain RAG pipeline because Documents can be passed directly to embedding models and vector stores.

Method 3: Get strings

texts = splitter.split_text(json_data=json_data)

for text in texts:
    print(f"Length: {len(text)} chars")
    print(text[:150])
    print()
        

This returns plain JSON strings. It is useful for debugging or when you need raw text output.

The splitter preserves the hierarchical path when it decomposes structures. If it splits the engineering department's projects, each chunk will still contain enough context (like the department name) to be meaningful on its own.

Step 3: Handle Lists with convert_lists

By default, the splitter does not convert list items to dictionary entries. If you want lists to be split more granularly, use convert_lists=True:

texts_with_list_conversion = splitter.split_text(
    json_data=json_data,
    convert_lists=True,
)

for text in texts_with_list_conversion:
    print(f"Length: {len(text)} chars")
    print()
        

When convert_lists=True, the splitter transforms arrays like ["Python", "Elasticsearch", "React"] into dictionary format {"0": "Python", "1": "Elasticsearch", "2": "React"}. This allows finer-grained splitting within lists while preserving the index information.

Important Behavioral Notes

  1. Chunks may exceed max_chunk_size: If a single JSON value (like a long string) is larger than the limit, the splitter cannot split it without breaking the JSON structure. It will keep the value intact. If you need hard size limits, pair the JSON splitter with a RecursiveCharacterTextSplitter as a second pass.
  2. Depth-first traversal: The splitter walks from the root to the deepest nested values first. This means it tries to keep deeply nested objects together before splitting at higher levels.
  3. Structural integrity: Every chunk produced is valid JSON. You will never get a chunk with an unclosed bracket or a broken key-value pair.

This concludes Phase 3. JSON splitting preserves structural integrity while respecting size constraints. Next, we move to something completely different: source code.


Phase 4: Splitting Source Code

Workflow: Choose a programming language from the Language enum, create a language-specific splitter, split source code, and inspect how it respects syntactic boundaries like classes and functions.

Splitting code is uniquely challenging. Unlike prose, code has strict syntactic rules. A function split in the middle is useless. A class definition separated from its methods loses all context.

LangChain handles this with RecursiveCharacterTextSplitter.from_language(). This factory method creates a splitter pre-configured with language-specific separators. For Python, it knows to split on class definitions, function definitions, and blank lines, in that order of priority.

Step 1: Explore Available Languages

First, let us see what languages are supported and what separators they use:

# See all supported languages
print("Supported languages:")
for lang in Language:
    print(f"  - {lang.value}")
        

Expected output (partial):

Supported languages:
  - cpp
  - go
  - java
  - kotlin
  - js
  - ts
  - php
  - python
  - ruby
  - rust
  - scala
  - swift
  - markdown
  - latex
  - html
  - sol
  - csharp
  - haskell
  - elixir
  - powershell
  ...
        

You can inspect the actual separators for any language:

python_separators = RecursiveCharacterTextSplitter.get_separators_for_language(
    Language.PYTHON
)
print("Python separators:", python_separators)
        

Expected output:

Python separators: ['\\nclass ', '\\ndef ', '\\n\\tdef ', '\\n\\n', '\\n', ' ', '']
        

This tells you the priority order. The splitter will first try to split at \\nclass boundaries (class definitions). If the chunk is still too large, it falls back to \\ndef (function definitions). Then \\n\\tdef (method definitions inside classes). Then blank lines, then single newlines, then spaces, and finally individual characters as a last resort.

This cascading separator strategy is what makes the code splitter so effective. It always tries to split at the most meaningful boundary first. LangChain's documentation on code splitting emphasizes that this approach preserves syntactic boundaries across all 25+ supported languages.

Step 2: Split Python Code

python_code = """
class DocumentProcessor:
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.processed_count = 0

    def process(self, document: str) -> dict:
        result = self._analyze(document)
        self.processed_count += 1
        return result

    def _analyze(self, text: str) -> dict:
        words = text.split()
        return {
            "word_count": len(words),
            "char_count": len(text),
            "model": self.model_name,
        }

def load_documents(file_path: str) -> list:
    with open(file_path, 'r') as f:
        content = f.read()
    return content.split('\\\\n\\\\n')

def main():
    processor = DocumentProcessor("gpt-4")
    docs = load_documents("data.txt")
    for doc in docs:
        result = processor.process(doc)
        print(result)
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=200,
    chunk_overlap=0,
)

chunks = python_splitter.create_documents([python_code])

for i, chunk in enumerate(chunks):
    print(f"--- Chunk {i} ({len(chunk.page_content)} chars) ---")
    print(chunk.page_content)
    print()
        

Let us trace what happens:

  1. The splitter receives the Python code string.
  2. It first tries to split on \\nclass boundaries. This separates the DocumentProcessor class from the standalone functions.
  3. If any resulting piece is still over 200 characters, it splits on \\ndef boundaries, separating individual functions.
  4. If still too large, it splits on \\n\\tdef boundaries, separating methods within the class.
  5. The process continues down the separator priority list until all chunks are under 200 characters.

The result is chunks where each class or function stays together when possible.

Step 3: Split JavaScript Code

Let us see how a different language works:

js_code = """
function fetchUserData(userId) {
    return fetch(`/api/users/${userId}`)
        .then(response => response.json())
        .then(data => {
            console.log('User data:', data);
            return data;
        });
}

class UserService {
    constructor(baseUrl) {
        this.baseUrl = baseUrl;
    }

    async getUser(id) {
        const response = await fetch(`${this.baseUrl}/users/${id}`);
        return response.json();
    }

    async updateUser(id, data) {
        const response = await fetch(`${this.baseUrl}/users/${id}`, {
            method: 'PUT',
            body: JSON.stringify(data),
        });
        return response.json();
    }
}
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS,
    chunk_size=250,
    chunk_overlap=0,
)

js_chunks = js_splitter.create_documents([js_code])

for i, chunk in enumerate(js_chunks):
    print(f"--- JS Chunk {i} ({len(chunk.page_content)} chars) ---")
    print(chunk.page_content)
    print()
        

The JavaScript splitter uses different separators, including function, class, const, let, var, and other JS-specific keywords. The same cascading logic applies: split at the highest-level boundary first, then fall back to more granular boundaries.

Step 4: Using Code Splitting in a RAG Pipeline

In a real RAG pipeline for code search, you would combine the code splitter with embeddings:

# In production, this is how you would wire it up:
# 1. Split code files by language
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=1000,
    chunk_overlap=100,
)

# 2. Create documents from your codebase
code_documents = python_splitter.create_documents(
    texts=[python_code],
    metadatas=[{"source": "document_processor.py", "language": "python"}],
)

# 3. Each document now has page_content (the code chunk)
# and metadata (source file and language)
# You can pass these directly to a LangChain vector store
for doc in code_documents:
    print(f"Source: {doc.metadata['source']}")
    print(f"Content preview: {doc.page_content[:80]}...")
    print()
        

Notice the metadatas parameter. This lets you attach additional metadata (like the source file path and language) to each chunk. When a user searches for code, the retriever can use this metadata to filter by file or language.

In production, you would set chunk_size between 500 and 1500 characters for code, depending on your embedding model. Smaller chunks improve precision (finding the exact function), while larger chunks improve recall (understanding the broader context of a class).

This concludes Phase 4. Code splitting gives you language-aware chunking that respects function and class boundaries. Now let us tackle the most complex format: HTML.


Phase 5: Splitting HTML Documents

Workflow: Start with header-based HTML splitting, then move to section-based splitting, and finally explore the semantic-preserving splitter that keeps tables, lists, and media intact.

HTML is the most complex format to split because it combines content, structure, and presentation in a single document. A naive character-based split can break tags, separate a table header from its rows, or cut an image reference in half.

LangChain provides three HTML splitters, each with increasing sophistication:

  1. HTMLHeaderTextSplitter: Splits by header tags, analogous to the markdown splitter
  2. HTMLSectionSplitter: Splits by sections with XSLT support and font-size awareness
  3. HTMLSemanticPreservingSplitter: The most advanced option, preserving tables, lists, media, and supporting custom handlers

Splitter 1: HTMLHeaderTextSplitter

This splitter works almost identically to the MarkdownHeaderTextSplitter. It reads HTML header tags (<h1>, <h2>, <h3>) and groups content under each header.

html_string = """
<html>
<body>
    <h1>LangChain Framework</h1>
    <p>LangChain is a framework for building applications with LLMs.</p>
    
    <h2>Core Components</h2>
    <p>The framework provides several key building blocks.</p>
    <p>Each component handles a specific part of the pipeline.</p>
    
    <h3>Chains</h3>
    <p>Chains connect multiple components into a sequence.</p>
    
    <h3>Agents</h3>
    <p>Agents use LLMs to decide which tools to call and in what order.</p>
    
    <h2>Integrations</h2>
    <p>LangChain integrates with vector stores, embedding models, and APIs.</p>
</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)

for chunk in html_header_splits:
    print(f"Content: {chunk.page_content}")
    print(f"Metadata: {chunk.metadata}")
    print("---")
        

Here is what happens step by step:

  1. The splitter parses the HTML and identifies all <h1>, <h2>, and <h3> tags.
  2. It groups the content (paragraphs, text) that falls under each header.
  3. Each chunk gets metadata showing the full header hierarchy, just like the markdown splitter.

Expected output:

Content: LangChain is a framework for building applications with LLMs.
Metadata: {'Header 1': 'LangChain Framework'}
---
Content: The framework provides several key building blocks.
Each component handles a specific part of the pipeline.
Metadata: {'Header 1': 'LangChain Framework', 'Header 2': 'Core Components'}
---
Content: Chains connect multiple components into a sequence.
Metadata: {'Header 1': 'LangChain Framework', 'Header 2': 'Core Components', 'Header 3': 'Chains'}
---
Content: Agents use LLMs to decide which tools to call and in what order.
Metadata: {'Header 1': 'LangChain Framework', 'Header 2': 'Core Components', 'Header 3': 'Agents'}
---
Content: LangChain integrates with vector stores, embedding models, and APIs.
Metadata: {'Header 1': 'LangChain Framework', 'Header 2': 'Integrations'}
---
        

The HTMLHeaderTextSplitter also supports loading HTML directly from a URL or a file:

# Split from a URL (fetches the HTML automatically)
# splits = html_splitter.split_text_from_url("<https://example.com/docs>")

# Split from a local file
# splits = html_splitter.split_text_from_file("path/to/file.html")
        

Combining with character splitting for size control:

chunk_size = 500
chunk_overlap = 30

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

final_splits = text_splitter.split_documents(html_header_splits)
        

This two-step pattern (structure-aware split first, size-constrained split second) is the same approach we used in Phase 2 for markdown. It is the recommended pattern across all structure-aware splitters.

Splitter 2: HTMLSectionSplitter

The HTMLSectionSplitter takes a different approach. Instead of splitting purely by header tags, it uses XSLT transformations to identify sections. It also considers font sizes when determining section boundaries.

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
]

html_section_splitter = HTMLSectionSplitter(
    headers_to_split_on=headers_to_split_on
)

section_splits = html_section_splitter.split_text(html_string)

for chunk in section_splits:
    print(f"Content: {chunk.page_content[:100]}...")
    print(f"Metadata: {chunk.metadata}")
    print("---")
        

The key difference from HTMLHeaderTextSplitter:

  • It uses XSLT transformations internally to detect section boundaries
  • It considers font sizes when determining section boundaries, which helps with HTML that uses <span style="font-size: 24px"> instead of proper <h1> tags
  • It automatically applies RecursiveCharacterTextSplitter internally for sections that exceed the size limit

Note: In older versions of langchain-text-splitters, the HTMLSectionSplitter accepted an xslt_path parameter for custom XSLT files. This parameter was removed in a security update to prevent XXE (XML External Entity) attacks. The current version uses a hardcoded, safe default XSLT file internally.

Use this splitter when your HTML does not follow standard header conventions or when sections are defined by styling rather than semantic tags.

Splitter 3: HTMLSemanticPreservingSplitter (The Advanced Option)

This is the most powerful HTML splitter. It solves a critical problem: what happens when your HTML contains tables, ordered lists, code blocks, or embedded media? A naive splitter might cut a table in half, destroying the relationship between headers and data rows.

The HTMLSemanticPreservingSplitter keeps these elements intact, even if it means exceeding the target chunk size.

from bs4 import Tag

html_with_table = """
<html>
<body>
    <h1>Model Comparison</h1>
    <p>Here is a comparison of popular LLM models.</p>
    
    <table>
        <tr><th>Model</th><th>Parameters</th><th>Context Window</th></tr>
        <tr><td>GPT-4</td><td>1.8T</td><td>128K tokens</td></tr>
        <tr><td>Claude 3</td><td>Unknown</td><td>200K tokens</td></tr>
        <tr><td>Gemini Ultra</td><td>Unknown</td><td>1M tokens</td></tr>
    </table>
    
    <h2>Performance Benchmarks</h2>
    <p>Each model excels in different areas.</p>
    
    <ul>
        <li>GPT-4 leads in creative writing tasks</li>
        <li>Claude 3 leads in long-context analysis</li>
        <li>Gemini Ultra leads in multimodal tasks</li>
    </ul>
    
    <h2>Code Examples</h2>
    <code data-lang="python">from langchain_openai import ChatOpenAI</code>
</body>
</html>
"""

# Define a custom handler for code elements
def code_handler(element: Tag) -> str:
    data_lang = element.get("data-lang", "text")
    code_format = f"<code:{data_lang}>{element.get_text()}</code>"
    return code_format

semantic_splitter = HTMLSemanticPreservingSplitter(
    headers_to_split_on=[
        ("h1", "Header 1"),
        ("h2", "Header 2"),
    ],
    separators=["\\n\\n", "\\n", ". ", "! ", "? "],
    max_chunk_size=500,
    preserve_images=True,
    preserve_videos=True,
    elements_to_preserve=["table", "ul", "ol", "code"],
    denylist_tags=["script", "style", "head"],
    custom_handlers={"code": code_handler},
)

semantic_chunks = semantic_splitter.split_text(html_with_table)

for i, chunk in enumerate(semantic_chunks):
    print(f"--- Semantic Chunk {i} ---")
    print(f"Content: {chunk.page_content}")
    print(f"Metadata: {chunk.metadata}")
    print()
        

Let us break down each parameter:

  • headers_to_split_on: Same as the other HTML splitters. Defines where to create chunk boundaries.
  • separators: The fallback separators for text content. Order matters: the splitter tries "\\n\\n" first, then "\\n", then ". ", and so on. Note: use ". " (with a space) instead of "." to avoid splitting in the middle of URLs.
  • max_chunk_size=500: The target chunk size in characters. This is a soft limit: if a preserved element (like a table) is larger than 500 characters, the splitter will keep it intact rather than breaking it.
  • preserve_images=True: Image references are kept as-is in the output.
  • preserve_videos=True: Video references are kept as-is.
  • elements_to_preserve: A list of HTML tags that should never be split. Tables, unordered lists, ordered lists, and code blocks all stay intact.
  • denylist_tags: HTML tags to completely remove before processing. Scripts, stylesheets, and head content are stripped out.
  • custom_handlers: A dictionary mapping tag names to Python functions. When the splitter encounters a <code> tag, it calls our code_handler function instead of using default processing.

The max_chunk_size being a soft limit is a deliberate design choice. It is better to have an oversized chunk that contains a complete table than a correctly-sized chunk that has half a table. When the LLM receives a complete table, it can reason about relationships between rows and columns. A partial table is worse than useless.

Comparison Table: Which HTML Splitter to Use

Feature HTMLHeaderTextSplitter HTMLSectionSplitter HTMLSemanticPreservingSplitter Header-based splitting Yes Yes Yes Preserves tables and lists No No Yes Header metadata Yes Yes Yes Custom tag handlers No No Yes Media preservation No No Yes Font size awareness No Yes No XSLT support No Yes No Best for Well-structured docs Non-standard HTML Rich content with tables and media

Selection guide:

  • Use HTMLHeaderTextSplitter when your HTML has clean, well-defined header hierarchy and you just need text content with metadata.
  • Use HTMLSectionSplitter when your HTML uses styling (font sizes) instead of semantic header tags, or when you need XSLT transformations.
  • Use HTMLSemanticPreservingSplitter when your HTML contains tables, lists, code blocks, or embedded media that must stay intact.

This concludes Phase 5. You now have three levels of HTML splitting in your toolkit, from simple header-based splitting to full semantic preservation.


Evaluation: Comparing Splitter Outputs

Let us put all four splitters side by side and compare what they produce:

print("=== MARKDOWN SPLITTER ===")
print(f"Input type: Markdown string")
print(f"Output type: Documents with header hierarchy metadata")
print(f"Best for: Documentation, README files, knowledge base articles")
print(f"Key feature: Full header path in metadata")
print()

print("=== JSON SPLITTER ===")
print(f"Input type: Python dict / JSON object")
print(f"Output type: Dicts, Documents, or strings")
print(f"Best for: API responses, config files, structured data")
print(f"Key feature: Preserves JSON structure integrity")
print()

print("=== CODE SPLITTER ===")
print(f"Input type: Source code string + Language enum")
print(f"Output type: Documents")
print(f"Best for: Codebase search, code documentation")
print(f"Key feature: 25+ language-specific separator hierarchies")
print()

print("=== HTML SPLITTER ===")
print(f"Input type: HTML string, URL, or file path")
print(f"Output type: Documents with header metadata")
print(f"Best for: Web pages, HTML documentation, scraped content")
print(f"Key feature: Three splitters with increasing sophistication")
        

Key observations:

  1. All splitters return LangChain Documents. This means you can mix and match them in the same RAG pipeline. A single vector store can contain chunks from markdown, JSON, code, and HTML sources.
  2. Metadata is the differentiator. The markdown and HTML splitters attach structural metadata (header hierarchies). The code splitter relies on syntactic boundaries. The JSON splitter preserves object structure. This metadata is what makes retrieval smart.
  3. The two-step pattern works everywhere. Split by structure first, then constrain by size. This pattern applies to markdown, HTML, and (less commonly) JSON.
  4. Choose the right splitter for your data format. Do not use a generic character splitter when you have structured data. The structure IS the context.

In production RAG systems, teams consistently find that hybrid strategies (combining multiple chunking approaches based on content type) outperform single-strategy approaches. This is exactly the pattern you should follow: detect the format of each document, route it to the appropriate splitter, and combine the results.

How to Improve It Further

Now that you have working code for all four structure-aware splitters, here are ways to build on this foundation:

  1. Build a format-aware routing pipeline. Create a function that detects whether an incoming document is markdown, JSON, code, or HTML, and automatically routes it to the correct splitter. In LangGraph, this would be a conditional edge in a StateGraph that routes based on document format.
  2. Add contextual retrieval. Anthropic's research showed that adding a brief context summary to each chunk before embedding can reduce retrieval errors by up to 67%. After splitting, use an LLM to generate a one-sentence context summary for each chunk, then prepend it to the chunk before embedding.
  3. Experiment with chunk sizes. There is no universal "best" chunk size. Start with 500-1000 characters for prose and 1000-1500 for code. Measure retrieval precision and recall, then adjust. A common production baseline is 10-20% overlap relative to your chunk size.
  4. Combine with semantic chunking. For documents where structural boundaries are not sufficient, consider using LangChain's SemanticChunker as a second pass after structure-aware splitting. This uses embeddings to detect semantic shifts within large sections.
  5. Add metadata enrichment. After splitting, enrich each chunk's metadata with additional context: source URL, document title, creation date, document type, and even a brief summary. This metadata enables powerful filtering during retrieval.
  6. Build evaluation metrics. Track chunk quality metrics like average chunk size, metadata completeness, and retrieval hit rates per splitter type. This data-driven approach lets you continuously optimize your chunking strategy.
  7. Production caching pattern. For large document collections, cache the split results. Check if a document has been split before (using a hash of the content), and load the cached chunks instead of re-splitting. This is especially important for JSON and HTML documents that may come from API calls.

LangChain's text splitters are part of the broader langchain-text-splitters package, which is designed to be lightweight and framework-agnostic. You can use these splitters even outside of a full LangChain pipeline, in any Python application that needs intelligent document chunking.

To view or add a comment, sign in

More articles by Subrat Pati

Others also viewed

Explore content categories