LangChain Structure-Aware Text Splitters: Split Markdown, JSON, Code, and HTML for Smarter RAG Pipelines
Your RAG Pipeline Is Only as Good as Your Chunking Strategy
Here is the problem most developers run into when building RAG applications.
They load a document, split it with a generic character-based splitter, embed the chunks, and wonder why retrieval quality is inconsistent. Some chunks contain half a function. Others mix unrelated sections. A JSON object gets sliced right through a nested value.
The root cause is simple: generic splitters do not understand the structure of what they are splitting.
LangChain solves this with structure-aware text splitters. These splitters understand the format of your data and split it along natural boundaries, whether those boundaries are markdown headers, JSON objects, function definitions, or HTML tags.
In this article, you will learn how to use four structure-aware splitters from the langchain-text-splitters package:
By the end of this article, you will have working Python code for each splitter and understand exactly when and why to use each one.
Anthropic's research on contextual retrieval showed that enriching chunks with document-level context before embedding can dramatically improve retrieval accuracy. When combined with BM25 and reranking, their approach reduced the top-20 chunk retrieval failure rate by 67%. Structure-aware splitting is the first step toward achieving that level of precision.
Architecture Overview: How Structure-Aware Splitting Works
Before we dive into each splitter, here is the mental model for how structure-aware splitting differs from generic splitting:
The result is chunks that preserve meaning, maintain metadata about where they came from, and produce better embeddings for retrieval.
Here is how each splitter approaches its format:
Table of Contents
Phase 1: Setting Up Your Environment
Workflow: Install the required package, then import the classes we will use throughout the article.
Every splitter we cover lives in a single package: langchain-text-splitters. This is a lightweight package that does not pull in the full LangChain dependency chain.
pip install -qU langchain-text-splitters
This installs the latest version of the text splitters package. The -qU flags mean quiet mode and upgrade to the latest version if already installed.
Now let us import the classes we will use:
from langchain_text_splitters import (
MarkdownHeaderTextSplitter,
RecursiveCharacterTextSplitter,
RecursiveJsonSplitter,
HTMLHeaderTextSplitter,
HTMLSectionSplitter,
HTMLSemanticPreservingSplitter,
Language,
)
Here is what each import does:
All of these splitters return LangChain Document objects with page_content and metadata fields. This means you can pipe the output directly into any LangChain embedding or retrieval chain.
This concludes Phase 1. With the package installed and imports ready, let us start splitting real documents.
Phase 2: Splitting Markdown Documents
Workflow: Define which headers to split on, create the splitter, pass in a markdown string, and inspect the resulting chunks with their metadata.
Markdown is one of the most common formats in RAG pipelines. README files, documentation, knowledge base articles, and notes are all written in markdown. The challenge is that a markdown document has a natural hierarchy defined by its headers, and a good splitter should respect that hierarchy.
The MarkdownHeaderTextSplitter does exactly this. It reads the header levels you specify, groups all content under each header into a chunk, and attaches the full header path as metadata.
Step 1: Define the Headers to Split On
First, you tell the splitter which header levels matter:
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
This is a list of tuples. Each tuple has two values:
So when the splitter finds a ## header, it will create a metadata entry like {"Header 2": "Section Title"} on the resulting chunk.
Step 2: Create the Splitter and Process a Document
Let us create a sample markdown document and split it:
markdown_document = """
# Machine Learning Basics
## Supervised Learning
Supervised learning uses labeled data to train models.
The model learns to map inputs to known outputs.
### Classification
Classification predicts discrete categories.
Examples include spam detection and image recognition.
### Regression
Regression predicts continuous values.
Examples include house price prediction and temperature forecasting.
## Unsupervised Learning
Unsupervised learning finds patterns in unlabeled data.
Clustering and dimensionality reduction are common techniques.
"""
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on
)
chunks = markdown_splitter.split_text(markdown_document)
for chunk in chunks:
print(f"Content: {chunk.page_content}")
print(f"Metadata: {chunk.metadata}")
print("---")
Let us walk through what happens here:
Expected output:
Content: Supervised learning uses labeled data to train models.
The model learns to map inputs to known outputs.
Metadata: {'Header 1': 'Machine Learning Basics', 'Header 2': 'Supervised Learning'}
---
Content: Classification predicts discrete categories.
Examples include spam detection and image recognition.
Metadata: {'Header 1': 'Machine Learning Basics', 'Header 2': 'Supervised Learning', 'Header 3': 'Classification'}
---
Content: Regression predicts continuous values.
Examples include house price prediction and temperature forecasting.
Metadata: {'Header 1': 'Machine Learning Basics', 'Header 2': 'Supervised Learning', 'Header 3': 'Regression'}
---
Content: Unsupervised learning finds patterns in unlabeled data.
Clustering and dimensionality reduction are common techniques.
Metadata: {'Header 1': 'Machine Learning Basics', 'Header 2': 'Unsupervised Learning'}
---
Notice how each chunk carries the full header path. The "Classification" chunk knows it belongs to "Supervised Learning" under "Machine Learning Basics". This metadata is extremely valuable for retrieval because it gives the LLM context about where a chunk came from in the original document.
This is important because when you embed these chunks, the retriever can use the metadata to filter results or provide context to the LLM. For example, you could filter chunks to only those under "Supervised Learning" before passing them to the model.
Step 3: Keep Headers in the Content
By default, the splitter strips header text from the page_content (the headers only appear in metadata). If you want headers to remain in the content as well, set strip_headers=False:
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
strip_headers=False,
)
chunks = markdown_splitter.split_text(markdown_document)
print(chunks[0].page_content)
Expected output:
## Supervised Learning
Supervised learning uses labeled data to train models.
The model learns to map inputs to known outputs.
Now the header text appears in both the page_content and the metadata. This is useful when you want the embedded chunk to contain the full context without relying on metadata.
Step 4: Combine with Character Splitting for Size Control
In production, markdown sections can be very long. A single ## section might contain thousands of words. You need to constrain chunk sizes to fit your embedding model's context window.
The pattern is: first split by headers, then split the resulting chunks by character count:
# Step 1: Split by headers
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
strip_headers=False,
)
md_header_splits = markdown_splitter.split_text(markdown_document)
# Step 2: Split large sections by character count
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=250,
chunk_overlap=30,
)
final_chunks = text_splitter.split_documents(md_header_splits)
for chunk in final_chunks:
print(f"Size: {len(chunk.page_content)} chars")
print(f"Metadata: {chunk.metadata}")
print()
Two critical details here:
In production, you would tune chunk_size and chunk_overlap based on your embedding model. A common starting point in production systems is 10-20% overlap relative to chunk size (for example, 50-100 characters of overlap on a 500-character chunk). Measure retrieval quality and adjust from there.
This concludes Phase 2. You now know how to split markdown by headers, preserve metadata, and constrain chunk sizes. Next, we tackle a very different format: JSON.
Phase 3: Splitting JSON Data
Workflow: Create a JSON structure, configure the splitter with a max chunk size, then split using three different output methods (JSON objects, Documents, or strings).
JSON is the format of APIs, configuration files, and structured data. The challenge with splitting JSON is that you cannot just cut at a character boundary. Cutting in the middle of a nested object breaks the structure entirely.
The RecursiveJsonSplitter handles this by traversing the JSON tree depth-first. It tries to keep objects and arrays intact. When a value is too large, it splits at the deepest possible level to minimize structural damage.
Step 1: Prepare Sample JSON Data
import json
json_data = {
"company": "TechCorp",
"departments": {
"engineering": {
"team_lead": "Alice Chen",
"projects": [
{
"name": "Search Engine",
"stack": ["Python", "Elasticsearch", "React"],
"status": "active",
"description": "Building a semantic search engine with vector embeddings and hybrid retrieval."
},
{
"name": "Data Pipeline",
"stack": ["Spark", "Airflow", "PostgreSQL"],
"status": "active",
"description": "Real-time data pipeline processing 10M events per day."
}
]
},
"marketing": {
"team_lead": "Bob Martinez",
"campaigns": [
{"name": "Product Launch", "budget": 50000},
{"name": "Brand Refresh", "budget": 30000}
]
}
}
}
This JSON has multiple levels of nesting: a top-level object containing departments, each with teams and projects. This is a realistic structure you might get from an API response.
Step 2: Create the Splitter and Split
The RecursiveJsonSplitter takes one key parameter: max_chunk_size, which sets the target maximum character count per chunk.
splitter = RecursiveJsonSplitter(max_chunk_size=300)
Now you have three different methods to split:
Method 1: Get JSON objects (Python dicts)
json_chunks = splitter.split_json(json_data=json_data)
for i, chunk in enumerate(json_chunks):
print(f"Chunk {i}: {json.dumps(chunk, indent=2)[:200]}...")
print()
This returns a list of Python dictionaries. Each dictionary is a valid JSON object that preserves the structure of the original data. This is useful when you need to process the data programmatically before embedding.
Method 2: Get LangChain Documents
docs = splitter.create_documents(texts=[json_data])
for doc in docs:
print(f"Content: {doc.page_content[:150]}...")
print(f"Metadata: {doc.metadata}")
print()
This returns Document objects with the JSON serialized as a string in page_content. This is the method you will use most often in a LangChain RAG pipeline because Documents can be passed directly to embedding models and vector stores.
Method 3: Get strings
texts = splitter.split_text(json_data=json_data)
for text in texts:
print(f"Length: {len(text)} chars")
print(text[:150])
print()
This returns plain JSON strings. It is useful for debugging or when you need raw text output.
The splitter preserves the hierarchical path when it decomposes structures. If it splits the engineering department's projects, each chunk will still contain enough context (like the department name) to be meaningful on its own.
Recommended by LinkedIn
Step 3: Handle Lists with convert_lists
By default, the splitter does not convert list items to dictionary entries. If you want lists to be split more granularly, use convert_lists=True:
texts_with_list_conversion = splitter.split_text(
json_data=json_data,
convert_lists=True,
)
for text in texts_with_list_conversion:
print(f"Length: {len(text)} chars")
print()
When convert_lists=True, the splitter transforms arrays like ["Python", "Elasticsearch", "React"] into dictionary format {"0": "Python", "1": "Elasticsearch", "2": "React"}. This allows finer-grained splitting within lists while preserving the index information.
Important Behavioral Notes
This concludes Phase 3. JSON splitting preserves structural integrity while respecting size constraints. Next, we move to something completely different: source code.
Phase 4: Splitting Source Code
Workflow: Choose a programming language from the Language enum, create a language-specific splitter, split source code, and inspect how it respects syntactic boundaries like classes and functions.
Splitting code is uniquely challenging. Unlike prose, code has strict syntactic rules. A function split in the middle is useless. A class definition separated from its methods loses all context.
LangChain handles this with RecursiveCharacterTextSplitter.from_language(). This factory method creates a splitter pre-configured with language-specific separators. For Python, it knows to split on class definitions, function definitions, and blank lines, in that order of priority.
Step 1: Explore Available Languages
First, let us see what languages are supported and what separators they use:
# See all supported languages
print("Supported languages:")
for lang in Language:
print(f" - {lang.value}")
Expected output (partial):
Supported languages:
- cpp
- go
- java
- kotlin
- js
- ts
- php
- python
- ruby
- rust
- scala
- swift
- markdown
- latex
- html
- sol
- csharp
- haskell
- elixir
- powershell
...
You can inspect the actual separators for any language:
python_separators = RecursiveCharacterTextSplitter.get_separators_for_language(
Language.PYTHON
)
print("Python separators:", python_separators)
Expected output:
Python separators: ['\\nclass ', '\\ndef ', '\\n\\tdef ', '\\n\\n', '\\n', ' ', '']
This tells you the priority order. The splitter will first try to split at \\nclass boundaries (class definitions). If the chunk is still too large, it falls back to \\ndef (function definitions). Then \\n\\tdef (method definitions inside classes). Then blank lines, then single newlines, then spaces, and finally individual characters as a last resort.
This cascading separator strategy is what makes the code splitter so effective. It always tries to split at the most meaningful boundary first. LangChain's documentation on code splitting emphasizes that this approach preserves syntactic boundaries across all 25+ supported languages.
Step 2: Split Python Code
python_code = """
class DocumentProcessor:
def __init__(self, model_name: str):
self.model_name = model_name
self.processed_count = 0
def process(self, document: str) -> dict:
result = self._analyze(document)
self.processed_count += 1
return result
def _analyze(self, text: str) -> dict:
words = text.split()
return {
"word_count": len(words),
"char_count": len(text),
"model": self.model_name,
}
def load_documents(file_path: str) -> list:
with open(file_path, 'r') as f:
content = f.read()
return content.split('\\\\n\\\\n')
def main():
processor = DocumentProcessor("gpt-4")
docs = load_documents("data.txt")
for doc in docs:
result = processor.process(doc)
print(result)
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=200,
chunk_overlap=0,
)
chunks = python_splitter.create_documents([python_code])
for i, chunk in enumerate(chunks):
print(f"--- Chunk {i} ({len(chunk.page_content)} chars) ---")
print(chunk.page_content)
print()
Let us trace what happens:
The result is chunks where each class or function stays together when possible.
Step 3: Split JavaScript Code
Let us see how a different language works:
js_code = """
function fetchUserData(userId) {
return fetch(`/api/users/${userId}`)
.then(response => response.json())
.then(data => {
console.log('User data:', data);
return data;
});
}
class UserService {
constructor(baseUrl) {
this.baseUrl = baseUrl;
}
async getUser(id) {
const response = await fetch(`${this.baseUrl}/users/${id}`);
return response.json();
}
async updateUser(id, data) {
const response = await fetch(`${this.baseUrl}/users/${id}`, {
method: 'PUT',
body: JSON.stringify(data),
});
return response.json();
}
}
"""
js_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.JS,
chunk_size=250,
chunk_overlap=0,
)
js_chunks = js_splitter.create_documents([js_code])
for i, chunk in enumerate(js_chunks):
print(f"--- JS Chunk {i} ({len(chunk.page_content)} chars) ---")
print(chunk.page_content)
print()
The JavaScript splitter uses different separators, including function, class, const, let, var, and other JS-specific keywords. The same cascading logic applies: split at the highest-level boundary first, then fall back to more granular boundaries.
Step 4: Using Code Splitting in a RAG Pipeline
In a real RAG pipeline for code search, you would combine the code splitter with embeddings:
# In production, this is how you would wire it up:
# 1. Split code files by language
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=1000,
chunk_overlap=100,
)
# 2. Create documents from your codebase
code_documents = python_splitter.create_documents(
texts=[python_code],
metadatas=[{"source": "document_processor.py", "language": "python"}],
)
# 3. Each document now has page_content (the code chunk)
# and metadata (source file and language)
# You can pass these directly to a LangChain vector store
for doc in code_documents:
print(f"Source: {doc.metadata['source']}")
print(f"Content preview: {doc.page_content[:80]}...")
print()
Notice the metadatas parameter. This lets you attach additional metadata (like the source file path and language) to each chunk. When a user searches for code, the retriever can use this metadata to filter by file or language.
In production, you would set chunk_size between 500 and 1500 characters for code, depending on your embedding model. Smaller chunks improve precision (finding the exact function), while larger chunks improve recall (understanding the broader context of a class).
This concludes Phase 4. Code splitting gives you language-aware chunking that respects function and class boundaries. Now let us tackle the most complex format: HTML.
Phase 5: Splitting HTML Documents
Workflow: Start with header-based HTML splitting, then move to section-based splitting, and finally explore the semantic-preserving splitter that keeps tables, lists, and media intact.
HTML is the most complex format to split because it combines content, structure, and presentation in a single document. A naive character-based split can break tags, separate a table header from its rows, or cut an image reference in half.
LangChain provides three HTML splitters, each with increasing sophistication:
Splitter 1: HTMLHeaderTextSplitter
This splitter works almost identically to the MarkdownHeaderTextSplitter. It reads HTML header tags (<h1>, <h2>, <h3>) and groups content under each header.
html_string = """
<html>
<body>
<h1>LangChain Framework</h1>
<p>LangChain is a framework for building applications with LLMs.</p>
<h2>Core Components</h2>
<p>The framework provides several key building blocks.</p>
<p>Each component handles a specific part of the pipeline.</p>
<h3>Chains</h3>
<p>Chains connect multiple components into a sequence.</p>
<h3>Agents</h3>
<p>Agents use LLMs to decide which tools to call and in what order.</p>
<h2>Integrations</h2>
<p>LangChain integrates with vector stores, embedding models, and APIs.</p>
</body>
</html>
"""
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
for chunk in html_header_splits:
print(f"Content: {chunk.page_content}")
print(f"Metadata: {chunk.metadata}")
print("---")
Here is what happens step by step:
Expected output:
Content: LangChain is a framework for building applications with LLMs.
Metadata: {'Header 1': 'LangChain Framework'}
---
Content: The framework provides several key building blocks.
Each component handles a specific part of the pipeline.
Metadata: {'Header 1': 'LangChain Framework', 'Header 2': 'Core Components'}
---
Content: Chains connect multiple components into a sequence.
Metadata: {'Header 1': 'LangChain Framework', 'Header 2': 'Core Components', 'Header 3': 'Chains'}
---
Content: Agents use LLMs to decide which tools to call and in what order.
Metadata: {'Header 1': 'LangChain Framework', 'Header 2': 'Core Components', 'Header 3': 'Agents'}
---
Content: LangChain integrates with vector stores, embedding models, and APIs.
Metadata: {'Header 1': 'LangChain Framework', 'Header 2': 'Integrations'}
---
The HTMLHeaderTextSplitter also supports loading HTML directly from a URL or a file:
# Split from a URL (fetches the HTML automatically)
# splits = html_splitter.split_text_from_url("<https://example.com/docs>")
# Split from a local file
# splits = html_splitter.split_text_from_file("path/to/file.html")
Combining with character splitting for size control:
chunk_size = 500
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
)
final_splits = text_splitter.split_documents(html_header_splits)
This two-step pattern (structure-aware split first, size-constrained split second) is the same approach we used in Phase 2 for markdown. It is the recommended pattern across all structure-aware splitters.
Splitter 2: HTMLSectionSplitter
The HTMLSectionSplitter takes a different approach. Instead of splitting purely by header tags, it uses XSLT transformations to identify sections. It also considers font sizes when determining section boundaries.
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
]
html_section_splitter = HTMLSectionSplitter(
headers_to_split_on=headers_to_split_on
)
section_splits = html_section_splitter.split_text(html_string)
for chunk in section_splits:
print(f"Content: {chunk.page_content[:100]}...")
print(f"Metadata: {chunk.metadata}")
print("---")
The key difference from HTMLHeaderTextSplitter:
Note: In older versions of langchain-text-splitters, the HTMLSectionSplitter accepted an xslt_path parameter for custom XSLT files. This parameter was removed in a security update to prevent XXE (XML External Entity) attacks. The current version uses a hardcoded, safe default XSLT file internally.
Use this splitter when your HTML does not follow standard header conventions or when sections are defined by styling rather than semantic tags.
Splitter 3: HTMLSemanticPreservingSplitter (The Advanced Option)
This is the most powerful HTML splitter. It solves a critical problem: what happens when your HTML contains tables, ordered lists, code blocks, or embedded media? A naive splitter might cut a table in half, destroying the relationship between headers and data rows.
The HTMLSemanticPreservingSplitter keeps these elements intact, even if it means exceeding the target chunk size.
from bs4 import Tag
html_with_table = """
<html>
<body>
<h1>Model Comparison</h1>
<p>Here is a comparison of popular LLM models.</p>
<table>
<tr><th>Model</th><th>Parameters</th><th>Context Window</th></tr>
<tr><td>GPT-4</td><td>1.8T</td><td>128K tokens</td></tr>
<tr><td>Claude 3</td><td>Unknown</td><td>200K tokens</td></tr>
<tr><td>Gemini Ultra</td><td>Unknown</td><td>1M tokens</td></tr>
</table>
<h2>Performance Benchmarks</h2>
<p>Each model excels in different areas.</p>
<ul>
<li>GPT-4 leads in creative writing tasks</li>
<li>Claude 3 leads in long-context analysis</li>
<li>Gemini Ultra leads in multimodal tasks</li>
</ul>
<h2>Code Examples</h2>
<code data-lang="python">from langchain_openai import ChatOpenAI</code>
</body>
</html>
"""
# Define a custom handler for code elements
def code_handler(element: Tag) -> str:
data_lang = element.get("data-lang", "text")
code_format = f"<code:{data_lang}>{element.get_text()}</code>"
return code_format
semantic_splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=[
("h1", "Header 1"),
("h2", "Header 2"),
],
separators=["\\n\\n", "\\n", ". ", "! ", "? "],
max_chunk_size=500,
preserve_images=True,
preserve_videos=True,
elements_to_preserve=["table", "ul", "ol", "code"],
denylist_tags=["script", "style", "head"],
custom_handlers={"code": code_handler},
)
semantic_chunks = semantic_splitter.split_text(html_with_table)
for i, chunk in enumerate(semantic_chunks):
print(f"--- Semantic Chunk {i} ---")
print(f"Content: {chunk.page_content}")
print(f"Metadata: {chunk.metadata}")
print()
Let us break down each parameter:
The max_chunk_size being a soft limit is a deliberate design choice. It is better to have an oversized chunk that contains a complete table than a correctly-sized chunk that has half a table. When the LLM receives a complete table, it can reason about relationships between rows and columns. A partial table is worse than useless.
Comparison Table: Which HTML Splitter to Use
Feature HTMLHeaderTextSplitter HTMLSectionSplitter HTMLSemanticPreservingSplitter Header-based splitting Yes Yes Yes Preserves tables and lists No No Yes Header metadata Yes Yes Yes Custom tag handlers No No Yes Media preservation No No Yes Font size awareness No Yes No XSLT support No Yes No Best for Well-structured docs Non-standard HTML Rich content with tables and media
Selection guide:
This concludes Phase 5. You now have three levels of HTML splitting in your toolkit, from simple header-based splitting to full semantic preservation.
Evaluation: Comparing Splitter Outputs
Let us put all four splitters side by side and compare what they produce:
print("=== MARKDOWN SPLITTER ===")
print(f"Input type: Markdown string")
print(f"Output type: Documents with header hierarchy metadata")
print(f"Best for: Documentation, README files, knowledge base articles")
print(f"Key feature: Full header path in metadata")
print()
print("=== JSON SPLITTER ===")
print(f"Input type: Python dict / JSON object")
print(f"Output type: Dicts, Documents, or strings")
print(f"Best for: API responses, config files, structured data")
print(f"Key feature: Preserves JSON structure integrity")
print()
print("=== CODE SPLITTER ===")
print(f"Input type: Source code string + Language enum")
print(f"Output type: Documents")
print(f"Best for: Codebase search, code documentation")
print(f"Key feature: 25+ language-specific separator hierarchies")
print()
print("=== HTML SPLITTER ===")
print(f"Input type: HTML string, URL, or file path")
print(f"Output type: Documents with header metadata")
print(f"Best for: Web pages, HTML documentation, scraped content")
print(f"Key feature: Three splitters with increasing sophistication")
Key observations:
In production RAG systems, teams consistently find that hybrid strategies (combining multiple chunking approaches based on content type) outperform single-strategy approaches. This is exactly the pattern you should follow: detect the format of each document, route it to the appropriate splitter, and combine the results.
How to Improve It Further
Now that you have working code for all four structure-aware splitters, here are ways to build on this foundation:
LangChain's text splitters are part of the broader langchain-text-splitters package, which is designed to be lightweight and framework-agnostic. You can use these splitters even outside of a full LangChain pipeline, in any Python application that needs intelligent document chunking.