Inside Hugging Face's tokenizers' Native Async Implementation
On August 29, 2024, Hugging Face merged PR #1843 into their tokenizers library. The change looks simple enough - you can now write await tokenizer.async_encode(text) in Python. But the implementation? That's where things get interesting. They had to bridge two completely different async systems - Python's asyncio and Rust's Tokio.
Let me back up a bit. Tokenizers break text into smaller units called tokens. They turn "Hello world" into something like [1, 2, 3] - an array of numbers that LLMs can actually work with. The problem is, this process is surprisingly expensive. Most modern tokenizers use BPE (Byte Pair Encoding) or some variant of it. BPE repeatedly scans through your entire text, finding the most common character pairs and merging them into single tokens. If "th" appears a lot, it becomes one token. Then maybe "the" becomes its own token. This requires multiple passes through the entire text.
For short sentences, we're talking milliseconds. But look at what modern LLMs support - Claude handles 200k tokens, Gemini goes up to 2M tokens. That's entire books worth of text. Tokenizing something that long can take seconds, sometimes tens of seconds.
import time
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("gpt2")
# Short text is fast
start = time.time()
tokens = tokenizer.encode("Hello world")
print(f"Short text: {time.time() - start:.4f}s") # 0.0001s
# Long text is slow
long_text = "tokenizer test " * 50000 # ~100k tokens
start = time.time()
tokens = tokenizer.encode(long_text)
print(f"Long text: {time.time() - start:.4f}s") # 2-3s
This becomes a disaster in Python asyncio servers. asyncio runs multiple coroutines on a single thread, switching between them when one waits for I/O. It's great for handling many concurrent requests efficiently. But when something like tokenization hogs the CPU without yielding, the entire event loop freezes. Every LLM serving framework - vLLM, TGI, sglang - they all hit this problem. One user sends a research paper as input, and suddenly every other user is stuck waiting. The server looks dead.
People tried various workarounds. First option: use asyncio.run_in_executor to run tokenization in a separate thread. The event loop stays unblocked, sure, but there's a catch. Python's GIL (Global Interpreter Lock) means only one thread can execute Python bytecode at a time. Now, tokenizers is written in Rust, and Rust code can release the GIL, so you do get some parallelism. But creating Python threads and context switching between them adds significant overhead.
Second option: use process pools. Processes don't share the GIL, so you get true parallelism. But now you're serializing data between processes, which is expensive. Plus each process needs its own memory space.
vLLM went with a more sophisticated approach - they used Ray to spawn dedicated tokenizer workers. Ray's a powerful distributed computing framework, but it's overkill for this. You need to set up a Ray cluster, the dependencies are heavy, configuration is complex. All that just to make tokenization async? Come on.
Now with PR #1843, here's what you can do:
# Single text encoding
tokens = await tokenizer.async_encode("Hello there", add_special_tokens=True)
# Batch encoding
texts = ["First sentence", "Second sentence", "Third sentence"]
results = await tokenizer.async_encode_batch(texts)
# Faster batch encoding (no offset tracking)
fast_results = await tokenizer.async_encode_batch_fast(texts)
# Decoding works too
token_ids = [[1, 2, 3], [4, 5, 6]]
texts = await tokenizer.async_decode_batch(token_ids)
Quick note on async_encode_batch_fast - by default, tokenizers track where each token came from in the original text. If "Hello world" becomes [1, 2, 3], it stores that token 1 maps to characters 0-5 ("Hello"), token 2 maps to 6-11 ("world"), and so on. This is useful for error messages or highlighting specific tokens. But often you just need the tokens and nothing else. Turning off offset tracking makes things 30-40% faster.
Here's the thing - Python and Rust have completely different async models. Python's asyncio is event-loop based. You define coroutines with async def, the event loop schedules them, and when you hit await, you yield control back to the loop so it can run other tasks. Everything happens on a single thread.
async def python_coroutine():
print("Starting")
await asyncio.sleep(1) # Yields control to event loop
print("Done")
# Event loop runs the coroutine
asyncio.run(python_coroutine())
Rust's async/await is Future-based. An async fn returns a Future, and calling .await polls that Future. But Futures don't do anything by themselves - they need an executor to actually run them. That's where Tokio comes in.
async fn rust_future() {
println!("Starting");
tokio::time::sleep(Duration::from_secs(1)).await; // Polls the Future
println!("Done");
}
// Tokio runtime executes the Future
#[tokio::main]
async fn main() {
rust_future().await;
}
So how do you connect these two worlds? Enter pyo3-async-runtimes. It provides two key functions. future_into_py converts a Rust Future into a Python awaitable. Python code can just await it like any other async object. into_future does the opposite - it converts a Python awaitable into a Rust Future, so Rust can await Python coroutines.
Under the hood, future_into_py does some heavy lifting. It needs to create an object that implements Python's awaitable protocol. This object needs an await method, and every time Python's event loop polls it, it needs to check the Rust Future's status and relay the result. The actual implementation is way more complex - it has to preserve Python's contextvars, handle exceptions, propagate cancellation signals, the works.
+-----------------------+ +---------------------------+
| Python (asyncio) | await | Rust (Tokio runtime) |
| event loop (main) | <---------- | global Lazy<Runtime> |
| | (PyAny) | (once_cell::Lazy) |
| await tokenizer.* | | |
| └─ async_* methods | | future_into_py(...) |
| (awaitable) | | └─ async { |
| | | spawn_blocking(|| |
| | | encode_batch... |
| | | ).await |
| | | -> PyResult<T> |
+-----------------------+ +---------------------------+
^ |
| into_future(...) |
+---------------------------------------+
The key to tokenizers' implementation is tokio::task::spawn_blocking. Tokio manages two kinds of threads. Worker threads run async tasks, and there's a separate pool for blocking operations. Worker threads are limited to the number of CPU cores, but blocking threads are spawned dynamically (up to 512 by default).
// tokenizers implementation (conceptual)
async fn encode_async(text: String) -> Vec<u32> {
// Send CPU-intensive work to blocking thread pool
let handle = tokio::task::spawn_blocking(move || {
// This closure runs on a dedicated OS thread
// Doesn't block Tokio worker threads
perform_tokenization(text) // Actual tokenization
});
// Await the JoinHandle to wait for completion
handle.await.unwrap()
}
Why do we need spawn_blocking? If you run CPU-intensive work directly in an async function, it monopolizes a Tokio worker thread. Other Futures can't make progress. If you have 4 worker threads and 4 tokenization tasks running, your entire Tokio runtime looks frozen. spawn_blocking moves this work to a separate thread pool. Worker threads can immediately move on to other Futures. When the blocking work completes, it sends the result back to a worker thread.
tokenizers uses a global Tokio runtime. Creating a runtime is expensive - you're spawning threads, initializing schedulers, setting up timers, preparing I/O drivers. So they use once_cell::Lazy to create it exactly once.
Recommended by LinkedIn
use once_cell::sync::Lazy;
use std::sync::Arc;
use tokio::runtime::Runtime;
static TOKIO_RUNTIME: Lazy<Arc<Runtime>> = Lazy::new(|| {
Arc::new(
tokio::runtime::Builder::new_multi_thread()
.worker_threads(4)
.enable_all()
.build()
.expect("Failed to create Tokio runtime")
)
});
once_cell::Lazy guarantees lazy initialization. The first time someone accesses TOKIO_RUNTIME, the closure runs and creates the runtime. Subsequent accesses return the same runtime. Even if multiple threads access it simultaneously, it only initializes once. This is harder than it sounds - you need to handle race conditions, initialization failures, all that jazz. Lazy handles it all.
Internally, Lazy uses OnceCell, which uses atomic operations and parking lots to ensure exactly-once initialization. Say two threads access it simultaneously - thread A starts initializing, thread B waits for completion, thread A finishes, thread B wakes up and uses the initialized value. If initialization panics, the OnceCell becomes poisoned and propagates the panic to future accesses.
When working with Python objects in PyO3, you need the GIL (Global Interpreter Lock). It's a global mutex that protects the Python interpreter's internal state. You must hold it to create or access Python objects.
// Operations that need the GIL
Python::with_gil(|py| {
// Access Python objects through py
let list = PyList::new(py, &[1, 2, 3]);
list.append(4)?;
Ok(())
})
// Release GIL for Rust operations
Python::with_gil(|py| {
let data = get_data_from_python(py)?;
// Pure Rust computation without Python objects
py.allow_threads(|| {
// GIL released - other Python threads can run
expensive_rust_computation(data)
})
})
In tokenizers' async implementation, the tokenization itself doesn't need the GIL. It's pure Rust computation. They only acquire the GIL when converting results back to Python objects. This improves overall concurrency since other Python threads can run during tokenization.
Python user code
└─ await tokenizer.async_encode_batch_fast([...])
|
v
pyo3-async-runtimes::tokio::future_into_py
|
v
Rust async block
└─ TOKIO_RT.spawn_blocking(|| { tokenizer.encode_batch_fast(...) })
|
v
(CPU-bound encoding on blocking thread pool)
|
v
JoinHandle.await → PyObject result (via GIL section)
|
v
back to Python event loop (result ready)
What happens when Python cancels a task? pyo3-async-runtimes propagates Python cancellations to Rust. But here's the thing - Tokio cancellation is "cooperative". Futures check for cancellation at .await points and terminate themselves. spawn_blocking tasks don't have .await points, so they can't be interrupted mid-execution.
import asyncio
async def tokenize_with_timeout():
try:
result = await asyncio.wait_for(
tokenizer.async_encode("extremely long text" * 100000),
timeout=1.0
)
except asyncio.TimeoutError:
print("Timed out!")
# But the Rust task might still be running
In tokenizers' case, the tokenization itself can't be stopped. Python sees it as cancelled and ignores the result, but the Rust task runs to completion in the background and the result gets dropped. To properly handle this, you'd need to chunk the work and check cancellation tokens periodically, but tokenizers didn't implement that level of control. Most tokenization is fast enough that it doesn't matter.
Data transfer between Python and Rust is another consideration. You're sending text to Rust and getting token arrays back. PyO3 provides various optimizations. Taking a Python string as &str avoids copying, but taking it as String copies. Converting a Rust Vec to a NumPy array can be zero-copy, but converting to a Python list requires copying. tokenizers mostly uses the copying approach. The text gets transformed during processing anyway, and returning Python lists keeps the API consistent.
The PR author's benchmarks show bigger gains with more concurrent requests. Single requests might actually be slower due to async/await overhead, Future creation costs, inter-thread communication. But real servers rarely handle single requests. vLLM and similar inference servers batch requests by default. When tokenization blocks, the entire pipeline stalls.
Why is native async faster than thread pools? First, no thread creation overhead - the Tokio runtime already exists and the blocking thread pool is reused. Second, minimal context switching - Tokio task switching is lighter than Python thread switching. Third, no GIL contention - Rust code runs completely outside the GIL. Fourth, better memory locality - Tokio's thread pool makes more efficient use of CPU caches.
In production, there are things to watch out for. First, limit concurrent execution. spawn_blocking's thread pool can grow to 512 threads. Throwing unlimited CPU-bound work at it can cause thread explosion.
import asyncio
from asyncio import Semaphore
class TokenizerService:
def __init__(self, max_concurrent=10):
self.tokenizer = Tokenizer.from_pretrained("gpt2")
self.semaphore = Semaphore(max_concurrent)
async def encode(self, text: str):
async with self.semaphore:
# Max 10 concurrent operations
return await self.tokenizer.async_encode(text)
Second, monitor memory usage. Processing lots of long texts can cause memory spikes. Third, set appropriate timeouts. Fourth, tune batch sizes. Too large and you blow out memory, too small and overhead dominates.
vLLM used to have this complex Ray setup. Create Ray actors, communicate via RPC, serialize/deserialize results. Now it's just await tokenizer.async_encode(text). One line. No Ray dependency, simpler config, lower latency.
When things go wrong, how do you debug? First, check Tokio runtime state with RUST_LOG=tokio=debug. Second, enable asyncio's blocking detection with loop.set_debug(True) - it warns when something blocks for over 100ms. Third, check for memory leaks using weakref to track object lifecycles.
This PR is just the beginning. There's room for improvement. Streaming tokenization would let you process text as it arrives. GPU acceleration using CUDA or Metal could make things way faster. Fine-grained control with priorities, cancellation tokens, progress reporting - all possible additions.
tokenizers isn't alone in this approach. Polars is having similar discussions about reading large CSVs without blocking the event loop. They're looking at pyo3-async-runtimes too. Pydantic-core is preparing async versions for slow JSON parsing and validation.
This trend will spread across the Python ecosystem. Python is easy but slow. Rust is fast but complex. Combine them and you get libraries that are both easy and fast. PyO3 and pyo3-async-runtimes are the bridges making this possible.
PR #1843 looks simple but the implementation was complex. They used pyo3-async-runtimes to bridge Python's asyncio and Rust's Tokio, isolated CPU-intensive work with spawn_blocking, and managed the global runtime with once_cell::Lazy. The lesson is clear - you can leverage each language's strengths across boundaries. Python provides developer-friendly APIs, Rust handles high-performance implementation. This shows it's possible even in the complex world of async programming.
There are caveats for real-world use. You need to limit concurrent execution, cancellation isn't immediate, short tasks might actually be slower. But despite these limitations, real workloads like LLM serving see huge benefits. You can handle thousands of concurrent users, latency drops, resource usage decreases. And behind it all is this sophisticated bridge between Rust and Python.
Thanks alot for sharing very insightful information!