Scaling Batch Processing with Spring Batch: A Practical Guide to Big Data & Thread Management

Processing millions of records is a common challenge in enterprise systems—from financial settlements to ETL pipelines and analytics workloads. Spring Batch provides a powerful, production-ready framework for building scalable and fault-tolerant batch jobs.

In this article, I’ll explain how Spring Batch handles large data sets, how to manage threads efficiently, and show a practical real-world example.


🔹 Spring Batch Architecture in Brief

Spring Batch’s core structure:

  • Job → top-level container
  • Step → reader → processor → writer
  • Chunk-Oriented Processing → processes small groups of items per transaction

This makes large-scale data processing stable, memory-efficient, and fault-tolerant.


🔹 Real Example: Processing 10 Million Records with Multi-Threading

Imagine you have a table with 10 million transactions, and you want to:

  • read the data in a streaming manner
  • process them
  • write results
  • run all of this using multiple threads to speed up execution

Below is a real Spring Batch example.


Step 1 – Thread Pool Configuration

@Bean
public TaskExecutor taskExecutor() {
    ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    executor.setCorePoolSize(4);
    executor.setMaxPoolSize(8);
    executor.setQueueCapacity(2000);
    executor.setThreadNamePrefix("batch-worker-");
    executor.initialize();
    return executor;
}        

This enables Spring Batch to process records in parallel.

Step 2 – Streaming Reader (Paging for Big Data)

@Bean
public JdbcPagingItemReader<Transaction> reader(DataSource dataSource) {
    JdbcPagingItemReader<Transaction> reader = new JdbcPagingItemReader<>();

    reader.setDataSource(dataSource);
    reader.setPageSize(1000);

    MySqlPagingQueryProvider queryProvider = new MySqlPagingQueryProvider();
    queryProvider.setSelectClause("SELECT id, amount, created_at");
    queryProvider.setFromClause("FROM transaction");
    queryProvider.setSortKeys(Collections.singletonMap("id", Order.ASCENDING));

    reader.setQueryProvider(queryProvider);
    reader.setRowMapper(new TransactionRowMapper());
    return reader;
}        

Streaming ensures we never load all rows into memory.

Step 3 – Define the Multi-Threaded Step

@Bean
public Step processStep() {
    return stepBuilderFactory.get("processStep")
        .<Transaction, ProcessedTransaction>chunk(1000)
        .reader(reader(null))
        .processor(new TransactionProcessor())
        .writer(new TransactionWriter())
        .taskExecutor(taskExecutor())
        .throttleLimit(8) // maximum parallel threads
        .build();
}        

Chunk size 1000 is optimal for multi-million record workloads (balanced performance/memory).

What This Achieves

Using the example above:

  • 4–8 parallel threads increase throughput
  • database reads are paged, not loaded into memory
  • each chunk is fully transactional
  • throughput on 10M rows can improve 3–5x depending on hardware


Why This Matters

Many teams run batch jobs that take hours—often because they are processed sequentially or load too much data into memory.

With Spring Batch:

  • multi-threading
  • chunk processing
  • database paging
  • fault tolerance

you can turn multi-hour jobs into minutes...

Final Thoughts

Spring Batch continues to be a reliable choice for large-scale enterprise workloads. With proper tuning—like multi-threaded steps, partitioning, and the right chunk size—you can safely process millions of records with excellent performance

To view or add a comment, sign in

More articles by mahmoud abbasi

Explore content categories