Processing large amount of CSV data using JAVA

Divagar Carlmarx

Published Mar 15, 2022

+ Follow

Have you worked with large amount of csv DATA in GBs ??

And you have memory constraints ??

This might help for you

First thing is first, we need to Parse CSV records

Usual & Open CSV way ==> Records will be parsed as List<List<String>>, this will be more expensive because of String & List of List and this may also completely

Optimized way --> don't use any library instead read your CSV line by line using Java NIO & InputStream & BufferedReader & parse it has List<String>

2. Filtering Unique records based on particular columns

Usual way ==> To check unique records based of particular columns we will be doing following

1. Maintain cache with key column_columnValue

2. Find column index

3. Read each records split wit with CSV separator(,)

4. Check cache & filter if not unique

5. Populate cache

This will work fine but it may take more time based on your volume of data & no of columns since String.split(",") iterates & splits each & every column

Optimized way ==>

1. Maintain cache with key column_columnValue

2. Find column index

3. Read each records find column's start & end index based on CSV separator & use String.substring(startIndex,endIndex)

4. Check cache & filter if not unique

5. Populate cache

This will be faster than our usual way since we String.substring which will extract only that particular string instead of iterating each columns

This change might look like small but we have reduced data processing time from hours to mins O_o

sukesh reddy 2y

Now i am working on an application to generate a CSV file , how can i generate the CSV file faster , it contains around 40K rows and the expected time is about 2 to 5 minutes

To view or add a comment, sign in

Processing large amount of CSV data using JAVA

Divagar Carlmarx

More articles by Divagar Carlmarx

Explore content categories

More articles by Divagar Carlmarx

Fell in love with Scala

Scala - Sealed Class Hierarchies

Scala - NULL handling with MAP

WHY and HOW I started using IntelliJ IDE and SCALA

Product based company team management strategies for productivity

Big Data Volume

Distributed Systems - Multi Leader Replication

Distributed Systems - Replication

Transaction Processing or Analytics ?

Designing key value database with btree

Explore content categories