Processing large amount of CSV data using JAVA

Processing large amount of CSV data using JAVA

Have you worked with large amount of csv DATA in GBs ??

And you have memory constraints ??

This might help for you

  1. First thing is first, we need to Parse CSV records

Usual & Open CSV way ==> Records will be parsed as List<List<String>>, this will be more expensive because of String & List of List and this may also completely

Optimized way --> don't use any library instead read your CSV line by line using Java NIO & InputStream & BufferedReader & parse it has List<String>

2. Filtering Unique records based on particular columns

Usual way ==> To check unique records based of particular columns we will be doing following

1. Maintain cache with key column_columnValue

2. Find column index

3. Read each records split wit with CSV separator(,)

4. Check cache & filter if not unique

5. Populate cache

This will work fine but it may take more time based on your volume of data & no of columns since String.split(",") iterates & splits each & every column

Optimized way ==>

1. Maintain cache with key column_columnValue

2. Find column index

3. Read each records find column's start & end index based on CSV separator & use String.substring(startIndex,endIndex)

4. Check cache & filter if not unique

5. Populate cache

This will be faster than our usual way since we String.substring which will extract only that particular string instead of iterating each columns

This change might look like small but we have reduced data processing time from hours to mins O_o

Now i am working on an application to generate a CSV file , how can i generate the CSV file faster , it contains around 40K rows and the expected time is about 2 to 5 minutes

Like
Reply

To view or add a comment, sign in

More articles by Divagar Carlmarx

  • Fell in love with Scala

    I was a hard core JAVA developer in both my professional and learning journey, but recently for a reason i have started…

  • Scala - Sealed Class Hierarchies

    In my previous article i had shared you regarding Option feature in Scala, in this article come lets discuss about…

  • Scala - NULL handling with MAP

    Sharing three useful types that express a very useful concept i learned today, for NULL handling. Most languages have a…

  • WHY and HOW I started using IntelliJ IDE and SCALA

    I was using Eclipse IDE for java enterprise development from beginning of my career and learning journey. In my life…

  • Product based company team management strategies for productivity

    I am sharing my knowledge i got in my professional and personal life as software developer for team management. Lets…

  • Big Data Volume

    Big Data Volume Data volume is characterized by the amount of data that is generated continuously. Different data types…

    2 Comments
  • Distributed Systems - Multi Leader Replication

    We know in Leader follower model, client can able to write only by leader this if leader is down for any reason, you…

  • Distributed Systems - Replication

    Replication means keeping a copy of the same data on multiple machines that are connected via a network. Reasons for…

  • Transaction Processing or Analytics ?

    Transaction processing systems In the early days of business data processing, a write to the database typically…

  • Designing key value database with btree

    Introduced in 1970 and called “ubiquitous” less than 10 years later , B-trees have stood the test of time very well…

Explore content categories