Processing large amount of CSV data using JAVA
Have you worked with large amount of csv DATA in GBs ??
And you have memory constraints ??
This might help for you
Usual & Open CSV way ==> Records will be parsed as List<List<String>>, this will be more expensive because of String & List of List and this may also completely
Optimized way --> don't use any library instead read your CSV line by line using Java NIO & InputStream & BufferedReader & parse it has List<String>
2. Filtering Unique records based on particular columns
Usual way ==> To check unique records based of particular columns we will be doing following
1. Maintain cache with key column_columnValue
2. Find column index
3. Read each records split wit with CSV separator(,)
4. Check cache & filter if not unique
5. Populate cache
This will work fine but it may take more time based on your volume of data & no of columns since String.split(",") iterates & splits each & every column
Optimized way ==>
1. Maintain cache with key column_columnValue
2. Find column index
3. Read each records find column's start & end index based on CSV separator & use String.substring(startIndex,endIndex)
4. Check cache & filter if not unique
5. Populate cache
This will be faster than our usual way since we String.substring which will extract only that particular string instead of iterating each columns
This change might look like small but we have reduced data processing time from hours to mins O_o
Now i am working on an application to generate a CSV file , how can i generate the CSV file faster , it contains around 40K rows and the expected time is about 2 to 5 minutes