🔗Continuing from my last post here https://lnkd.in/gDpPdCAr As a note, I've been using KNIME to support my daily data entry tasks. Now I try to make small experiment by comparing it with Python from raw to database and verification 🗃️ 𝐃𝐚𝐭𝐚𝐬𝐞𝐭: ▪️I'm using data which have ~370k rows per day ▪️In this experiment, I used 4 days of data 📚 𝐃𝐚𝐭𝐚 𝐢𝐧𝐠𝐞𝐬𝐭𝐢𝐨𝐧 𝐟𝐥𝐨𝐰: The flow itself is not too complex: 1️⃣ 𝐏𝐮𝐭 𝐝𝐚𝐭𝐚 𝐢𝐧 𝐝𝐞𝐟𝐢𝐧𝐞𝐝 𝐟𝐨𝐥𝐝𝐞𝐫 KNIME/Python will check and extract the file(s) which have specific filename 2️⃣ 𝐃𝐚𝐭𝐚 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧 In this part, the transformation process is simple - Change the date format to "𝘥𝘥/𝘮𝘮/𝘺𝘺" - arrange column names - sort the data 3️⃣ 𝐑𝐮𝐧𝐧𝐢𝐧𝐠 𝐏𝐨𝐬𝐭𝐠𝐫𝐞𝐒𝐐𝐋 𝐞𝐧𝐠𝐢𝐧𝐞 𝐚𝐧𝐝 𝐜𝐨𝐧𝐧𝐞𝐜𝐭 𝐭𝐨 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞 Setting up the database environment and connect KNIME/Python to interact with the database 4️⃣ 𝐈𝐧𝐠𝐞𝐬𝐭𝐢𝐨𝐧 𝐝𝐚𝐭𝐚 𝐭𝐨 𝐝𝐚𝐭𝐚𝐛𝐚𝐬𝐞 Storing data in the defines tables and verifying the upload results 5️⃣ 𝐃𝐚𝐭𝐚 𝐯𝐞𝐫𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 Make verification using line chart to check the data already import Your data is successfully ingested into database 🎉 📊 𝐑𝐞𝐬𝐮𝐥𝐭 𝐟𝐫𝐨𝐦 𝐭𝐡𝐢𝐬 𝐞𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭 ▪️As on video, KNIME Analytics need 115s and Python need 87s to ingest data from raw to data verification. Python is ~25% more faster than KNIME ▪️The advantage of using KNIME is that we are not required to master coding because the way to operate it is simply "drag and drop" ▪️Whereas by Python, we at least need to have the ability to understand the algorithm which are then translated into coding script ---------- 🔆𝐋𝐞𝐬𝐬𝐨𝐧𝐬 𝐥𝐞𝐚𝐫𝐧𝐞𝐝 𝐟𝐫𝐨𝐦 𝐭𝐡𝐢𝐬 𝐞𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭: The lessons learned were more focused on creating Python code. I encountered many errors, but from here we'll learn how to fix them. For example: ▪️Data types differ between the extracted results and the database. 𝑺𝒐𝒍𝒖𝒕𝒊𝒐𝒏: Match the data types. ▪️If follow method from the previous post, it takes a long time because it sends the data to the database one row at a time (or in very small chunks), which creates a lot of "network chatter" between Python and your Postgres 𝑺𝒐𝒍𝒖𝒕𝒊𝒐𝒏: Create a function that uses Postgres' built-in COPY command. This is the fastest way to move data into Postgres ▪️Found a 𝘔𝘦𝘮𝘰𝘳𝘺𝘌𝘳𝘳𝘰𝘳 message, due to converting large columns and regular expressions consumes a lot of RAM 𝑺𝒐𝒍𝒖𝒕𝒊𝒐𝒏: Use "Chunking," breaking rows into smaller "chunks". This way, your RAM only has to handle one small chunk of data at a time, but you still get the high speed of the COPY method. ---------- You can check the detail in my git https://lnkd.in/gM9MCUpb Salam Fatwa Rafiudin #DataEngineering #Python #PostgreSQL #KNIME #ETL #VSCode #JupyterNotebook
Gas terus, Wa! Yang kenceng..
Digital, People & Organization Development Enthusiast
3wGood iniciative Bro