A Newcomer's Perspective on Data Cleaning Process
Hello readers! My name is Urmimala Mandal, who has just embarked on a journey in the world of the data universe. As I take my first steps into this, I find myself standing at the threshold of a fascinating yet somewhat intimidating realm. With a background in mathematics and computer science and a passion for uncovering meaningful insights from data, I am excited to share my experiences and discoveries as a newcomer in this field. Being a novice in this fascinating field, I have faced various challenges and thought to share with you all through writing. Some I have overcome by myself through lots of trial and error and with the help of various e-learning platforms and some still exist with me. The objective of writing is to get help/advice/insight while navigating these issues and help fellow beginners who may face the same. Today the main focus of this writing is Data Cleaning which is a crucial process while preparing data to get processed. At first, I questioned why it was necessary to dedicate so much time and effort to this seemingly mundane task. However, I soon grasped its pivotal importance. I would not lie, at the beginning, I had tendencies to remove those data rows in a displease just to make dataset what I like to see them. What I understood in a few months, was that a 'cleaning' process would create an almost accurate and consistent dataset with the least amount of errors so that it simplifies the analytical process and ensures alignment with predefined objectives.
As I dived into the vast sea of data field, I found Excel is my new best friend forever. Mostly due to its familiarity as Microsoft family members and user-friendly interface made the process less daunting. Then one fine day I met with the dexterous and slick beauty named 'Power Query', an advanced cleaning technique in Excel. This tool opened up new horizons for me as a newcomer. 'Power Query ' tool is also available in Power BI, a visualization tool by Microsoft. With my limited few months of knowledge, I have not found any differences between these 'Power Query' tools offered by two different Microsoft products. Down the line, I'll find some day. Another popular visualization tool by Salesforce, Tableau, also comes with an automatic data cleaning tool, Data Interpreter, which I found very easy to use. I also explored other accessible data cleaning tools, Google Sheets, Google Refine, and OpenRefine. However I have not used those extensively like other described tools. Having some knowledge of basic coding with Python, Pandas library provides extensive capabilities for data cleaning and transformation. The cons of using Pandas is, it is slower than other tools. Regardless of the tools, backup of the data set in each step is one the most rewarding for me. I learned this from several mistakes. Documentation is another way to keep track of the original data set and it also helps in transparencies and in future cleaning/working processes for others. However, I am still working on how to handle missing data in this data-cleaning process. This part is the most challenging for me so far. I am looking for some answers/insight into both the popular methods of handling missing data (Removal and imputation).
Recommended by LinkedIn
Having just begun my journey in this field, I understand the importance of sharing insights from a beginner's standpoint. Through this article, I hope to share my experiences and newfound insights into the art of preparing data for analysis. If you're a fellow newcomer please share your experience in this data cleaning process with me and if you are a seasoned/expert data analyst, please comments/suggestions share your in sight as well.
Excellent writing.