Data Cleaning and Integration using Microsoft Excel
Most data sets require cleansing before analysis and it is known as data preparation. Quality of the data is critical for meaningful results. Wrong entries, duplicates and inconsistent values can affect the analysis while dealing with small or medium-size data sets. In this article, I am detailing a Six-step approach for data preparation and the excel functions that’ll help you achieve that.
Data Import
Importing and exporting data in Excel seems like a complicated process, but it’s actually pretty easy once you’ve done it a couple times. When you import your data, you should be aware of a clean delimitation of your entries. Excel lets you choose a number of different options related to delimiting that will help you make sure your data gets imported correctly. Microsoft Excel also provides three different functions (LEFT, RIGHT and MID) to extract text of a specified length from a cell.
Data Cleaning
If there are unwanted leading or trailing spaces in the data set, Excel also provides functions for that. The TRIM function removes both leading and trailing spaces from text, and also "normalizes" multiple spaces between words to one space character only. If you also need to remove line breaks from cells, you can add the CLEAN function. If CLEAN is not able to remove all non-printing characters, notably a non-breaking space, which can be appear in Excel as CHAR(160). By adding the SUBSTITUTE function to the formula, you can even remove specific characters.
Data Formatting
To make sure that your data is analysed appropriately, you should store it in the correct format. Excel formatting is changing the way a cell, or its contents look. There are two ways you can format cells in Excel: you can format the content type of the cell or format the visual look of the cell. A number for example can be shown as a full number, a number with decimals, a percentage or a currency amount. Dates can be shown in a lot of different ways too.
Correct inconsistencies
Finding inconsistent data in a large spreadsheet can be time consuming. The key to finding the inconsistencies is to create a Filter. The filter will allow you to see all the unique values in the column, making it easier to isolate the incorrect values.
Combine data sets
If you do not only want to analyze a single data set, the next step to get your data ready for analysis is to combine it. You could simply copy columns from one sheet to another if your data is sorted. It is better to identify match criteria and pull your data based on these criteria from one table into another. If you have one match criterion that is unique for each row of your data set (a so-called unique identifier), you can use this match criterion and perform a LOOKUP in Excel. If you have more than one match criteria, you have to use a combination of INDEX and MATCH.
Remove duplicates
Data duplication can happen when records are entered into the spreadsheet more than once (or you're combining records from multiple sources). You might also find duplicates for one or more fields, such as a name or state field, with the same data and want to limit the records to unique data for some fields. You can directly remove them with Excel’s Remove Duplicates function in the Data Ribbon.
You will probably have a lot of questions about which function will serve your purpose of data integration. Excel has a lot of powerful functions that can save your time and achieve the results, its all about finding the right one at the right time.
Good one Aswathy!