Data normalization

Grant Brodie

Published Mar 30, 2016

This post continues last week’s theme on matching and false positives. Data normalization is a term you might not be familiar with, but it is key to virtually every industrial scale matching exercise, and is in fact fundamental to database theory in general. Anyone who has tried an analytic involving phone numbers is probably already familiar with the concept. Try determining the area code, unless the number has been strictly formatted. One entry might be “(604) 437-7873” while another might be +1 604 4377873, or 604-437-7873. They are obviously the same number, but still are hard to match.

As discussed last week, the idea with data normalization is to eliminate as many "insignificant" differences as possible. This, lets you (and the processing) focus instead on the “significant” differences. Data normalization is key whenever you are comparing/joining data, particularly from different sources. You might want to match addresses or names (that you purchased, or perhaps from your web site) to your own customers, for example, or you might be looking for duplicates in the single database.

OK, what exactly do I mean by "insignificant" differences? In my phone number example the difference is mostly formatting (and is easily addressed with an Include function), but normalization is much broader, and typically much harder. I covered capitalization, punctuation and the like in the previous post, but a particularly vexing area is abbreviations and “noise” words. The address "123 Main Street" and "123 Main St." clearly refer to the same address, but matching them automatically can be challenging. Similarly, just imagine trying to match the identical addresses "210-123 W Ridge Rd." and "123 Ridge Road West, Appt. 210".

This is hard for a number of reasons. First, there are literally hundreds of abbreviations in addresses (street, road, drive, ..., the list we use from the USPS includes about 500 variations). There are also a significant number of noise words as well (appt, unit, floor. …) that may not be significant. Second (and very importantly), you need to apply this analysis on a “whole word” basis. If you simply replace(address, “ROAD”, “RD”) then “BROADWAY” becomes “BRDWAY”, and if you Omit(address, “FLOOR”) then “JONES FLOORING” becomes “JONES ING”. Third, the components of each address might be in a different order, as in the example in the previous paragraph.

Creating expressions that correctly handle whole word replacements or omissions can be more challenging than you might think. For example, you might simplistically Omit(address, “FLOOR ”), adding a blank after the item to be omitted. This isn’t a bad approach, but will fail if the omitted term is the last thing on the line, like “123 Main St., 2nd Floor” or is the last part of a word in the address. So a more complex process may be appropriate, and then repeated perhaps many hundreds of times.

And that's just for addresses. Virtually any complex field has the same issues. For example, names can be similarly abbreviated (Dick vs. Richard, or Bob vs. Robert, for example). Our “nickname” list includes over 700 entries each, for men and women. Even different types of names have different characteristics. With personal names you might want to omit “Jr.” and “Junior”, but for company names it might be “Inc.” or “LLP”. With a female name, “Sam” might be “Samantha”, whereas with a male name it might be “Samuel”. Normalization isn’t as easy as it sounds at first, at least to do it well.

As with noise words, abbreviations can be addressed with a large set of (somewhat complex) expressions (generally using “Replace” instead of “Omit”), but maintaining this can be a nightmare, as it embeds the parameters in your script logic. I think you can imagine what is involved in addressing the many hundreds of required transformations; and re-ordered components (“123 W Main St.” vs. “123 Main St. W”) are particularly vexing.

Imagine, if you will, that you could automatically perform all of these normalizations, and more. Normalization so powerful that two addresses as different as "#200 - 1234 W Main Street" and "1234 MAIN ST. WEST, Suite 200" match perfectly every time, and with just one command! This is exactly what Arbutus Analyzer offers. Analyzer’s ground-breaking Normalize and SortNormalize functions automatically apply all of the above transformations and more, ensuring maximum performance for your matching and joining analytics. A computed field as simple as: SortNormalize(address, “USPS”) will accomplish this remarkably powerful normalization, which can then be used to dramatically improve your results when matching addresses.

Analyzer’s proprietary normalization technology is fully user configurable, automatically whole-word based, and allows either the replacement or elimination of any words or abbreviations. As well, it is designed for industrial use, so you can easily handle hundreds (or even thousands) of distinct terms, and create separate lists for each distinct type of normalization. Finally, since the normalization terms are distinct from the scripts they are easy to maintain.

You won’t believe how easy normalization can be, and how valuable the results, until you try it.

Hans Hupje 10y

It's an amazing piece of functionality, one which is a real USP for Arbutus Software Inc. - great job.

To view or add a comment, sign in

Data normalization

Grant Brodie

More articles by Grant Brodie

Others also viewed

Hidden Data Factories: what are they?

Why Getting Answers from Untidy Data is So Difficult

Calling All Data Provocateurs – Relationship Between the Friday Afternoon Measurement and the Hidden Data Factory

It's All About the Data ... Quality

🚀 Tackling a Common Data Challenge: Finding the Right Timestamp

Thoughts about taking data records out of use

Data Is Not The New Gold!

HOOK vs Data Vault: Willibald Part 3

Priority Queue : Better or not ?

Performance - choosing the appropriate data set

Explore content categories

More articles by Grant Brodie

Self-serve analytics

Using Delimited Data (last of a series)

Using XML Data (part of a continuing series)

Using print image files (part of a continuing series)

Success with analytics is everyone's job

Data Transfer Formats

Take control of your data, maintain your audit independence

Data Quality testing

No Apologies

Big Data Analytics