Data normalization

Data normalization

This post continues last week’s theme on matching and false positives.  Data normalization is a term you might not be familiar with, but it is key to virtually every industrial scale matching exercise, and is in fact fundamental to database theory in general.  Anyone who has tried an analytic involving phone numbers is probably already familiar with the concept.  Try determining the area code, unless the number has been strictly formatted.  One entry might be “(604) 437-7873” while another might be +1 604 4377873, or 604-437-7873.  They are obviously the same number, but still are hard to match.

As discussed last week, the idea with data normalization is to eliminate as many "insignificant" differences as possible.  This, lets you (and the processing) focus instead on the “significant” differences.  Data normalization is key whenever you are comparing/joining data, particularly from different sources.  You might want to match addresses or names (that you purchased, or perhaps from your web site) to your own customers, for example, or you might be looking for duplicates in the single database.

OK, what exactly do I mean by "insignificant" differences?  In my phone number example the difference is mostly formatting (and is easily addressed with an Include function), but normalization is much broader, and typically much harder.  I covered capitalization, punctuation and the like in the previous post, but a particularly vexing area is abbreviations and “noise” words.  The address "123 Main Street" and "123 Main St." clearly refer to the same address, but matching them automatically can be challenging.  Similarly, just imagine trying to match the identical addresses "210-123 W Ridge Rd." and "123 Ridge Road West, Appt. 210".

This is hard for a number of reasons.  First, there are literally hundreds of abbreviations in addresses (street, road, drive, ..., the list we use from the USPS includes about 500 variations).  There are also a significant number of noise words as well (appt, unit, floor. …) that may not be significant.  Second (and very importantly), you need to apply this analysis on a “whole word” basis.  If you simply replace(address, “ROAD”, “RD”) then “BROADWAY” becomes “BRDWAY”, and if you Omit(address, “FLOOR”) then “JONES FLOORING” becomes “JONES ING”.  Third, the components of each address might be in a different order, as in the example in the previous paragraph.

Creating expressions that correctly handle whole word replacements or omissions can be more challenging than you might think.  For example, you might simplistically Omit(address, “FLOOR ”), adding a blank after the item to be omitted.  This isn’t a bad approach, but will fail if the omitted term is the last thing on the line, like “123 Main St., 2nd Floor” or is the last part of a word in the address.  So a more complex process may be appropriate, and then repeated perhaps many hundreds of times.

And that's just for addresses.  Virtually any complex field has the same issues.  For example, names can be similarly abbreviated (Dick vs. Richard, or Bob vs. Robert, for example).  Our “nickname” list includes over 700 entries each, for men and women.  Even different types of names have different characteristics.  With personal names you might want to omit “Jr.” and “Junior”, but for company names it might be “Inc.” or “LLP”.  With a female name, “Sam” might be “Samantha”, whereas with a male name it might be “Samuel”.  Normalization isn’t as easy as it sounds at first, at least to do it well.

As with noise words, abbreviations can be addressed with a large set of (somewhat complex) expressions (generally using “Replace” instead of “Omit”), but maintaining this can be a nightmare, as it embeds the parameters in your script logic.  I think you can imagine what is involved in addressing the many hundreds of required transformations; and re-ordered components (“123 W Main St.” vs. “123 Main St. W”) are particularly vexing.

Imagine, if you will, that you could automatically perform all of these normalizations, and more.  Normalization so powerful that two addresses as different as "#200 - 1234 W Main Street" and "1234 MAIN ST. WEST, Suite 200" match perfectly every time, and with just one command!  This is exactly what Arbutus Analyzer offers.  Analyzer’s ground-breaking Normalize and SortNormalize functions automatically apply all of the above transformations and more, ensuring maximum performance for your matching and joining analytics.  A computed field as simple as: SortNormalize(address, “USPS”) will accomplish this remarkably powerful normalization, which can then be used to dramatically improve your results when matching addresses.

Analyzer’s proprietary normalization technology is fully user configurable, automatically whole-word based, and allows either the replacement or elimination of any words or abbreviations.  As well, it is designed for industrial use, so you can easily handle hundreds (or even thousands) of distinct terms, and create separate lists for each distinct type of normalization.  Finally, since the normalization terms are distinct from the scripts they are easy to maintain.

You won’t believe how easy normalization can be, and how valuable the results, until you try it.

It's an amazing piece of functionality, one which is a real USP for Arbutus Software Inc. - great job.

Like
Reply

To view or add a comment, sign in

More articles by Grant Brodie

  • Self-serve analytics

    Self-serve analytics are an ideal we should all strive for. They minimize the “time to answer”, by letting the consumer…

  • Using Delimited Data (last of a series)

    The delimited data format is the workhorse of data transfer standards, and for good reason. It is designed specifically…

  • Using XML Data (part of a continuing series)

    An increasingly popular format in the Internet age is XML. The main reason for its ubiquity is that unlike most other…

  • Using print image files (part of a continuing series)

    Last week I talked about PDF as a data transfer choice. The logical extension is to discuss print image files, as they…

  • Success with analytics is everyone's job

    PwC recently published their annual State of the Internal Audit Profession. ACL published a response titled "Leadership…

    1 Comment
  • Data Transfer Formats

    Most data isn't transferred in its native format. The reason for this is that internal formats are usually designed for…

  • Take control of your data, maintain your audit independence

    Data is seldom cooperative, it comes in innumerable formats, and in many/most cases isn't conveniently located in a…

  • Data Quality testing

    We are all familiar with the phrase “garbage in, garbage out”. Once data quality gets “off the rails” it can be…

  • No Apologies

    Just this morning I read a post by an individual who wrote a utility to overcome a shortcoming in our major competitor.…

    2 Comments
  • Big Data Analytics

    Like so many buzzwords, it's impossible to avoid the term “Big Data” these days. Today I thought I’d explain the…

    1 Comment

Others also viewed

Explore content categories