The big challenge of Small Data

The big challenge of Small Data

We're getting used to dealing with Big Data in our organizations. It has its challenges -- Big Data can be too big to ship around easily, too big to back up, too big to aggregate with traditional tools, too big to join quickly to the metadata we'd like to use. But on the other hand, since it's big and important, you usually know where it is. A typical Big Data item in my sector, banking, might consist of a year's worth of end-of-day risk measures under various scenarios, and it might have a life cycle roughly like this:

No alt text provided for this image

Figure 1: End of day risk data flow (greatly simplified)

It's not an easy pipeline to manage, but at least there's only one of it.

Big Data is a challenge, but in my work in data management and governance, I'd say I may have seen seen more actual money and person-hours wasted by Small Data than by Big. Small Data consists of data sets that are small enough to:

  • Distribute by email (effectively creating a new copy each time)
  • Edit by hand (effectively creating an ungoverned, unmanaged change history)
  • Join, filter and structure via Excel and other end user tools (effectively creating undocumented lineage)

A typical Small Data item in my sector might consist of, say, a list of the ISO currency codes, and it might have a life cycle something like this:

No alt text provided for this image

Figure 2: ISO country code list data flow (even more greatly simplified)

Data such as this is small yet highly intractable, rather like a really angry shrew. We can be fairly sure there's only one set of end-of-day risk in the enterprise, but we can be fairly sure there are many, many lists of country codes. And each of those unmanaged, uncounted lists can potentially cause data quality issues in our Big Data, further down the pipeline.

Managing Small Data is hard. On the technical side, one useful step is to centralize, so that at least the emailing of files and the storing of them on local laptops is curtailed. But it's not really a technical problem, because unlike Big Data, Small Data can go through its lifecycle without IT checks and balances, without passing through a chain of formally owned data services; instead, the links in the chain are individual people.

And that makes managing Small Data a cultural and operating model challenge rather than primarily a technical one. Once that's understood, it's possible to gradually rein in the Small Data within an organization.

I'd write more on this, but I don't have time at the moment -- I have to pull together a single, true, final list of our currency codes. When I'm done, I'll mail it out to everyone; that should solve the problem for a while.


To view or add a comment, sign in

More articles by Benjamin Peterson

Others also viewed

Explore content categories