The Data Lifecycle: Data Management in the Enterprise

The Data Lifecycle: Data Management in the Enterprise

Overview

In the modern enterprise, effective use of data to run operations and improve business results is a fundamental competency. Yet, the complexity, diversity and available solutions have conspired to make data management a significant area of complexity and even risk. This brief summarizes the main problems facing enterprises, especially emerging ones. Then, it reviews the types or sources of data and discusses the ways in which the diversity of data can be handled. Finally, it outlines some practical approaches.

In this brief, no attempt is made to discuss database technology with detail; instead the focus is on providing a broad perspective about enterprise data and to present a framework. With such a context set, decision makers can better understand how to direct their organizations’ priorities.

The Data Challenge for Emerging Enterprises

Data has become the lifeblood of most enterprises, both small and large. Data about users, customers, operations, resources and other activities in a company, help an enterprise add value, maintain competitive advantage and grow. Data comes from many sources, but most importantly, from an enterprise’s own, usually proprietary listening systems, such as its website, customer call centers, product instrumentation, sales data and so on. It also comes from third parties or intermediaries, such as agencies, tracking systems, and others.

The problems with all this data boil down to four basic issues—the 4 V’s:

  • Volume: The quantity of the data requires designing appropriate repositories to consume or manage that data with appropriate performance or service level standards. Enterprises find that technologies which often work with smaller data sets might not scale up in a cost-effective way or sometimes at all.
  • Variety: Data might be unstructured (for example, raw text, Twitter feeds, audio, etc.), semi-structured or structured. In order to derive insights from it, the data must either be transformed to give it a coherent structure or managed in an entirely different way using unstructured approaches.
  • Veracity: Collecting data is increasingly automated, but there still are potential problems with the correctness of the data, such as quality, missing values, redundancy, pedigree, and so on. In most cases, it is necessary to develop processes for data cleansing and enhancing data quality.
  • Velocity: Enterprise data flows in streams that can increase and decrease, sometimes quite dramatically. When the flow accelerates, it may be difficult for legacy systems to keep up; when the flow is lessened, legacy systems can be prohibitively expensive to operate. More importantly, as the velocity of data changes, the rate at which insights can be found should also change, allowing for faster response times.

Sources of Data

Transactional Data

Since the emergence of business computing more than 50 years ago, the data generated by applications in finance, manufacturing, commerce, business operations, etc., has risen dramatically. Most companies are extensive users of business applications that interface with ERP, CRM, SCM and other systems. The data generated reflect the ebb and flow of a business’ activities as it interacts with customers, vendors, partners, etc. For example, one typical kind of transactional data is sales orders from an online Ecommerce website.  

This data is considered transactional since it arises from the operations of the business. The bulk of transaction data is usually organized into relational databases, which are highly structured and defined by a schema. Some transactional data can be unstructured. The most common problem with transactional data is its volume and velocity. A requirement for transactional data is that it be collected from business operations efficiently with minimal (or in some cases no) error.

Unstructured Data

Unstructured data, though prevalent, is a relative newcomer to the data management scene. In the beginning, it was hardly considered worthwhile (or even possible) to collect, since storage was prohibitively expensive to expend on something of uncertain value. As the cost of permanent storage declined, especially after the 1990’s, cost was no longer the prime obstacle. However, the value of the data was still unclear. With the emergence of standards and tools to organize the data, this constraint too was lifted. One of the first and still most power tools to bring out the value of unstructured data, of course, was search.

Unstructured data is more accurately described as data that is both potentially schema-less and schema-ful. Schema-less data is often visualized as key-value pairs; the main characteristic is that they don’t conform to a predefined, fixed pattern or schema. The main advantage of schema-less data is the dynamic way in which the data store can be constructed, adding more data types as they are encountered. An example of schema-less data might be sentences in a verbatim response to a survey. JSON is a popular schema-less data structure standard.

Schema-ful data is often associated with relational databases, but could include schema-driven data structures such as XML/XSD (perhaps a bit confusingly, XML without a corresponding XSD could be considered a schema-less data structure). An example of schema-ful data is a data structure such as customer (which could include name, address, phone number) or order (which could include an order number, a reference to a product and other information). Schema-ful data requires more care in designing, but can better support queries and transactional processing prerequisites, such as consistency and better support functions such as the joining of data sets. All the four V’s are at play with unstructured data.

Warehouse Data

Data warehouses are built from highly structured, schema-ful data as well as from unstructured schema-less data. Typically, the transactional data from business systems are extracted, transformed and loaded (a.k.a. ETL) into a warehouse. In some cases, it may be sufficient to just extract and load, potentially delaying transformation to a later stage (ELT). In any case, the data is moved from one or more sources to a specially designed database, usually designated a warehouse (though there are mezzanine concepts such as data marts). Usually, the data must be massaged, cleaned and summarized before it can be stored in the warehouse.

With warehouse data, the key issues are variety, veracity and velocity.

Backup Data

Business operations transaction data and warehouse data must be backed up and stored in a safe environment so that it can be reconstituted should the need arise. The primary reason for managing backup data is for business continuity and disaster recovery (BCDR). Equally important is the ability for an enterprise to efficiently use this data to restart operations in the event of a catastrophic failure.

Generally, all the data that an enterprise generates or transforms as part of its operations should be backed up. This is primarily a concern when the data are managed on-premise, though merely storing the data in the Cloud may not be sufficient to satisfy BCDR requirements. The key problem with backup data is its volume.

Managing the Diversity of Data

Although the sources or types of data often give rise to the best means to manage it (e.g., structured data typically belongs in relational databases), in practice, a heterogeneous approach is most common. In addition, few organizations have the luxury of starting from a clean slate; often legacy data sources must be supported and sustained.

Relational Databases

Since their emergence in the early 1980’s relational databases (RDB) have become the standard database model, supplanting network databases and file-based systems. They are ideal for transaction processing inherent in business operations. To facilitate this, an RDB is designed to reflect the semantics of a business problem—that is, it acts as a data model of the business world. For example, a RDB might model a business with tables to represent business entities such as customers, sales, orders, products, and so on.

For various reasons, the semantic representation must then be “normalized” to eliminate any redundancy in the data model (i.e., the same data elements represented in more than one place). By doing this normalization, performance can be improved and data integrity enhanced (i.e., reduce the possibility that data can become inconsistent under real world conditions).

Relational databases are ideal for applications such as Ecommerce, website content management, ERP, CRM and countless other business solutions. There are numerous strong RDBs in the market: Oracle Database, Microsoft SQLServer, MySQL, IBM DB2, Ingres and others.

Persistent Caches

It’s conceptually easier to architect applications in which databases appear monolithic and accessible on demand, with no latency or other dependencies. Unfortunately, real world applications are typically highly distributed, perhaps because users are not simply in one (or a few) locales or the hosting strategy is deliberately decentralized. In addition, the nature of an enterprise’s application may require accessing data that is inherently dispersed, such as summarizing data from different company offices or stores.

To ensure high performance of online systems, caching is an approach that has no conceptual limit. In fact, depending on the expected lifespan of a piece of content, caches can begin at the user’s device and only end inside the server responding to a request. For example, static content like logos that change infrequently can be cached in a user’s browser; on the other hand, stock quotes or news headlines can be cached on a content server for many to see before being periodically refreshed (i.e., from a few seconds to a few hours).

A well-designed application necessarily takes into account a caching strategy, not only to deliver timely content to applications, but also to write through data that might be changed in the real world. In addition, most applications will have more than one cache; managing how to update and synchronize these caches with each other and the master data store are key technical issues. Thus, the design, build and test of caching solutions are at times difficult and quite challenging. It is highly dependent on business needs and many other factors.

Examples of commercially available caches are Amazon ElasticCache, Redis, Cassandra.

NoSQL Databases

NoSQL databases are often contrasted to relational databases. Indeed, they are better suited for unstructured or less structured data. As noted in the discussion on Unstructured Data, NoSQL databases are not synonymous with schema-less (and therefore non-relational) databases. NoSQL databases are best suited for relatively simple data models and where the application puts a premium on scalability, performance and availability. This is partly because NoSQL databases allow the efficient storage (and retrieval) of data, usually indexed with a system of lookup keys. NoSQL databases are often optimized for particular data types, such as columnar, document, key-value, graph and hybrid. Examples of such databases include DynamoDB (key-value), MongoDB, MemcacheDB (both hybrid).

Relational Warehouses

Data warehouses are typically constructed from relational databases. The relational model is flexible and well suited for such applications. A central concept in relational warehouses is to make the data readily available for rapid retrieval, but only along well-defined, prescribed lines. In this respect, the correct design of the warehouse is essential at the outset, since it can be very difficult to recover from a design oversight or error. In addition, the data is usually not highly normalized, as it is in relational transactional models.

Warehouses are built around the facts that underlie the business to be modeled, such as sales, purchase orders, shipments, and payments. Each of the facts is organized into its own fact table. Associated with these facts are measures, which together characterize or fully describe the facts. For example, the sales fact table might refer to the related measures such as products, customer, sales persons, sales amount, geography, etc., all of which were aspects of sales. It’s not uncommon for facts to have 20 or 30 related measures each.

Identifying the correct facts and measures is a key part of the challenge and skill in designing a data warehouse. Warehouse data is distinct from transactional data in that it is not usually a direct output of business operations. The requirement for warehouse data is to be easily accessed and efficiently pulled into reports or compiled into other insights.

Business Continuity and Disaster Recovery (BCDR)

BCDR is a complex and growing field that cannot be easily summarized. However, the main concepts are to define an enterprise’s Recovery Point Objective (the maximum period for which data may be lost), its Minimum Acceptable Level of Service (the service level below which the business effectively is out of operation), and its Maximum Acceptable Outage (longest duration for loss of operations). Once defined, the recovery processes, backup and recovery strategy can be determined.

The advent of Cloud services can address an enterprise’s needs. However, unless the service is fully managed, has its own BCDR strategy and is transparently tested, it may not be sufficient to ensure an enterprise’s own BCDR needs are met. Sometimes, it’s necessary to create database snapshots and propagate them to different geographic regions within a Cloud service provider’s network in order to explicitly ensure DR backups. Amazon AWS and other Cloud vendors provide infrastructure to support BCDR. But it’s incumbent on the enterprise to understand its business needs and to design and build a system of both process and technology to support BCDR—and to regularly audit, test and update the BCDR system.

Practical Approaches

Most enterprises are heavily invested in and dependent on transactional data generated by business applications and stored transiently in caches and more persistently in relational databases. Increasingly, an enterprise’s data comes from a by-product of its business, such as from listening systems in social media or the Internet of Things (IoT). All of this data has a role to play in helping the enterprise develop insights about its customers or business; one of the most common ways to achieve this is to marshal and summarize the transactional and other data into warehouses for analysis and reporting. In the enterprise information lifecycle, data is generated during the course of business, but then is summarized and analyzed to provide insights for the enterprise, thereby improving business operations:

When business data is managed in this way, internal users of the data can then count on a single, dependable source of truth upon which to make decisions, build plans and grow the business.

Hosted Databases

While enterprises have been comfortable with on-premise database servers, especially for line-of-business applications, they have gradually moved their RDBs to the Cloud. The first step in this evolution was on-premise virtualization, which helped enterprises wean themselves away from a hardware-centric orientation. Then, shared data centers (or, co-location facilities) further broke down the fear of losing direct physical control of hardware.

Virtualization in the Cloud was then a relatively painless transition, whereby the enterprise would still handle database software updates, replication, capacity planning, backups, and patches, but not worry about hardware and infrastructure. The next step in this steady progression away from direct control is “Database as a Service”—a fully managed service. With DaaS, the last remnants of the on-premise paradigm are shed and the enterprise focuses on the application. In this category, Microsoft Azure SQL, Amazon AWS RDS and Google Cloud SQL all have fully hosted DaaS.

Hosted Caches

Caches have always been a critical part of practical computer systems architecture. The first caches were of course implemented in hardware. Indeed, the organization of computer systems could arguably be viewed as the efficient management of successively larger caches, from registers in a microprocessor to virtual caches in applications to content distribution networks to archival storage.

Designing an application around fast, highly available caches is a typical requirement for complex systems today. In the on-premise era, such an option was largely outside the reach of any but the most sophisticated and well-resourced organizations. In recent years, open source projects such as Redis and MemCached have brought this technology to wider range of developers. Further, the availability of scalable, hosted caches such as Amazon’s AWS ElastiCache or Microsoft Azure Redis have made caching relatively easy to adopt. Indeed, no application developer should be satisfied with a design until considering how a database cache can alleviate bottlenecks and improve performance.

Scalable Warehouses

Once the data warehouse is designed and deployed, the practical challenge is scaling it efficiently and maintaining high levels of performance as it grows and business needs evolve. For an enterprise that has users in different locales, replication globally can be an added issue. In addition, optimizations for warehouse-oriented applications are usually needed to improve performance (sometimes by orders of magnitude) both in terms of time and cost. These optimizations include columnar storage, zone maps and data compression. Parallelism and distributed computing are also often required at scale. These are complex, ever-evolving technologies for any enterprise to master. Warehouses delivered as a service can alleviate some of these challenges; among such services are Amazon AWS Redshift, Google Cloud BigQuery or Microsoft Azure SQL Datawarehouse.

Conclusion

Data can transform the modern enterprise. To attain this transformation, enterprises must develop a data strategy and an architecture that manages the lifecycle of data as it wends its way through the enterprise. The problems posed by data range from purely transactional issues of capturing real time events efficiently to processing data so that it yields information and then insights for the organization. Fortunately, the increased complexity of data management at scale can be reduced by leveraging the infrastructure and investments of the leading SaaS providers. However, even such an approach requires considerable expertise and a deep appreciation of the available technologies so as to make the optimal tradeoffs.

To view or add a comment, sign in

More articles by Mark Looi

  • GDPR: Notes from HBR Webinar

    Though GDPR is EU-specific, it will have a global impact. Enza Iannopollo of Forrester Research presented her their…

  • Webinar Notes: “The Reputation Game: The Art of Changing How People See You”

    Rupert Younger, Director of the Oxford University Centre for Corporate Reputation, which he founded in 2008, gave a…

  • Tim Cook at the Oxford Foundry

    How does Apple stay close to customer? And recommendations for entrepreneurs? Focus groups don't really work because…

    7 Comments
  • The Oxford View on Scenario Planning

    Traditional strategic planning sought to develop an organization’s strategy from the starting point of its vision…

    4 Comments
  • Organizations of the Future

    Review of Exponential Organizations: Why new organizations are ten times better, faster, and cheaper than yours. We’re…

    1 Comment
  • Leading Digital Transformation at L’Oréal

    Lubomira Rochet - Chief Digital Officer Overview Recently, at the Saïd Business School in Oxford, we had a chance to…

    3 Comments
  • What Makes a New Market?

    A Summary of The Evolution of New Markets “All successful markets are alike; each unsuccessful market is unsuccessful…

  • Globalization is Going into High Gear!

    Summary of “The Great Convergence” It was the best of times, it was the worst of times, …, it was the season of Light…

  • How to Create an Ambidextrous Organization

    A Summary of “Lead and Disrupt” “It is a truth universally acknowledged, that a single business in possession of a good…

  • Porter's Productivity Frontier

    In an industry at any given time, there is a theoretical boundary of performance for which the operational state of the…

Others also viewed

Explore content categories