Big Data
The data analysts at Data Crunchers introduced you to the concepts of analytics and the importance of visualizations. Now it is time to work alongside the company’s data engineers to learn about big data and the role that engineers play in building, maintaining, and ensuring that the organization’s data infrastructure is available and reliable.
Big data is a term used to describe the massive volumes of digital data generated, collected, and processed. The term big data describes data that is either moving too quickly, is simply too large, or is too complex to be stored, processed, or analyzed with traditional data storage and analytics applications. Some examples of big data include data generated by postings to social media accounts, such as Facebook and Twitter, and the ratings given to products on e-commerce sites like Amazon marketplace.
Size is only one of the characteristics that define big data. Other criteria include the speed of generated data and the variety of data collected and stored.
Big Data Characteristics
Volume describes the amount of data transported and stored. According to International Data Corporation (IDC) experts, discovering ways to process the increasing amounts of data generated each day is a challenge. They predict data volume will increase at a compound annual growth rate of 23% over the next five years. While traditional data storage systems can, in theory, handle large amounts of data, they are struggling to keep up with the high volume demands of big data.
Variety describes the many forms data can take, most of which are rarely in a ready state for processing and analysis. A significant contributor to big data is unstructured data, such as video, images and text documents, which are estimated to represent 80 to 90% of the world’s data. These formats are too complex for traditional data warehouse storage architectures. The unstructured data that makes up a significant portion of big data does not fit into the rows and columns of traditional relational data storage system.
Velocity describes the rate at which this data is generated. For example, New York Stock Exchange generated data by a billion sold shares cannot just be stored for later analysis. It must be analyzed and reported immediately. The data infrastructure must instantly respond to the demands of applications accessing and streaming the data. Big data scales instantaneously, and research often needs to occur in real time.
Veracity is the process of preventing inaccurate data from spoiling your data sets. For example, when people sign up for an online account, they often use false contact information. Much of this inaccurate information must be “scrubbed” from the data before use in analysis. Increased veracity in the collection of data can reduce the amount of data cleaning that is required.
Recommended by LinkedIn
Big Data Management
Data Pipelines
Using all of this data to achieve these potential benefits requires managing the data. Data engineers are the professionals who engage in this management. This process includes developing infrastructure and systems to ingest the data, clean and transform it, and finally store it in ways that make it easy for the rest of the people in their organization to access and query the data to answer business questions.
What is a data pipeline?
The best approach is to think of a data pipeline to understand better what data engineers do with data. You can think of it almost like water flowing through pipes. To understand what data engineers do with these data, consider the figure below, which is a simplified representation of data flowing through the three phases of a data pipeline: ingestion, transformation, and storage.
Data engineers will want to ingest two primary sources of data: batches of data from servers or databases (batch ingestion) and real-time events happening in the world and streaming from the world of devices (streaming ingestion). An example of batch ingestion is a game company that wants to examine the relationship between subscription renewals and customer support tickets. It could ingest all the related data on a daily or weekly basis. It doesn’t need to access and analyze data immediately after a support ticket is closed or a subscription is renewed. An example of streaming ingestion is when you request a ride from a ride share service. The company combines streams of data (e.g. historical data, real-time traffic data, and location tracking) to make sure you get a ride from the driver who is closest to you at the time.
After housing ingested data in temporary storage, we’re ready to go, right? Well, not quite. Data nearly always needs to be transformed to be useful for later analyses. There are two main issues to deal with here. One, data often needs to be cleaned up: missing values, dates can be in the wrong format, and data quickly gets outdated: you might have gathered data on individuals who have changed roles or companies.
The other major issue involves transforming your data so that its structure aligns with the system needed to allow accurate analyses. For example, you might want to figure out your company’s best selling products every month. But the data may only contain each product’s sale date. You would need to transform the data by creating, for example, a number of sales per month variable.
After transforming data, it needs to be stored in places and forms, making it easy for analysts to run reports on weekly sales and data scientists to create predictive recommendation models. Data security, or managing data access so that people who should be accessing the data can efficiently, and keeping out people who shouldn’t.
There are two primary locations for businesses to store their data, on-premises or in the cloud. Often, companies use a hybrid of both.
The term “on-premises” refers to hardware on an organization’s servers and infrastructure - usually physically on site. In the past, on-premises storage was the only option available for storing data. The organization would deploy more servers as storage needs increased. Over time, organizations had entire rooms or data centers with servers hosting the databases that stored all the data. This model had significant direct costs for the hardware and licenses for the servers and indirect costs of power, cooling, and off-site backup services. The company must also keep IT staff on hand to maintain and manage the servers.
Today, however, businesses are increasingly moving their data storage to the cloud. Cloud storage sounds mysterious, but it just means storing data on servers maintained by providers such as Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), and Alibaba Cloud. The cloud service provider purchases, installs, and maintains all hardware, software, and supporting infrastructure in its data centers. Using cloud services, an organization avoids the enormous costs of building and supporting the infrastructure necessary to store the vast amounts of data they collect. Instead, the cloud service provider charges a “pay-as-you-go” (monthly) subscription fee.