Big Data

Curt Long

Published Nov 11, 2015

Most of us, unfortunately, do not have our business data truly organized. We have unstructured data sources: This is the data creation component. Typically, this is data that’s not or cannot be stored in a structured, relational database.Includes both semi-structured and unstructured data sources. Example sources include: email, social data, XML data, videos, audio files, photos, GPS, satellite images, sensor data, spreadsheets, web log data, mobile data, RFID tags and PDF docs.

Try Hadoop: The Hadoop Distributed File System (HDFS) is the data storage component of the open source Apache Hadoop project. It can store any type of data – structured, semi-structured and unstructured. It is designed to run on low-cost commodity hardware and is able to scale out quickly and cheaply across thousands of machines.

There are also big data applications: This is the data action component. These are the applications, tools and utilities that have been natively built for users to access, interact, analyze and make decisions using data in Hadoop and other non-relational storage systems. It does not include traditional BI/analytics applications or tools that have been extended to support Hadoop.

MapReduce, is the resource management and processing component of Hadoop. MapReduce allows Hadoop developers to write optimized programs that can process large volumes of data, structured and unstructured, in parallel across clusters of machines in a reliable and fault-tolerant manner. For instance, a programmer can use MapReduce to find friends or calculate the average number of contacts in a social network application, or process web access log stats to analyze web traffic volume and patterns.

Another benefit of MapReduce is that it processes the data where it resides (in HDFS) instead of moving it around, as is sometimes the case in a traditional EDW system. It also comes with a built-in recovery system – so if one machine goes down, MapReduce knows where to go to get another copy of the data. Although MapReduce processing is lightning fast when compared to more traditional methods, its jobs must be run in batch mode. This has proven to be a limitation for organizations that need to process data more frequently and/or closer to real time. The good news is that with the release of Hadoop 2.0, the resource management functionality has been packaged separately (it’s called YARN) so that MapReduce does not get bottlenecked and can stay focused on what it does best: processing data.

To view or add a comment, sign in

Big Data

Curt Long

More articles by Curt Long

Others also viewed

Hadoop/Big Data - Impala in 1 Minute.

Analyzing Big Data with Hadoop

RDBMS or Hadoop.. When to use it?!

Hadoop Cluster using Ansible

Hadoop and its HDFS! another level for big data analytics

Hadoop Summit 2016 - San Jose

How to exploit the full use of Hadoop - part 2: How to create Hadoop-friendly data schema

Problems with Hadoop Map Reduce

Hadoop Architecture

File Formats in Hadoop

Open Source Big Data Tools

Big Data Application Development

Real-Time Data Processing Tools

Scalability in Big Data Solutions

Batch Processing in Big Data

Explore content categories