Big Data
Related to Big Data Meets Hpc Scientific Computing - http://zonese7en.com

Big Data

Most of us, unfortunately, do not have our business data truly organized. We have unstructured data sources: This is the data creation component. Typically, this is data that’s not or cannot be stored in a structured, relational database.Includes both semi-structured and unstructured data sources. Example sources include: email, social data, XML data, videos, audio files, photos, GPS, satellite images, sensor data, spreadsheets, web log data, mobile data, RFID tags and PDF docs.

Try Hadoop: The Hadoop Distributed File System (HDFS) is the data storage component of the open source Apache Hadoop project. It can store any type of data – structured, semi-structured and unstructured. It is designed to run on low-cost commodity hardware and is able to scale out quickly and cheaply across thousands of machines.

There are also big data applications: This is the data action component. These are the applications, tools and utilities that have been natively built for users to access, interact, analyze and make decisions using data in Hadoop and other non-relational storage systems. It does not include traditional BI/analytics applications or tools that have been extended to support Hadoop.

MapReduce, is the resource management and processing component of Hadoop. MapReduce allows Hadoop developers to write optimized programs that can process large volumes of data, structured and unstructured, in parallel across clusters of machines in a reliable and fault-tolerant manner. For instance, a programmer can use MapReduce to find friends or calculate the average number of contacts in a social network application, or process web access log stats to analyze web traffic volume and patterns.

Another benefit of MapReduce is that it processes the data where it resides (in HDFS) instead of moving it around, as is sometimes the case in a traditional EDW system. It also comes with a built-in recovery system – so if one machine goes down, MapReduce knows where to go to get another copy of the data. Although MapReduce processing is lightning fast when compared to more traditional methods, its jobs must be run in batch mode. This has proven to be a limitation for organizations that need to process data more frequently and/or closer to real time. The good news is that with the release of Hadoop 2.0, the resource management functionality has been packaged separately (it’s called YARN) so that MapReduce does not get bottlenecked and can stay focused on what it does best: processing data.

To view or add a comment, sign in

More articles by Curt Long

  • 9 Things for Great Communication in Projects

    The following 9 tips have helped me communicate effectively to my project groups. Always communicate simple clarity…

    2 Comments

Others also viewed

Explore content categories