Massively Parallel Processing (MPP) database on Hadoop

Massively Parallel Processing (MPP) database on Hadoop

Massively Parallel Processing (MPP) database on Hadoop:

In Massively Parallel Processing (MPP) databases data is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally. All communication is via a network interconnect — there is no disk-level sharing or contention to be concerned with (i.e. it is a ‘shared-nothing’ architecture). 

Pivotal HAWQ:

Apache HAWQ (incubating) combines exceptional MPP-based analytics performance, robust ANSI SQL compliance, Hadoop ecosystem integration and manageability, and flexible data-store format support. All natively in Hadoop. No connectors required.

HAWQ is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

Pivotal HAWQ is a Massively Parallel Processing (MPP) database using several Postgres database instances and HDFS storage. Think of your regular MPP databases like Teradata/Greenplum/Netezza but instead of using local storage it uses HDFS to store datafiles. Each of the processing nodes still has its own CPU/memory and storage.

Core components of Pivotal HAWQ:

There are four core components you can extend for your data format or source.

Fragmenter: The Fragmenter is responsible for passing data source metadata back to HAWQ. Each data fragment describes a piece of the requested data, including the data source name and hostname of where it is located. For example, if the source is a file in HDFS, the Fragmenter returns a list of data fragments containing HDFS file blocks. Each fragment includes the location of the block. If the source data is an HBase table, the Fragmenter returns information about table regions, including their locations.

 Accessor: The Accessor retrieves specific fragments and passes records back to the Resolver. For example, the Accessor creates a FileInputFormat and a RecordReader for the specific fragments (HDFS blocks) it was requested to process and sends this to the Resolver. In the case of HBase and Hive files, the Accessor returns single rows from an HBase or Hive table.

 Resolver: The Resolver deserializes records passed from the Accessor (Java OneRow objects) and serializes them to a list of field objects (OneField). PXF automatically converts this list of OneField objects into HAWQ-readable GPDBWritable format.

 Analyzer: The Analyzer provides PXF statistical data for the HAWQ query optimizer. Statistics include the size of a fragment of data, the number of fragments, and the number of rows in the data. It is advised that these statistics are quick estimates, as scanning an entire data set for actual statistics can take some time. Analyzers are optional, and a number of analyzers are built into the framework for us (e.g., files stored in HDFS).

 

Lets look at some components of HAWQ and how it works. This is an example of sample physical architecture for Pivotal HAWQ.

Physical architecture for Pivotal HAWQ:

HAWQ Master Server – Separate physical server with its own OS/CPU/storage/memory. Hosts HAWQ master database. There is no user data in HAWQ master database but stores metadata about database objects and segments – think in terms of system tables. This is also an entry point to HAWQ and manages user requests and responsible for generating query plans and submitting the requests to HAWQ segments. 

HDFS Namenode – Seperate physical server with its own OS/CPU/storage/memory is the centerpiece of HDFS system. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant Datanode servers where the data lives. 

Segment Server – Individual physical servers with their own OS/CPU/storage/memory. Hosts HAWQ segment process and HDFS datanode. HAWQ segment servers do not store any data on local file system. 

Interconnect switch – Segment servers and datanode communicate through an interconnect switch

 Since HAWQ is MPP database data is distributed across each HAWQ Segements to achieve data and processing parallelism. This is achieved by creating a database table with DISTRIBUTED BY clause. By using this clause data is automatically distributed across segment servers. In HAWQ you can either use hash or round-robin distribution.

In this example we will use hash distribution to distribute the data across HAWQ segments. To get even distribution of data across segments it makes sense to use primary key as a distribution key.

CREATE TABLE product_dim
(product_id     integer,
product_name     varchar(200))
DISTRIBUTED BY (product_id) ;

Once the table is created, lets insert 9 records and see how the data is distributed and stored based on hashing of the product_id.

insert into product_dim values (1,’Apple MacBook Pro’);
insert into product_dim values (2,’Apple Iphone 4s’);
insert into product_dim values (3,’Apple Ipad 3′);
insert into product_dim values (4,’Samsung Galaxy Tab 10.1′);
insert into product_dim values (5,’Blackberry Bold’);
insert into product_dim values (6,’Amazon Kindle Fire’);
insert into product_dim values (7,’Google Android Tablet 9′);
insert into product_dim values (8,’Nook Color’);
insert into product_dim values (9,’Lenovo IdeaPad’);

Data Distribution in MPP Nodes:

As the distribution key was product_id, data is hashed using the distribution key and spread out evenly on each of the HAWQ segment servers. Now lets looks at how the data is stored since the underlying storage is HDFS.

Case I: Insert data into HAWQ

When a record needs to be inserted into a table, HAWQ master will decide  – based on the distribution key which HAWQ segment server it will send the record to. It will also communicate with HDFS Namenode to get a HDFS block address to store data. HDFS Namenode will assign a HDFS data block – most preferably on the same Datanode as HAWQ Segment and give the HDFS url to HAWQ master.

HAWQ master will pass this information to HAWQ segment that in turn will store the data in the HDFS block.

Case II: Select data from HAWQ

select * from product_dim where product_id = 5 ;

When this query is issued, HAWQ master will parse and build a query plan. Since the query is being filtered on a distribution key column the plan will be sent to HAWQ segment 2 for execution – remember master stores metadata about data distribution for each table. It also sends the HDFS url for required HDFS datablock. After the query is sent to HAWQ segment  2, HAWQ segment will use the HDFS url to read data and send the results set back to HAWQ master.

This looks like an IBM Netezza database appliance to me. Same approach . This could work if they make it more simple to use than Hadoop

Like
Reply

To view or add a comment, sign in

More articles by Birendra Kumar Sahu

Others also viewed

Explore content categories