The Right Big Data Tools

Andrew Punch

Published Feb 11, 2016

I have talked to a few people who are new to big data and they are overwhelmed by the range of tools available. Often they will make decisions based on only knowing one tool. Others can be led astray for an advocate or salesperson for a particular tool.

This is a guide from my experiences with big data tools and which tools are appropriate for various tasks.

Hold Your Horses

Before investing time and money on big data solutions hold back and ask yourself some questions.

The most important thing to ask is what analysis you want to undertake. It may be worthwhile hiring a data scientist on contract or permanently. What kind of reports do you want? Realistically how quickly do you want to retrieve that information?

What data do you have? How much data do you have? If the size of the data is less than about 1TB it might be worthwhile to set up a dedicated server with a traditional database such as PostgreSQL. Also memcached or redis may be helpful.

Can you use existing databases and datasources to extract interesting information?

PostgreSQL and MySQL

PostgreSQL has good performance as a key / value store - on some benchmarks it can beat Cassandra. It is not as good as Cassandra on replication though.

MySQL has good performance as a key / value store. Facebook uses MySQL with memcached to power their website.

PostgreSQL can index JSON and XML so it may be used as an alternative to MongoDB. PostgreSQL automatically makes use of available system memory or the table can be put on a RAM filesystem.

The advantage of using PostgreSQL is that in a pinch you still have access to the power of a referential database.

Both PostgreSQL and MySQL are free software, so you only have to pay for the servers they run on.

Pros:

Full ACID compliance available
The work is already done for cross-referencing information (i.e. JOIN) or summarising information (i.e. GROUP BY)
Some databases include complex data types suitable for de-normalising (a copy) such as JSON or XML
Some databases allow indexing on JSON or XML fragments
Good if you want to access data over a range. e.g. a date range
Full data integrity checks available

Cons:

Politics involved in using existing databases
Existing databases may be under-provisioned
Management inertia against spending compared to a more "sexy" technology
Inertia against denormalising a copy of the data to minimise the need for JOINS

Hive on Hadoop

If you have log files that you don't know if you want to analyse or you are likely to analyse only once Hive is a good option.

Hive allows an SQL query to be run on Hadoop without needing to create a specific Hadoop program. AWS's Hive supports directly accessing log files stored in s3.

Hive "CREATE TABLE" supports CSV, JSON and regex. A custom plugin may be written to handle other log formats - such as binary formats. The delimiter for CSV may be changed.

Pros:

With AWS Hive you basically pay per query which is great if you have a only have a small number of queries per data set or may never query a data set (e.g. research or debugging)
Great for parsing logs: CSV, JSON and regex are supported. Extra formats can be added using plugins
Great for large data sets that you will only look at once, particularly debugging
Very flexible - great for debugging and custom reports
Good if you want to access data over a range. e.g. a date range
Queries are SQL

Cons:

With AWS you basically pay per query which is terrible if you have a large number of queries per data set
Even with a private Hadoop system the system will support a limited number of queries, so having a large number of queries per data set will give you a bad time
Hive has very high latency. A query will take a minimum of a few minutes and could easily take hour

Data Warehouse - Redshift, Netezza, Teradata and Others

Data warehouse devices are truly amazing. When using Netezza at 24/7 Media a Netezza query on terabytes of atomic data would take less than a minute, compared to taking a few minutes to query the summarised data on a well optimised Oracle server.

The disadvantage is that they are a long term commitment. Netezza or Teradata will cost you in $100,000 increments.

Leasing Redshift is paid on a per-GB per-month basis. If you turn a Redshift instance off you have to load it again when you turn it on again. Data warehouses are a longer term solution.

Pros:

Pay per data set, not per query
Great if you have data sets that will be queried frequently
Relatively low latency. Usually you can expect your query result in a few seconds to a few minutes
Good if you want to access data over a range. e.g. a date range
Queries are SQL
Full ACID compliance
Limited data integrity checks
Custom hardware makes queries very fast

Cons:

Terrible if you have data sets that will be queried once or never
Can be expensive if you have a large data set. It can be a strategy to have frequently accessed data in a data warehouse (say the last 30 days) then older data may be access through Hive or as summarised data
Limited data integrity checks
Indexing not available
Expensive to move from IAAS to your own data warehouse appliance
Disks can be unreliable - make sure you keep backups or your original sources

Greenplum

Greenplum is an open-source data warehouse implemented in software, rather than hardware. I haven't used Greenplum but it looks like a good option when a self-hosted option is desired, rather than an IAAS solution.

Greenplum will need a build-out of servers and you won't have the advantage of the custom hardware in hardware data warehouse solutions.

Pros:

Smaller cost increments than hardware solution
May work out cheaper by using commodity hardware
Queries are SQL
ACID compliant

Cons:

IAAS has even smaller cost increments
Performance would be worse than hardware data warehouse solutions
Terrible if you have data sets that will be queried once or never
Can be expensive if you have a large data set. It can be a strategy to have frequently accessed data in a data warehouse (say the last 30 days) then older data may be access through Hive or as summarised data
Limited data integrity checks
Indexing not available

MongoDB

Mongo is great if you want a de-normalised copy of data. Mongo records are based on JSON, so they have a structure. They can contain multiple pieces of reference information in a record.

The problem with MongoDB is that when you end up needing a JOIN you will need to do that in the application. And if you have a dozen places where you need a JOIN you will need to re-implement JOIN in all those places.

MongoDB is atomic at a row level but it is not possible to atomically update rows in different tables.

MongoDB does not provide any transaction constructs (i.e. BEGIN, COMMIT, ROLLBACK).

Pros:

Great for de-normalised data
Low latency
CID compliant
Updating a row is atomic
JSON formatting check
Indexing available

Cons:

No JOIN - it must be implemented by the application
No GROUP BY - it must be implemented by the application
Updating a row is atomic - no other transactions are supported
No data integrity checks

Cassandra

Cassandra is a key-value store. It works really well if you want a data store that is very low latency, can be temporarily inconsistent, you do not need ranges, you do not need to search through the value data and you don't need to cross-reference data.

Cassandra is very low latency - when we were using Cassandra at Brandscreen the latency was dominated by the 7ms seek times of the disks we were using. Once Brandscreen switched to SSDs the result was even better.

Many people disregard the limitation that Cassandra is eventually consistent. This means that the data may be temporarily inconsistent. The row may not exist or it may be an old version of the row.

This works well in real time bidding for advertising for storing information about known visitors. This is a good use case for Cassandra. The data is strictly one-to-one and if one visitor is missed there will be 200,000 more in the the next second.

It doesn't work as well for things like financial information or most reference information where your users will expect that the data will be consistent.

Another mistake using Cassandra is to use it for data which needs to be queried over a range - for example a range of dates. This means that Cassandra will be hit with an enormous number of requests - many of which may even return no data.

If you need to cross-reference (a JOIN) or summarise (a GROUP BY) you will have to implement it yourself. Or if you need to search through the value.

Pros:

Very low latency (although PostgreSQL and mySQL may be able to give it a run for it's money)
Good for key / value data
Good if your data can be temporarily inconsistent (i.e. missing or out-of-date)

Cons:

Not good for querying ranges of data
Not good for data that is expected to always be consistent
Not good if you need to cross reference (i.e. JOIN)
Not good if you need to summarise (i.e. GROUP BY)
Not good if you need to search values

Conclusion

By thinking about your data, both big and small, you can much better results for a smaller outlay. Don't just jump into a big data project and expect the "why" to sort itself out.

Pascal Hakim 10y

I think you mean Cassandra is eventually consistent, not the other way around.

See more comments

To view or add a comment, sign in

The Right Big Data Tools

Andrew Punch

Hold Your Horses

PostgreSQL and MySQL

Hive on Hadoop

Data Warehouse - Redshift, Netezza, Teradata and Others

Greenplum

MongoDB

Cassandra

Conclusion

More articles by this author

Others also viewed

The Hidden Data Structures Powering Modern Databases

[Logstash] Sync postgres data to elasticsearch

Use ADX for storing Azure SQL audit logs

Overview of Discord's data platform that daily processes petabytes of data

Big Data Is About To Get Bigger, And You'd Better Be On Board

Navigating Data Processing: Demystifying Dedicated SQL Pools, Serverless SQL Pools, and Apache Spark Pools

What Do You Mean, SQL Can't Do Big Data?

🚀 Day 8/60 — Data Dilemmas: SQL, NoSQL, and the CAP Theorem

When to use Azure Synapse Serverless SQL pool vs Dedicated SQL pool to manage Big Data.

Open Source Big Data Tools

Big Data Application Development

Big Data Integration Platforms

Scalability in Big Data Solutions

Real-Time Data Processing Tools

Data Transformation Tools

Explore content categories

Hold Your Horses

PostgreSQL and MySQL

Hive on Hadoop

Data Warehouse - Redshift, Netezza, Teradata and Others

Greenplum

MongoDB

Cassandra

Conclusion

AWS vs Google Cloud Platform

Feb 23, 2019

Visual Data Dissemination Tools

Mar 29, 2017

Others also viewed

The Hidden Data Structures Powering Modern Databases

[Logstash] Sync postgres data to elasticsearch

Use ADX for storing Azure SQL audit logs

Overview of Discord's data platform that daily processes petabytes of data

Big Data Is About To Get Bigger, And You'd Better Be On Board

Navigating Data Processing: Demystifying Dedicated SQL Pools, Serverless SQL Pools, and Apache Spark Pools

What Do You Mean, SQL Can't Do Big Data?

🚀 Day 8/60 — Data Dilemmas: SQL, NoSQL, and the CAP Theorem

When to use Azure Synapse Serverless SQL pool vs Dedicated SQL pool to manage Big Data.

Similar topics

Open Source Big Data Tools

Big Data Application Development

Big Data Integration Platforms

Scalability in Big Data Solutions

Real-Time Data Processing Tools

Data Transformation Tools

Explore content categories