The Right Big Data Tools
I have talked to a few people who are new to big data and they are overwhelmed by the range of tools available. Often they will make decisions based on only knowing one tool. Others can be led astray for an advocate or salesperson for a particular tool.
This is a guide from my experiences with big data tools and which tools are appropriate for various tasks.
Hold Your Horses
Before investing time and money on big data solutions hold back and ask yourself some questions.
The most important thing to ask is what analysis you want to undertake. It may be worthwhile hiring a data scientist on contract or permanently. What kind of reports do you want? Realistically how quickly do you want to retrieve that information?
What data do you have? How much data do you have? If the size of the data is less than about 1TB it might be worthwhile to set up a dedicated server with a traditional database such as PostgreSQL. Also memcached or redis may be helpful.
Can you use existing databases and datasources to extract interesting information?
PostgreSQL and MySQL
PostgreSQL has good performance as a key / value store - on some benchmarks it can beat Cassandra. It is not as good as Cassandra on replication though.
MySQL has good performance as a key / value store. Facebook uses MySQL with memcached to power their website.
PostgreSQL can index JSON and XML so it may be used as an alternative to MongoDB. PostgreSQL automatically makes use of available system memory or the table can be put on a RAM filesystem.
The advantage of using PostgreSQL is that in a pinch you still have access to the power of a referential database.
Both PostgreSQL and MySQL are free software, so you only have to pay for the servers they run on.
Pros:
- Full ACID compliance available
- The work is already done for cross-referencing information (i.e. JOIN) or summarising information (i.e. GROUP BY)
- Some databases include complex data types suitable for de-normalising (a copy) such as JSON or XML
- Some databases allow indexing on JSON or XML fragments
- Good if you want to access data over a range. e.g. a date range
- Full data integrity checks available
Cons:
- Politics involved in using existing databases
- Existing databases may be under-provisioned
- Management inertia against spending compared to a more "sexy" technology
- Inertia against denormalising a copy of the data to minimise the need for JOINS
Hive on Hadoop
If you have log files that you don't know if you want to analyse or you are likely to analyse only once Hive is a good option.
Hive allows an SQL query to be run on Hadoop without needing to create a specific Hadoop program. AWS's Hive supports directly accessing log files stored in s3.
Hive "CREATE TABLE" supports CSV, JSON and regex. A custom plugin may be written to handle other log formats - such as binary formats. The delimiter for CSV may be changed.
Pros:
- With AWS Hive you basically pay per query which is great if you have a only have a small number of queries per data set or may never query a data set (e.g. research or debugging)
- Great for parsing logs: CSV, JSON and regex are supported. Extra formats can be added using plugins
- Great for large data sets that you will only look at once, particularly debugging
- Very flexible - great for debugging and custom reports
- Good if you want to access data over a range. e.g. a date range
- Queries are SQL
Cons:
- With AWS you basically pay per query which is terrible if you have a large number of queries per data set
- Even with a private Hadoop system the system will support a limited number of queries, so having a large number of queries per data set will give you a bad time
- Hive has very high latency. A query will take a minimum of a few minutes and could easily take hour
Data Warehouse - Redshift, Netezza, Teradata and Others
Data warehouse devices are truly amazing. When using Netezza at 24/7 Media a Netezza query on terabytes of atomic data would take less than a minute, compared to taking a few minutes to query the summarised data on a well optimised Oracle server.
The disadvantage is that they are a long term commitment. Netezza or Teradata will cost you in $100,000 increments.
Leasing Redshift is paid on a per-GB per-month basis. If you turn a Redshift instance off you have to load it again when you turn it on again. Data warehouses are a longer term solution.
Pros:
- Pay per data set, not per query
- Great if you have data sets that will be queried frequently
- Relatively low latency. Usually you can expect your query result in a few seconds to a few minutes
- Good if you want to access data over a range. e.g. a date range
- Queries are SQL
- Full ACID compliance
- Limited data integrity checks
- Custom hardware makes queries very fast
Cons:
- Terrible if you have data sets that will be queried once or never
- Can be expensive if you have a large data set. It can be a strategy to have frequently accessed data in a data warehouse (say the last 30 days) then older data may be access through Hive or as summarised data
- Limited data integrity checks
- Indexing not available
- Expensive to move from IAAS to your own data warehouse appliance
- Disks can be unreliable - make sure you keep backups or your original sources
Greenplum
Greenplum is an open-source data warehouse implemented in software, rather than hardware. I haven't used Greenplum but it looks like a good option when a self-hosted option is desired, rather than an IAAS solution.
Greenplum will need a build-out of servers and you won't have the advantage of the custom hardware in hardware data warehouse solutions.
Pros:
- Smaller cost increments than hardware solution
- May work out cheaper by using commodity hardware
- Queries are SQL
- ACID compliant
Cons:
- IAAS has even smaller cost increments
- Performance would be worse than hardware data warehouse solutions
- Terrible if you have data sets that will be queried once or never
- Can be expensive if you have a large data set. It can be a strategy to have frequently accessed data in a data warehouse (say the last 30 days) then older data may be access through Hive or as summarised data
- Limited data integrity checks
- Indexing not available
MongoDB
Mongo is great if you want a de-normalised copy of data. Mongo records are based on JSON, so they have a structure. They can contain multiple pieces of reference information in a record.
The problem with MongoDB is that when you end up needing a JOIN you will need to do that in the application. And if you have a dozen places where you need a JOIN you will need to re-implement JOIN in all those places.
MongoDB is atomic at a row level but it is not possible to atomically update rows in different tables.
MongoDB does not provide any transaction constructs (i.e. BEGIN, COMMIT, ROLLBACK).
Pros:
- Great for de-normalised data
- Low latency
- CID compliant
- Updating a row is atomic
- JSON formatting check
- Indexing available
Cons:
- No JOIN - it must be implemented by the application
- No GROUP BY - it must be implemented by the application
- Updating a row is atomic - no other transactions are supported
- No data integrity checks
Cassandra
Cassandra is a key-value store. It works really well if you want a data store that is very low latency, can be temporarily inconsistent, you do not need ranges, you do not need to search through the value data and you don't need to cross-reference data.
Cassandra is very low latency - when we were using Cassandra at Brandscreen the latency was dominated by the 7ms seek times of the disks we were using. Once Brandscreen switched to SSDs the result was even better.
Many people disregard the limitation that Cassandra is eventually consistent. This means that the data may be temporarily inconsistent. The row may not exist or it may be an old version of the row.
This works well in real time bidding for advertising for storing information about known visitors. This is a good use case for Cassandra. The data is strictly one-to-one and if one visitor is missed there will be 200,000 more in the the next second.
It doesn't work as well for things like financial information or most reference information where your users will expect that the data will be consistent.
Another mistake using Cassandra is to use it for data which needs to be queried over a range - for example a range of dates. This means that Cassandra will be hit with an enormous number of requests - many of which may even return no data.
If you need to cross-reference (a JOIN) or summarise (a GROUP BY) you will have to implement it yourself. Or if you need to search through the value.
Pros:
- Very low latency (although PostgreSQL and mySQL may be able to give it a run for it's money)
- Good for key / value data
- Good if your data can be temporarily inconsistent (i.e. missing or out-of-date)
Cons:
- Not good for querying ranges of data
- Not good for data that is expected to always be consistent
- Not good if you need to cross reference (i.e. JOIN)
- Not good if you need to summarise (i.e. GROUP BY)
- Not good if you need to search values
Conclusion
By thinking about your data, both big and small, you can much better results for a smaller outlay. Don't just jump into a big data project and expect the "why" to sort itself out.
I think you mean Cassandra is eventually consistent, not the other way around.