What is "Functional" in Big Data?
Functional: in operation; working.
Yeah... but what is "working"?
This is something I have been asking myself a lot over the past few months. LOCALLY has just gone through a hardcore period of feature building and we are getting to grips with what "Functional" means for different departments.
There is a fine line between micromanaging a specific feature and letting people do their job, however there are times that something slips through the cracks - in this case it was because of a difference of opinion as to what "Functional" actually means between development and operations. These definitions definitely have nuances when you start talking in terms of Big Data. Let me explain:
LOCALLY manages a data warehouse that contains around 500-750TB of "Live" data at any one time, this is split across several different systems which each do something to the data and "serve" it in some way. Because of this it makes sense to have a unified schema so that schema changes aren't something we have to deal on top of all of the other data pipeline work.
It seemed like a great idea so sometime back we decided to test and implement it into the query engine that drives our platform. In short this required us to "nest" multiple records under a single row (think an array of records). The migration went pretty smoothly and everything seemed to be fine. This was at a point where we were geospatially processing around 1 Billion rows a day. Not a small number by any means - but still manageable. Queries were running in sub 100ms time-frames and usually this is enough to return the data to the front end in a time suitable for human interaction.
Then one month - our data increased 100x, all of a sudden everything that was previously working "Fine" - decided to show a pretty ugly side of itself. When you are talking about working at this scale empirical measurement is absolutely crucial because a 50ms difference in a query at 1x might be barely noticeable to a developer (or even noticed & accepted should that developer be used to a normal RDBMS), but at 100x that is 5 seconds and 5 second at the scale of billions suddenly becomes unmanageable.
That departments idea of "Functional" was "Can we get this schema working with our platforms?"
The answer was Yes, they could (after all they are pretty great developers!) - so they did.
The system worked exactly as designed with the new schema at the existing data levels, however once the data volume increased significantly, queries started slowing down meaning the user experience suffered pretty badly and the only way to improve this was to run expensive clusters to handle the pre-processing of the data faster.
The task should have been communicated as "Can we get this schema working with our platforms at the same speed as the existing schema?"
Big Data has the same constraints as every other project, you want it done quickly, cheaply and effectively.
The triangle above represents pretty well how the tenets of Big Data affect each other, if any of the vertices change the others are going to be affected in someway.
Increased Speed = Requires Increased Cost or requires Lower Volume
Increased Volume = Requires Increased Cost or Lower Speed
Decreased Cost = Requires Lower Speed or Lower Volume (or both!)
So what did we learn?
Namely that when you are working at the scale of hundreds of Billions of records, the tiniest anomaly in response time really matters. That 2ms difference in fetching a small dataset gets wildly blown out when you are dealing with a scale of 10's of Billion.
As part of these changes we now have a robust experiment design checklist that means we are gathering data on every affected part of our Big Data architecture to ensure that no inefficiencies have crept in. (see next article for a retrospective on our choices).
We also learned that while you can throw many servers at the problem to increase speed, it is often not cost effective to do so, we researched and developed some pretty hefty Presto, Hive and ORC optimisations as part of this work to ensure that we would never have to do this again. (Another article on that soon!).
In short as my Grandad always said "Measure twice - cut once" - the only good decisions that can be made are data driven decisions.