Storage Polyglot Persistence
Not post for me giving a rant or an epiphany, just an observation / thinking out-loud with no conclusion (helpful I know!)! I've been thinking about this for some time, but two things have brought this to the front of my mind. Firstly I have a good friend who is a data scientist and we were discussing this recently, I wasn't aware of the term at the time. Secondly I've been an outsider looking at the storage and infrastructure world at large recently and I think that this approach is entirely appropriate.
For those that don't know the term Polyglot Persistence, this has become popular with the advent of NoSQL and the multitude of new databases available out there (I think it might originally come from developers when referring to the correct language to use for a specific application requirement). Each seemingly has it's area that it's best at, and I'd nervously suggest none of them do everything perfectly. So developers / IT organisations end up putting in the right database for the right use-case. I won't show my ignorance by trying to give examples, but use an object based database for objects, a time-series database for time-series data, (I did learn recently that actually relational data isn't that great in a relational database!) and so on.
So why wouldn't the same rule apply to infrastructure in general? We do it with servers (virtual or physical), when I need something that's memory intensive I give it lots of memory, CPU intensive lots of CPU's or cores, disk intensive lots of disk. And yet storage is still positioned by the vast number of vendors as a single consolidated array, one appliance to rule them all. This was popular when arrays were expensive and new kids came along in the 90's and 00's and brought us fully functional multi-protocol arrays which effectively could do anything. The reality was (and I think still is) that while they can take any workload, they typically didn't excel at any single workload, so you end up with heavily customised LUNs or disk volumes that are designed to get the most from the backend in order to support the front-end.
One of the big challenges is its still too expensive to do this, an array is anywhere between £20k - £2m, so buying two or more is just unheard of for most organisations. Some organisations do this for VDI and some high performance databases, where the performance is so critical it doesn't make sense to not do this. Many modern applications that don't require monolithic shared storage give the options here, applications developed in cloud platforms for example where you then specifically provision the exact type of storage you require for that workload. The trouble is I don't see any storage vendor being able to provide this out of a single array for those people with non-cloud solutions.
I guess this is vaguely my hope of where software-defined-storage comes into its own. If you can build a scaled out array with a combination of fast, capacity and super fast storage, and categorise your workload accordingly, then actually you'd be able to achieve a Polyglot Persistence storage architecture within a single appliance. In theory this is something that a SDS array would be able to handle which a monolithic appliance would not, purely because the "controller" unit is quite small, so you can happily build this type of architecture out. Additionally I think as SDS is inherently cheaper (no hardware) you'd have the flexibility of choice to build an array with this level of diversity. I've looked a lot at SDS recently, and unfortunately I'm not 100% convinced there is anyone that does this today (happy to be corrected). I definitely don't think any traditional storage array vendor does this today (and don't give me QoS or policy control, that's a very poor answer to this).
Bottom line, I'd love to be able to have the right type of storage for specific applications: fast, capacity, high performance, 2x copies, 4x copies, cross-DC replication, geo-namespace, etc. etc. Not have to compromise because my array only does x, and not need to have multiple massive arrays just because each has it's sweet spot. I'd also love to do this without introducing siloed storage. Maybe I'm asking for too much :-)
As I say, just a random observation / thinking out-loud, maybe just an excuse to use Polyglot Persistence as it sounds cool, and to use a massive cliche image :-)
More on Polyglot Persistence if you're interested:
http://www.jamesserra.com/archive/2015/07/what-is-polyglot-persistence/
Yup, not new thinking, but I'm hoping technology might be close to catching up with it!
I timely post, and it fully aligns with my thought processes in regard to data management form an infrastructure perspective.