Big Data and the Cloud
I've been thinking a lot lately about big data, and that inevitably starts me to thinking about cloud – based storage. Increasingly, we are all moving to that, whether the cloud is an internal cloud within our organization, or an external cloud hosted by a vendor or solution provider.
What we mean by a cloud? It's a bit of a nebulous term, but for purposes of this discussion I'll characterize it like this: an IT storage system that is distributed either by architecture or by geography or both, such that information in the system is stored in a location that is determined, not by the assignment of that information to a particular server or disk, but is rather distributed throughout the system based upon resource allocations made by the system itself.
If we take this or something like it is the working definition of a cloud, we see that it has some immediate characteristics that may be of interest to us:
- If the system is making automatic allocations of disk space and other storage parameters based upon its own calculated needs, a bit of data could wind up any place in the system as determined by whatever algorithm is making the decision.
- A larger block of data could potentially be diced up and parked in more than one location on the system.
- If the system has data redundancy built in, different copies of a data object could reside in different places on the system. Likewise, if it's a large data object that's already been chopped into pieces, the pieces of the multiple copies could potentially be a great number of places.
In a certain sense, this is old news. For a very long time now a computer hard drive has written data to the space it has available, based upon some management algorithm in the operating system and firmware, so that the data object may not be stored in a continuous fashion. Nor may related data objects be stored anywhere near each other on the disk. So, some of the characteristics of cloud-based storage are really just an old paradigm now being applied in a very much larger storage area that occupies a very much larger physical space.
From an administrative management standpoint, this distributed storage may not be in and of itself be problematic. If whatever software is managing all of this is doing its job correctly, the bits and pieces are seamlessly pulled together whenever you need them, and the fact that one piece came from a server in the Philippines, and the next one came from a server in California is largely irrelevant. And to the extent that the system employs data redundancy, that's a good thing – losing a chunk of data due to a server failure or some other problem isn't fatal, and may not even be a problem at all – you might not even notice it.
For other purposes though, this kind of distributed storage can be problematic. One of the most obvious issues surrounding this is legal compliance. There are a great assortment of laws governing much of that data, including laws governing where you can put it, what you can do with it, who can see it, and how long you can or should keep it. Each one of these poses some sort of issue in anything but the most localized environment, and the more distributed the environment within which the data is stored, the more problematic these issues become. So, although it may not matter for business purposes that some of your date is being parked in the Philippines and some of it is being parked in California, if the data in question comes from Europe, and it falls into one of a number of regulated categories such as tax and accounting information or sensitive information about people, you may find yourself with a legal compliance situation that is very challenging to resolve.
There are other challenges as well – if you're implementing a retention and disposition policy in a cloud-based environment, applying it in these many distributed locations is problematic – there are a lot of rocks to look under to make sure you've destroyed every copy of every piece of a data object, and in such environments it may well be impossible to absolutely assure that you've deleted every copy of the data object in question.
Because of these and many other challenges, it's worth asking yourself some questions and finding some answers to those questions before you moved to a cloud-based environment. I'm going to be conducting a webinar sponsored by Archive Systems on March 29, at which we will look at these questions in some detail, and discuss some potential answers. The webinar will be food for thought for those of you who have already moved to a cloud – based environment, and for those of you contemplating such a move, it will provide you with some questions to ask of your vendor before you moved to a cloud-based environment. You can sign up for the webinar here:
http://www.archivesystems.com/resources/webinars
I hope to see you there.