To cloud or not to cloud - that is the question.
Moving services to the cloud has many benefits - scale, availability, security - but rarely should it be cost, though that is often cited as a reason. It’s not usually the initial sticker shock that gets people about cloud costs, but the gradual “boil”, like the proverbial frog in the pot.
When the heat gets too much, the saying goes to get out of the kitchen. There’s been opinions out there that moving workloads from cloud back to on-prem (or rented/owner-operated datacenter space) - a term known as “repatriation” - can save upwards of 30%. Sarah Wang and Martin Casado, people who I very much admire, wrote up one of the seminal articles on this, although looking around LinkedIn or Twitter one can see many advocating for the same (with DHH being one of the most recent I’ve seen advocating for it). But I don’t buy it as something all but a select few can do and save money.
Let’s first state something very obvious - CSPs (cloud service providers - AWS/Azure/GCP) are for profit entities and they don’t do what they do altruistically. It’s hard to get specific numbers about their costs, but looking at those company’s financial filings the profit margins on their cloud business units range around 30%. So, if you feel that you can do the same as Amazon, Microsoft, Google, for a third less, then go for it.
CSPs have amassed a huge amount of know-how, people-talent, and partnerships in being able to build out and operate their clouds. Everything from land deals to build the data center footprints where all that hardware is going to go, acquiring the power (including backup generators and resupplying them with fuel when needed) to be able to run all that hardware, supply-chains to get enough hard-disks/CPU/etc to build (or maintain/replace) servers, to hiring (and retaining) engineers that have the experience and skill to develop and operate the services + infrastructure that scale reliability -- these are all finite resources that you’d be in competition with the CSPs on.
The cloud is a great place to build new services - the ability to get resources and use the PaaS/SaaS building-blocks CSPs provide to easily and quickly develop on top of and putting that on the VC’s credit-card has helped so many startups. The elasticity of the cloud to be able to “rightsize”, growing (and shrinking) as needed, is necessary when one is building something for the first time is a huge benefit the cloud brings. But once you have an idea on the size (and operating costs) of running a given workload, does it make sense to continue running it in the cloud any more? Can, or should, you leave these workloads in the cloud and continue stuffing other’s coffers, or do you go it alone?
Let’s take one of the examples cited in the article above - Dropbox. They have a very specific product offering: data storage. Surely, they could write their services on “bare metal”, and they are at the scale where even a small amount of savings in “repatriation” from the cloud will make a material difference.
Maybe they wouldn’t go all out and build physical datacenter(s) themselves, but instead rent space from one of the many “co-lo” facilities, negating the power and network connectivity concerns (although even then, would still have to work with or negotiate a share of what is available). But they would then have to build and operate the hardware, and manage the supply-chain. Hard-disks and CPUs fail at surprising rates at scale, which would have to be kept up with. Most of the time that should be straightforward, but in times when there’s a crunch on the number of hard-disks or CPUs available, you’d have to have just as good supplier relationships and commitments as Amazon, Google, or Microsoft does to get your orders fulfilled - doubtful. This doesn’t even consider the need for operators to go in and actually do the maintenance of said servers.
Recommended by LinkedIn
Storage systems aren’t easy to write, at high scale and reliability, but there’s mature examples out there which have at least written up their approach which can be minimally taken as “inspiration”, so there’s not necessarily a lot of “invention” necessary (but certainly some “innovation”) which takes some of the difficulty away. That is one part of the software stack though, and there is routing, queueing, load balancing, etc, which again there’s open source software out there, but one has to configure, test, and operate it, with CSPs (mostly) do for you.
Also, consider security - you’d have to not only staff and train engineering, but also monitoring, threat hunting, response, compliance, and keep up with all the new TTPs (tools, techniques, and procedures) that the attackers are using. Not just at the application and service layer (that you have to take care of with the shared responsibility model all CSPs have anyway), but all the way down at the infrastructure and physical level. Microsoft as an example spends over a billion dollars a year on security R&D and sees all kinds of attacks across their businesses, so who would I bet on in seeing and knowing more, and thus their ability to detect and respond? It’s either your CSP, you, or some 3rd party with expertise (eg. Crowdstrike et al).
Overall, you are getting much more than a place to run containers or store bytes when running in the cloud. Those shared responsibility models mentioned earlier show this in a very clear way as anything on those lower “orange” levels are the CSP’s to worry about and not yours. And it’s not just the “infrastructure” the CSPs provide, but large chunks of the higher levels as well. For example IAM is the customer’s responsibility to configure, but the correctness, the running, and security of it is the CSP’s, which is no small feat, but is a building block given to you for “free” by CSPs.
The cloud is expensive, there’s no getting away from that, but you have to be pretty sophisticated and well past a given scaling point (and maturity) to start thinking about doing it yourself.
Let’s go back to DHH’s post, as he gives very clear numbers (and I thank him for that). Yes, he gets to have all his hardware, and the space, power, and I assume network bandwidth, for $1.5 million less per year than his current cloud bill. There’s nothing in there about budgeting for hardware replacement - say $150,000 (based on average rate of 41 months MTTF from the above linked article, which is one-third, or make it even easier, 25% of the original $600,000 hardware outlay), and 2x operators to do the work ($50,000 fully loaded) = $250,000. Then you have let’s say just four more engineers to cover infrastructure, security, monitoring, compliance - that’s $1,000,000. This then leaves $250,000 savings/profit/margin, which I’m sure there’s many more things I’ve forgotten about or left off that can easily eat that money up. The nice thing though is that he gets to depreciate that hardware outlay, moving those costs from OpEx to CapEx on the balance sheet (which is something CFO’s have been trying to do the opposite of for a few years, so they have more short-term cash on hand to play with, but maybe the pendulum is swinging the other way now).
My advice to anyone is to look at and manage your cloud costs first - make them predictable, and know what (and where) you are paying for things that you need vs come along for the ride. There’s some dark-arts to that (that Cory Quinn and The Duckbill Group will attest to), but only then can one get an idea of what may be stable enough to repatriate. But even if you do, remember that built in to that fraction of a cent per GB, or pennies per vCPU hour are a whole lot of intangibles that must be considered if you’re going to go it alone.
You make good points. The top 3 things that are inherently cheaper about a cloud datacenter (not necessarily *priced lower* but cheaper to run) are: 1) economies of scale - it's easier to provide, say, backup power for a 100MW datacenter *per Watt* than it is for a 5MW datacenter. 2) uncorrelated loads - it's easier to 'use' all of the (usually mostly idle) resources of a server when you have many uncorrelated loads that you can spread across the same resources. This makes the 'peak' usage you have to size for closer to the average than it is/would be with correlated loads. 3) all the stuff you mentioned about expertise and specialty people and whatnot. Now, when you buy from a cloud provider, you get things like "5 minute provisioning" and "redundant network connections" and a PUE of 1.1, so on. If you tried to replicate that yourself, it would cost at least 2x as much. Now, whether cloud provider *prices* are actually reasonable has everything to do with competition. We know the marginal and average prices of providing the resources are way lower. As long as there is at least some competition (and there is), customers will realize at least part of those cost savings.
Great write up Mike. On “cloud is expensive, there’s no getting away from that, but you have to be pretty sophisticated and well past a given scaling point (and maturity) to start thinking about doing it yourself.” It seems like new services/companies are in a catch 22. Invest in the overhead of infrastructure, with the hope of finding or building the expertise and capabilities in the foundation of design… or go with a csp, and forever sacrifice 30% of revenue in cost, but take advantage of the fact that csps hard a deep bench for talent and Years xp doing it. Someone once told me, “well just write your architecture in a way that it can move, ya know.. terraform” .. good for moving between the big csps, but going into a private cloud, less so. Now you’ve got solutions like arc, trying to present a cloud bridge/hybrid cloud… but still locking a business into a specific mode of operation. When thinking about businesses making the call on what to do, do you feel that industry plays a heavy role in the decision? (Driven at times by the associated compliance and regulatory costs ?)
First, if you're old, I'm retiring. Second, I appreciate this article, because it's pointing at things that are much more subtle than headlines. Finally, while I love ruby, I find DHH detestable, I almost couldn't get over that to finish.
Thanks Mike - learned from reading this.