TIPS AND TRAPS FOR CLOUD MIGRATIONS, PART THREE (OF THREE)

TIPS AND TRAPS FOR CLOUD MIGRATIONS, PART THREE (OF THREE)

Context

We discussed the migration to microservices and separate databases in Part I and the migration of applications from single datacenter to an active/active configuration in multiple datacenters in Part II. This segment describes our migration to multi-region active/active cloud providers. Cloud migration is an immense topic and we’re just going to concentrate on the application impacts here. I recommend Cloud Native Patterns for an overview of the many topics in this space.

Vendor Lock-In

(WARNING) Each cloud vendor offers a wide array of features that you can use to build and run your applications. Hidden amongst the offerings are sexy vendor-specific features that seem like they are there to intentionally make it difficult to move off their platform.

The more of these vendor-specific features you use, the harder it will be to move to an alternate cloud provider. As an architect, why should you care about that?

(TIP) As architects we are supposed to be making decisions that are for the benefit of the corporation. Tying our software assets to a single vendor limits our ability to negotiate better cost structures for our compute processing. 

It’s a lot easier to negotiate better terms if our applications can be easily ported to an alternate cloud vendor. For example, if I decide that our applications will use an open-source distributed database system on one cloud that is also compatible with other clouds, then my management is in a much better position to negotiate for better terms.

(WARNING) I had this very conversation with our Fearless Leader. Our Fearless Leader not only chose a specific vendor, but we were also instructed to use that vendor’s “lock-in” features if it got the job done faster. Alas. So much for the sorry lot of architects. At least we can have some minor revenge by classifying this as an anti-pattern.

Management Cloud Choice vs Technical Merits

(TIP) There are a handful of very capable cloud providers out there and it helps to understand their similarities and differences. It will come in handy, regardless of what I’m going to tell you in the next warning.

Are they potential competitors? For example, our management would not allow us to work with AWS because they were a potential competitor. So certain cloud providers can be removed from the short list quickly because of management strictures.

(WARNING) You can spend an awfully long time researching the best-possible cloud provider for your company. I know that we did. But your management is going to make that decision. And after all that research, you will be lucky if they consult you about the cloud providers’ technical merits. In our case they didn’t ask us for any input.

(TIP) Find out your managements’ decision, then research THAT cloud provider in detail. Sad, but true. You’re an architect, not the CIO or CTO. They have the money. You do not. They own the cloud vendor relationships. You do not. Really, you’ll save yourself a LOT of grief and anguish.

If, against all reason, they do want your input, rejoice (briefly) and start researching immediately. You’re not going to have that large a window in which to make your recommendations.

(TIP) Even though this happened to us, I’m glad we did the research. We learned a great deal about the cloud provider space. Especially useful were understanding the gaps in features at our chosen cloud provider. We were prepared to compensate for those shortcomings when the time came. For example, at the time the vendor did not have a generally available Kubernetes implementation, so we knew we would have to put extra effort and resources on that.

Support – “You Gonna Need It”

Unless you have already hired cloud-savvy technical staff to help with your cloud migration, you are going to need some immediate expertise. (TIP) Do not be afraid to admit that you are going to need help with the migration.

Cloud migrations, especially if they involve Containers and Container Orchestration (e.g., Docker and Kubernetes), are immensely complicated to install, configure and operate.

If you do not have staff in-house that already have had experience setting these kinds of infrastructures, you’d better plan on getting some serious, professional support.

(WARNING) What I’ve encountered in this open-source age is an attitude that “We have the source; we can support it ourselves.” This is a complete pile of [expletive deleted]. These infrastructure systems are incredibly complex, involving hundreds of modules and millions of lines of code. You’re not going to succeed without some serious support from experienced consultants/companies.

And I say this knowing that any time you bring in a set of vendor consultants, you are relinquishing some of your design independence. Well, get over it and admit that you need the help. You’re still going to learn a lot in the process.

CI/CD and Evolution to Multi-Datacenter

Fortunately for us, the CI/CD program had been maturing for about 6 years before the multi-datacenter program began. 

(WARNING) If you don’t have a mature CI/CD pipeline already, think twice before proceeding to the cloud. We discuss the reason for this next…

(TIP) One lesson from DevOps and their CI/CD experiences that helped all of us immensely is the need to automate everything. Builds, tests, deployments, configurations… everything must be automated. This is your only hope at succeeding with a multi-datacenter roll-out. With hundreds of instances of microservices being deployed to thousands of containers/pods/virtual machines… you cannot do that work manually.

I smell a whole other article in this area. Maybe we can convince our DevOps veterans to write that one…?

So we began our migration knowing that the CI/CD pipeline would require two enhancements. One was minor. The other was major.

Code Deployments

Code deployments were relatively simple once the CI/CD pipeline is in place. All that is then needed for code deployments to the cloud is to add additional deployment targets to the CD platform.

One nuance to this is the requirement to deploy multiple versions of applications to these endpoints. For example, Blue-green deployments require that each deployment endpoint has multiple versions of applications that are being deployed.

(TIP) If you can, try to get away from blue-green deployments that are controlled by network routers. Although that is the current norm, it requires that you involve network support staff to make it happen. That, in turn, requires that you must schedule these deployments with yet-another external organization.

Ideally, we would have used the internal services routing capabilities to control blue vs green routing in our container orchestration infrastructure. That would have limited the impact of our deployments to the applications and DevOps staff (which were already working hand-in-hand on deployments) without involving the network staff. We did not get to that point while I was involved in this migration.

Datacenter-Specific Configuration

The big adjustment for multiple datacenter deployments is its impact on application configuration. Simple things like database connection properties will vary from application to application, from pod to pod, from region to region, etc.

(TIP) You will need a configuration database (clustered to make it reliable since it’s so critical). We started with etcd since it also sits at the core of K8S and we thought we could leverage our experience and its resilience. Later we switched to Consul at the request (insistence) of the DevOps team. 

(TIP) This switch was easy because our application infrastructure team created an application startup shim that provided a transparent interface between the configuration database and the application properties. Applications had to rebuild and retest with the interface module, but otherwise did not have to change their configuration processing.

(WARNING) Since the DevOps configuration management is going to be manipulating the configurations of the application properties, the applications had to start using symbolics in their property file templates so that the DevOps CM could process any substitutions required for different instances of virtuals, pods, databases, etc. I’m not entirely sure that we did a completely transparent implementation of that (i.e., property file templates with symbolics should work in local test environments as well as shared test and production environments). We’ll leave that as an exercise for the reader, as they say in the math textbooks.

Touchstone for True Cloud Applications

In one word: AUTOSCALING

You should be able to add (or subtract) capacity (i.e., virtual machines) and be able to utilize that new capacity immediately. Having to manually configure the new capacity means that opportunities to handle volume spikes and lulls will be effectively missed.

Leveraging the services feature of Kubernetes is an example of how operations can expand or contract the capacity of an application. (TIP) Each application needs to be tested beforehand to ensure that autoscaling will work. Don’t assume that the load balancing features of K8S’ services will work for every application. Infrastructure tests should include adding or taking away containers or pods to check that there is an automatic adjustment to the available capacity.

Global Load Balancing

Now, this is more of a pure infrastructure concern, but still one that requires a lot of up-front preparation and design. How will you load balance all your network traffic between datacenters?

Ideally you want to balance the workload based on the up-to-the-minute capacity at each of the datacenters. Even more finely grained, you’d ideally want to do this based on an application-by-application basis. (TIP) Start simple. We started simply and used a round robin approach. (WARNING) With round robin, we expected a 50/50 mix of request traffic between the two datacenters and spent too much time trying to figure out why we never hit 50/50. I’d advise your management to not worry about it.

Quiescing Traffic to a Datacenter

I’ve been focusing on emergency failover so far. What if you want to take a datacenter out of rotation for planned, maintenance purposes? We introduced a change to our CDN that allowed us to stop the creation of new sessions on the datacenter that we wanted to rotate out of service.

For us, this quiesce process usually took 10-15 minutes. Typically, after that timeframe, there are no more active sessions at that datacenter, and we could proceed with the maintenance.

The specifics of how you will implement that will depend on the routing technology you use. (TIP) Since this will have to be a global routing function (i.e., logically “above” all your datacenters), your CDN will be a good place to start your research.

I Welcome Your Comments

I’d like to thank Ramanuj Bhardwaj (now at Google) and Tony Hsieh (not-the-Zappos-Tony, the one at Blue Shield) for reviewing this article. Of course, all the inevitable errors are my own. I’m reminded of the saying: “Experience isn’t the best teacher; it is the only teacher.”

And, yes, call me “lazy.” I should have written this last year. Hey! What good is retirement if you can’t kick back??? 😊

You sir are a scholar and a gentleman! I only hope I will be able to internalize your lessons enough to use them. Thank you for writing this up, Peter!

Like
Reply

To view or add a comment, sign in

More articles by Peter Connolly

Others also viewed

Explore content categories