Infrastructure as a Service
The fourth article of the series of 5 articles describes my journey to build an Engineering setup in Financial Services. Read the preface.
Infrastructure is at the foundation level for any product or service platform. It starts from the lowest level i.e. bare metal to servers to storage to networking. In a cloud setup, all of these resources are abstracted and you treat the resources as service components served through either console or APIs. Hence, it becomes easy to apply all the best practices of software development to build your Infrastructure.
The infrastructure team was formed following our core guiding principle - X-as-a-service. They are service providers to our feature development squads, platform engineering squads, marketing squads, security/compliance squads, and to themselves. I don’t think the term IaaS team requires an introduction, but let me propose my definition.
Infrastructure as a Service
A team that abstracts the complexities of Infra operability (networking, computing, storage, monitoring/ observability, scalability, high-availability), provides on-demand self-serve capability to bootstrap new Infra components, and responsible for Infrastructure support for all environments with a pre-defined SLA & Uptime.
Guiding principles
Birth of Snipers
I like initiatives and teams having their names (attractive ones) based on the context and responsibilities they do. I have observed the following benefits:
This is in alignment with Project Aristotle, a research done by Google to discover the secrets of effective teams. They tried to find reasons behind: What makes a team effective at Google?
Great teams are not driven by free lunches, benefits, or wonderful salaries. Psychological safety, Dependability, Structure & clarity, Meaning and, Impact are the 5 traits highly effective team.
Infrastructure is one of the core pillars of engineering (find ways to link to org structure) setup in DKatalis both architecturally and by design. Hence, the name of the team should reflect a set of specialists who are highly skilled in what they do (doesn’t undermine the specialisation of other teams).
Snipers, as the name suggests, are highly skilled and dependable on the job they do. This led to the formation and naming of a team of highly skilled systems engineers who would deliver building cloud-native & highly reliable infrastructure for the bank.
The array of cloud providers
When we started, one of the important considerations was data center and hosting bank products/ services (this was early 2019). Based on my experience, I was clear to avoid building a physical data center in Jakarta (Indonesia). Unfortunately, the only cloud provider available in Indonesia then was Alibaba Cloud but none of us had previous experience working on the AlibabaCloud platform. AWS & GCP were on their path to establish their regional setup in Indonesia but it was unclear when operations would start for us to build our data center on their platform.
AWS Outposts or GCP Anthos with our own hardware were good options
It was lucrative to consider Outposts as it falls between having a DC and a cloud-native setup. This would also help make regulators comfortable given we were building for a highly regulated entity - a bank (a full cloud setup of a bank in this region is still a BIG thing). But, Outposts wasn't ready by the time we started to ship their hardware. Similarly, Anthos, was another option for us to consider but there were too many unknowns. With clear intention to not have a physical data center, the only option left was AlibabaCloud.
It was AlibabaCloud
With some back-n-forth and with scale in mind, we settled for AlibabCloud as our hosting setup. The guiding principles were indicators on how the team was going to operate. From a cloud provider perspective, Alibabacloud is similar to AWS (at least the cloud platform APIs) and hence it was relatively similar (given some of us had exposure to AWS) to build the cloud platform.
Recommended by LinkedIn
Challenges with AlibabaCloud
Given the similarity with AWS and the availability of TF Aliyun provider, it was a quick start to build IAC. However, AlibabaCloud is not a widely penetrated service in the Indonesia Region, hence, many expected feature-set like internal DNS service wouldn't work by default (but it is not documented). After hours of debugging, we concluded the service is not enabled in the Indonesia region. Raising a ticket with the Aliyun service desk sorted it in a day. The story was similar for a few other services as well.
Designing for failure (and hence scale)
While designing the network topology and scale for Infrastructure, it was important to keep in mind learnings from the past. The guiding principles for Infrastructure design were as follows:
Compliance & regulatory requirements
Compliance requirements are one of the important concerns for financial service institutions. The traditional compliance model was designed in a different era and with a different purpose in mind that doesn’t work in the current expectation of digital setup, however, this has imposed bigger unknown risks at the same time. This article does not cover is limited to some of the compliance & regulatory requirements for a bank from an infrastructure engineering lens.
The Mega Cloud Migration
We launched the Bank Platform on AlibabaCloud. But, as I said, other global players were footing themselves in Indonesia. Although, Aliyun was good enough for us, for longevity, stability, and strategic partnership we wanted to migrate out to Google Cloud Platform. Any migration activity requires meticulous planning for:
service components, data (integrity, storage, logs, backups, migration, protection, encryption), business drivers, integration to internal & external services, traffic routing, security and threat coverage, performance, scalability, high-availability, zero downtime, rollbacks, avoid vendor lockin, compliance & regulatory approvals, checklists, rebuild infrastructure components, runbooks, war-rooms, automation (avoid any manual process or step), testing, fail in lower environment, customer communication
The keywords above are very important to keep in-mind when planning for migration. Hence, migrating a running bank with transacting customers was a massive undertaking. Jeevan has summarised the whole process in a wonderful article - a must-read.
Sniper team played a lynchpin role to anchor the migration activity. A new set of TF scripts: providers, provisioners, and service migration (w.r.t. GCP) was re-written and migration activity was carried out in a span of ~6 months. This included making sure all system components are monitored and alerts are generated in the new cloud setup.
Cloud migration of the bank was completed in less than 6 months with almost no customer impact and no P1 tickets (keep in mind, this was an operational bank with customers transacting on the platform)
The service mindset
The majority of product development is happening in the era of SaaS-first, hence, it is important to build a SaaS mindset (both internally and externally). Having a X-as-a-service as our guiding principle, we wanted to have Sniper team develop this paradigm since their inception.
For Snipers, the customers were all internal (stakeholders who work within your company and require assistance from another individual or department to get their job done) product development squads, business/marketing squads, and compliance/risk/security squads. All of these squads need Infrastructure components to run their development/operational responsibilities - their success metrics is closely linked to how effectively Sniper provides capabilities to them.
Some learnings building the service mindset:
Journey continues
The journey to build a true IaaS provider to the internal customer continues. We are working on streamlining learnings and improvising the team.