What could AWS Improve?
In the shadow of the AWS Public Sector Summit in Washington DC, a friend asked "Is AWS Perfect? What improvements could they make to be great?" AWS holds 33% of the cloud market generating $5.4B in cloud sales annually so they must be doing most things right. While visiting Europe soon after college, one of many trips to the Vatican yielded an interesting encounter with someone who seemed to only exist in hyperbole. The Congregation for the Causes of Saints staffs an official position as The Devil's Advocate. Persons holding this role prepare a case against anyone the Pope puts forward for sainthood. Individuals up for beatification lived exemplary lives yet the Vatican, a monarchy, still hears opposing views and designates a lawyer to provide these opposing views.
When asked about what AWS could do to improve, this role of Devils Advocate came to mind. How can one argue fault with AWS? One would only find issues in AWS by running workloads at scale, under constant attack from nefarious actors. When one could easily list hundreds of features they love about AWS, there exist a few areas for improvement. For those Amazonians that overcame any of these, please leave feedback as with the breadth, flexibility, and ever evolving services, there could be features and alternative configurations to fix these challenges. Maya Angelou put it:
Do the best you can until you know better. Then when you know better, do better.
When rolling out AWS at scale for public sector clients, one must realize the tremendous risk agencies face from bad actors. Agencies also use their own server names and must resolve both cloud and on premise assets seamlessly in a hybrid environment. They also require agility to be able to integrate in DevSecOps processes so they can improve velocity in delivering services faster to citizens as aided by the United States Digital Service. The founder, Mikey Dickerson, advocated for cloud and the increased velocity that accompanies it.
The three improvements AWS could make to help agencies include:
- Integrate IAM to authenticate OS Users. One can choose to create key pairs to authenticate users when accessing the instance operating systems. These keys provide strong encryption and protection. However, users typically save these keys on their desktop computers and often insecurely in S3 buckets or GitHub repositories (Uh!). These desktops survive in a hostile environment with direct access to email and web browsing threat vectors. Both vectors represent the number 1 and 2 vectors into any enterprise. Stealing keys from a desktop can be done with relative easy. By default, OS users provisioned at instance creation do not require passwords so the key only represents a single factor of authentication that can be easily stolen. While agencies will typically join servers to an active directory domain or deploy their own IDM tools, AWS provides such a robust IAM service with easy integration with CAC, PIV, Active Directory, and Open Auth. Why not leverage this powerful tool to push these identities to provisioned instances by default, but allow agencies to do their own IDM should they desire. Typically extending an Agency's IDM to the cloud tends to require a risk assessment and slow down the standup process. One can also never guarantee that all provisioned assets follow an enterprise's policy either.
- Domain Name Service can be a challenge to integrate in a hybrid environment. AWS requires users to use the instance ID when calling AWS APIs to make changes. Tools like Ansible or DevSecOps tools need the instance ID to make changes. One must link the instance ID to the hostname (FQDN) to uniquely identify the host in different tools. Ideally AWS would allow users to pass in the customer's FQDN and use the DNS features of the VPC to derive the instance ID. Even if one wants to lookup the AWS FQDN, AWS makes it complicated to get access to the internal horizon of Route 53. A DNS query to Route 53 internal horizon must originate from within your account to lookup any of the private names in Route 53. This means one must provision a DNS forwarder in each AWS account in the enterprise for it to be able to resolve private entries. One also typically demands redundancy so this becomes a lot of infrastructure in an enterprise. Route 53 should allow for an ACL that would allow an enterprise's DNS servers to resolve the internal horizon of Route 53 when explicitly allowed. When trying to implement the best practice of ubiquitous encryption, one receives errors on the load balancer encryption to the destination services in the account because these use enterprise signed certs. Most load balancers have a feature that allow one to set a "host header" to eliminate these certificate errors. For example, if my enterprise certificate server is authoritative internally for "mycompany.local" and I have web servers with certificate www.mycompany.local, I should be able to tell the load balancer when connecting to the destination services send www.mycompany.local and ideally error if you cannot connect using that certificate.
- Marketplace add ons could be Improved. Another popular cloud service has the concept of extensions one can set on instances in the meta data to provision marketplace items. One must set items like the unique key required by the service to register the add-on into a user's console. The competition allows these services to ride the internal network to avoid DevSecOps configuration changes from removing critical ports and ACLs necessary to communicate with the console. When users muck with these ACLs or Security groups, they often break critical security services like scanning agents, Anti-virus, or Host Intrusion Detection Systems. Security teams constantly need to run down connectivity issues with disparate teams. AWS could allow these services to communicate like DNS or NTP so they traverse a private AWS network seamlessly ensuring connectivity.
Those represent the top three encountered so far that appear to still exist. Inter region VPC peering restrictions used to be on this list but AWS fixed that. They also seem to be moving more and more services into the VPC that once only existed outside. They should also consider more robust networking with stateful ACLs on VPCs as well as better support for High Availability clustering of network devices or servers. Gratuities ARPs take 20s to a minute to failover devices at which point devices like firewalls drop sessions. AWS recommends implementing these as separate devices split across availability zones for redundancy but that means when a device fails, one loses have their environment so one must build in extra auto scaling to account for it. These would probably be harder to fix.
The postings on this site are my opinions and do not necessarily represent CGI’s strategies, views or opinions.