Container Security
Full Stack Telemetry & Actionable Observability is the key!
Authored BY: Sriram Krishnamachari
I write this blog, as past weeks have posed some tough questions to the enterprises, especially to the fans of Containers – Dockers/Kubernetes and cyber security teams with the VFEmail hack. I share my point of view, management lessons & takeaways that CIO/CSO should consider, as they look to adopt containers/kubernetes rapidly.
Kubernetes is awesome for its efficient container orchestrations, well known as a great tool to support the cloud native, container migration goals and is estimated to be adopted by over 50-60% of enterprises at varying levels. However, 2018 was the year Kubernetes/docker faced its first real security attacks and malicious images were backdoored in the docker hub. The major holes especially recent runC root exposure, is a good reminder & reason enough to really look deeper and question the controls and guardrails that you have in place.
It is a vast topic, with more questions than answers.I intend to touch upon 4 of the key takeaways, that reflect the current state & the aspects you may want to consider:
✓ DevSecOps through Automation Platform
✓ Telemetry and Actionable Observability
✓ Rapidly Evolving Kubernetes Ecosystem
✓ Security is everyone’s responsibility within the Enterprise
➢ DevSecOps through Automation Platforms:
Container vulnerability (CVE-2019-5736) seemingly affected all container platforms that use runC, a standardized runtime that allows creation and running of containers. The vulnerability apparently affected Docker, Kubernetes, and even Apache Mesos, which does not use runC - a case where in a bad actor can gain control of your host by exploiting privileged containers.
The impact potential of losing root control is indeed severe and substantial for business operations. Just the sheer impact potential warrants a harder, closer look at your container / cloud security posture:
- What is your business exposure & impact potential with the widened/scaled up threat surface with containers & cloud – do you measure your impact minutes?
- What is your resiliency to spring back up? Do you measure your Mean Time to Recover & Resolve (MTTR²)
- Have you re-baselined your threat surface and exposure controls?
- Hard labor and toil simply cannot scale up to keep up with enterprise cloud scale demands. Do you have the tools, platforms that can help you with the resiliency today?
It was indeed a great reminder, that cloud providers who provide you with DevSecOPs automation tools/frameworks, may take only limited responsibility or accountability for the incident & recovery. It becomes ultimately the responsibility of the enterprise operator to ensure their security posture is current and updated.
➢ Telemetry and Actionable Observability:
Containers run on a shared kernel, that greatly limits the level of isolation, and they require dynamic networking – both of which make it harder to have visibility and control over the runtime environment.
Think through the embedded observability that you need to correlate business transactions, with the app transactions cutting across monoliths, microservices, mesh networks, mainframes farm that you may have. Are you securing your container farms & run times with Telemetry and the 3 key tenets of observability - logs, metrics and tracing?. Can you take specific actions based on the alerts on real time basis.
Signatures & Network Perimeter based intrusion detection are fast becoming a thing of the past, especially for the cloud scale enterprises, given the advanced intrusion patterns that enterprises are facing on their widened threat surface.
- Have you embedded your observability right into your container to container (service to service) communications, and ensured your apps/services are secure?
- Are you looking into the anomalous behaviors in container usage patterns and have the ability to trace it all the way back to root cause? The signals from infra alone is not enough anymore, it needs to cut across services & business transactions.
- As your threat surface has widened, what is your Telemetry & Actionable Observability strategy? - Do you have framework that learns from the patterns across the value chain from detection to prevention, indeed pushing the boundaries to predict the incidences, based on in-service & social contextual learning.
➢ Rapidly evolving architectures & ecosystem:
Stateless Apps ... Sure! it is a near Disney ride with Kubernetes – fun and safe!
Think hard on how exactly you will run stateful apps with Kubernetes. With statefulsets, you could surely discover the POD and get it back up running and tackle the scheduling problem, ( ie presuming you are still in control of the root :-) ). What about the storage orchestration? The Storage responsibilities were relegated to the underlying engine, until recently ie kubernetes (1.8), so there are solutions like ‘portworx’ to address the white space. Kubernetes has quite recently introduced the CSI (container storage interface) couple weeks ago, and the adoption still needs to be tested out in the real world and this is still evolving.
With such evolving community, how do you plan to secure your sidecars, ensure exposure controls within services? How mature is your SDN for enabling service to service communication & fine-grained security policies, across clusters?
With open core abstractions around Kubernetes, enterprises must consider that, upgrading Kubernetes at runtime, without a downtime, is a non-trivial activity, especially if it is in a multi-cluster environment. In order to, keep pace with the releases, you are probably looking at 4 major upgrades in a year and over 10+ minor updates as experts point out. While cloud foundry runtime like bosh, or a light weight CRI-O purpose built for Kubernetes, offer a more hardened path for adoption. Abstractions like pivotal container services (PKS) make it super efficient and easy to enforce guardrails for your multi-cloud, or any of the 'x'KS, from the respective public cloud providers, with its limitations.
➢ Security is everyone’s responsibility:
In the digital delivery models, empowering developers is critical, as they are increasingly directly accountable for the experiences they deliver to the customer, and they now equally own up the responsibility to secure the apps/services & their customer data, to ensure it consistent and compliant with enterprise policies.
CSOs/Operators, therefore need to have the capability to code & roll out security through their platform chassis, aspects like recognizing attack vector patterns and alerts & leveraging intelligent detection models that enable users and owners of the system, to stop an attack pre-emptively.
How do CIOs/CSOs, enforce guard rails, to 1000+ developers spanning multiple teams and clusters, especially as the app velocities increases in the cloud native world? How much of it, can you automate, patch through elegant platform abstractions? – this indeed becomes a critical question to ask.
Platform Admins/Developers through embedding full stack observability, must bring in security much earlier in the software dev cycle, like informing larger community/teams of the nature of attacks, when/where they are occurring, the targets they are hitting, etc. in real time & try to shorten the mean time to detect and fix vulnerabilities.
In summary:
All of this is only pointing to the need to have a well-defined & holistic container security strategy, and leaning more and more on efficient automation platforms to execute it. I do see some teams in the enterprises preferring ‘upstream kubernetes’, for all its commercial benefits, but do think about Day 2 operations, your security posture holistically, think about your journey to cloud & the implications of a downtime, the telemetry & actionable observability you need to scale en-masse’, as you make key decisions.
Security, undoubtedly, has now become a Board level consideration, as the threat surface & exposure has substantially widened over past few years & the impacts of a breach can be substantial, as well.
Good News seems to be, the 3Ps of security is not going away anytime soon … Patch, Patch, Patch !.
Good post Sriram - having a platform that has security built in is one way to deal with this - our product PKS did a good job of protecting customers (though they could have gone against the defaults and turned on Privileged container which we turn off by default, in which case they'd have to patch)