Network Function Decomposition
Industry has been driving towards network function virtualization. This has resulted in disaggregation of hardware from software, enabling use of COTS hardware for network functions. Network function virtualization NFV has been net positive for operator business. It opens up possibilities of leveraging cloud agility and cloud economics in network domain.
But have the benefits of NFV fully realized? Custom hardware has been replaced with x86 COTS servers. COTS hardware may be less costly but it comes with accelerated upgrade cycles. Hardware utilization may also not be optimum. Let me explain why.
What’s missing in Virtual NFs
First, current VNF deployments are virtual machine based. VMs have OS overhead. Also cloud stacks running on OpenStack are famously bloated.
Even though network functions have been virtualized, these network functions could in reality be large monoliths with chunks of functionalities/modules. So when it comes to scaling, the most resource intensive functionality will determine scale out. This means when we scale out a network function, we scale out not only the modules that needed scaling out, but also those modules that did not need to be scaled.
Their monolithic architecture tightly couples modules inside the network functions. This limits DevOps team’s flexibility in introducing code changes. Deployments are still a major undertaking. One small code commit in one module may cause unforeseen problems in another related module. Hence, the efficient DevOps CI/CD cycles are not the prevalent experience. Deployment practice is still that of traditional telco software. Code is released by vendors, which is regression tested, then field tested, then after a thorough FoA, we finally come to the actual commercial deployment. Difference being we are now deploying on COTS hardware and perhaps mechanics of deployment are more efficient once we have our final good build of software. Deployment of NFVs continues to be a high risk activity.
MME Example
Let’s take an example from LTE core network domain. MME is a key network function for LTE. It is the control plane node that performs some important functions like:
Network Access Control, Session management, GW selection and Mobility management.
A commercial MME pool can handle millions of connections to LTE network, each UE has to be authenticated and authorized, its sessions need to be managed and mobility not handled by RAN, needs to be handled in MME. Some of these functions are concurrent and require capacity scaling, others are not that demanding. Within a VNF for MME, these different modules of MME have different scaling requirements. When we run out of capacity on a vMME, that may be due to session management module running out of resources. We may need to scale it out, not the GW selection module. But with one NF for vMME, we are forced to scale everything out.
Let’s say our operator wants to introduce Cellular IoT. Operator is expecting to double the number of subscriptions, which effects session management and network access modules, however, there is imperceptible impact on mobility management. Most IoT devices are stationary. If doubling of vMME instances is needed, a large portion of the code, related to mobility management, may not be executed.
Let’s look at a hypothetical scenario. The operator comes up with a new service called Secure Identity, which allows users to securely log into apps, even authenticate identity over the web. This new service forces us to re-write the code for access control module that violates a lot of dependencies in session management module. The dependencies could be so deep that a separate identity service module had to be written, with its own dependencies on access control and session management. The underlying architecture is not elegant, but the service is what everyone needed. It is selling like hot cakes. Now, the problem is having to maintain the new module separately. We need to scale even more vMME instances because we added a new module to all existing and new instances. Since there are more dependencies and more VNFs to manage, friction is the result and the velocity of innovation is impacted.
Now suppose the operator marketing department determines oversees demand of this Secure Identity service could be several times the domestic demand. Product team wants to work on a Secure Identity product for oversees customers, technical team determines they could re-use most of the code from access control module, however, they don’t need anything else from vMME for this oversees service. Since there isn’t a way to decouple access control module from vMME, technical team informs that provisioning an oversees secure identity service is not possible under the current cost constraints. And so a lucrative business opportunity is lost.
Towards Network Function Decomposition
These are all the problems associated with our tightly coupled monolithic VNFs. We are operating in the Cloud but can’t take full advantage of Cloud economics and agility.
The solution to these problems is decomposition of large monolithic VNFs into their constituent modules. These modules could be developed, deployed and scaled independently, using modern DevOp practices. Let’s call these modules constituent VNFs or just CNFs. There are some important considerations.
How to chop up the monolithic VNF
The principle is that of separation of concerns with self-contained, loosely interacting domains. We divide the functionality of our monolithic VNF into CNFs that form a well defined chunk of functionality. We strive to eliminate dependencies between CNFs, so we can continually develop, integrate and deploy each CNF independent of others. Let’s take the example of MME. Session management is quite distinct from mobility management, so the two can form separate CNFs. In fact, for some use cases like Fixed Wireless Access, the load on mobility management CNF may be so low as to require minimum instantiations of it.
Figure 1: MME VNF split into multiple CNFs
In early days of implementation, the CNFs are likely to be flat and high level. With time, there will be opportunities for further decomposing CNFs and even perhaps have a hierarchy of microservices for each initial CNF. For instance session management could be split into session management for LTE and session management for NB-IoT. Session management for LTE could be split into default bearer set up and dedicated bearer setup and so on. Notice that creating new microservices from CNFs will result in increases in overall complexity, so it needs to be undertaken with a view to balance costs/benefits and sound business/technical rationale.
Figure 2: VNF decomposition can proceed towards increasing depth
Handling CNF communication
Of course, CNFs may need other CNFs to complete the task at hand. How do we handle the fact that one CNF may have to call procedures in another CNF and still avoid dependencies. One way of doing this is to use the popular distributed systems messaging pattern architecture. Here CNFs operate as microservices that are responsible for performing certain tasks. All inter-CNF communication happens asynchronously through a microservice bus. Each CNF has the capability of publishing messages over the bus, as well as subscribing to certain messages. When the CNF publishes a message, it need the system to do something for it. The CNF also subscribes to messages that it could pick up from the microservice bus and perform processing activities. This subscribed message was probably posted by another CNF, so our message subscribing CNF needs to post that it completed its processing action with success of failure to inform the publishing CNF. The messaging pattern allows decoupling of CNFs and eliminates dependencies. CNFs work by publishing messages and listening for asynchronous responses, as well as subscribing to messages for the task they are assigned to handle.
Figure 3: CNFs communicate with other CNFs via messages on a microservices bus
The asynchronous communication architecture of microservice bus and message publish/consume is a crucial piece of the microservices architecture. It allows looser coupling, enable parallel processing and reduce systematic cascading failures.
There may be multiple instances of a CNF running at one time. Cloud platforms such as Kubernetes allow auto scaling of CNFs. A proxy may be placed to ingest messages intended towards a certain group of CNFs. Proxy allows simpler service/CNF discovery and also load balancing over CNF instances.
Do we need to have a master orchestrator CNF
Not really. If all the CNFs in the system are implemented as microservices that work with a common microservice bus, we don’t need a central CNF orchestrating the flow of tasks between the CNFs. The sequence of message generations and consumptions carry the overall state of the system forward. Not having a master CNF enhances resiliency of the system as well. If there was a master CNF and it fails, it takes down the entire system with it (or part of the system capacity if the master is duplicated). In our case, where there is no master CNF, if a CNF crashes, we move the workload to another CNF and if needed we instantiate a new one. The state of the system still survives since the crashed CNF does not publish completion messages, this triggers the requesting CNF to retry. Each CNF is stateless in itself. State is maintained by the sequence of message flows between CNFs.
A key-value logging database may be implemented with the microservices bus to log all the messages with CNF ID as the key, plus process ID and time stamp. This would allow retracing of process step to an arbitrary point in time. The database would act as a store of system state in such a case.
How to deploy CNFs in production
It is not necessary to replace the monolithic VNF with CNF on a set cut-off date all at once. One pattern that slowly decomposes a VNF is what is called strangler pattern. Here existing VNF is placed behind an API, its functionality is frozen, and we implement new functionality in our CNF, making API calls to monolith VNF for older functionality when necessary. Over time, we can migrate older functionality into the CNF, until the VNF is not useful anymore and then it is decommissioned. The inspiration for strangler pattern comes from strangler vines. They grow in upper branches of trees, gradually working their way down until after many years, they completely strangle and kill the original tree. Strangler vines themselves form intricate shapes as they enclose the original tree.
Figure 4: Strangler vine (credits Wikipedia)
Leveraging cloud platform
Some functions in the monolith VNF are common functions and could just be moved to the platform. For instance, both vMME and vPGW may have their own logging & performance management functions. These functions could become common platform functions as part of the cloud infrastructure telemetry and be moved there. Cloud platform may also provide a CI/CD pipeline, which could again be common over all CNFs. If the cloud system is implemented with Kubernetes or Docker swarm, there is native support for service discovery in these platforms, that our VNF decomposition scheme can leverage.
Where to run the CNFs
In containers natively in Kubernetes or Docker Swarm, of course. Docker container lend themselves to DevOps. Container based deployment have lower overhead compared to VMs. Simpler CNFs might be deployed as serverless lambda functions, with even better hardware utilization efficiency.
What about process isolation, we need VMs - some would say. Running Docker containers in Kubernetes pods is not sufficient isolation– I ask? Does this really matter if we are talking about a single tenant telco cloud? Security should be built in every level of the stack, however, we’ve got to balance it with risks.
Unleashing DevOps
Architectures define (and sometime constrain) how we work. A microservices based architecture and container deployment naturally enables efficient CI/CD pipelines, DevOps ways of working and more frequent, safer less risky code deployments. CNFs could be a key to this in network function virtualization domain. CNFs could accelerate rate of innovation in operator networks.
Unlike a monolithic VNF, CNFs do not require to be developed in the same programming language or framework. Dev teams can choose the best framework for the task at hand. For instance, a session management CNF may be developed with a concurrent framework like Erlang/OTP, whereas a call data record CNF may just use SQL.
Are there any downsides?
The flexibility of the above SNF scheme comes with complexity of microservices architecture. Not every VNF is a candidate for VNF decomposition. If the functionality of a VNF remains unchanged, then there is probably less of a need to undergo a decomposition process. VNFs that accept on-going new requirements and undergo frequent code changes, it may still be not worth the extra effort VNF functionality is fairly simple.
Summary
In this article, I discussed how large monolithic VNFs may not be the complete answer for cloud economics and cloud agility that network operators are looking for. We looked at decomposing the VNFs into constituent network functions, which we called CNFs. The principle behind this decomposition is about separation of concerns. CNFs allow implementation as microservices, that could be run in containers. We also looked at how to decompose a VNF and how to deploy the resulting CNFs in more detail.
Could VNF decomposition become the next step in virtualized network evolution?
Nicely done Kashif, NFV/SDN is here and for time to come, Operators envision this would reduce their dependency of vendors but love to see the business case from a major deployment of proven cost benefits , easy to deploy , easy to scale and to maintain.