IT Automation, Build a framework
Platform Request -- A Real World Example
Instead of starting with why an IT automation framework is needed, we will start with the platform request (hardware provisioning) example we identified earlier. In Figure 1, we showed a high level work flow chart for hardware provisioning. Roughly there are four main stages for this workflow;
- Provisioning -- create a new base build on VM (Virtual Machine) or bare metal boxes
- Configuration Management -- configure the base build to a company standard platform
- Monitoring -- add the newly build platform to the existing monitoring set up
- Optimization -- based on the monitoring feedback. Optimized configuration will be applied through Configuration Management Stage again
To complete the full infrastructure resource life cycle, you will need all four components integrated. Most IT organizations will at least try to implement the Provisioning and the Configuration Management portion as a fully automated process.
As we pointed out earlier, for most automation projects, automation tools are critical parts of the projects. In each stage, we attached a couple of sample tools for the workflow. While your choices could vary, we attached those tools simply because we used them in our project. The automation tools we use in each stage;
- Provisioning – Cisco UCS Director, VMware vSphere
- Configuration Management – Puppet
- Monitoring – Solarwinds
- Optimization – in-house developed tool, part of the automation framework automation tier
The starting and end stages are within the ITSM system such as ServiceNow because that is where all the automated workflow starts and ends. We allow the end-user to use the ITSM system to initiate their requests, and once the automated workflow finishes, it will close and report the result back to the ITSM system.
To translate this high level workflow to a real-world automated workflow, we removed the monitoring and optimization stage for now to make the workflow chart easy to read. Moreover, most IT organizations will consider the monitoring and optimization stage an add-on, not a must-have on the initial implementation.
Figure 2. shows our automated workflow to provision a VM server. Here we kick off the platform request from ServiceNow, getting the VM specs using user inputs and CMDB (ServiceNow Configuration Management Database) query. A REST call from ServiceNow then initiates the provision workflow within the UCS Director. Within the UCS Director workflow, it provisions the VM using VM templates and installs Puppet agent at the end. (A detailed UCS Director workflow in shown in Figure 3). When the control is handed over to Puppet, the newly built server nodes will be added to your Puppet network, and manifests will be pushed to those nodes based on the role and profile you defined in the server class we try to build.
In Figure 2, we can see that the control of the workflow changed several times. Many automation tools have their own workflow orchestration function. Therefore, we have sub-workflows within the main workflow. The issue arises when those sub-workflows come from different tools. That builds the dependency we do not want in a systematic approach.
For example, if due to cost or other functional requirement changes, we must change the provision stage tool from UCS Director to VMware VRA (vRealize Automation), we must not only change the provision stage workflow, but also change the interface to the underlying sub-workflow. The issue is not just the complexity, but also the dependency since not all the workflow dependency is necessarily linear.
No two tools' functions will be completely identical. In the example we gave, the VRA workflow will not handle the physical boxes (bare metal ones), but only VMs deployed through VRA. Therefore, a new sub-workflow needs to be in place to accomplish that bare metal provisioning if UCS Director has to be replaced, which will not only require the change on the sub-workflow but also the workflow above.
Another example in Figure 3. we showed the control is handed over to Puppet within the UCS Director workflow. Since the interface is handled within the UCS Director workflow, those interfaces will need to be changed if the underlying tool changes.
At the architecture level, it is preferable to have one orchestration tool to coordinate all the workflows across multiple technology towers. However, most automation tools actually have their own orchestration functions and back-end data repositories. Therefore, as we showed in our platform request workflow, there is implicit dependency among those different tools. The existing approach, although process-driven, has the following problems;
- Most automation tools serve in a technology tower for a function. To coordinate the process activities, they usually have their own orchestration functions
- Multiple orchestrated workflows create complexity and dependency
- No clear interfaces defined among those automation tools
- Build-in tool support sometimes make change/replace tool even more difficult
The Solution -- Build A Framework
By taking an architectural approach to IT automation, IT organizations are able to simplify integration, increase re-usability, improve reliability and reduce the amount of effects necessary for upgrading and scaling. On the application automation side, automated test framework is well established and being practiced by many IT organizations.
Most of those test automation frameworks are either based on test library architecture or are data/keyword driven. We build our IT automation framework on the concept that it is process-driven. The idea is to build an automation layer over the existing automation tools. The multiple automation tools within the framework will get input and send output to the automation layer instead of each other. Compared with the more mature test automation framework, we could call our framework Process-Driven.
- Use automation tools to accomplish the process function
- Build an automation layer in the framework to provide a library of functions and policy actions independent of tools
- Process based modular design makes it easy to scale and replaces the underlying automation tools if necessary
- Unified interface
- Centralized data repository for reporting and future analytic ability
- Full integration with ITSM tools to kickoff all the automated workflows.
When we look at the example for our platform request, there is no central point of control. In our framework (Figure 4), there are three tiers;
- The data layer -- Collects infra monitoring data, ITSM tool data inputs to feed the central automation data repository
- The automation layer -- The two main components are the central data repository and rules/policy library
- The ITSM layer -- This is the client facing part, start and end point for all the automated workflows. The ITSM orchestration function is the main workflow control. All automation tool orchestration is hidden behind the automation layer
Instead of having company specific automated workflow data stored in the respective automation tools, the central data repository is independent of any underlying tools. In the platform request example, the automation layer will kick off the platform request using rules/policies we defined. The automation tool workflow will act like a 'black box' within the framework. No matter how complex the overall workflow is, it will always only have two levels of workflows, one main workflow runs within ITSM tool, and all the sub-workflows will be at the same level below the main ITSM workflow. There is no dependency among automation tools itself.
The framework we developed by no means is perfect. With a top-down, policy-driven strategy, a lot of work needs to be done in the planning stages. But there are benefits later from having established guidelines and practices for incorporating new technology and adapting changes.
With this architectural approach, automation processes can be managed from a unified construct rather than having to oversee multiple level of complexity. Ultimately, this makes it possible to seamlessly integrate new technologies, operate with better governance, transparency and control, and ensure more strategic resource utilization.
The central data repository also provides future analytic/AI (artificial intelligence) capability more easily than combining different data sources from multiple automation tools.
In Figure 5, we showed an in-house developed tool to report from our centralized automation data repository.
The Benefits of A Framework
As we pointed out in the "IT Automation, Tools and Talents" article, the proliferation of automation tools with overlapping functions creates new silos. One major benefits of implementing an IT automation framework is that it introduces a great degree of re-usability. Common libraries can be used when required with no need to develop them every time. Common sub-workflows can be used across different automation tools to avoid redundant tasks.
While it means taking a step back and committing to additional work upfront, you may need to take inventory of all the automation tool processes, identifying inputs and outputs, consolidating the number of automation processes needed, and developing a coordinated approach. The benefits have been proven to be worth it.
Another major benefit lies in the centralized automation data repository. This centralized data store keeps all the automated workflow data in house, not within each automation tool itself. Even if you change tools down the road, you will not lose the critical run info on that specific process.
At this time, we do not have any analytic/AI function implemented. However, with the maturing of AI technology, we could explore the possibility of some behavior pattern/analytic functions in the future. A simple example in our platform request scenario;
- Configuration Management -- configure medium size MS SQL box with 8 CPUs
- Monitoring -- CPU utilization across 80%, send alert,
- Optimization -- Consistent CPU utilization alerts, increase number of CPUs request send back to Configuration Management
If there is a batch window on that production MS SQL server after the market closes, during the batch window, that production box will spike CPU utilization across 80% every time due to the heavy workload. Once we mine the data in our repository, we could establish a pattern using the workflow info collected. The system could draw its conclusion that there is no need to increase number of CPUs since the spike only happens during that specific batch window.
On the other hand, the rule and policy action which contains the business logic of the workflow derived from the central repository could also be used for IT security/compliance auditing.