Using Gitlab Runners in Network Pipelines
For the past few days I've been looking into Gitlab CI/CD for creating network pipelines. I considered using Jenkins, TravisCI and Concourse but decided to go with Gitlab since I already have most of my repos there and is pretty straight forward to use. Gitlab's documentation on their CI/CD tool is great so I will not go into much detail describing how to set it up.
My idea behind this article is to demonstrate a simple pipeline with two stages using gitlab runners to execute our jobs:
- Build: Using docker executors we will spin up a container on a remote host to validate/lint YAML syntax (if there's a problem print an error message and quit with a non-zero error code) triggering our CI system to indicate failure. In addition, we will have our docker container run an Ansible playbook that will deploy a given configuration to our Lab.
- Test: Using the shell executor, which allows us to execute builds locally to the machine where the runner is installed, run a batfish validation against our network to make sure the number of established BGP sessions meet an expected value.
Let's start by creating the following repo structure:
Directories and Files:
- Batfish-snapshots: this is where our playbook will dump full router configs for batfish to analyze.
- deploy: location of our deploy files generated by playbook
- templates: Jinja2 templates
- tests: our Python script to validate YAML syntax will go in here
- vars: YAML variable files
- gitlab-ci.yml: pipeline configuration file, this is where we define our stages and jobs
- interface-config.yml: our Ansible playbook
Gitlab Runners:
As you can see, we have two runners activated for our project, the use of tags (blue labels on the picture) is very important here as this is the way we indicate what runners will be used for our different build jobs.
Defining our pipeline:
Inside our gitlab-ci file we define the structure and order of pipelines. Determine what to execute in our runners along with decisions to make when specific conditions are encountered:
stages:
- build
- test
build_job:
stage: build
tags:
- build
script:
- cd tests/ && python validate_yaml.py
- cd ../ && ansible-playbook -i hosts interface-config.yml --extra-vars "wf_ticket=22046"
test_job:
stage: test
tags:
- test
script:
- python3 /home/lab/jromero-batfish/batfish-assertion.py
Both of our stages are defined here, you can see the use of tags and our scripts which become jobs to run within each stage.
Our Ansible playbook, batfish assertion and YAML linter:
interface-config.yml
---
- name: Generate deploy files from vars
hosts: neteng-lab
connection: local
gather_facts: no
roles:
- juniper.junos
vars_files:
- vars/{{inventory_hostname}}.yml
tasks:
- name: Generating Deploy Files..
template:
src: "{{ item.src }}"
dest: "{{ item.dest }}"
mode: 0777
with_items:
- {src: 'templates/interface-config.j2',dest: 'deploy/{{wf_ticket}}-{{inventory_hostname}}.conf'}
delegate_to: localhost
- name: Push Configuration to Lab
juniper_junos_config:
config_mode: "exclusive"
load: "merge"
src: "deploy/{{wf_ticket}}-{{inventory_hostname}}.conf"
commit: false
register: response
- name: Print the complete response.
debug:
var: response
- name: "Pull configs for batfish analysis"
juniper_junos_config:
retrieve: "committed"
dest: "batfish-snapshots/configs/{{ inventory_hostname }}"
register: response
- name: Print the complete response.
debug:
var: response
batfish-assertion.py - this is a pybatfish assertion that runs in our shell executor, it analyzes config snapshots to look for at least 1 established BGP session.
import pandas as pd
from pybatfish.client.commands import *
from pybatfish.datamodel import Edge, Interface
from pybatfish.datamodel.answer import TableAnswer
from pybatfish.datamodel.flow import (HeaderConstraints, PathConstraints)
from pybatfish.question import bfq, load_questions
# batfish host
bf_session.host = "localhost"
load_questions()
bf_set_network('neteng-lab')
bf_init_snapshot('/home/lab/repos/network-ci-pipeline/batfish-snapshots', name='neteng-lab', overwrite=True)
pd.set_option('display.min_rows', 400)
pd.set_option('display.max_rows', 400)
bgpSessStat = bfq.bgpSessionStatus(nodes='vmx-01', remoteNodes='vmx-02', status='Established').answer().frame()
print(bgpSessStat)
bgpSessStat[bgpSessStat.Established_Status == "ESTABLISHED"]
assert len(bgpSessStat[bgpSessStat.Established_Status == "ESTABLISHED"]) == 1, "BGP session Down"
validate_yaml.py
#!/usr/bin/env python
import os
import sys
import yaml
# YAML_DIR is the location of the directory where the YAML files are kept
YAML_DIR = "%s/../vars/" % os.path.dirname(os.path.abspath(__file__))
# loop over the YAML files and try to load them
for filename in os.listdir(YAML_DIR):
yaml_file = "%s%s" % (YAML_DIR, filename)
if os.path.isfile(yaml_file) and ".yml" in yaml_file:
try:
with open(yaml_file) as yamlfile:
configdata = yaml.load(yamlfile)
# If there was a problem importing the YAML, we can print # an error message, and quit with a non-zero error code # (which will trigger our CI system to indicate failure)
except Exception:
print("%s failed YAML import" % yaml_file)
sys.exit(1)
sys.exit(0)
Testing pipeline:
As soon as we commit to our master branch, our pipeline will start executing jobs defined in our CI config file (this depends in your environment and maybe you'll rather have it run during change requests):
We'll first start by committing an invalid YAML variable file to see if this is picked up, this should trigger a pipeline run:
--- interfaces: - address: 10.70.0.1/30 description: "PTP to vMX-02" name :ge-0/0/1
Towards the bottom you can see the job failed due to our script being unable to load a properly formatted YAML file.
Our pipeline also shows what stage failed, with our test stage being skipped as a result:
We will fix this and try to break our BGP session next in order to test our second stage batfish validation job:
We will configure the wrong BGP peer IP on the interface, causing our session to go down
--- interfaces: - address: 10.70.0.1/30 description: "PTP to vMX-02" name: ge-0/0/1
Similar to our previous test case we can see this job failed throwing an assertion error as batfish analysis did not return what we expected.
Pipeline showing stage failed for the given run:
Let's correct our BGP peer IP in our variable file and commit to our repo again.
Upon committing we can now see both stages passed:
There are things in this article I don't cover like getting the ansible container set up, runner server configuration etc. I feel there's plenty of how-to's online for this already. I hope this is helpful to other network engineers on this journey.