Taste of Distributed Virtual Routing

Taste of Distributed Virtual Routing

Most of the people have heard about Distributed Virtual Routing (DVR) before, but not many of us have actually deployed it into a real production environment. Recently we got a chance to deploy DVR on top of RHOSP 10, and we found it is quite simple and beneficial for our use case. The following is a quick sharing of our experience and configuration.

DVR Overview

* Based on Red Hat OpenStack Networking Guide, please refer to the document if you need more details.

Overview of OpenStack Tenant Network

OpenStack provides multi-tenancy support through its project (a.k.a. tenant) network concept. Each project network is isolated in its own namespace so the instances in a project network can only communicate with one another over a shared L2 broadcast domain. To talk to any endpoint (e.g. instance, IP-based network service, etc.) outside, a project router is required to perform the network routing. A tenant can create a router and assign it to a tenant network to allow the instances to reach other project networks or upstream (if an external gateway is defined for the router).

Based on the routing flow, OpenStack tenant traffic can be categorized into three type:

  • East-West routing - routing of traffic between different networks in the same tenant. This traffic does not leave the OpenStack deployment.
  • North-South routing with floating IPs - Floating IP addressing can be best described as a one-to-one NAT that can be modified and floats between instances. Floating IP is implemented by association with a neutron router that performs the NAT translation. The floating IPs themselves are taken from the uplink network that is providing the router with its external connectivity. As a result, instances can communicate with external resources (such as endpoints in the external network) or the other way around.
  • North-South routing without floating IPs (also known as SNAT) - Neutron offers a default port address translation (PAT) service for instances that have not been allocated floating IPs. With this service, instances can communicate with external endpoints through the router, but not the other way around. For example, an instance can browse a website in the external network, but a web browser in the external network could not browse a website hosted within the instance.

Default Centralized Routing

By default, neutron was designed with a centralized routing model where a project’s virtual routers, managed by the neutron L3 agent, are all deployed in the centralized network node/cluster (in most of the deployment case, the network role is within OpenStack Controller nodes). This means that each time a routing function is required (east/west, floating IPs or SNAT), traffic would traverse through a dedicated node in the topology. This introduced multiple challenges and resulted in sub-optimal traffic flows. For example:

Traffic between instances flows through a Controller node - when two instances need to communicate with each other using L3, traffic has to hit the Controller node even if the instances are scheduled on the same Compute node.

Instances with floating IPs receive and send packets through the Controller node - the external network gateway interface is available only at the Controller node, so for the traffic flows between a tenant instance and the external network, it has to flow through the Controller node. This would affect performance and scalability, especially for a large deployment.

Diagram: Routing under Centralized vRouter

Alternative: Distributed Virtual Routing (DVR)

DVR is an alternative routing architecture which is intended to isolate the failure domain of the Controller node and optimize network traffic by deploying the L3 agent and schedule routers on every Compute node. When using DVR:

  • East-West traffic is routed directly on the Compute nodes in a distributed fashion.
  • North-South traffic with floating IP is distributed and routed on the Compute nodes.
  • North-South traffic without floating IP (SNAT use case) is not distributed and still requires a dedicated Controller node.
  • The neutron metadata agent is distributed and deployed on all Compute nodes. The metadata proxy service is hosted on all the distributed routers.

The new routing model is depicted below:

Diagram: Routing under Distributed vRouter

Benefits from DVR

Based on our use case, we deploy big data analytics workload into OpenStack project. For the big data analytics applications, such as Apache Kafka / Hadoop, they consume a lot of network traffic through real-time data streaming and batch data injection. We observed two major benefits brought by DVR:

1. Eliminate potential network bottleneck and Enhance Scalability

As DVR distributes network traffic to each compute node directly (through Floating IPs), this eliminates the potential network bottleneck in the controller nodes, for example if we deploy 3 controller nodes with 15 compute nodes:

Total north-south network bandwidth without DVR:

  • 3 controller nodes x 20 Gbps / node = 60 Gbps if workload are deployed into at least three tenants;
  • 1 controller nodes x 20 Gbps / node = 20 Gbps if all workload are deployed under single tenant project;

Total north-south network bandwidth with DVR:

  • 15 compute nodes x 20 Gbps / node = 300 Gbps
  • The most important is, the throughput can be scaled out horizontally by adding more compute nodes.

2. Isolate the failure domain of the Controller and Higher Reliability

Besides, DVR is independent of the control plane. During our testing, we simulate totally control plane down (e.g. power outage for racks, control nodes upgrade / maintenance) by powering off all three controller nodes, our workload in compute nodes still can work normally as the Floating IPs network traffic flow directly into / out from compute nodes without going through controller nodes. This will give us high reliability and convenience during daily operation.

DVR Deployment through Red Hat OpenStack Director

To enable DVR, it is actually quite simple and fully integrated into OpenStack Director. The following is the required step to enable DVR

1). During the physical network design and deployment, the external network has to be connected to each and every Compute node for distributed routing. Also, a bridge must be created on Compute with an interface for external network traffic, same as Controller nodes. For example, include external network VLAN in your nic-configs/compute.yaml file:

-
  type: ovs_bridge
  name: {get_input: bridge_name}
  members:
    -
      type: linux_bond
      name: bond0
      bonding_options: {get_param: BondInterfaceOvsOptions}
      members:
        -
          type: interface
          name: enp6s0
          primary: true
        -
          type: interface
          name: enp7s0
    # add external NW here for DVR
    -
      type: vlan
      device: bond0
      vlan_id: {get_param: ExternalNetworkVlanID}
      addresses:
        -
          ip_netmask: {get_param: ExternalIpSubnet}

2). DVR requires a port on the external network for each compute node. This can be configured by OSP Director via the following parameters in network-isolation.yaml

# Enable the creation of Neutron networks for isolated Overcloud
# traffic and configure each role to assign ports (related
# to that role) on these networks.
resource_registry:
  # Port assignments for the compute role# Enable external port for dvr
  OS::TripleO::Compute::Ports::ExternalPort: ../default/network/ports/external.yaml

3). If you assign predictable IPs, please remember to include external IPs for your compute nodes. For example, in ips-from-pool-all.yaml:

# Environment file demonstrating how to pre-assign IPs to all node types
resource_registry:
  OS::TripleO::Compute::Ports::ExternalPort: ../default/network/ports/external_from_pool.yam

4). Prepare an additional environments/neutron-ovs-dvr.yaml to enable L3 and metadata agents on compute nodes, for example:

# A Heat environment file that enables DVR in the overcloud.
# This works by configuring L3 and Metadata agents on the
# compute nodes.
resource_registry:
  OS::TripleO::Services::ComputeNeutronL3Agent: ../default/puppet/services/neutron-l3-compute-dvr.yaml
  OS::TripleO::Services::ComputeNeutronMetadataAgent: ../default/puppet/services/neutron-metadata.yaml

parameter_defaults:
  # DVR requires that the L2 population feature is enabled
  NeutronMechanismDrivers: ['openvswitch', 'l2population']
  NeutronEnableL2Pop: 'True'

  # Setting NeutronEnableDVR enables distributed routing support in the# ML2 plugin and agents that support this feature
  NeutronEnableDVR: true

  # We also need to set the proper agent mode for the L3 agent.  This will only# affect the agent on the controller node.
  NeutronL3AgentMode: 'dvr_snat'

  # L3 HA isn't supported for DVR enabled routers.
  NeutronL3HA: false

5). Include all your parameters files in the deployment script, for example:

openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates \
-e ~/templates/environments/network-isolation.yaml \
-e ~/templates/environments/ips-from-pool-all.yaml \
-e ~/templates/environments/network-environment.yaml \
-e ~/templates/environments/neutron-ovs-dvr.yaml \
--control-scale 3 --compute-scale 15 \
-t 120

That is all for DVR, isn't it simple & powerful?

Reference:

*Red Hat OpenStack Platform 10 Networking Guide

To view or add a comment, sign in

More articles by Derek Li

Others also viewed

Explore content categories