VxLAN EVPN - A Simple Explanation
Ashok Kumar

VxLAN EVPN - A Simple Explanation

VxLAN EVPN is hot technology right now in the datacenter segment, there are lots of draft still in progress & many are out there already. In this article, I will try bring everything together while keeping short & concise.

RFC 8365 came out early last year (2018), provides the first ever official specification of EVPN's use for VxLAN. Originally VxLAN (RFC 7348) was designed to address problem of 4K limit (12 bit VLAN dot1q tag), STP & multi-tenancy etc. But RFC 8365 emphasize its use as DCI (DataCenter Interconnect).

VxLAN VNI

VNI, VxLAN Network Identifier is a unique 24-bit tag used in VxLAN Header. It's purpose is as its name says.

No alt text provided for this image

There are 2 types in this, L2VNI & L3VNI; L2VNI is used to create MAC-VRF & L3VNI used for IP-VRF sometimes we call it Tenant-VRF, however it essentially is just an IP-VRF like in the case of MPLS L3 VPN.

VxLAN EVPN Analogy to Normal Networking

To better understand & correlate with our day-to-day normal networking terms, below analogy should be helpful:

  • VLAN / SVI s

When a host ping to an IP address, it will check its own IP, subnet & destination IP & figure-out if it will need to reach it's default gateway (SVI-aka-Interface-vlan ) or the destination is within its own VLAN. Based on this, host will do either 2 of below:

  1. If destination within same VLAN, no need to reach the default gateway, flood within the VLAN if needed, in VxLAN we send it to L2VNI of that VLAN.
  2. If destination is outside of VLAN, therefore send the packet to default gateway / SVI. In VxLAN EVPN it is sent to L3VNI of that VRF/Tenant.
  • MPLS

EVPN (RFC 7432) is originally designed as an alternate of VPLS etc. VxLAN EVPN brings in almost same framework with slight modifications. PE in MPLS is equivalent to VTEP or NVE in VxLAN. In MPLS L3 VPN, there used to 2 MPLS label, Transport & VPN; in VxLAN EVPN, VNI Tag (24-bit) is equivalent to MPLS VPN Label.

  • IRB (Integrated Routing & Bridging)

In our classical networking, a L3 switch already does routing & switching operations. Configuration-wise, we need two component for it; L2 VLAN (L2 VNI) & it's SVI (L3 VNI). We don't have any fancy name for this thing in classical networking but in VxLAN EVPN world IRB is just a close match of L3 function of a switch.

VxLAN Packet

Above is a VxLAN encapsulated packet. A VxLAN packet is identified by a VTEP by it's UDP destination port no. viz. 4789, designated by IANA; however it is still left configurable because there were vendor implementations even before RFC came. I bit (in green color) of Flag in VxLAN header, marks a valid VNI in packet. Below is packet capture of VxLAN traffic. 

No alt text provided for this image

First-Hop Redundancy Technology & Multi-Homing

First-Hop Redundancy Technology: VxLAN EVPN usages Anycast L3 Gateway technology. In this, we have 2 redundancy modes:

  1. All Active: All Anycast L3 gateway are actively used & do load-balancing. For BUM (Broadcast, Unknown unicast & Multicast) traffic, split-horizon is employed to avoid duplicacy.
  2. Single-Active: Only one of the anycast gateway is active & rest are backup, no need of split-horizon

Multi-Homing: Host/Endpoints are usually dual connected or multi-homed to the VTEPs. Multi-homed networks always needs split-horizon utilities for loop prevention.

With redundancy, in L2 domain, comes a very old known problem called "Forwarding loops". It's the similar problem which was addressed by our very known protocol "Spanning-Tree Protocol(STP)". End of the day, VxLAN still is a L2 domain extension; very well vulnerable to this problem, so we needed a mechanism to deal with it.

Funny thing is, dealing mechanism not very different then STP at conceptual level. We selected "Root Bridge" among L2 switches, then Cisco optimised it and implemented with "Per-VLAN", called PVST(Per-Vlan Spanning-Tree). In EVPN, we select DF(Designated Forwarder) with granularity of per EVI/ESI basis for load-balancing via a distributive algorithm called "Service Carving"

Therefore, it is very important that, in the case of a multihomed Host/Servers, only one of the VTEPs will be used to send BUM traffic to it. There are 2 ways for it:

  1. MC-LAG: We can use vPC/MC-LAG technology for dual connected hosts. These technologies are usually vendor proprietary & are enhanced/tweaked to VxLAN EVPN framework.
  2. Ethernet Segment Route ( BGP EVPN Route Type 4 ): BGP EVPN Type 4 Route is used here. Advertised only if ESI (Ethernet Segment Identifier) is set to a non-zero value. Type 4 routes are originated only for sites where multihoming is configured.

Multi-homing with vPC/MC-LAG: Since Anycast L3 gateway is used, all SVI has same IP addresses, & same MAC for all the SVIs on all the VTEPs, called Anycast Gateway MAC (AGM).

No alt text provided for this image



AGM is a single MAC for default gateway for everything, for a connected host

For Northbound Traffic: Traffic coming from host & going towards the VxLAN fabric, will be northbound traffic. In case of All-Active redundancy mode; if the vPC is configured for a dual connected host, the vPC peers behaves like single logical device to which host has a normal port-channel. Here port-channel hashing algorithm will determine, where it needs to send the traffic.

And since, it's All-Active mode, traffic flows will be load balanced among the vPC peers

No alt text provided for this image

In case, a VTEP fails, one of the member link of port-channel will down & traffic will be simply hashed over to remaining member of port-channel connected to redundant vPC peer.








For Southbound Traffic: Traffic coming from VxLAN fabric & going towards host will be southbound traffic. There is need of configuration/enhancement in vPC domain as this dual connected host must be seen as advertised from both VTEPs (vPC peers). We configure a vPC VTEP Anycast IP or VIP on both vPC peers. All dual connected host from this vPC pair will be seen as advertised from vPC VTEP Anycast IP. Since its anycast IP, it must be same on both vPC peer which is done by configuring a same secondary IP of NVE / VTEP.

No alt text provided for this image

Spine (shown as green switch) will have ECMP (Equal Cost Multiple Path) to vPC VTEP Anycast IP, and therefore will flow-wise load-balance among the ECMP.








No alt text provided for this image

If any of the vPC peer fails, traffic will be diverted to available routes in ECMP, towards the redundant vPC peer.











VxLAN Flood & Learn

Very quickly, we'll just recap basics of VxLAN flood & learn working as this is foundation of VxLAN & then will deep-dive in its EVPN part.

No alt text provided for this image

In flood & learn or data-plane learning, flooding is the only way VTEP/NVE learns about the end-points. When first time, a VM comes online or tries to reach some host, ingress VTEP flood the ARP (or any other BUM traffic) across the VxLAN fabric. Above is shown & typical flow of ARP traffic, in VxLAN Flood & Learn approach. As you can see, in a fairly large cloud environment datacenter, this could easily creates problems in terms of scalability. Therefore, the EVPN was seen & alternative to this approach.

VxLAN EVPN CONCEPTS

VxLAN EVPN, has adopted BGP MPLS based EVPN (RFC 7432) framework with some slight changes. A quick view of BGP EVPN routes & their purpose:

BGP EVPN Routes:

No alt text provided for this image

Type: 1 Ethernet Auto-Discovery (Fast Convergence) : In a BGP based control-plane environment, if a PE experience failure of it's attached ethernet segment, network convergence time is really a function of no. of MAC/IP advertisements. On high scale environment, this can be slow enough to create problems. To address this, a new type of route was introduced.

Upon failure of attached ethernet segment, NVE/VTEP withdraw these BGP EVPN Ethernet A-D routes, triggering "UPDATE" in BGP EVPN control-plane, causing other NVE/VTEP to either update (if any other NVE/VTEP also advertised the ethernet segment) next-hop of MAC entries of ethernet segment in question or simply invalidate them.

No alt text provided for this image

In MPLS based EVPN, an MPLS label is used for split-horizon called site of origin (aka an ESI label) when encapsulating the packet. Since VXLAN encapsulations do not include the ESI label, other means of performing the split-horizon filtering function must be devised for these encapsulations.

No alt text provided for this image

Every VTEP tracks the IP address of the other VTEPs with which it has same ES attached. When the NVE receives a multi-destination frame from the overlay network, it examines the source IP address in the tunnel header (which corresponds to the ingress NVE where packet was encapsulated) and drops the frame on all local attached interfaces to same ES. 

Type:2 MAC/IP Advertisement Route (MAC Learning): Most common among the VxLAN EVPN routes. It makes possible the control-plane learning of host hanging to VTEPs. This routes eliminates the need of flooding in VxLAN EVPN fabric. When a VM or host comes online, they sends out GARP. VTEP snoops this & create a local ARP-cache & advertise it to all remote VTEP via BGP UPDATE. Remotes VTEP learns about this host & creates a remote ARP-cache. Therefore, if any other host wants to ping this newly learnt host, VTEP already knows to which VTEP it is connected. This builds up the MAC-VRF, which is basically a tight-coupling of MAC-IP entries.

No alt text provided for this image

Type:3 Inclusive Multicast Ethernet Tag (IMET) Route (Peer-Discovery): We know that EVPN combined with VxLAN eliminates the need to flood. But if the VTEP don't have destination MAC entry, we need to flood it. BGP-IMET is used discover to whom, we need to flood. BGP-IMET does something we call "peer-discovery. Normally we option of static configuration of peer VTEPs but this is automated, scalable & standard way of doing it. But how this works? This brings the concept of P-Tunnels or P-Multicast Service Interface (PMSI) Tunnel.

No alt text provided for this image

P-Tunnels uses a BGP path attribute called, PMSI_TUNNEL_ATTRIBUTE. Please note, MPLS based EVPN uses different MPLS VPN Label for unicast & multicast traffic. However, here in VxLAN we uses same VPN Label which is again same VNI tag. All BUM traffic in VxLAN EVPN is carried by these multicast tunnels.

Type:4 Ethernet Segment Route (Multi-homing): These are used for electing the designated forwarder (DF) for BUM traffic to the CE. Once an ESI has been assigned for the Ethernet segment for a multihomed CE. ESI is advertised by the PE as BGP route type 4. The PEs where it matches with the ESI, imports ES route to auto-discover each other. It is originated only by PEs connected to multihomed CEs. It is imported only by PEs connected to the same Ethernet Segment.

DF election: The DF election algorithm based on a modulus operation. The VTEPs to which the ES (for which DF election is to be carried out per EVI) is multihomed form an ordered (ordinal) list in ascending order by VTEP IP address value.  

For example, there are N PEs: PE0, PE1,... PE(N-1) ranked as per increasing IP addresses in the ordinal list; then, for each VLAN with Ethernet Tag V, configured on ES1, PEx is the DF for VLAN V on ES1 when x equals (V mod N).

No alt text provided for this image

In above example, VTEP1 becomes DF for Ethernet Tag-ID 100 & VTEP2 becomes DF for Ethernet Tag-ID 101.

Type:5 IP PREFIX Advertisement (IP Routing): How to forward a packet destined to different subnet/VLAN? Well, this is Routing, not switching; we need routing information for this and that's where we leverage Type-5 Routes. It builds the IP-VRF for EVI.

Notice, here MAC-IP tightly-bind route advertisements (Type-2) are not desirable. For example, in case of floating IP, NAT etc. where IP addresses/MAC binding may get change. Usually, it happens where Firewall, Load-Balancers in datacenter comes in the data-path of traffic flow & they alters IP/MAC bindings.

No alt text provided for this image
IP Prefix advertisement route NLRI:


    +---------------------------------------+
    |      RD   (8 octets)                  |
    +---------------------------------------+
    |Ethernet Segment Identifier (10 octets)|
    +---------------------------------------+
    |  Ethernet Tag ID (4 octets)           |
    +---------------------------------------+
    |  IP Prefix Length (1 octet)           |
    +---------------------------------------+
    |  IP Prefix (4 or 16 octets)           |
    +---------------------------------------+
    |  GW IP Address (4 or 16 octets)       |
    +---------------------------------------+
    |  MPLS Label (3 octets)                |
    +---------------------------------------+
 
  • RD, Ethernet Tag ID and MPLS Label fields will be used as defined in [RFC7432] 
  • The GW IP (Gateway IP Address) and will an overlay IP index for the IP Prefixes.

To view or add a comment, sign in

More articles by Ashok K.

Others also viewed

Explore content categories