VXLAN Underlay eBGP Two-AS

This topic is to discuss the following lesson:

1 Like

Thanks Rene, great lesson! Regarding the “disable-peer-as-check”, could an another option to use be the “as-override” command? If not, why?

Hello Mohammed

You bring up a very good point, these two commands are similar but they do two different things.

disable-peer-as-check: Bypasses the BGP loop prevention mechanism that rejects routes containing the local AS in the AS_PATH. This is critical in asymmetric routing scenarios (e.g., data center interconnects where routes traverse the same AS multiple times).
as-override: This is used primarily in MPLS/VPN setups to replace a customer’s AS number with the provider’s AS in the AS_PATH. This allows sites with the same AS to accept routes as if they originated locally, avoiding standard eBGP AS_PATH loop checks.

as-override modifies the AS_PATH (replacing the peer’s AS with the local AS), whereas disable-peer-as-check preserves the original AS_PATH but disables loop checks.

For VXLAN Underlays using eBGP like the one in the lesson, the disable-peer-as-check is preferred. In such a two-AS eBGP underlay, routes are exchanged directly between Spines and Leafs (no duplicate ASes). The primary issue is the AS_PATH validation when route reflection is used. disable-peer-as-check solves this by allowing Spines to advertise routes to Leafs with the same AS number. as-override may break the AS_PATH structure and is unnecessary in this topology since there are no duplicate ASes to override. The following lesson shows an example of where as-override is useful.

I hope this has been helpful!

Laz

1 Like

Hi Renee,

Thank you for VXLAN series.
Do you session which explain how vxlan fabric can have border leaf config and eBGP to external WAN router?

Thanks
Richita

Hello Richita

At this point, Rene doesn’t have a lesson that describes a border leaf configuration with an eBGP connection to an external WAN. However, if this is a lesson you’d like to see on the site, you can make your suggestion at this Member Ideas page:

There you may find that others have made similar suggestions, and you can add your voice to theirs.

I hope this has been helpful!

Laz

Hello,

I would like to ask you why won’t the LEAF2 switch even forward an ARP request from S1 to S2 before I configure the next-hop to remain unchanged. I have done packet capture and I have seen that ARP request gets sent properly to a multicast address and it reaches LEAF2, the packet capture from the link between LEAF2 and S2 shows that the ARP request doesn’t get forwarded to that link. Why does it happen precisely? Is it because the VXLAN-encapsulated ARP request comes from 3.3.3.3 address and the nve peer table shows only 1.1.1.1 at that point so the LEAF2 switch considers the packet to be erroneous and discards it?

Thanks,
Alex

Hello Aleksander

That’s an excellent and very insightful question. Your troubleshooting process is perfect, and your hypothesis is correct. You have pinpointed the reason for the failure.

Let’s break down the “why” in detail.

The behavior you’re observing is a direct result of two interacting mechanisms: the default next-hop-self rule of eBGP and the VTEP source validation check performed by the VXLAN data plane.

The Control Plane Problem: eBGP next-hop-self

In an eBGP EVPN fabric, your leaf switches are in different ASes.

  • When LEAF1 advertises its EVPN routes (like the Type-3 route for BUM traffic handling) to LEAF2, the eBGP protocol rules dictate that it must change the NEXT_HOP attribute to its own source IP address used for the peering.
  • LEAF1’s VTEP source is its Loopback0 IP (3.3.3.3) and its physical link IP for peering is 192.168.13.3.
    • LEAF1 sends the BGP update, but changes the next-hop from 3.3.3.3 to 192.168.13.3.
    • LEAF2 receives this update and installs the route. It now populates its list of valid remote VTEPs (the NVE peer list) for that VNI with the IP 192.168.13.3. It has no knowledge of 3.3.3.3 from the control plane.
  • You can verify this on LEAF2 before the fix with:
    • show bgp l2vpn evpn all (Notice the Next Hop for routes from LEAF1)
    • show nve peers (Notice that LEAF1’s VTEP IP 3.3.3.3 is missing)

The Data Plane Impact: VTEP Source Validation

Now, for the data plane:

  • When a server like S1 sends an ARP request, LEAF1 encapsulates it in a VXLAN packet. The outer source IP of this packet is its VTEP address, which is 3.3.3.3 (from Loopback0).
    This packet is sent to the VNI’s multicast group and is received by LEAF2.
  • Upon receipt, LEAF2 performs a critical security and loop-prevention check: “Did this packet come from a valid, known VTEP for this VNI?”
    • It checks the outer source IP (3.3.3.3) against its NVE peer list.
    • As we established, the list only contains 192.168.13.3. The check fails because 3.3.3.3 is an unknown source. The packet is silently dropped.

This is precisely what you observed with your packet capture. The packet enters LEAF2 but is dropped before it can be decapsulated and sent to the local interface. Good going in your diagnosis! If you need any other help or clarification, let us know!

I hope this has been helpful!

Laz

1 Like

Hello!

Thanks for an excellent lesson.

I have a doubt: on the lesson, Rene configures maximum-paths on the underlay BGP to be able to “ECMP” traffic… From my understanding, and please correct me if wrong, is possible to ECMP multicast traffic (ARP Requests flooded, …) thanks to be able to reach the Anycast address of the RP via all SPINE switches, so that different LEAF switches send their PIM Joins to different SPINE switches. Is this correct?

However, from a single VNI perspective, how is VXLAN encapsulated traffic going to be ECMPd? The source and destination IP are always the same (VXLAN tunnel source and destination). So is ECMP considered among traffic from multiple VNIs then? (different VXLAN IP destinations).

What about fragmentation? I have talked to a lot of network engineers that are afraid and negative about ECMP because they say that, if fragmentation happens, things can go bad with the packet reordering… is that why Rene recommends to increase the MTU in one of its lessons?

Thanks,

Jose

Hello Jose

It looks like you have a good understanding of the concepts. When you configure Anycast RP on your SPINE switches (all SPINEs advertise the same RP IP address), enabling BGP maximum-paths creates multiple equal-cost routes to that RP address (and to all RPs that have that address) in the underlay routing table.

Now this introduces an apparent paradox. If the outer source IP (local VTEP) and outer destination IP (remote VTEP) are always the same, how can ECMP work? Well, the wayit works is using UDP source port entropy as described in RFC 7348, where it specifically states:

  -  Source Port:  It is recommended that the UDP source port number
     be calculated using a hash of fields from the inner packet --
     one example being a hash of the inner Ethernet frame's headers.
     This is to enable a level of entropy for the ECMP/load-
     balancing of the VM-to-VM traffic across the VXLAN overlay.
     When calculating the UDP source port number in this manner, it
     is RECOMMENDED that the value be in the dynamic/private port
     range 49152-65535

Explained another way, when a LEAF encapsulates an inner Ethernet frame into VXLAN, it doesn’t use a fixed UDP source port. Instead, the ingress VTEP calculates a hash based on the inner packet headers (inner src/dst IP, inner src/dst TCP/UDP ports, protocol, etc.) This hash value is inserted into the outer UDP source port field such that different flows = different ports. Each unique inner flow gets a different outer UDP source port.

This ensures a level randomness that will enable ECMP. The underlay switches (SPINEs) perform standard 5-tuple hashing:

{Outer Src IP, Outer Dst IP, Protocol, Outer UDP Dst Port (4789), Outer UDP Src Port}

While the first four values are constant for a given VTEP pair, the fifth value (UDP source port) varies per inner flow. Different inner flows thus hash to different ECMP paths. This results in the following behavior:

  • Single TCP flow: Takes one consistent path (good - preserves ordering)
  • Many flows: Distributed across all available ECMP paths (good - load balancing)
  • Per-VNI ECMP: You don’t get ECMP based on VNI alone. Entropy comes from inner flow diversity, which often correlates with different VNIs carrying different endpoint traffic. So ECMP works at a per-flow granularity, not per-VNI or per-VTEP-pair, thanks to the outer UDP source port mechanism.

Now this does present an issue with fragmentation and packet reordering. This is a concern whenever ECMP is involved, whether in a VXLAN environment or otherwise. But when applied to VXLAN:

  • When an IP packet is fragmented, the first fragment contains the full IP + UDP/TCP headers (the complete 5-tuple)
  • Subsequent fragments contain only IP headers and Layer 4 port information is missing
  • If underlay switches hash on the 5-tuple, the first fragment and subsequent fragments may hash differently
  • Thus, fragments can take different paths, arriving out-of-order or timing out, causing reassembly failures and severe performance degradation

The solution to this, as you have suggested, is increasing the MTU. This completely prevents fragmentation rather than trying to manage it. It is generally the best practice that should be employed in order to mitigate fragmentation. It’s not optional, it’s a hard requirement to avoid the fragmentation + ECMP reordering problem you’ve correctly identified.

I hope this has been helpful!

Laz

1 Like

Thanks a lot LAZ for an excellent explanation!

1 Like