Cisco SD-WAN Service VPN

Hello Manikanth

Yes, this is a valid design approach. TLOC Extension allows one SD-WAN device (e.g., C8300 Edge) to extend its transport reachability to another device (e.g., vEdge) when one of them has no direct transport access. This is useful when only one device has a transport link, and you want the second device to use it for connectivity.

We don’t currently have any lessons that deal with TLOC extension in an SD-WAN environment, however, you can go to the Member Ideas page below and make your suggestion. You may find that others have had similar suggestions, and you can add your voice to theirs:

In the meantime, the Cisco SD-WAN design guide is an excellent source for understanding how this feature works.

I hope this has been helpful!

Laz

Hi,

I had the same issue as Manoj - all bfd sessions between not the same colors (i.e. biz-pub, pub-biz) were down with 2 x static summary routes of 10.0.0.0/8. When changed to default routes 0.0.0.0/0 all good.

I would appreciate your comment on this as I can’t seem to find an anwser on the original question.

Thanks, Maciej Stanecki

Hello Maciej

Thanks for revisiting the post issued by Manoj, you’re right, it wasn’t addressed. Since you reproduced Manoj’s results, this behavior can be considered confirmed.

Let’s see if we can break down and explain the reasoning for this behavior. You and Manoj observed partial WAN connectivity, and BFD sessions were down between different TLOC color combinations. The introduction of a default route seems to fix the issue. However, when using a static summary route, the issue persists and BFD remains down.

Without having performed the lab directly, I can share with you my thoughts about why this takes place. Let’s look at BFD in more detail. BFD relies on establishing a tunnel between the local TLOC and the remote TLOC IP. The BFD session cannot come up if there is no valid route.

Since BFD comes up with the default route set but not with just the summary routes, I would further investigate reachability when just the summary routes are used. Each color pair must have bi-directional reachability to the public IP (TLOC) of the other side. And this routing should be symmetric. Also, check to see if the IPsec SA is successful when just the summary routes are used. Since BFD uses that VPN, it could be that the VPN itself is not fully operational. Check these parameters and let us know how you get along in your troubleshooting.

I hope this has been helpful!

Laz

Hi Laz,

May thanks for coming back - appreciated.

I have investigated this furhter already and the only thing which could make sense is the asymmetric routing. As I have checked in SDWAN documentation, this indeed could cause BFD sessions to be down.

I think this explanation is good enough and I can’t seem to see any other reason anyways.

BR, Maciej Stanecki

1 Like

Hello Laz and Rene,

Back again with a new challenge. I have a Firewall on my LAB sitting at DC1. I have Color Retsrict. On my topology, there is a site categorized as an “Untrusted ISP Site”. All traffic coming from this site from public color TLOCs has to traverse the Firewall to get inspected. So what I did is I inserted a Service Route called netsvc2 that points to the OUTSIDE interface of the Firewall. Then, I applied a Centralized Control Policy which contains a Topology that inserts such service to any OMP Routes whose color is a public one, and I applied that Topology outbounds towards the Untrusted site. This way, when the untrusted site routers try to reach any remote destinations over a public transport, the traffic gets always inspected in the Firewall.

Everything works fine. I can see the traffic going through the Firewall. To avoid the Firewall dropping packets due to stateless connections from its perspective, I configured a second Topology in the same Centralized Control Policy that is applied outbounds towards all the trusted sites, which inserts netsvc1 to any OMP Routes whose Site-ID matches the Untrusted sites one and whose color is a public one. This service points to the INSIDE interface of the Firewall, and it effectively avoids asymmetries that were causing the packets to get dropped.

Now here comes the challenge: I have 1 private transport, MPLS. This transport is available on both Untrusted ISP sites and Trusted ISP sites. Because it is a private transport owned by the Enterprise, there is no risk of ISP attacks on it. Therefore, I don’t want any traffic going through MPLS coming from the Untrusted ISP Sites to go through the Firewall, since that defeats the fastness and low latency of MPLS, making traffic having to go all the way the DC and through the Firewall.

I have got this to work on all the sites EXCEPT the Data Center where the Firewall is sitting at. Traffic from different sites directed to the Untrusted ISP site going through MPLS never touches the Firewall, and vice versa.

But for traffic originated from the Data Center where the Firewall sits, I have not found a way to achieve this: the problem is that when the Data Center router has to INITIALLY send traffic to the Untrusted ISP Site over a public color transport, then such traffic does NOT traverse the INSIDE interface of the Firewall. I haven’t found a way to insert the Service route netsvc1 locally so that before traffic originated on the Data Center gets sent over the WAN on public transports, it goes through the Firewall INSIDE interface. This causes the Firewall to drop the return traffic from the Untrusted ISP site because the return traffic goes through the OUTSIDE interface of the Firewall when using a public transport, as expected.

I need to tell the Data Center router that hosts the Firewall the following:

“Whenever you are going to initiate a connection to an Untrusted ISP site and reach such site via a public transport, send all traffic to the INSIDE interface of the Firewall. If you are going to use a private transport, send all traffic directly to the destination”.

I thought about configuring a static route for the Untrusted ISP Sites prefixes pointing to the Firewall INSIDE Interface. But this will also result in traffic sent to Untrusted ISP sites over MPLS having to traverse the INSIDE interface of the Firewall, and getting dropped due to a stateless connection.

Any ideas please?

Hello Jose

Just to clarify, this is a topology and a deployment of the network with which I don’t have direct hands-on experience. However, I can offer some suggestions or share thoughts that may help point you in the right direction for troubleshooting and a resolution.

It looks like this has to do with the following SD-WAN limitation: centralized control policies work perfectly for transit traffic, but don’t apply to locally-originated traffic on the same router hosting the service.

Your centralized control policy correctly inserts:

  • netsvc2 (FW OUTSIDE) for OMP routes toward Untrusted sites over public colors
  • netsvc1 (FW INSIDE) for OMP routes from Untrusted sites over public colors toward trusted sites

This works for all remote sites because their traffic transits through the DC edge, which performs FIB lookups on OMP routes with service attributes.

However, when traffic originates from the DC router itself, the service-insertion logic doesn’t apply the same way. The DC router has the OMP routes with service attributes in its database, but locally-originated traffic doesn’t hairpin through the firewall based on those attributes. It goes directly out of the WAN interface. Am I describing it correctly so far?

The result is that outbound packets bypass FW INSIDE, but return packets hit FW OUTSIDE (due to netsvc2 on the Untrusted site), creating asymmetry and firewall drops.

To resolve this, several approaches can be used. However, I suggest focusing on creating a separate service node for the DC LAN. This way, you can keep your existing service site as is, where the firewall resides, but ensure that you have no user or LAN subnets behind it. Create a second site that has a separate SD-WAN edge (with a different site ID or physical device from the service site) and treat it as a trusted site in your centralized policy.

If you want to avoid creating a separate edge node and you need to use the same hardware, you can use a VPN separation approach, where you create an additional VPN on the same edge hardware and introduce route leaking between VPNs. But be aware that with this solution, the fundamental issue remains: you cannot easily say “route prefix X via firewall when using public color, but direct when using MPLS” within a single routing table. The VPN separation approach requires either:

  • Different prefixes for FW-required vs MPLS-only destinations, OR
  • Additional VPNs (e.g., VPN 30 for MPLS-only with direct routes), OR
  • Complex policy-based routing (version-specific and difficult to scale and maintain)

For a cleaner solution, the creation of a separate node is the best. Some Cisco resources that may be helpful can be found at this site containing a series of SD-WAN design guides.

Let us know how you get along, and if we can help you any further in a particular direction…

I hope this has been helpful!

Laz

Hello Laz,

Thank you for the help. Your description of the issue that I am having is completely correct.

The solution you proposed is effective and gets the job done. However, implementing it will require extra devices, maintenance, software updates for the extra router, extra hardware for HA designs, …

I wanted to avoid the cost of having to deploy extra gear to achieve the objective. So, this is the solution I employed, hoping that someone with the same issue finds it useful or as another possibility that is verified to work:

LABSDW006-Service_FW.drawio.pdf (502.1 KB)

There is one more thing to take in mind: for the traffic originated in the Data Center, we need to ensure that it goes through the INSIDE of the Firewall when using public transports to reach Untrusted ISP Sites. So we configured a static route that aggregates all the Untrusted ISP Site prefixes and points to the INSIDE Interface, with an AD of 252. This means that if MPLS is up and used, the Firewall will not be traversed (OMP’s 251 AD beats 252), but if MPLS is down then DC ORIGINATED TRAFFIC will be forwarded following the static route and therefore via the INSIDE Interface of the Firewall, and then via a public transport on VPN 6137. (I know that this partially defeats the SVC Injection of netsvc1 but that’s why I put an AD of 252, please let me know if you would do it differently)

The limitation is that transport asymmetry will cause traffic being dropped on some combinations. It is complex to ensure that traffic gets returned via the same transport it got sent on a remote site: I have AAR policies with SLA Classes, different transports at different sites… ideally, in a perfect world network, all sites will have the same number of transports, and the same SLA compliance at all times over all transport. But realistically, I cannot guarantee that and therefore transport asymmetries occur: for example, when the Public Internet transport is not compliant with the SLA Class for certain traffic on a Trusted Site, MPLS will be used for such traffic, but on the Untrusted Site MPLS might be out of compliance and the Public Internet will be used. This results in the FW seeing stateless connections and dropping them, as I tested.

If you have any ideas or solutions to ensure this transport symmetry, or you would correct anything I said, I will love to hear it all.

Thanks Laz!

Hello Jose

Yes, I understand. Typically, the proper solution, and often the Cisco-certified solution, often involves a more expensive arrangement, but there you go. It’s understandable that you want to resolve this with the equipment and infrastructure you have. Thanks for sharing your solution with us. I myself don’t have the experience with this kind of setup to share anything more useful. I will, however, let @ReneMolenaar know to take a look as well to see if he has any more insight into the situation.

However, I would like to share that SD-WAN transport selection is inherently per-direction and per-site. Each edge router makes independent decisions based on local SLA measurements (loss, latency, jitter, availability). There is no built-in mechanism to “guarantee” end-to-end symmetric transport selection between arbitrary sites when AAR is actively managing path choice. You said that there may be transport asymmetry when SLAs are not met, and this is precisely correct. But keep in mind that this is a fundamental architectural constraint, and not a configuration mistake. I hope Rene will have some more insight fo you…

I hope this has been helpful!

Laz

Thanks Laz.

The truth is that Cisco (and many courses and people) sell service insertion with firewalls as something simple and easy with SD-WAN. But on a big enterprise with many transports and so much variety of requirements, things are not that simple to implement than just “ a few clicks” or a “quick workflow”, like the scenario we have in my organization (and that I recreated on my lab).

Hopefully Rene can provide some insight on the issue I was having.

I thought about separating VPNs per transport on the branch side too, and peering with the core layer of the LAN via BGP and learning LAN prefixes on both VPNs, for every site on the fabric. The LAN default route to each VPN can be tuned as desired for preference of private or public transport (avoid load balancing for sure). This way, a “public color users VPN” and a “private color users VPN” could give a good solution to the insert the FW on the routes for the “public color users” VPN OMP Routes only, since transport symmetry by private or public color becomes pretty solid. I will let you know if I manage that to work.

Thanks again!

Hello Jose

Indeed, your frustration with the deployment is understandable. There’s often a gap between “simple service insertion” and the actual complex reality of large enterprise SD-WAN deployments with multiple transport and diverse requirements. Introducing stateful firewalls that require symmetric routing can often be very challenging.

Your idea to separate service VPNs per transport (e.g., VPN 10 for private/MPLS users, VPN 20 for public/Internet users etc) and use BGP peering to the LAN core to control which traffic enters which VPN is technically sound. However, it does introduce some characteristics that eliminate the advantages of SD-WAN automation. For example:

  1. You lose native failover, if MPLS fails, VPN 10 loses connectivity (unless you build complex BGP conditional advertisements or route leaking)
  2. You have operational overhead by doubling the VPNs, subnets, and BGP sessions
  3. Increase complexity, and thus, troubleshooting complexity.

These are just some general things to think about as you approach the problem. I’ll ping @ReneMolenaar to take a look when he has a chance so he can add his expertise to the conversation…

I hope this has been helpful!

Laz

1 Like