Multicast RPF Select on Cisco 3850

I’m having a very strange issue with a 3850 right now. We use 3850’s as building routers on campus. Each building gets its own 3850 and then there is a variety of 4500, 2960X and 9300 switches deployed for layer 2.

On each 3850 we have the traffic separated into 2 vrfs: let’s call them Campus and Dist. I was recently asked to look at multicast configurations. We had a history of pim-sparse mode configs that no one had ever really examined. Long story short, it seems that pim-sparse was put on interfaces but no rp address was ever configured nor was there are any thought to passing multicast from one vrf to the other–if ever required.

Changes that I made:

Added a RP for each vrf using the loopback interface address for that vrf.

Created a standard ACL called MCAST-GRP and added the address I was testing with: 239.1.1.5

Added a config line to steer the RPF process so as to not drop traffic coming from one vrf to the other:

ip multicast vrf Dist rpf select vrf Campus group-list MCAST-GRP

Confirmed that multicast-routing was enabled for each vrf:

ip multicast-routing vrf Campus
ip multicast-routing vrf Dist

I then used a command line utility to test (multicast-tester). This is a simple python based multicast testing tool that I use on Macs. On Windows or Linux, the CisTech NetSpanner is a fantastic tool for testing multicast, but alas there is no Mac version.

I ran a test and it worked fine. I was able to start a multicast stream on a host in vrf Campus and join that group on a host in vrf Dist. After that I was looking at some show commands and ran:

sh ip pim vrf Campus rp mapping

This immediately hung my console and I had to open another session. There was no high cpu or mem usage to explain the hung console. After that, my rpf select mulitcast tests fail. I removed all of the vrf based multicast config that I had added and went home. I came back this morning and laid the commands back in and tested. Again, it worked fine. Ok, good. I am, in fact, not crazy. I then ran that same show command and again the same result. My console hung (but at least this time it did show the mapping), and my multicast tests fail again.

I can still ssh in from a new session and the router is fine. But this seems like pretty strange behaviour. And I just tested again and multicast is still broken from one vrf to the other. My console session did eventually come back. It took a few minutes. Very odd.

Anyone have any potential insight here?

1 Like

Hello Don

Wow, that’s a strange behavior! A show command that causes a switch to freeze, and affects the operation of the actual multicast mechanism!

I can share with you some thoughts about this, but I can’t guarantee that the information I give is verified. It should only be used as a guideline.

This looks and feels like a bug. According to some sources, the 3850 platform does have a history of bugs related to PIM in VRF contexts. Specifically, this behavior matches characteristics of a control-plane process deadlock or crash in the PIM/multicast subsystem, not a configuration error. Your approach to steer RPF lookups across VRF boundaries looks syntactically correct and is the recommended way to approach this issue.

I have been unable to find any specific bugs that match this behavior, but the best bet would be to open a TAC case with Cisco to dig deeper into the intricacies of this behavior. Let us know how your troubleshooting goes!

I hope this has been helpful!

Laz

Hi, thanks for the reply.

I seem to have stabilized this today. I added one additonal rp-address global command:

ip pim vrf Dist rp-address x.x.x.x MCAST-GRP (the rp-address for Campus vrf)

Now it works and I can run vrf based pim show commands and while the console hangs a bit after the initial output, it does not break the inter vrf multicast traffic.

So now I have 3 rp-address global commands instead of 2. I also still need the vrf select global command in place. All I did to stabilize this was adding the third rp-address global command.

This addition should ensure that both RP trees will have an entry for 239.1.1.5

It looks like this now:

ip pim vrf Campus-trust rp-address x.x.x.129
ip pim vrf Dist rp-address x.x.x.128
ip pim vrf Dist rp-address x.x.x.129 MCAST-GRP

Hello Don

Ah this may give some insight into what was actually happening and why the router seemed to just hang. This is just an assumption, but maybe it will help you understand the behavior better.

The key here is that PIM RP mappings are maintained per-VRF. For inter-VRF multicast to work correctly, each VRF involved must have consistent knowledge of which RP serves which multicast groups.

Before Your fix, on the VRF Campus, you had group 239.1.1.5 mapped to RP x.x.x.129 (via MCAST-GRP ACL). In the VRF Dist, group 239.1.1.5 was mapped to RP x.x.x.128 (default RP for all groups). This may have created an RP mapping mismatch. When inter-VRF multicast tried to build PIM state across VRFs, one VRF was trying to build (*,G) shared trees toward RP .129 while the other was using RP .128. This inconsistency may have caused RPF check failures, unstable PIM Join/Prune behavior, all causing state churn that caused the console to hang.

But after your fix, by adding ip pim vrf Dist rp-address x.x.x.129 MCAST-GRP, you ensured that group 239.1.1.5 now maps to RP .129 in both VRFs. The group-list mapping is more specific than the default RP statement, so IOS now selects RP .129 consistently across VRFs.

This insight may shed some more light on the topic, and the reasons behind the behavior you observe. Thanks for keeping us updated with your progress, it is invaluable for the forum.

I hope this has been helpful!

Laz

Hi, thank you.

Yes, that is my thinking as well.

Also, Catalyst 3850’s are likely not the best choice here. We will be replacing them with Cat 9K over the next couple of years.

I’d rather work with full service routers such as an ASR, but we are a university campus and don’t have much in the way of a traditional WAN. It’s a large campus network, but it’s all layer 3 switched at line speed. So layer 3 switches make sense and are more cost effective and have more interfaces, but there is always a trade-off. You also don’t get the deep netflow analysis that would get with an ASR when you use an L3 switch instead.

Hello Don

Thanks for the extra info, that all makes sense. Given the campus architecture you described, with a high-speed L3 switching environment, your approach is very reasonable, especially from a cost and port-density perspective.

Platforms like ASRs do offer deeper telemetry and flow analysis, but the trade-offs don’t always justify them in a non-traditional WAN scenario like this. Looking forward to seeing how the move to the 9Ks shapes things over time!

Thanks again!

Laz