BGP Down in GRE VRF IPSEC VPN

kensmithh95 · September 20, 2023, 3:28am

Hello Community,

I am having the following issue with a GRE over IPSEC tunnel.
The syslogs from the router are indicating this:

Aug 15 16:34:59.302: %SEC_LOGIN-5-LOGIN_SUCCESS: Login Success [user: jadams] [Source: 10.28.65.221] [localport: 22] at 16:34:59 UTC Tue Aug 15 2023
Aug 15 16:35:40.338: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel50, changed state to down
Aug 15 16:35:40.582: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel50, changed state to up
Aug 15 16:35:42.527: %SYS-5-CONFIG_I: Configured from console by jadams on vty0 (10.28.65.221)
Aug 15 16:35:55.787: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel50, changed state to down
Aug 15 16:35:55.787: %LINK-5-CHANGED: Interface Tunnel50, changed state to administratively down
Aug 15 16:35:57.682: %LINEPROTO-5-UPDOWN: Line protocol on Interface Tunnel50, changed state to up
Aug 15 16:35:58.035: %BGP-5-NBR_RESET: Neighbor 169.254.43.14 active reset (BGP Notification sent)
Aug 15 16:35:58.035: %BGP-5-ADJCHANGE: neighbor 169.254.43.14 vpn vrf VODAFONE003 Up
Aug 15 16:35:58.035: BGP(4): (base) 169.254.43.14 send UPDATE (format) 217.171.101.0/24, next 169.254.43.13, metric 0, path Local, extended community RT:65028:503209
Aug 15 16:35:59.748: %SYS-5-CONFIG_I: Configured from console by jadams on vty0 (10.28.65.221)
Aug 15 16:51:20.903: %BGP-3-NOTIFICATION: received from neighbor 169.254.43.14 6/3 (Peer De-configured) 0 bytes

The tunnel flaps once an hour. I don’t have any keepalives configured and standard TTL 255 configured. Any suggestion to troubleshoot this?

thanks for taking the time to answer.

lagapidis · September 23, 2023, 5:43am

Hello Johan

Let’s go through some of the significant syslog messages that we see and decipher what they are telling us:

Aug 15 16:34:59.302: A successful login is registered by a user jadams from IP 10.28.65.221.
Aug 15 16:35:40.338: The line protocol for Tunnel50 goes down and then comes back up after a short interval.
Aug 15 16:35:42.527: Configuration changes are made by jadams from the console. This could mean that the user might be making changes related to the tunnel or its underlying infrastructure (like the crypto map, IPSec profile, etc.).
Aug 15 16:35:55.787: The line protocol for Tunnel50 goes down again, and it appears the interface itself is administratively shut down and then comes back up shortly afterward.
Aug 15 16:35:58.035: BGP neighborship with 169.254.43.14 resets due to the tunnel interface flapping. It then re-establishes and exchanges prefixes.
Aug 15 16:35:59.748: Additional configuration changes are made by jadams.
Aug 15 16:51:20.903: A BGP notification is received that the neighbor 169.254.43.14 has been de-configured.

From the above sequence of events, it seems that the tunnel is somewhat unstable. The reason for this instability is not very clear from the info we have, however, we can deduce the following:

There were several configuration changes made by jadams. The Tunnel50 interface status changed after the user logged in and made configurations. Similarly, the BGP peering also failed after configuration changes. This suggests that the changes made might have inadvertently affected the tunnel stability.
There’s no indication from the provided logs about any issues with the underlying physical interfaces or IPSec security associations, which are common reasons for tunnel flaps and instability. Since keepalives aren’t configured, we don’t have messages indicating keepalive failures either.
The BGP notification indicates a neighbor was de-configured. Ensure that BGP neighbor configurations are consistent and correct on both ends of the tunnel.

It is also noteworthy that the BGP neighbor’s IP address belongs to the 169.254.0.0/16 range, which makes it an IPv4 link-local address. It’s somewhat unusual to see a link-local address used in the context of a GRE tunnel endpoint or BGP neighbor configuration, as the logs you provided indicated. However, they can be used in scenarios where direct link communications are needed without requiring a routable IP. But I see that is the link to the mobile connection so it may not be a problem. Just mentioning it as a possible problematic point.

Now having said all that, from the information shared, we can’t come to a definitive conclusion as to the reason for the flapping, and especially the flapping that occurs on the hour. Hopefully this will be one step in the right direction for you to ultimately troubleshoot and resolve the issue. Let us know how you get along and if we can be of further help.

I hope this has been helpful!

Laz

kensmithh95 · September 25, 2023, 10:59pm

Hello Laz,

Thanks for taking the time to check on this.
I was able to collect some logs which I am attaching here
20230921-111620499_log.txt (146.2 KB)
It seems to be something on the remote end but not sure about the root cause.
Thanks for taking a look into this.

kensmithh95 · September 27, 2023, 4:42am

Hello Laz,

Thanks for reading me. I took some debugs about the whole issue and found this:

type or pasteSep 21 15:06:59.986: IPSEC:(SESSION ID = 16225) (ident_save_delete_notify_kmi) updated peer 85.205.30.159 current outbound sa to SPI 0
Sep 21 15:06:59.987: IPSEC:(SESSION ID = 16225) (delete_sa) deleting SA,
  (sa) sa_dest= 10.28.56.71, sa_proto= 50,
    sa_spi= 0x8F226333(2401395507),
    sa_trans= esp-aes 256 esp-sha256-hmac , sa_conn_id= 7583
    sa_lifetime(k/sec)= (4608000/3600),
  (identity) local= 10.28.56.71:0, remote= 85.205.30.159:0,
    local_proxy= 0.0.0.0/0.0.0.0/256/0,
    remote_proxy= 0.0.0.0/0.0.0.0/256/0
Sep 21 15:06:59.987: IPSEC:(SESSION ID = 16225) (delete_sa) SA found saving DEL kmi
Sep 21 15:06:59.988: IPSEC:(SESSION ID = 16225) (delete_sa) deleting SA,
  (sa) sa_dest= 85.205.30.159, sa_proto= 50,
    sa_spi= 0x91A7BD02(2443689218),
    sa_trans= esp-aes 256 esp-sha256-hmac , sa_conn_id= 7584
    sa_lifetime(k/sec)= (4608000/3600),
  (identity) local= 10.28.56.71:0, remote= 85.205.30.159:0,
    local_proxy= 0.0.0.0/0.0.0.0/256/0,
    remote_proxy= 0.0.0.0/0.0.0.0/256/0
Sep 21 15:06:59.988: IPSEC:(SESSION ID = 16225) (update_current_outbound_sa) updated peer 85.205.30.159 current outbound sa to SPI 0
Sep 21 15:06:59.988: IPSEC:(SESSION ID = 16225) (delete_sa) SA found saving DEL kmi
Sep 21 15:06:59.988: IPSEC(sibling_delete_notify_ident_action): Ident down, not sending DECR/DELETE
Sep 21 15:06:59.989: ipsec_out_sa_hash_idx: sa=0x7F78D36B9D00, hash_idx=401, port=4500/4500, addr=0x0A1C3847/0x55CD1E9F
Sep 21 15:06:59.991: IPSEC:(SESSION ID = 16225) (ident_update_final_flow_stats) Collect Final Stats and update MIB
IPSEC get IKMP peer index from peer 0x7F78D36B6B70 ikmp handle 0x0
[ident_update_final_flow_stats] : Flow delete complete event received for flow id 0x240015CF,peer index 0

Sep 21 15:06:59.994: IPSEC(key_engine): got a queue event with 1 KMI message(s)
Sep 21 15:06:59.994: IDB is NULL : in crypto_ipsec_key_engine_delete_sas (), 6145
Sep 21 15:06:59.994: IPSEC:(SESSION ID = 16225) (key_engine_delete_sas) rec'd delete notify from ISAKMP
Sep 21 15:06:59.994: IPSEC(key_engine): got a queue event with 1 KMI message(s)
Sep 21 15:06:59.994: IDB is NULL : in crypto_ipsec_key_engine_delete_sas (), 6145
Sep 21 15:06:59.994: IPSEC:(SESSION ID = 16225) (key_engine_delete_sas) rec'd delete notify from ISAKMP
Sep 21 15:07:00.012: IPSEC(key_engine): got a queue event with 1 KMI message(s)
Sep 21 15:07:00.012: IDB is NULL : in crypto_ipsec_key_engine_delete_sas (), 6145
Sep 21 15:07:00.012: IPSEC:(SESSION ID = 16225) (key_engine_delete_sas) rec'd delete notify from ISAKMP
Sep 21 15:07:00.012: IPSEC(key_engine): got a queue event with 1 KMI message(s)
Sep 21 15:07:00.012: IDB is NULL : in crypto_ipsec_key_engine_delete_sas (), 6145
Sep 21 15:07:00.012: IPSEC:(SESSION ID = 16225) (key_engine_delete_sas) rec'd delete notify from ISAKMP
Sep 21 15:07:01.984: %LINK-5-CHANGED: Interface Tunnel50, changed state to administratively down
Sep 21 15:07:02.173: insert of map into mapdb AVL failed, map + ace pair already exists on the mapdb
Sep 21 15:07:02.174: IPSEC(sa_request): ,
  (key eng. msg.) OUTBOUND local= 10.28.56.71:500, remote= 85.205.30.159:500,
    local_proxy= 0.0.0.0/0.0.0.0/256/0,
    remote_proxy= 0.0.0.0/0.0.0.0/256/0,
    protocol= ESP, transform= esp-aes 256 esp-sha256-hmac  (Tunnel), esn= FALSE,
    lifedur= 3600s and 4608000kb,
    spi= 0x0(0), conn_id= 0, keysize= 256, flags= 0x0
Sep 21 15:07:02.227: IPSEC(key_engine): got a queue event with 1 KMI message(s)
Sep 21 15:07:02.227: IPSEC(validate_proposal_request): proposal part #1,
  (key eng. msg.) INBOUND local= 10.28.56.71:0, remote= 85.205.30.159:0,
    local_proxy= 0.0.0.0/0.0.0.0/256/0,
    remote_proxy= 0.0.0.0/0.0.0.0/256/0,
    protocol= ESP, transform= esp-aes 256 esp-sha256-hmac  (Tunnel), esn= FALSE,
    lifedur= 0s and 0kb,
    spi= 0x0(0), conn_id= 0, keysize= 256, flags= 0x0
Sep 21 15:07:02.228: Crypto mapdb : proxy_match
        src addr     : 0.0.0.0
        dst addr     : 0.0.0.0
        protocol     : 0
        src port     : 0
        dst port     : 0
Sep 21 15:07:02.228: (ipsec_process_proposal)Map Accepted: Tunnel50-head-0, 65537
Sep 21 15:07:02.228: IPSEC(key_engine): got a queue event with 1 KMI message(s)
Sep 21 15:07:02.228: Crypto mapdb : proxy_match
        src addr     : 0.0.0.0
        dst addr     : 0.0.0.0
        protocol     : 256
        src port     : 0
        dst port     : 0
Sep 21 15:07:02.228: IPSEC:(SESSION ID = 16382) (crypto_ipsec_create_ipsec_sas) Map found Tunnel50-head-0, 65537
Sep 21 15:07:02.228: IPSEC:(SESSION ID = 16382) (crypto_ipsec_sa_find_ident_head) reconnecting with the same proxies and peer 85.205.30.159
Sep 21 15:07:02.229: IPSEC:(SESSION ID = 16382) (get_old_outbound_sa_for_peer) No outbound SA found for peer 7F78D36B6B70
Sep 21 15:07:02.229: IPSEC:(SESSION ID = 16382) (create_sa) sa created,
  (sa) sa_dest= 10.28.56.71, sa_proto= 50, code here

Seems to be a phase 1 issue, correct?
After taking the debugs, I talked to the engineer who is the direct responsible of the router and he responded:
I have to flap the interface or admin shut it and no shut for the interface to reset and come up along with the BGP. We use those IPs for BGP because the CSR is in an AWS cloud environment and it works for the other IPS we are connected to. If you look at the time from when the interface comes up to the time it shuts it is an hour exactly.
From the previous logs, I am not sure if this might be related to a local interface problem or the VPN. What would you suggest me to investigate?

Thanks again!

lagapidis · September 28, 2023, 5:29am

Hello Johan

Thanks for the additional information. Troubleshooting such issues is always difficult to do without having hands-on access to the equipment. However, I can comment on some of the output of the logs that may guide you in the right direction.

From the logs you’ve provided, we can gather the following information:

IPSec Session: This is related to an IPSec session (SESSION ID = 16225) between local address 10.28.56.71 and a remote peer at 85.205.XX.XX. (sanitized for security)
SA Deletion: There’s an event where the Security Associations (SAs) for this IPSec tunnel are being deleted, which is seen from the (delete_sa) logs. This typically happens during rekeying or when the tunnel goes down.
IDB is NULL: This error (IDB is NULL : in crypto_ipsec_key_engine_delete_sas ()) appears multiple times. It indicates an issue with the Interface Descriptor Block (IDB), which may lead to the deletion of the IPSec SAs. This may suggest a misconfiguration or an issue with the device software.
Tunnel State: There’s a state change for Interface Tunnel50 that goes to administratively down and then later its line protocol changes to up. This suggests that someone may have administratively taken down the tunnel and then brought it back up or it could be due to a reconnection attempt.
IPSec SA Creation: Later in the logs, it appears that a new IPSec session (SESSION ID = 16382) is being established with new SAs being created.

From the above points, it seems there’s an issue with the IPSec tunnel between 10.28.56.71 and 85.205.XX.XX. The tunnel seems to have been restarted, and there appears to be an issue with the IDB which might be contributing to the problem.

The issue with the IDB seems to be the most significant one. The IDB is a data structure used internally on Cisco IOS routers to store configuration and operational details about each network interface on a router, including tunnel interfaces. The IDB contains specific details about the tunnel and its configuration. The fact that it seems to be empty (NULL) may be a helpful clue for your troubleshooting.

I hope this has been helpful!

Laz

lagapidis · September 29, 2023, 6:18am

Hello Johan

Based on the logs you’ve shared, it seems like the issue lies within the IPsec VPN tunnel itself. The logs indicate that the SAs are being deleted and recreated. This could be due to a mismatch in the phase 1 or phase 2 parameters between the two VPN endpoints.

However, the fact that the tunnel works for exactly an hour before going down, and then comes back up after a manual reset, suggests that there might be a problem with the lifetimes configured for the SAs. It’s possible that one side is not recognizing the rekeying process and thus dropping the tunnel after the lifetime expires.

I would suggest checking the following:

Ensure that the phase 1 and phase 2 parameters match on both VPN endpoints. This includes the encryption and hashing algorithms, Diffie-Hellman group, and authentication method.
Verify the lifetimes for both phase 1 and phase 2. They should be the same on both sides. If they are not, the VPN tunnel could drop when the shorter lifetime expires.
If it is possible, look at the logs on the other VPN endpoint to see if there are any messages that could help identify the problem.

Let us know how you get along with the troubleshooting process…

I hope this has been helpful!

Laz