Tools to diagnose network slowness (lentency)

daoudes80 · January 17, 2022, 11:07am

Hi evryone;

I just want ask if someone know which tools i can use to diagnose slowness in my network? Thanks in advance for your answers.

lagapidis · January 19, 2022, 6:53am

Hello Daoud

Troubleshooting a slow network can be tricky. There is no single process to follow since a slow network can be due to many different things. A slow network can be due to:

network congestion resulting from incorrect QoS processes, routing, or simply due to a time of high network usage
data corruption - there may be a high number of corrupted packets/frames arriving at ports that simply drop them, requiring applications to resend much data.
collisions - due to an incorrectly configured network, once again resulting in applications needing to resend data
faulty physical infrastructure such as cables
too many STP topology change notifications (TCNs) resulting in flapping interfaces
non-optimal routing configuration,
hardware failure

There are several strategies you can use to diagnose these including:

Ping and traceroute will allow you to see the measured delays as well as the paths that are taken. Delays from end to end as well as to each individual hop will give you an idea of where the problem may be.
Monitoring suites such as Solarwinds and LibreNMS can use protocols including SNMP to monitor specific aspects of a network and the network devices, identifying errors and events that can help you troubleshoot.
Other tools such as NBAR and NetFlow are also useful to gain insight into what is happening on your network.

Having a network monitoring system is critical for such situations. It’s not easy to troubleshoot such problems using the CLI, beyond very basic diagnosis tasks. Using monitoring systems increase visibility, and will warn you whenever any thresholds you set, such as the upper limit of allowed latency, are surpassed.

If you’d like us to go over something more specific, please let us know.

I hope this has been helpful!

Laz

daoudes80 · January 20, 2022, 3:06pm

Thank you for your feedback lagapides.
Let me explain the problem I am having
I have 8 ESX servers that are connected to a NESUS 5000 switch via FEX modules. I have a nagios supervisor that monitors the ESX servers. This nagios supervisor returns “ERROR CRC” errors for all 8 esx servers. But when I check on the interfaces of the NESUS switch on which the 8 esx are connected I don’t see any anomaly. No CRC errors on the interfaces. Hence I don’t know where these errors come from. And there is a slowness to send data to the ESX servers. Do you have any idea how to solve CRC errors? or how to identify its cause? Thanks in advance for your feedback

lagapidis · January 22, 2022, 5:58am

Hello Daoud

CRC errors on your network can definitely be a cause of a slow or sluggish network, so seeing those on Nagios does give a reasonable cause to your slow network. However, what is Nagios actually reporting? Is it monitoring the servers or the switch, or both? And where does the ERROR CRC take place, on the switch or on the ESX server?

If all eight servers are returning the same errors, you should take a look and see what commonalities there are between them. Could it be a common setting on their NICs? Dig a little deeper by looking at the stats on the ESX NICs as well.

In addition, you could take a look at the way the FEX is handling traffic. There are some cases where CRC errors are logged but are not counted in the expected interface. This has to do with Nexus 5K switching packets before CRC is being checked, so the actual CRC errors may be marked on another interface.

Take a look at this Cisco community thread which may shed some more light on your issue.

Keep us posted about your troubleshooting is going and how you’re getting along.

I hope this has been helpful!

Laz

daoudes80 · February 1, 2022, 2:06pm

Once again, thank you for your feedback. I was a bit busy with other challenges otherwise I would have already answered your answer. I will answer all your questions
1- Nagios displays CRC errors
2- Nagios only monitors the ESX server interfaces
3- On the NEXUS switch level, when I check, I don’t see any error, everything looks fine
4 on the other hand, at the ESX level, the checking shows me errors. Here is the result of the checking on one of the interfaces

[root:~] esxcli network nic stats get -n vmnic1
NIC statistics for vmnic1
   Packets received: 810661
   Packets sent: 50367
   Bytes received: 37641693004
   Bytes sent: 165401052
   Receive packets dropped: 0
   Transmit packets dropped: 5
   Multicast packets received: 189214093
   Broadcast packets received: 214728114
   Multicast packets sent: 711173
   Broadcast packets sent: 537
   Total receive errors: 242025
   Receive length errors: 0
   Receive over errors: 0
   Receive CRC errors: 242025
   Receive frame errors: 0
   Receive FIFO errors: 0
   Receive missed errors: 0
   Total transmit errors: 0
   Transmit aborted errors: 0
   Transmit carrier errors: 0
   Transmit FIFO errors: 0
   Transmit heartbeat errors: 0
   Transmit window errors: 0

So I think that the problem may be on the side of ESX interfaces and not FEX modules.

Do you think for example if an update of the ios of the NEXUS can improve something?

Thanks in advance for your feedback

Sincerely

lagapidis · February 4, 2022, 10:33am

Hello Daoud

Thanks for the additional information, it looks like the issue is on the hardware NIC of the ESX server as you suggested. Based on the output you shared, the packets received are 810661 and those with CRC errors are 242025, which is more than 25%. This is significant. Since you see similar behavior on all 8 ESX server NICs, it is unlikely that it is a cabling issue. Since you don’t see any errors on the Nexus devices, then it’s not a problem with corrupted packets that may be exiting the Nexus devices.

Based on a post at this site, I suggest you do the following:

Check MTU value on the Nexus side as well as on the ESX side. According to this post concerning ESX:

rx_crc_errors are caused either by faults in layer 1, or issues with jumbo frames on the network. If that packet has an MTU over what is configured on the interface, it will cut off the packet at the designated MTU, causing the server to receive a malformed packet, which will throw a CRC error.

Are your connections copper or fiber? If they’re fiber, there may be fiber/SFP type mismatch on all of the links, making all the links behave in the same way.

For more info about MTUs in networking, take a look at this lesson:

I hope this has been helpful!

Laz