I have a customer having an issue with vSAN Health Service - Network Health - vSAN: MTU check which was, from time to time, alerting the problem. Normally, the check is green as depicted in the screenshot below.
The same can be checked from CLI via esxcli.
However, my customer was experienced intermittent yellow and red alerts and the only way was to retest the skyline test suite. After retesting, sometimes it switched back to green, sometimes not.
During the problem isolation was identified that the only problem is on vSAN clusters having witness nodes (2-node clusters, stretched clusters). Another indication was that the problem was identified only between vSAN data nodes and vSAN witness. The network communication between data nodes was always ok.
How is this particular vSAN health check work?
It is important to understand, that “vSAN: MTU check (ping with large packet size)”
- is not using “don’t fragment bit” to test end-to-end MTU configuration
- is not using manually reconfigured (decreased) MTU from vSAN witness vmkernel interfaces leveraged in my customer's environment. The check is using static large packet size to understand how the network can handle it.
- The check is sending the large packet between ESXi (vSAN Nodes) and evaluates packet loss based on the following thresholds:
- 0% <-> 32% packet loss => green
- 33% <-> 66% packet loss => yellow
- 67% <-> 100% packet loss => red
So what's the problem?
Feature Request
- Test #1: “vSAN: MTU check (ping with large packet size) between data nodes”
- Test #2: “vSAN: MTU check (ping with large packet size) between data nodes and witness”