I was observing unexpected behavior in my vSAN ESA cluster. I have a 6-node vSAN ESA cluster and a VM with a Storage Policy configured for RAID-5 (Erasure Coding). Based on the cluster size, I would expect a 4+1 stripe configuration. However, the system is using 2+1 striping, which typically applies to clusters with only 3 to 5 nodes.
RAID-5 (2+1) striping is using 133% of the raw storage
RAID-5 (4+1) striping is using 120% of the raw storage
A 13% difference is worth investigating.
Screenshots from my environment are attached below ...
|  | 
| 6-Node vSAN ESA Cluster | 
|  | 
| Storage Policy with RAID-5 Erasure Coding | 
|  | 
| VM with RAID-5 Storage Policy on 6-node vSAN Cluster. Why 2+1 and nor 4+1? | 
Is there something I’m misunderstanding or doing incorrectly?
No, I'm not doing anything incorrectly and here is the explanation what happened.
I had one host in long-time maintenance, and RAID-5 was proactively rebuilt from 4+1 to 2+1, because that's how it works for 5-node vSAN clusters. This is expected behavior and I was fine with that.
When 6th host was added back to the cluster, it took 24 hours to rebuild it back to 4+1. I just did not wait long enough :-) I thought it would take a while and checked the status after 22 hours, but that was not enough.
After 24 hours, the vSAN object striping was 4+1 as depicted on screenshot below.
|  | 
| The problem solved itself after 24 hours | 
Conclusion
The issue resolved itself after approximately 24 hours (vSAN starts restriping and rebalancing) + 3 hours (real data re-syncs and re-balancing in my environment). It's important to be aware of this behavior, as it can impact capacity planning and design decisions for a 6-node vSAN ESA cluster.
I’m planning to scale out to a 7-node vSAN cluster soon, which will enable the use of RAID-6 with a consistent 4+2 erasure coding scheme. However, documenting this kind of adaptive RAID-5 behavior in 6-node configurations could be valuable for other VMware users relying on vSAN ESA and similar storage policies.
What still confuses me is that the same degraded RAID-5 policy continued to be applied to newly created vSAN objects during the 24 hours after the host exited long-term maintenance mode.
 
No comments:
Post a Comment