This is the follow-up blog post to my recent blog post about "benchmark results of VMware vSAN ESA".
It is obvious and logical that every computer I/O requires CPU Cycles. This is not (or better to say should not be) a surprise for any infrastructure professional. Anyway, computers are evolving year after year, so some rules of thumb should be validated and sometimes redefined from time to time.
Every bit transmitted/received over the TCP/IP network requires CPU cycles. The same applies to storage I/O. vSAN is a hyper-converged software-defined enterprise storage system, therefore, it requires TCP/IP networking for data striping across nodes (vSAN is RAIN - Redundant Array of Independent Nodes) and storage I/Os to local NVMe disks.
What is the CPU Clock Cycle?
- 1 GHz = 1 billion (1,000,000,000) cycles per second.
- 3.6 GHz = 3.6 billion (3,600,000,000) cycles per second.
Let's start with networking.
In the past, there was a rule of thumb that 1 bit/s to send or receive requires 1 Hz (1 CPU cycle @ 1 GHz CPU). This general rule of thumb is mentioned in the book “VMware vSphere 6.5 Host Resources Deep Dive” by Frank Denneman and Niels Hagoort. Btw, it is a great book and anybody designing or operating VMware vSphere should have it on the bookshelf.
In my testing, I found that
- 1 b/s receive traffic requires 0.373 Hz
- 1 b/s transmit traffic requires 0.163 Hz
Let’s average it and redefine the old rule of thumb a little bit.
My current rule of thumb for TCP/IP networking is ...
"1 bit send or receive over TCP/IP datacenter network requires ~0.25 Hz of general purpose x86-64 CPU (Intel Emerald Rapids)"
There is a difference between receive traffic and transmit traffic. So, if we would like to be more precise, the rule of thumb should be ...
"1 bit send over TCP/IP datacenter network requires ~0.37 Hz and 1 bit send over TCP/IP datacenter network requires ~0.16 Hz of general purpose x86-64 CPU"
Unfortunately, the later rule of thumb (~0.37 MHz and ~0.16 MHz) is not as easy to remember as the first simplified and averaged rule of thumb (~0.25 per bit per second).
Also the mention of "general purpose x86-64 CPU" is pretty important because it can vary on various CPU architectures and specific ASICs are totally different topics.
It is worth mentioning, that my results were tested and observed on Intel Xeon Gold 6544Y 16C @ 3.6 GHz (aka Emerald Rapids).
So let's use the simplified rule (0.25 Hz for 1 bit per second) and we can say that 10 Gbps received or transmitted traffic requires 2,500,000,000 Hz = 2.5 GHz = ~0.75 of my CPU Core @ 3.6 Ghz and the whole CPU Core with frequency 2.5 GHz.
In my testing, vSAN ESA during the load of 720k IOPS (32KB) or 20 GB/s aggregated throughput across 6 ESXi hosts typically use between ~27 and ~34 Gbps network traffic per particular ESXi host (vSAN node). Let’s average it and assume ~30 Gbps on average, therefore, it consumes 3 x 1.5 CPU Core = 4.5 CPU Core per ESXi host just for TCP/IP networking!
Nice. That’s networking.
Let's continue with hyper-converged storage traffic
Now we can do similar math for hyper-converged vSAN storage and network CPU consumption.
Note, that I used RAID-5 with compression enabled. It will be different for RAID-0, and RAID-1. RAID-6 should be similar to RAID-5 but with some slight differences (4+1 versus 4+2).
I would assume that compression also plays a role in CPU usage even modern CPUs support compression/decompression offloading (Intel QAT, AMD CCX/CPX).
During my testing, the following CPU consumptions were observed …
- ~2.5 Hz to read 1 bit of vSAN data (32KB IO, random)
- ~7.13 Hz to write 1 bit of vSAN data (32KB IO, random)
- ~1.86 Hz to read 1 bit of vSAN data (1MB IO, sequential)
- ~2.78 Hz to write 1 bit of vSAN data (1MB IO, sequential)
Let’s average it and define another rule of thumb ...
"3.5 Hz - of general purpose x86-64 CPU (Intel Emerald Rapids)- is required to read or write 1 bit/s from vSAN ESA RAID-5 with compression enabled"
If we would like to be more precise we should split the oversimplified rule of 3.5 Hz per bit per second to something more precise based on I/O size and read or write operation.
Anyway, if we stick with the simple rule of thumb that 3.5 Hz is required per bit per second then 1 GB/s (8,000,000,000 b/s) would require 28 GHz (7.8 CPU Cores @ 3.6 GHz).
In my 6-node vSAN ESA I was able to achieve the following sustainable read throughput in VM guests with 2 ms response time …
- ~22.5 GB/s (32 KB IO, 100% read, 100% random)
- ~ 8.9 GB/s (32 KB IO, 100% write, 100% random)
- ~22.5 GB/s (1 MB IO, 100% read, 100% sequential)
- ~15.2 GB/s (1 MB IO, 100% write, 100% sequential)
Therefore, if we average it, it is ~17.2 GB/s read or write.
17.2 GB/s (563,600 32 KB IOPS or 17,612 1 MB IOPS) vSAN ESA read or write is 481 GHz (134 CPU Cores @ 3.6 GHz).
In the 6-node vSAN cluster, I have 192 cores.
- Each ESXi has 2x CPU Intel Xeon Gold 6544Y 16C @ 3.6 GHz (32 CPU Cores, 115.2 GHz capacity)
- The total cluster capacity is 192 CPU Cores and 691.2 GHz
That means ~70% (134 / 192) of vSAN CPU cluster capacity is consumed by vSAN storage and network operations under a storage workload of 17.2 GB/s 563,600 IOPS (32 KB I/O size) or 17,612 IOPS (1 MB I/O size).
Well, your 6-node vSAN storage system will probably not deliver ~600,000 IOPS continuously, but it is good to understand that every I/O requires CPU cycles.
What about RAM?
Thanks to ChatGPT, I have found these numbers.
- L1 Cache: Access in 3–5 cycles.
- L2 Cache: Access in 10–20 cycles.
- L3 Cache: Access in 30–50 cycles.
- RAM: Access in 100–200+ cycles, depending on the system
So, for our rule of thumb, let's stick with 250 CPU cycles for 4 KB RAM (memory page). It would be 32,768 bits (4x1024x8) at a cost of 250 CPU cycles. That means 0.008 CPU cycles, so let's generalize it to 0.01 Hz.
"0.01 Hz - of general purpose x86-64 CPU (Intel Emerald Rapids) - is required to read or write 1 bit/s from RAM DDR5"
What about CPU offload technologies?
Yes, it is obvious that our rules of thumb are oversimplified and we are not talking about hardware offloading technologies like TOE, RSS, LSO, RDMA, Intel AES-NI, Intel QuickAssist Technology (QAT), and similar AMD technologies like AMD AES Acceleration.
My testing environment was on Intel and almost all above CPU offload technologies were enabled on my testing infrastructure. The only missing one is RDMA (RoCEv2) which is not supported by Cisco UCS VIC 15230 for vSAN. I'm in touch with Cisco and VMware about it and hope it is just a matter of testing but it is not supported, so it cannot be used in enterprise production environment.
RDMA (RoCEv2)
- Non-RDMA > 939790.5 IOPS @ cpu.usage 96.54% , cpu.utilization 80.88%
- RDMA > 1135752.6 IOPS @ cpu.usage 96.59% , cpu.utilization 81.53%
- Gains by RDMA over Non-RDMA +21% IOPS, -0.1% cpu.usage, -0.8% cpu.utilization
- RDMA CPU usage savings 21%
- Non-RDMA > 633421.4 IOPS @ cpu.usage 95.45% , cpu.utilization 78.57%
- RDMA> 779107.8 IOPS @ cpu.usage 93.38% , cpu.utilization 77.46%
- Gains by RDMA over Non-RDMA +23% IOPS, -2% cpu.usage, -1.5% cpu.utilization
- RDMA CPU usage savings 25%
- Non-RDMA > 468353.5 IOPS @ cpu.usage 95.43% , cpu.utilization 77.71%
- RDMA> 576964.7 IOPS @ cpu.usage 90.18% , cpu.utilization 73.57%
- Gains by RDMA over Non-RDMA +23% IOPS, -6% cpu.usage, -6% cpu.utilization
- RDMA CPU usage savings 29%
- Non-RDMA > 24112.6 IOPS @ cpu.usage 64.04% , cpu.utilization 41.21%
- RDMA> 24474.6 IOPS @ cpu.usage 56.7% , cpu.utilization 35.2%
- Gains by RDMA over Non-RDMA +1.5% IOPS, -11% cpu.usage, -14.6% cpu.utilization
- RDMA CPU usage savings 12.5%
"RDMA can save between 10% and 30% of CPU depending on I/O size"
- cpu.utilization - Provides statistics for physical CPUs.
- cpu.usage - Provides statistics for logical CPUs. This is based on CPU Hyperthreading.
Conclusion
It is always good to have some rule of thumb even if they are averaged and oversimplified.
If we work by following rules of thumb ...
- 1 b/s for RAM transfer requires ~0.01 Hz
- 1 b/s for network transfer requires ~0.25 Hz
- 1 b/s for general local NVMe storage requires ~0.5 Hz
- 1 b/s for vSAN storage requires ~3.5 Hz
... we can see that
- STORAGE vs NETWORK
- general local NVMe storage transfer requires 2x more CPU cycles than network transfer
- vSAN storage transfer requires 14x more CPU cycles than network transfer
- STORAGE vs RAM
- general local NVMe storage transfer requires 50x more CPU cycles than RAM transfer
- vSAN storage transfer requires 350x more CPU cycles than RAM transfer
- NETWORK vs RAM
- network transfer requires 25x more CPU cycles than RAM transfer
General CPU compute processing involves operations related to RAM, network, and storage. Nothing else correct?
That’s the reason why I believe that around 60% of CPU cycles could be leveraged for hyper-converged storage I/O and the rest (40%) could be good enough for data manipulation (calculations, transformations, etc.) in RAM and for front-end network communication.
IT infrastructure is here for data processing. Storage and network operations are a significant part of data processing, therefore keeping vSAN storage I/O on 60% CPU usage and using the rest (40%) for real data manipulation and ingress/egress traffic from/to vSphere/vSAN cluster is probably good infrastructure use case and helps us to increase the usage of common x86-64 CPU and improve our datacenter infrastructure ROI.
All above are just assumptions based on some synthetic testing. The workload reality could be different, hovever, there is not a big risk to plan 60% of CPU for storage transfers, as if the initial assumption about CPU usage distribution among RAM, network, and storage isn't met, you can always scale out. That's the advantage and beauty of scale-out systems like VMware vSAN. And if you cannot scale-out from various reasons, you can leverage vSAN IOPS limits to decrease storage stress and lower CPU usage. Of course, it means that your storage response time will increase and you intentionally decrase the storage performance quality. That's where various VM Storage Polices come in to play and you can do a per vDisk storage tiering.
It is obvious that CPU offload technologies are helping us to use our systems to the maximum. It is beneficial to leverage technologies like TOE, RSS, LSO, RDMA, etc.
Double check your NIC driver/firmware supports TOE, LSO, RSS, RDMA (RoCE).
It seems that RDMA/RoCEv2 can significantly reduce CPU usage (up to 30%) so if you can, enable it.
It is even more important when software-defined networking (NSX) come in to play. There are other hardware offload technologies (SR-IOV, NetDump, GENEVE-Offload) helping to decrease CPU cycles for network operations.
No comments:
Post a Comment