Monday, December 09, 2024

Every I/O requires CPU Cycles - vSAN ESA is not different

This is the follow-up blog post to my recent blog post about "benchmark results of VMware vSAN ESA".

It is obvious and logical that every computer I/O requires CPU Cycles. This is not (or better to say should not be) a surprise for any infrastructure professional. Anyway, computers are evolving year after year, so some rules of thumb should be validated and sometimes redefined from time to time.

Every bit transmitted/received over the TCP/IP network requires CPU cycles. The same applies to storage I/O. vSAN is a hyper-converged software-defined enterprise storage system, therefore, it requires TCP/IP networking for data striping across nodes (vSAN is RAIN - Redundant Array of Independent Nodes) and storage I/Os to local NVMe disks. 

What is the CPU Clock Cycle? 

A clock cycle is the smallest unit of time in which a CPU performs a task. It acts like a metronome, synchronizing operations such as fetching, decoding, executing instructions, and transferring data.

I have two CPUs Intel Xeon Gold 6544Y 16C @ 3.6 GHz.
I have 32 CPU Cores in total. 
16 CPU Cores in one socket (NUMA node) and 16 CPU Cores in another socket (NUMA node).

That means that every CPU Cores runs at 3.6 GHz frequency:
  • 1 GHz = 1 billion (1,000,000,000) cycles per second.
  • 3.6 GHz = 3.6 billion (3,600,000,000) cycles per second.
My single CPU Core can execute 3.6 billion (3,600,000,000) clock cycles every second.
For our simplified computer math we can say that 1 Hz = 1 CPU cycle.
We have to simplify, otherwise we will not be able to define rules of thumb.

Let's start with networking. 

In the past, there was a rule of thumb that 1 bit/s to send or receive requires 1 Hz (1 CPU cycle @ 1 GHz CPU). This general rule of thumb is mentioned in the book “VMware vSphere 6.5 Host Resources Deep Dive” by Frank Denneman and Niels Hagoort. Btw, it is a great book and anybody designing or operating VMware vSphere should have it on the bookshelf.

In my testing, I found that 

  • 1 b/s receive traffic requires 0.373 Hz 
  • 1 b/s transmit traffic requires 0.163 Hz

Let’s average it and redefine the old rule of thumb a little bit.

My current rule of thumb for TCP/IP networking is ... 

"1 bit send or receive over TCP/IP datacenter network requires ~0.25 Hz of general purpose x86-64 CPU (Intel Emerald Rapids)"

There is a difference between receive traffic and transmit traffic. So, if we would like to be more precise, the rule of thumb should be ...

"1 bit send over TCP/IP datacenter network requires ~0.37 Hz and 1 bit send over TCP/IP datacenter network requires ~0.16 Hz of general purpose x86-64 CPU"

Unfortunately, the later rule of thumb (~0.37 MHz and ~0.16 MHz) is not as easy to remember as the first simplified and averaged rule of thumb (~0.25 per bit per second).

Also the mention of "general purpose x86-64 CPU" is pretty important because it can vary on various CPU architectures and specific ASICs are totally different topics.

It is worth mentioning, that my results were tested and observed on Intel Xeon Gold 6544Y 16C @ 3.6 GHz (aka Emerald Rapids).

So let's use the simplified rule (0.25 Hz for 1 bit per second)  and we can say that 10 Gbps received or transmitted traffic requires 2,500,000,000 Hz = 2.5 GHz = ~0.75 of my CPU Core @ 3.6 Ghz and the whole CPU Core with frequency 2.5 GHz.

In my testing, vSAN ESA during the load of 720k IOPS (32KB) or 20 GB/s aggregated throughput across 6 ESXi hosts typically use between ~27 and ~34 Gbps network traffic per particular ESXi host (vSAN node). Let’s average it and assume ~30 Gbps on average, therefore, it consumes 3 x 1.5 CPU Core = 4.5 CPU Core per ESXi host just for TCP/IP networking!

Nice. That’s networking.

Let's continue with hyper-converged storage traffic 

Now we can do similar math for hyper-converged vSAN storage and network CPU consumption.

Note, that I used RAID-5 with compression enabled. It will be different for RAID-0, and RAID-1. RAID-6 should be similar to RAID-5 but with some slight differences (4+1 versus 4+2). 

I would assume that compression also plays a role in CPU usage even modern CPUs support compression/decompression offloading (Intel QAT, AMD CCX/CPX).

During my testing, the following CPU consumptions were observed …

  • ~2.5 Hz to read 1 bit of vSAN data (32KB IO, random) 
  • ~7.13 Hz to write 1 bit of vSAN data (32KB IO, random)
  • ~1.86 Hz to read 1 bit of vSAN data (1MB IO, sequential) 
  • ~2.78 Hz to write 1 bit of vSAN data (1MB IO, sequential)

Let’s average it and define another rule of thumb ... 

"3.5 Hz - of general purpose x86-64 CPU (Intel Emerald Rapids)- is required to read or write 1 bit/s from vSAN ESA RAID-5 with compression enabled"

If we would like to be more precise we should split the oversimplified rule of 3.5 Hz per bit per second to something more precise based on I/O size and read or write operation.

Anyway, if we stick with the simple rule of thumb that 3.5 Hz is required per bit per second then 1 GB/s (8,000,000,000 b/s) would require 28 GHz (7.8 CPU Cores @ 3.6 GHz).

In my 6-node vSAN ESA I was able to achieve the following sustainable read throughput in VM guests with 2 ms response time … 

  • ~22.5 GB/s (32 KB IO, 100% read, 100% random) 
  • ~ 8.9 GB/s  (32 KB IO, 100% write, 100% random) 
  • ~22.5 GB/s (1 MB IO, 100% read, 100% sequential)
  • ~15.2 GB/s (1 MB IO, 100% write, 100% sequential)

Therefore, if we average it, it is ~17.2 GB/s read or write.

17.2 GB/s (563,600 32 KB IOPS or 17,612 1 MB IOPS) vSAN ESA read or write is 481 GHz (134 CPU Cores @ 3.6 GHz).

In the 6-node vSAN cluster, I have 192 cores.

  • Each ESXi has 2x CPU Intel Xeon Gold 6544Y 16C @ 3.6 GHz (32 CPU Cores, 115.2 GHz capacity)
  • The total cluster capacity is 192 CPU Cores and 691.2 GHz

That means ~70% (134 / 192) of vSAN CPU cluster capacity is consumed by vSAN storage and network operations under a storage workload of 17.2 GB/s  563,600 IOPS (32 KB I/O size) or 17,612 IOPS (1 MB I/O size).

Well, your 6-node vSAN storage system will probably not deliver ~600,000 IOPS continuously, but it is good to understand that every I/O requires CPU cycles.

What about RAM?

Thanks to ChatGPT, I have found these numbers.

  • L1 Cache: Access in 3–5 cycles.
  • L2 Cache: Access in 10–20 cycles.
  • L3 Cache: Access in 30–50 cycles.
  • RAM: Access in 100–200+ cycles, depending on the system
... and we agreed that in a modern DDR5 system, the CPU cycles required to transfer a 4 KB RAM page typically range between 160 and 320 cycles for high-performance configurations (dual-channel DDR5-6400 or better). Optimizations like multi-channel setups, burst transfers, and caching mechanisms can push this closer to the lower bound.

So, for our rule of thumb, let's stick with 250 CPU cycles for 4 KB RAM (memory page). It would be 32,768 bits (4x1024x8) at a cost of 250 CPU cycles. That means 0.008 CPU cycles, so let's generalize it to 0.01 Hz.

"0.01 Hz - of general purpose x86-64 CPU (Intel Emerald Rapids) - is required to read or write 1 bit/s from RAM DDR5"

What about CPU offload technologies?

Yes, it is obvious that our rules of thumb are oversimplified and we are not talking about hardware offloading technologies like TOE, RSS, LSO, RDMA, Intel AES-NI, Intel QuickAssist Technology (QAT), and similar AMD technologies like AMD AES Acceleration.

My testing environment was on Intel and almost all above CPU offload technologies were enabled on my testing infrastructure. The only missing one is RDMA (RoCEv2) which is not supported by Cisco UCS VIC 15230 for vSAN. I'm in touch with Cisco and VMware about it and hope it is just a matter of testing but it is not supported, so it cannot be used in enterprise production environment.

RDMA (RoCEv2) 

I’m wondering how much RDMA (RoCEv2) would help to decrease the CPU usage of my vSAN ESA storage system. Unfortunately, I cannot test it because Cisco doesn’t support vSAN over RDMA/RoCE with Cisco UCS VIC 15230 even though it seems supported for vSphere 8.0 U3 :-(

I'm in touch with Cisco and waiting for more details as I have assumed a CPU usage reduction between 10% and 30%.

Thanks to Reddit user u/irzyk27 who shared his data from 4-node vSAN ESA cluster benchmark, I was able to estimate RDMA/RoCEv2 benefit.

Here is the summary.

TEST - 8vmdk_100ws_4k_100rdpct_100randompct_4threads (4KB IO, 100% read, 100% random)
  • Non-RDMA > 939790.5 IOPS @ cpu.usage 96.54% , cpu.utilization 80.88%
  • RDMA > 1135752.6 IOPS @ cpu.usage 96.59% , cpu.utilization 81.53%
  • Gains by RDMA over Non-RDMA +21% IOPS, -0.1% cpu.usage, -0.8% cpu.utilization
  • RDMA CPU usage savings 21%
TEST - 8vmdk_100ws_4k_70rdpct_100randompct_4threads (4KB IO, 70% read, 100% random)
  • Non-RDMA > 633421.4 IOPS @ cpu.usage 95.45% , cpu.utilization 78.57%
  • RDMA> 779107.8 IOPS @ cpu.usage 93.38% , cpu.utilization 77.46%
  • Gains by RDMA over Non-RDMA +23% IOPS, -2% cpu.usage, -1.5% cpu.utilization
  • RDMA CPU usage savings 25%
TEST - 8vmdk_100ws_8k_50rdpct_100randompct_4threads (8KB IO, 50% read, 100% random)
  • Non-RDMA > 468353.5 IOPS @ cpu.usage 95.43% , cpu.utilization 77.71%
  • RDMA> 576964.7 IOPS @ cpu.usage 90.18% , cpu.utilization 73.57%
  • Gains by RDMA over Non-RDMA +23% IOPS, -6% cpu.usage, -6% cpu.utilization
  • RDMA CPU usage savings 29%
TEST - 8vmdk_100ws_256k_0rdpct_0randompct_1threads (256KB IO, 100% write, 100% sequent.)
  • Non-RDMA > 24112.6 IOPS @ cpu.usage 64.04% , cpu.utilization 41.21%
  • RDMA> 24474.6 IOPS @ cpu.usage 56.7% , cpu.utilization 35.2%
  • Gains by RDMA over Non-RDMA +1.5% IOPS, -11% cpu.usage, -14.6% cpu.utilization
  • RDMA CPU usage savings 12.5%
If I read and calculate u/irzyk27 results correctly, RDMA allows more IOPS (~20%) for the almost same CPU usage when small I/Os (4 KB, 8 KB) are used for random workloads. So, for smaller I/Os (4k,8k random) the gain is between ~25% and 30%.

For bigger I/Os (256 KB) and sequential workloads, similar storage performance/throughput is achieved with ~11% less CPU usage. So, for larger I/Os (256k sequential workload) the RDMA gain is around 11%.

This leads me to the validation of my assumption and conclusion that ... 
"RDMA can save between 10% and 30% of CPU depending on I/O size"
So, is it worth to enable RDMA (RoCEv2)?

It depends, but if your infrastructure supports RDMA (RoCEv2) and it is not a manageability issue in your environment/organization, I would highly recommend enabling RDMA, because savings between 10% and 30% of CPU have a very positive impact on TCO

Fewer CPU Cores required to handle vSAN traffic also means less hardware, less power consumption, and fewer VMware VCF/VVF licenses, so in the end, it means less money and I'm pretty sure no one will be angry if you save money. 

Unfortunately, my Cisco UCS infrastructure with VIC 15230 doesn’t currently support RDMA for vSAN even though RDMA/RoCE is supported for vSphere. I have to ask Cisco what’s the reason and what other workload then vSAN can benefit from RDMA. 

Note: For someone interesting what's the difference between cpu.utilization and cpu.usage ...
  • cpu.utilization - Provides statistics for physical CPUs.
  • cpu.usage - Provides statistics for logical CPUs. This is based on CPU Hyperthreading.

Conclusion

It is always good to have some rule of thumb even if they are averaged and oversimplified.

If we work by following rules of thumb ...

  • 1 b/s for RAM transfer requires ~0.01 Hz
  • 1 b/s for network transfer requires ~0.25 Hz
  • 1 b/s for general local NVMe storage requires ~0.5 Hz
  • 1 b/s for vSAN storage requires ~3.5 Hz

... we can see that  

  • STORAGE vs NETWORK
    • general local NVMe storage transfer requires 2x more CPU cycles than network transfer 
    • vSAN storage transfer requires 14x more CPU cycles than network transfer 
  • STORAGE vs RAM
    • general local NVMe storage transfer requires 50x more CPU cycles than RAM transfer
    • vSAN storage transfer requires 350x more CPU cycles than RAM transfer 
  • NETWORK vs RAM
    • network transfer requires 25x more CPU cycles than RAM transfer  

General CPU compute processing involves operations related to RAM, network, and storage. Nothing else correct?

That’s the reason why I believe that around 60% of CPU cycles could be leveraged for hyper-converged storage I/O and the rest (40%) could be good enough for data manipulation (calculations, transformations, etc.) in RAM and for front-end network communication. 

IT infrastructure is here for data processing. Storage and network operations are a significant part of data processing, therefore keeping vSAN storage I/O on 60% CPU usage and using the rest (40%) for real data manipulation and ingress/egress traffic from/to vSphere/vSAN cluster is probably good infrastructure use case and helps us to increase the usage of common x86-64 CPU and improve our datacenter infrastructure ROI.

All above are just assumptions based on some synthetic testing. The workload reality could be different, hovever, there is not a big risk to plan 60% of CPU for storage transfers, as if the initial assumption about CPU usage distribution among RAM, network, and storage isn't met, you can always scale out. That's the advantage and beauty of scale-out systems like VMware vSAN. And if you cannot scale-out from various reasons, you can leverage vSAN IOPS limits to decrease storage stress and lower CPU usage. Of course, it means that your storage response time will increase and you intentionally decrase the storage performance quality. That's where various VM Storage Polices come in to play and you can do a per vDisk storage tiering.  

It is obvious that CPU offload technologies are helping us to use our systems to the maximum. It is beneficial to leverage technologies like TOE, RSS, LSO, RDMA, etc. 

Double check your NIC driver/firmware supports TOE, LSO, RSS, RDMA (RoCE). 

It seems that RDMA/RoCEv2 can significantly reduce CPU usage (up to 30%) so if you can, enable it. 

It is even more important when software-defined networking (NSX) come in to play. There are other hardware offload technologies (SR-IOV, NetDump, GENEVE-Offload) helping to decrease CPU cycles for network operations.  

No comments: