Thursday, March 20, 2025

VMware PowerCLI (PowerShell) on Linux

VMware PowerCLI is very handy and flexible automation tool allowing automation of almost all VMware features. It is based on Microsoft PowerShell. I do not have any Microsoft Windows system in my home lab but I would like to use Microsoft PowerShell. Fortunately enough, Microsoft PowerShell Core is available for Linux. Here is my latest runbook how to leverage PowerCLI in Linux management workstation leveraging Docker Application packaging.

Install Docker in your Linux Workstation

This is out of scope of this runbook.

Pull official and verified Microsoft Powershell

sudo docker pull mcr.microsoft.com/powershell:latest

Now you can run powershell container interactively (-i) and in allocated pseudo-TTY (-t). Option -rm stands for "Automatically remove the container when it exits".

List container images

sudo docker image ls

Run powershell container

sudo docker run --rm -it mcr.microsoft.com/powershell

You can avoid image pull and run powershell container, it will pull image automatically during first attempt of run.

Install PowerCLI in PowerShell

Install-Module -Name VMware.PowerCLI -Scope CurrentUser -Force

Allow Untrusted Certificates

Set-PowerCLIConfiguration -InvalidCertificateAction Ignore -Confirm:$false

Now you can connect to vCenter and list VMs

Connect-VIServer -Server <vcenter-server> -User <username> -Password <password>

Get-VM


 

 

 

Saturday, March 15, 2025

How to update ESXi with unsupported CPU?

I have old unsupported servers in my lab used for ESXi 8.0.3. In such configuration, you cannot update ESXi by default procedure in GUI.

vSphere Cluster Update doesn't allow remediation

ESXi host shows unsupported CPU

Solution is to allow legacy CPU and update ESXi from shell with esxcli.

Allow legacy CPU

The option allowLegacyCPU is not available in the ESXi GUI (DCUI or vSphere Client). It must be enabled using the ESXi shell or SSH. Bellow are command to allow legacy CPU.

esxcli system settings kernel set -s allowLegacyCPU -v TRUE

You can verify it by command ...

esxcli system settings kernel list | grep allowLegacyCPU

If above procedure fails, the other option is to edit file /bootbank/boot.cfg and add allowLegacyCPU=true to the end of kernelopt line.

In my case, it look like ...

kernelopt=autoPartition=FALSE allowLegacyCPU=true

After modifying /bootbank/boot.cfg, ESXi configuration should be saved to make changes persistent across reboots.

 /sbin/auto-backup.sh

Reboot of ESXi is obviously required to make kernel option active.

reboot

After reboot, you can follow by standard system update procedure by ESXCLI method as documented below.

ESXi update procedure (ESXCLI method)

  1. Download appropriate ESXi offline depot. You can find URL of depot in Release Notes of particular ESXi version. You will need Broadcom credentials to download it from Broadcom support site.
  2. Upload (leveraging Datastore File Browser, scp, winscp, etc.) ESXi offline depot to some Datastore
    • in my case /vmfs/volumes/vsanDatastore/TMP
  3. List profiles in ESXi depot
    • esxcli software sources profile list -d /vmfs/volumes/vsanDatastore/TMP/VMware-ESXi-8.0U3d-24585383-depot.zip 
  4. Update ESXi to particular profile with no hardware warning
    • esxcli software profile update -d /vmfs/volumes/vsanDatastore/TMP/VMware-ESXi-8.0U3d-24585383-depot.zip -p ESXi-8.0U3d-24585383-no-tools --no-hardware-warning
  5. Reboot ESXi
    •   reboot

Hope this helps other folks in their home labs with unsupported CPUs.

Friday, February 07, 2025

Broadcom (VMware) Useful Links for Technical Designer and/or Architect

Lot of URLs have been changed after Broadcom acquisition of VMware. That's the reason I have started to document some of useful links for me.

VMware Product Configuration Maximums - https://configmax.broadcom.com (aka https://vmware.com/go/hcl)

Network (IP) ports Needed by VMware Products and Solutions - https://ports.broadcom.com/

VMware Compatibility Guide - https://compatibilityguide.broadcom.com/ (aka https://www.vmware.com/go/hcl)

VMware Product Lifecycle - https://support.broadcom.com/group/ecx/productlifecycle (aka https://lifecycle.vmware.com/)

Product Interoperability Matrix - https://interopmatrix.broadcom.com/Interoperability

VMware Hands-On Lab - https://labs.hol.vmware.com/HOL/catalog

Broadcom (VMware) Education / Learning - https://www.broadcom.com/education

VMware Validated Solutions - https://vmware.github.io/validated-solutions-for-cloud-foundation/

If you are independent consultant and have to open support ticket related to VMware Education or Certification you can use form at https://broadcomcms-software.wolkenservicedesk.com/web-form  

VMware Health Analyzer

 Do you know any other helpful link? Use comments below to let me know. Thanks.

Tuesday, February 04, 2025

How my Microsoft Windows OS syncing the time?

This is very short post with the procedure how to check time synchronization of Microsoft Windows OS in VMware virtual machine.

There are two options how time can be synchronized

  1. via NTP 
  2. via VMware Tools with ESXi host where VM is running 

The command w32tm /query /status shows the current configuration of time sync.

 Microsoft Windows [Version 10.0.20348.2582]  
 (c) Microsoft Corporation. All rights reserved.  
 C:\Users\david.pasek>w32tm /query /status  
 Leap Indicator: 0(no warning)  
 Stratum: 6 (secondary reference - syncd by (S)NTP)  
 Precision: -23 (119.209ns per tick)  
 Root Delay: 0.0204520s  
 Root Dispersion: 0.3495897s  
 ReferenceId: 0x644D010B (source IP: 10.77.1.11)  
 Last Successful Sync Time: 2/4/2025 10:14:10 AM  
 Source: DC02.example.com  
 Poll Interval: 7 (128s)  
 C:\Users\david.pasek>   

If Windows OS is connected to Active Directory (this is my case), it synchronize time with AD via NTP by default. This is visible in the output of command w32tm /query /status.

You are dependent on Active Directory Domain Controllers, therefore, the correct time in Active Directory Domain Controllers is crucial. I was blogging how to configure time in virtualized Active Directory Domain Controller back in 2011. Is is very old post but it still should work.

To check if VMware Tools are syncing time with ESXi host use following command

 C:\>"c:\Program Files\VMware\VMware Tools\VMwareToolboxCmd.exe" timesync status  
 Disabled  

VMware Tools time sync is disabled by default, which is the VMware best practice. It is highly recommended to not synchronize time with underlaying ESXi host and leverage NTP sync over network with trusted time provider. This will help you in case someone will make configuration mistake and time is not configured properly in particular ESXi.  

Hope you find this useful.

Friday, December 20, 2024

CPU cycles required for general storage workload

I recently published a blog post about CPU cycles required for network and VMware vSAN ESA storage workload. I realized it would be nice to test and quantify CPU cycles needed for general storage workload without vSAN ESA backend operations like RAID/RAIN and compression.

Performance testing is always tricky as it depends on guest OS, firmware, drivers, and application, but we are not looking for exact numbers and approximations are good enough for a general rule of thumb helping pure designer during capacity planning. 

My test environment was old Dell PowerEdge R620 (Intel Xeon CPU E5-2620 @ 2.00GHz), with ESXi 8.0.3 and Windows Server 2025 in a Virtual Machine (2 vCPU @ 2 GHz, 1x para-virtualized SCSI controller/PVSCSI, 1x vDisk). Storage subsystem was VMware VMFS datastore on local NVMe consumer-grade disk (Kingston SNVS1000GB flash).

Storage tests were done using an old good Iometer.

Test VM had total CPU capacity of 4 GHz (4,000,000,000 Hz aka CPU Clock Cycles)

Below are some test results to help me define another rule of thumb.

TEST - 512 B, 100% read, 100% random - 4,040 IOPS @ 2.07 MB/s @ avg response time 0.25 ms

  • 15.49% CPU = 619.6 MHz
  • 619.6 MHz  (619,600,000 CPU cycles) is required to deliver 2.07 MB/s (16,560,000 b/s)
    • 37.42 Hz to read 1 b/s
    • 153.4 KHz for reading 1 IOPS (512 B, random)

TEST - 512 B, 100% write, 100% random - 4,874 IOPS @ 2.50 MB/s @ avg response time 0.2 ms

  • 19.45% CPU = 778 MHz
  • 778 MHz  (778,000,000 CPU cycles) is required to deliver 2.50 MB/s (20,000,000 b/s)
    • 38.9 Hz to write 1 b/s
    • 159.6 KHz for writing 1 IOPS (512 B, random)

TEST - 4 KiB, 100% read, 100% random - 3,813 IOPS @ 15.62 MB/s @ avg response time 0.26 ms

  • 13.85% CPU = 554.0 MHz
  • 554.0 MHz  (554,000,000 CPU cycles) is required to deliver 15.62 MB/s (124,960,000 b/s)
    • 4.43 Hz to read 1 b/s
    • 145.3 KHz for 1 reading IOPS (4 KiB, random)

TEST - 4 KiB, 100% write, 100% random - 4,413 IOPS @ 18.08 MB/s @ avg response time 0.23 ms

  • 21.84% CPU = 873.6 MHz
  • 873.6 MHz  (873,600,000 CPU cycles) is required to deliver 18.08 MB/s (144,640,000 b/s)
    • 6.039 Hz to write 1 b/s
    • 197.9 KHz for writing 1 IOPS (4 KiB, random)

TEST - 32 KiB, 100% read, 100% random - 2,568 IOPS @ 84.16 MB/s @ avg response time 0.39 ms

  • 10.9% CPU = 436 MHz
  • 436 MHz  (436,000,000 CPU cycles) is required to deliver 84.16 MB/s (673,280,000 b/s)
    • 0.648 Hz to read 1 b/s
    • 169.8 KHz for reading 1 IOPS (32 KiB, random)

TEST - 32 KiB, 100% write, 100% random - 2,873 IOPS @ 94.16 MB/s @ avg response time 0.35 ms

  • 14.16% CPU = 566.4 MHz
  • 566.4 MHz  (566,400,000 CPU cycles) is required to deliver 94.16 MB/s (753,280,000 b/s)
    • 0.752 Hz to write 1 b/s
    • 197.1 KHz for writing 1 IOPS (32 KiB, random)

TEST - 64 KiB, 100% read, 100% random - 1,826 IOPS @ 119.68 MB/s @ avg response time 0.55 ms

  • 9.06% CPU = 362.4 MHz
  • 362.4 MHz  (362,400,000 CPU cycles) is required to deliver 119.68 MB/s (957,440,000 b/s)
    • 0.37 Hz to read 1 b/s
    • 198.5 KHz for reading 1 IOPS (64 KiB, random)

TEST - 64 KiB, 100% write, 100% random - 2,242 IOPS @ 146.93 MB/s @ avg response time 0.45 ms

  • 12.15% CPU = 486.0 MHz
  • 486.0 MHz  (486,000,000 CPU cycles) is required to deliver 149.93 MB/s (1,199,440,000 b/s)
    • 0.41 Hz to write 1 b/s
    • 216.7 KHz for writing 1 IOPS (64 KiB, random)

TEST - 256 KiB, 100% read, 100% random - 735 IOPS @ 192.78 MB/s @ avg response time 1.36 ms

  • 6.66% CPU = 266.4 MHz
  • 266.4 MHz  (266,400,000 CPU cycles) is required to deliver 192.78 MB/s (1,542,240,000 b/s)
    • 0.17 Hz to read 1 b/s
    • 362.4 KHz for reading 1 IOPS (256 KiB, random)

TEST - 256 KiB, 100% write, 100% random - 703 IOPS @ 184.49 MB/s @ avg response time 1.41 ms

  • 7.73% CPU = 309.2 MHz
  • 309.2 MHz  (309,200,000 CPU cycles) is required to deliver 184.49 MB/s (1,475,920,000 b/s)
    • 0.21 Hz to write 1 b/s
    • 439.9 KHz for writing 1 IOPS (256 KiB, random)

TEST - 256 KiB, 100% read, 100% seq - 2784 IOPS @ 730.03 MB/s @ avg response time 0.36 ms

  • 15.26% CPU = 610.4 MHz
  • 610.4 MHz  (610,400,000 CPU cycles) is required to deliver 730.03 MB/s (5,840,240,000 b/s)
    • 0.1 Hz to read 1 b/s
    • 219.25 KHz for reading 1 IOPS (256 KiB, sequential)

TEST - 256 KiB, 100% write, 100% seq - 1042 IOPS @ 273.16 MB/s @ avg response time 0.96 ms

  • 9.09% CPU = 363.6 MHz
  • 363.6 MHz  (363,600,000 CPU cycles) is required to deliver 273.16 MB/s (2,185,280,000 b/s)
    • 0.17 Hz to write 1 b/s
    • 348.4 KHz for writing 1 IOPS (256 KiB, sequential)

TEST - 1 MiB, 100% read, 100% seq - 966 IOPS @ 1013.3 MB/s @ avg response time 1 ms

  • 9.93% CPU = 397.2 MHz
  • 397.2 MHz  (397,200,000 CPU cycles) is required to deliver 1013.3 MB/s (8,106,400,000 b/s)
    • 0.05 Hz to read 1 b/s
    • 411.18 KHz for reading 1 IOPS (1 MiB, sequential)

TEST - 1 MiB, 100% write, 100% seq - 286 IOPS @ 300.73 MB/s @ avg response time 3.49 ms

  • 10.38% CPU = 415.2 MHz
  • 415.2 MHz  (415,200,000 CPU cycles) is required to deliver 300.73 MB/s (2,405,840,000 b/s)
    • 0.17 Hz to write 1 b/s
    • 1.452 MHz for writing 1 IOPS (1 MiB, sequential)

Observations

We can see that the CPU cycles required to read 1 b/s vary based on I/O size, Read/Write, and Random/Sequential pattern.

  • Small I/O (512 B, random) can consume almost 40 Hz to read or write 1 b/s. 
  • Normalized I/O (32 KiB, random) can consume around 0.7 Hz to read or write 1 b/s
  • Large I/O (1 MiB, sequential) can consume around 0.1 Hz to read or write 1 b/s
If we use the same approach as for vSAN and average 32 KiB I/O (random) and 1 MiB I/O (sequential), we can define the following rule of thumb 
"0.5 Hz of general purpose x86-64 CPU (Intel Sandy Bridge) is required to read or write 1 bit/s from local NVMe flash disk"

If we compare it with the 3.5 Hz rule of thumb for vSAN ESA RAID-5 with compression, we can see the vSAN ESA requires 7x more CPU cycles, but it makes perfect sense because vSAN ESA does a lot of additional processing on the backend. Such processing mainly involves data protection (RAID-5/RAIN-5) and compression.  

I was curious how much CPU cycles require a non-redundant storage workload and observed numbers IMHO make sense.

Hope this helps others during infrastructure design exercises. 

Wednesday, December 11, 2024

VMware Desktop Products direct download links

Main URL for all desktop products: https://softwareupdate.vmware.com/cds/vmw-desktop/

VMware Fusion: https://softwareupdate.vmware.com/cds/vmw-desktop/fusion/

VMware Workstation: https://softwareupdate.vmware.com/cds/vmw-desktop/ws/

VMware Remote Console (VMRC): https://softwareupdate.vmware.com/cds/vmw-desktop/vmrc/

You do not need to have a Broadcom account. All VMware desktop products are directly downloadable without signing in.

VMware Health Analyzer - how to download and register the tool

Are you looking for VMware Health Analyzer? It is not easy to find it so here are links to download and register the tool to get the license.

Full VHA download: https://docs.broadcom.com/docs/VHA-FULL-OVF10

Collector VHA download: https://docs.broadcom.com/docs/VHA-COLLECTOR-OVF10

Full VHA license Register Tool: https://pstoolhub.broadcom.com/

I publish it mainly for my own reference but I hope other VMware community folks find it useful.

Monday, December 09, 2024

Every I/O requires CPU Cycles - vSAN ESA is not different

This is the follow-up blog post to my recent blog post about "benchmark results of VMware vSAN ESA".

It is obvious and logical that every computer I/O requires CPU Cycles. This is not (or better to say should not be) a surprise for any infrastructure professional. Anyway, computers are evolving year after year, so some rules of thumb should be validated and sometimes redefined from time to time.

Every bit transmitted/received over the TCP/IP network requires CPU cycles. The same applies to storage I/O. vSAN is a hyper-converged software-defined enterprise storage system, therefore, it requires TCP/IP networking for data striping across nodes (vSAN is RAIN - Redundant Array of Independent Nodes) and storage I/Os to local NVMe disks. 

What is the CPU Clock Cycle? 

A clock cycle is the smallest unit of time in which a CPU performs a task. It acts like a metronome, synchronizing operations such as fetching, decoding, executing instructions, and transferring data.

I have two CPUs Intel Xeon Gold 6544Y 16C @ 3.6 GHz.
I have 32 CPU Cores in total. 
16 CPU Cores in one socket (NUMA node) and 16 CPU Cores in another socket (NUMA node).

That means that every CPU Cores runs at 3.6 GHz frequency:
  • 1 GHz = 1 billion (1,000,000,000) cycles per second.
  • 3.6 GHz = 3.6 billion (3,600,000,000) cycles per second.
My single CPU Core can execute 3.6 billion (3,600,000,000) clock cycles every second.
For our simplified computer math we can say that 1 Hz = 1 CPU cycle.
We have to simplify, otherwise we will not be able to define rules of thumb.

Let's start with networking. 

In the past, there was a rule of thumb that 1 bit/s to send or receive requires 1 Hz (1 CPU cycle @ 1 GHz CPU). This general rule of thumb is mentioned in the book “VMware vSphere 6.5 Host Resources Deep Dive” by Frank Denneman and Niels Hagoort. Btw, it is a great book and anybody designing or operating VMware vSphere should have it on the bookshelf.

In my testing, I found that 

  • 1 b/s receive traffic requires 0.373 Hz 
  • 1 b/s transmit traffic requires 0.163 Hz

Let’s average it and redefine the old rule of thumb a little bit.

My current rule of thumb for TCP/IP networking is ... 

"1 bit send or receive over TCP/IP datacenter network requires ~0.25 Hz of general purpose x86-64 CPU (Intel Emerald Rapids)"

There is a difference between receive traffic and transmit traffic. So, if we would like to be more precise, the rule of thumb should be ...

"1 bit send over TCP/IP datacenter network requires ~0.37 Hz and 1 bit send over TCP/IP datacenter network requires ~0.16 Hz of general purpose x86-64 CPU"

Unfortunately, the later rule of thumb (~0.37 MHz and ~0.16 MHz) is not as easy to remember as the first simplified and averaged rule of thumb (~0.25 per bit per second).

Also the mention of "general purpose x86-64 CPU" is pretty important because it can vary on various CPU architectures and specific ASICs are totally different topics.

It is worth mentioning, that my results were tested and observed on Intel Xeon Gold 6544Y 16C @ 3.6 GHz (aka Emerald Rapids).

So let's use the simplified rule (0.25 Hz for 1 bit per second)  and we can say that 10 Gbps received or transmitted traffic requires 2,500,000,000 Hz = 2.5 GHz = ~0.75 of my CPU Core @ 3.6 Ghz and the whole CPU Core with frequency 2.5 GHz.

In my testing, vSAN ESA during the load of 720k IOPS (32KB) or 20 GB/s aggregated throughput across 6 ESXi hosts typically use between ~27 and ~34 Gbps network traffic per particular ESXi host (vSAN node). Let’s average it and assume ~30 Gbps on average, therefore, it consumes 3 x 1.5 CPU Core = 4.5 CPU Core per ESXi host just for TCP/IP networking!

Nice. That’s networking.

Let's continue with hyper-converged storage traffic 

Now we can do similar math for hyper-converged vSAN storage and network CPU consumption.

Note, that I used RAID-5 with compression enabled. It will be different for RAID-0, and RAID-1. RAID-6 should be similar to RAID-5 but with some slight differences (4+1 versus 4+2). 

I would assume that compression also plays a role in CPU usage even modern CPUs support compression/decompression offloading (Intel QAT, AMD CCX/CPX).

During my testing, the following CPU consumptions were observed …

  • ~2.5 Hz to read 1 bit of vSAN data (32KB IO, random) 
  • ~7.13 Hz to write 1 bit of vSAN data (32KB IO, random)
  • ~1.86 Hz to read 1 bit of vSAN data (1MB IO, sequential) 
  • ~2.78 Hz to write 1 bit of vSAN data (1MB IO, sequential)

Let’s average it and define another rule of thumb ... 

"3.5 Hz - of general purpose x86-64 CPU (Intel Emerald Rapids)- is required to read or write 1 bit/s from vSAN ESA RAID-5 with compression enabled"

If we would like to be more precise we should split the oversimplified rule of 3.5 Hz per bit per second to something more precise based on I/O size and read or write operation.

Anyway, if we stick with the simple rule of thumb that 3.5 Hz is required per bit per second then 1 GB/s (8,000,000,000 b/s) would require 28 GHz (7.8 CPU Cores @ 3.6 GHz).

In my 6-node vSAN ESA I was able to achieve the following sustainable read throughput in VM guests with 2 ms response time … 

  • ~22.5 GB/s (32 KB IO, 100% read, 100% random) 
  • ~ 8.9 GB/s  (32 KB IO, 100% write, 100% random) 
  • ~22.5 GB/s (1 MB IO, 100% read, 100% sequential)
  • ~15.2 GB/s (1 MB IO, 100% write, 100% sequential)

Therefore, if we average it, it is ~17.2 GB/s read or write.

17.2 GB/s (563,600 32 KB IOPS or 17,612 1 MB IOPS) vSAN ESA read or write is 481 GHz (134 CPU Cores @ 3.6 GHz).

In the 6-node vSAN cluster, I have 192 cores.

  • Each ESXi has 2x CPU Intel Xeon Gold 6544Y 16C @ 3.6 GHz (32 CPU Cores, 115.2 GHz capacity)
  • The total cluster capacity is 192 CPU Cores and 691.2 GHz

That means ~70% (134 / 192) of vSAN CPU cluster capacity is consumed by vSAN storage and network operations under a storage workload of 17.2 GB/s  563,600 IOPS (32 KB I/O size) or 17,612 IOPS (1 MB I/O size).

Well, your 6-node vSAN storage system will probably not deliver ~600,000 IOPS continuously, but it is good to understand that every I/O requires CPU cycles.

What about RAM?

Thanks to ChatGPT, I have found these numbers.

  • L1 Cache: Access in 3–5 cycles.
  • L2 Cache: Access in 10–20 cycles.
  • L3 Cache: Access in 30–50 cycles.
  • RAM: Access in 100–200+ cycles, depending on the system
... and we agreed that in a modern DDR5 system, the CPU cycles required to transfer a 4 KB RAM page typically range between 160 and 320 cycles for high-performance configurations (dual-channel DDR5-6400 or better). Optimizations like multi-channel setups, burst transfers, and caching mechanisms can push this closer to the lower bound.

So, for our rule of thumb, let's stick with 250 CPU cycles for 4 KB RAM (memory page). It would be 32,768 bits (4x1024x8) at a cost of 250 CPU cycles. That means 0.008 CPU cycles, so let's generalize it to 0.01 Hz.

"0.01 Hz - of general purpose x86-64 CPU (Intel Emerald Rapids) - is required to read or write 1 bit/s from RAM DDR5"

What about CPU offload technologies?

Yes, it is obvious that our rules of thumb are oversimplified and we are not talking about hardware offloading technologies like TOE, RSS, LSO, RDMA, Intel AES-NI, Intel QuickAssist Technology (QAT), and similar AMD technologies like AMD AES Acceleration.

My testing environment was on Intel and almost all above CPU offload technologies were enabled on my testing infrastructure. The only missing one is RDMA (RoCEv2) which is not supported by Cisco UCS VIC 15230 for vSAN. I'm in touch with Cisco and VMware about it and hope it is just a matter of testing but it is not supported, so it cannot be used in enterprise production environment.

RDMA (RoCEv2) 

I’m wondering how much RDMA (RoCEv2) would help to decrease the CPU usage of my vSAN ESA storage system. Unfortunately, I cannot test it because Cisco doesn’t support vSAN over RDMA/RoCE with Cisco UCS VIC 15230 even though it seems supported for vSphere 8.0 U3 :-(

I'm in touch with Cisco and waiting for more details as I have assumed a CPU usage reduction between 10% and 30%.

Thanks to Reddit user u/irzyk27 who shared his data from 4-node vSAN ESA cluster benchmark, I was able to estimate RDMA/RoCEv2 benefit.

Here is the summary.

TEST - 8vmdk_100ws_4k_100rdpct_100randompct_4threads (4KB IO, 100% read, 100% random)
  • Non-RDMA > 939790.5 IOPS @ cpu.usage 96.54% , cpu.utilization 80.88%
  • RDMA > 1135752.6 IOPS @ cpu.usage 96.59% , cpu.utilization 81.53%
  • Gains by RDMA over Non-RDMA +21% IOPS, -0.1% cpu.usage, -0.8% cpu.utilization
  • RDMA CPU usage savings 21%
TEST - 8vmdk_100ws_4k_70rdpct_100randompct_4threads (4KB IO, 70% read, 100% random)
  • Non-RDMA > 633421.4 IOPS @ cpu.usage 95.45% , cpu.utilization 78.57%
  • RDMA> 779107.8 IOPS @ cpu.usage 93.38% , cpu.utilization 77.46%
  • Gains by RDMA over Non-RDMA +23% IOPS, -2% cpu.usage, -1.5% cpu.utilization
  • RDMA CPU usage savings 25%
TEST - 8vmdk_100ws_8k_50rdpct_100randompct_4threads (8KB IO, 50% read, 100% random)
  • Non-RDMA > 468353.5 IOPS @ cpu.usage 95.43% , cpu.utilization 77.71%
  • RDMA> 576964.7 IOPS @ cpu.usage 90.18% , cpu.utilization 73.57%
  • Gains by RDMA over Non-RDMA +23% IOPS, -6% cpu.usage, -6% cpu.utilization
  • RDMA CPU usage savings 29%
TEST - 8vmdk_100ws_256k_0rdpct_0randompct_1threads (256KB IO, 100% write, 100% sequent.)
  • Non-RDMA > 24112.6 IOPS @ cpu.usage 64.04% , cpu.utilization 41.21%
  • RDMA> 24474.6 IOPS @ cpu.usage 56.7% , cpu.utilization 35.2%
  • Gains by RDMA over Non-RDMA +1.5% IOPS, -11% cpu.usage, -14.6% cpu.utilization
  • RDMA CPU usage savings 12.5%
If I read and calculate u/irzyk27 results correctly, RDMA allows more IOPS (~20%) for the almost same CPU usage when small I/Os (4 KB, 8 KB) are used for random workloads. So, for smaller I/Os (4k,8k random) the gain is between ~25% and 30%.

For bigger I/Os (256 KB) and sequential workloads, similar storage performance/throughput is achieved with ~11% less CPU usage. So, for larger I/Os (256k sequential workload) the RDMA gain is around 11%.

This leads me to the validation of my assumption and conclusion that ... 
"RDMA can save between 10% and 30% of CPU depending on I/O size"
So, is it worth to enable RDMA (RoCEv2)?

It depends, but if your infrastructure supports RDMA (RoCEv2) and it is not a manageability issue in your environment/organization, I would highly recommend enabling RDMA, because savings between 10% and 30% of CPU have a very positive impact on TCO

Fewer CPU Cores required to handle vSAN traffic also means less hardware, less power consumption, and fewer VMware VCF/VVF licenses, so in the end, it means less money and I'm pretty sure no one will be angry if you save money. 

Unfortunately, my Cisco UCS infrastructure with VIC 15230 doesn’t currently support RDMA for vSAN even though RDMA/RoCE is supported for vSphere. I have to ask Cisco what’s the reason and what other workload then vSAN can benefit from RDMA. 

Note: For someone interesting what's the difference between cpu.utilization and cpu.usage ...
  • cpu.utilization - Provides statistics for physical CPUs.
  • cpu.usage - Provides statistics for logical CPUs. This is based on CPU Hyperthreading.

Conclusion

It is always good to have some rule of thumb even if they are averaged and oversimplified.

If we work by following rules of thumb ...

  • 1 b/s for RAM transfer requires ~0.01 Hz
  • 1 b/s for network transfer requires ~0.25 Hz
  • 1 b/s for general local NVMe storage requires ~0.5 Hz
  • 1 b/s for vSAN storage requires ~3.5 Hz

... we can see that  

  • STORAGE vs NETWORK
    • general local NVMe storage transfer requires 2x more CPU cycles than network transfer 
    • vSAN storage transfer requires 14x more CPU cycles than network transfer 
  • STORAGE vs RAM
    • general local NVMe storage transfer requires 50x more CPU cycles than RAM transfer
    • vSAN storage transfer requires 350x more CPU cycles than RAM transfer 
  • NETWORK vs RAM
    • network transfer requires 25x more CPU cycles than RAM transfer  

General CPU compute processing involves operations related to RAM, network, and storage. Nothing else correct?

That’s the reason why I believe that around 60% of CPU cycles could be leveraged for hyper-converged storage I/O and the rest (40%) could be good enough for data manipulation (calculations, transformations, etc.) in RAM and for front-end network communication. 

IT infrastructure is here for data processing. Storage and network operations are a significant part of data processing, therefore keeping vSAN storage I/O on 60% CPU usage and using the rest (40%) for real data manipulation and ingress/egress traffic from/to vSphere/vSAN cluster is probably good infrastructure use case and helps us to increase the usage of common x86-64 CPU and improve our datacenter infrastructure ROI.

All above are just assumptions based on some synthetic testing. The workload reality could be different, hovever, there is not a big risk to plan 60% of CPU for storage transfers, as if the initial assumption about CPU usage distribution among RAM, network, and storage isn't met, you can always scale out. That's the advantage and beauty of scale-out systems like VMware vSAN. And if you cannot scale-out from various reasons, you can leverage vSAN IOPS limits to decrease storage stress and lower CPU usage. Of course, it means that your storage response time will increase and you intentionally decrase the storage performance quality. That's where various VM Storage Polices come in to play and you can do a per vDisk storage tiering.  

It is obvious that CPU offload technologies are helping us to use our systems to the maximum. It is beneficial to leverage technologies like TOE, RSS, LSO, RDMA, etc. 

Double check your NIC driver/firmware supports TOE, LSO, RSS, RDMA (RoCE). 

It seems that RDMA/RoCEv2 can significantly reduce CPU usage (up to 30%) so if you can, enable it. 

It is even more important when software-defined networking (NSX) come in to play. There are other hardware offload technologies (SR-IOV, NetDump, GENEVE-Offload) helping to decrease CPU cycles for network operations.