Friday, December 20, 2024

CPU cycles required for general storage workload

I recently published a blog post about CPU cycles required for network and VMware vSAN ESA storage workload. I realized it would be nice to test and quantify CPU cycles needed for general storage workload without vSAN ESA backend operations like RAID/RAIN and compression.

Performance testing is always tricky as it depends on guest OS, firmware, drivers, and application, but we are not looking for exact numbers and approximations are good enough for a general rule of thumb helping pure designer during capacity planning. 

My test environment was old Dell PowerEdge R620 (Intel Xeon CPU E5-2620 @ 2.00GHz), with ESXi 8.0.3 and Windows Server 2025 in a Virtual Machine (2 vCPU @ 2 GHz, 1x para-virtualized SCSI controller/PVSCSI, 1x vDisk). Storage subsystem was VMware VMFS datastore on local NVMe consumer-grade disk (Kingston SNVS1000GB flash).

Storage tests were done using an old good Iometer.

Test VM had total CPU capacity of 4 GHz (4,000,000,000 Hz aka CPU Clock Cycles)

Below are some test results to help me define another rule of thumb.

TEST - 512 B, 100% read, 100% random - 4,040 IOPS @ 2.07 MB/s @ avg response time 0.25 ms

  • 15.49% CPU = 619.6 MHz
  • 619.6 MHz  (619,600,000 CPU cycles) is required to deliver 2.07 MB/s (16,560,000 b/s)
    • 37.42 Hz to read 1 b/s
    • 153.4 KHz for reading 1 IOPS (512 B, random)

TEST - 512 B, 100% write, 100% random - 4,874 IOPS @ 2.50 MB/s @ avg response time 0.2 ms

  • 19.45% CPU = 778 MHz
  • 778 MHz  (778,000,000 CPU cycles) is required to deliver 2.50 MB/s (20,000,000 b/s)
    • 38.9 Hz to write 1 b/s
    • 159.6 KHz for writing 1 IOPS (512 B, random)

TEST - 4 KiB, 100% read, 100% random - 3,813 IOPS @ 15.62 MB/s @ avg response time 0.26 ms

  • 13.85% CPU = 554.0 MHz
  • 554.0 MHz  (554,000,000 CPU cycles) is required to deliver 15.62 MB/s (124,960,000 b/s)
    • 4.43 Hz to read 1 b/s
    • 145.3 KHz for 1 reading IOPS (4 KiB, random)

TEST - 4 KiB, 100% write, 100% random - 4,413 IOPS @ 18.08 MB/s @ avg response time 0.23 ms

  • 21.84% CPU = 873.6 MHz
  • 873.6 MHz  (873,600,000 CPU cycles) is required to deliver 18.08 MB/s (144,640,000 b/s)
    • 6.039 Hz to write 1 b/s
    • 197.9 KHz for writing 1 IOPS (4 KiB, random)

TEST - 32 KiB, 100% read, 100% random - 2,568 IOPS @ 84.16 MB/s @ avg response time 0.39 ms

  • 10.9% CPU = 436 MHz
  • 436 MHz  (436,000,000 CPU cycles) is required to deliver 84.16 MB/s (673,280,000 b/s)
    • 0.648 Hz to read 1 b/s
    • 169.8 KHz for reading 1 IOPS (32 KiB, random)

TEST - 32 KiB, 100% write, 100% random - 2,873 IOPS @ 94.16 MB/s @ avg response time 0.35 ms

  • 14.16% CPU = 566.4 MHz
  • 566.4 MHz  (566,400,000 CPU cycles) is required to deliver 94.16 MB/s (753,280,000 b/s)
    • 0.752 Hz to write 1 b/s
    • 197.1 KHz for writing 1 IOPS (32 KiB, random)

TEST - 64 KiB, 100% read, 100% random - 1,826 IOPS @ 119.68 MB/s @ avg response time 0.55 ms

  • 9.06% CPU = 362.4 MHz
  • 362.4 MHz  (362,400,000 CPU cycles) is required to deliver 119.68 MB/s (957,440,000 b/s)
    • 0.37 Hz to read 1 b/s
    • 198.5 KHz for reading 1 IOPS (64 KiB, random)

TEST - 64 KiB, 100% write, 100% random - 2,242 IOPS @ 146.93 MB/s @ avg response time 0.45 ms

  • 12.15% CPU = 486.0 MHz
  • 486.0 MHz  (486,000,000 CPU cycles) is required to deliver 149.93 MB/s (1,199,440,000 b/s)
    • 0.41 Hz to write 1 b/s
    • 216.7 KHz for writing 1 IOPS (64 KiB, random)

TEST - 256 KiB, 100% read, 100% random - 735 IOPS @ 192.78 MB/s @ avg response time 1.36 ms

  • 6.66% CPU = 266.4 MHz
  • 266.4 MHz  (266,400,000 CPU cycles) is required to deliver 192.78 MB/s (1,542,240,000 b/s)
    • 0.17 Hz to read 1 b/s
    • 362.4 KHz for reading 1 IOPS (256 KiB, random)

TEST - 256 KiB, 100% write, 100% random - 703 IOPS @ 184.49 MB/s @ avg response time 1.41 ms

  • 7.73% CPU = 309.2 MHz
  • 309.2 MHz  (309,200,000 CPU cycles) is required to deliver 184.49 MB/s (1,475,920,000 b/s)
    • 0.21 Hz to write 1 b/s
    • 439.9 KHz for writing 1 IOPS (256 KiB, random)

TEST - 256 KiB, 100% read, 100% seq - 2784 IOPS @ 730.03 MB/s @ avg response time 0.36 ms

  • 15.26% CPU = 610.4 MHz
  • 610.4 MHz  (610,400,000 CPU cycles) is required to deliver 730.03 MB/s (5,840,240,000 b/s)
    • 0.1 Hz to read 1 b/s
    • 219.25 KHz for reading 1 IOPS (256 KiB, sequential)

TEST - 256 KiB, 100% write, 100% seq - 1042 IOPS @ 273.16 MB/s @ avg response time 0.96 ms

  • 9.09% CPU = 363.6 MHz
  • 363.6 MHz  (363,600,000 CPU cycles) is required to deliver 273.16 MB/s (2,185,280,000 b/s)
    • 0.17 Hz to write 1 b/s
    • 348.4 KHz for writing 1 IOPS (256 KiB, sequential)

TEST - 1 MiB, 100% read, 100% seq - 966 IOPS @ 1013.3 MB/s @ avg response time 1 ms

  • 9.93% CPU = 397.2 MHz
  • 397.2 MHz  (397,200,000 CPU cycles) is required to deliver 1013.3 MB/s (8,106,400,000 b/s)
    • 0.05 Hz to read 1 b/s
    • 411.18 KHz for reading 1 IOPS (1 MiB, sequential)

TEST - 1 MiB, 100% write, 100% seq - 286 IOPS @ 300.73 MB/s @ avg response time 3.49 ms

  • 10.38% CPU = 415.2 MHz
  • 415.2 MHz  (415,200,000 CPU cycles) is required to deliver 300.73 MB/s (2,405,840,000 b/s)
    • 0.17 Hz to write 1 b/s
    • 1.452 MHz for writing 1 IOPS (1 MiB, sequential)

Observations

We can see that the CPU cycles required to read 1 b/s vary based on I/O size, Read/Write, and Random/Sequential pattern.

  • Small I/O (512 B, random) can consume almost 40 Hz to read or write 1 b/s. 
  • Normalized I/O (32 KiB, random) can consume around 0.7 Hz to read or write 1 b/s
  • Large I/O (1 MiB, sequential) can consume around 0.1 Hz to read or write 1 b/s
If we use the same approach as for vSAN and average 32 KiB I/O (random) and 1 MiB I/O (sequential), we can define the following rule of thumb 
"0.5 Hz of general purpose x86-64 CPU (Intel Sandy Bridge) is required to read or write 1 bit/s from local NVMe flash disk"

If we compare it with the 3.5 Hz rule of thumb for vSAN ESA RAID-5 with compression, we can see the vSAN ESA requires 7x more CPU cycles, but it makes perfect sense because vSAN ESA does a lot of additional processing on the backend. Such processing mainly involves data protection (RAID-5/RAIN-5) and compression.  

I was curious how much CPU cycles require a non-redundant storage workload and observed numbers IMHO make sense.

Hope this helps others during infrastructure design exercises. 

Wednesday, December 11, 2024

VMware Desktop Products direct download links

Main URL for all desktop products: https://softwareupdate.vmware.com/cds/vmw-desktop/

VMware Fusion: https://softwareupdate.vmware.com/cds/vmw-desktop/fusion/

VMware Workstation: https://softwareupdate.vmware.com/cds/vmw-desktop/ws/

VMware Remote Console (VMRC): https://softwareupdate.vmware.com/cds/vmw-desktop/vmrc/

You do not need to have a Broadcom account. All VMware desktop products are directly downloadable without signing in.

VMware Health Analyzer - how to download and register the tool

Are you looking for VMware Health Analyzer? It is not easy to find it so here are links to download and register the tool to get the license.

Full VHA download: https://docs.broadcom.com/docs/VHA-FULL-OVF10

Collector VHA download: https://docs.broadcom.com/docs/VHA-COLLECTOR-OVF10

Full VHA license Register Tool: https://pstoolhub.broadcom.com/

I publish it mainly for my own reference but I hope other VMware community folks find it useful.

Monday, December 09, 2024

Every I/O requires CPU Cycles - vSAN ESA is not different

This is the follow-up blog post to my recent blog post about "benchmark results of VMware vSAN ESA".

It is obvious and logical that every computer I/O requires CPU Cycles. This is not (or better to say should not be) a surprise for any infrastructure professional. Anyway, computers are evolving year after year, so some rules of thumb should be validated and sometimes redefined from time to time.

Every bit transmitted/received over the TCP/IP network requires CPU cycles. The same applies to storage I/O. vSAN is a hyper-converged software-defined enterprise storage system, therefore, it requires TCP/IP networking for data striping across nodes (vSAN is RAIN - Redundant Array of Independent Nodes) and storage I/Os to local NVMe disks. 

What is the CPU Clock Cycle? 

A clock cycle is the smallest unit of time in which a CPU performs a task. It acts like a metronome, synchronizing operations such as fetching, decoding, executing instructions, and transferring data.

I have two CPUs Intel Xeon Gold 6544Y 16C @ 3.6 GHz.
I have 32 CPU Cores in total. 
16 CPU Cores in one socket (NUMA node) and 16 CPU Cores in another socket (NUMA node).

That means that every CPU Cores runs at 3.6 GHz frequency:
  • 1 GHz = 1 billion (1,000,000,000) cycles per second.
  • 3.6 GHz = 3.6 billion (3,600,000,000) cycles per second.
My single CPU Core can execute 3.6 billion (3,600,000,000) clock cycles every second.
For our simplified computer math we can say that 1 Hz = 1 CPU cycle.
We have to simplify, otherwise we will not be able to define rules of thumb.

Let's start with networking. 

In the past, there was a rule of thumb that 1 bit/s to send or receive requires 1 Hz (1 CPU cycle @ 1 GHz CPU). This general rule of thumb is mentioned in the book “VMware vSphere 6.5 Host Resources Deep Dive” by Frank Denneman and Niels Hagoort. Btw, it is a great book and anybody designing or operating VMware vSphere should have it on the bookshelf.

In my testing, I found that 

  • 1 b/s receive traffic requires 0.373 Hz 
  • 1 b/s transmit traffic requires 0.163 Hz

Let’s average it and redefine the old rule of thumb a little bit.

My current rule of thumb for TCP/IP networking is ... 

"1 bit send or receive over TCP/IP datacenter network requires ~0.25 Hz of general purpose x86-64 CPU (Intel Emerald Rapids)"

There is a difference between receive traffic and transmit traffic. So, if we would like to be more precise, the rule of thumb should be ...

"1 bit send over TCP/IP datacenter network requires ~0.37 Hz and 1 bit send over TCP/IP datacenter network requires ~0.16 Hz of general purpose x86-64 CPU"

Unfortunately, the later rule of thumb (~0.37 MHz and ~0.16 MHz) is not as easy to remember as the first simplified and averaged rule of thumb (~0.25 per bit per second).

Also the mention of "general purpose x86-64 CPU" is pretty important because it can vary on various CPU architectures and specific ASICs are totally different topics.

It is worth mentioning, that my results were tested and observed on Intel Xeon Gold 6544Y 16C @ 3.6 GHz (aka Emerald Rapids).

So let's use the simplified rule (0.25 Hz for 1 bit per second)  and we can say that 10 Gbps received or transmitted traffic requires 2,500,000,000 Hz = 2.5 GHz = ~0.75 of my CPU Core @ 3.6 Ghz and the whole CPU Core with frequency 2.5 GHz.

In my testing, vSAN ESA during the load of 720k IOPS (32KB) or 20 GB/s aggregated throughput across 6 ESXi hosts typically use between ~27 and ~34 Gbps network traffic per particular ESXi host (vSAN node). Let’s average it and assume ~30 Gbps on average, therefore, it consumes 3 x 1.5 CPU Core = 4.5 CPU Core per ESXi host just for TCP/IP networking!

Nice. That’s networking.

Let's continue with hyper-converged storage traffic 

Now we can do similar math for hyper-converged vSAN storage and network CPU consumption.

Note, that I used RAID-5 with compression enabled. It will be different for RAID-0, and RAID-1. RAID-6 should be similar to RAID-5 but with some slight differences (4+1 versus 4+2). 

I would assume that compression also plays a role in CPU usage even modern CPUs support compression/decompression offloading (Intel QAT, AMD CCX/CPX).

During my testing, the following CPU consumptions were observed …

  • ~2.5 Hz to read 1 bit of vSAN data (32KB IO, random) 
  • ~7.13 Hz to write 1 bit of vSAN data (32KB IO, random)
  • ~1.86 Hz to read 1 bit of vSAN data (1MB IO, sequential) 
  • ~2.78 Hz to write 1 bit of vSAN data (1MB IO, sequential)

Let’s average it and define another rule of thumb ... 

"3.5 Hz - of general purpose x86-64 CPU (Intel Emerald Rapids)- is required to read or write 1 bit/s from vSAN ESA RAID-5 with compression enabled"

If we would like to be more precise we should split the oversimplified rule of 3.5 Hz per bit per second to something more precise based on I/O size and read or write operation.

Anyway, if we stick with the simple rule of thumb that 3.5 Hz is required per bit per second then 1 GB/s (8,000,000,000 b/s) would require 28 GHz (7.8 CPU Cores @ 3.6 GHz).

In my 6-node vSAN ESA I was able to achieve the following sustainable read throughput in VM guests with 2 ms response time … 

  • ~22.5 GB/s (32 KB IO, 100% read, 100% random) 
  • ~ 8.9 GB/s  (32 KB IO, 100% write, 100% random) 
  • ~22.5 GB/s (1 MB IO, 100% read, 100% sequential)
  • ~15.2 GB/s (1 MB IO, 100% write, 100% sequential)

Therefore, if we average it, it is ~17.2 GB/s read or write.

17.2 GB/s (563,600 32 KB IOPS or 17,612 1 MB IOPS) vSAN ESA read or write is 481 GHz (134 CPU Cores @ 3.6 GHz).

In the 6-node vSAN cluster, I have 192 cores.

  • Each ESXi has 2x CPU Intel Xeon Gold 6544Y 16C @ 3.6 GHz (32 CPU Cores, 115.2 GHz capacity)
  • The total cluster capacity is 192 CPU Cores and 691.2 GHz

That means ~70% (134 / 192) of vSAN CPU cluster capacity is consumed by vSAN storage and network operations under a storage workload of 17.2 GB/s  563,600 IOPS (32 KB I/O size) or 17,612 IOPS (1 MB I/O size).

Well, your 6-node vSAN storage system will probably not deliver ~600,000 IOPS continuously, but it is good to understand that every I/O requires CPU cycles.

What about RAM?

Thanks to ChatGPT, I have found these numbers.

  • L1 Cache: Access in 3–5 cycles.
  • L2 Cache: Access in 10–20 cycles.
  • L3 Cache: Access in 30–50 cycles.
  • RAM: Access in 100–200+ cycles, depending on the system
... and we agreed that in a modern DDR5 system, the CPU cycles required to transfer a 4 KB RAM page typically range between 160 and 320 cycles for high-performance configurations (dual-channel DDR5-6400 or better). Optimizations like multi-channel setups, burst transfers, and caching mechanisms can push this closer to the lower bound.

So, for our rule of thumb, let's stick with 250 CPU cycles for 4 KB RAM (memory page). It would be 32,768 bits (4x1024x8) at a cost of 250 CPU cycles. That means 0.008 CPU cycles, so let's generalize it to 0.01 Hz.

"0.01 Hz - of general purpose x86-64 CPU (Intel Emerald Rapids) - is required to read or write 1 bit/s from RAM DDR5"

What about CPU offload technologies?

Yes, it is obvious that our rules of thumb are oversimplified and we are not talking about hardware offloading technologies like TOE, RSS, LSO, RDMA, Intel AES-NI, Intel QuickAssist Technology (QAT), and similar AMD technologies like AMD AES Acceleration.

My testing environment was on Intel and almost all above CPU offload technologies were enabled on my testing infrastructure. The only missing one is RDMA (RoCEv2) which is not supported by Cisco UCS VIC 15230 for vSAN. I'm in touch with Cisco and VMware about it and hope it is just a matter of testing but it is not supported, so it cannot be used in enterprise production environment.

RDMA (RoCEv2) 

I’m wondering how much RDMA (RoCEv2) would help to decrease the CPU usage of my vSAN ESA storage system. Unfortunately, I cannot test it because Cisco doesn’t support vSAN over RDMA/RoCE with Cisco UCS VIC 15230 even though it seems supported for vSphere 8.0 U3 :-(

I'm in touch with Cisco and waiting for more details as I have assumed a CPU usage reduction between 10% and 30%.

Thanks to Reddit user u/irzyk27 who shared his data from 4-node vSAN ESA cluster benchmark, I was able to estimate RDMA/RoCEv2 benefit.

Here is the summary.

TEST - 8vmdk_100ws_4k_100rdpct_100randompct_4threads (4KB IO, 100% read, 100% random)
  • Non-RDMA > 939790.5 IOPS @ cpu.usage 96.54% , cpu.utilization 80.88%
  • RDMA > 1135752.6 IOPS @ cpu.usage 96.59% , cpu.utilization 81.53%
  • Gains by RDMA over Non-RDMA +21% IOPS, -0.1% cpu.usage, -0.8% cpu.utilization
  • RDMA CPU usage savings 21%
TEST - 8vmdk_100ws_4k_70rdpct_100randompct_4threads (4KB IO, 70% read, 100% random)
  • Non-RDMA > 633421.4 IOPS @ cpu.usage 95.45% , cpu.utilization 78.57%
  • RDMA> 779107.8 IOPS @ cpu.usage 93.38% , cpu.utilization 77.46%
  • Gains by RDMA over Non-RDMA +23% IOPS, -2% cpu.usage, -1.5% cpu.utilization
  • RDMA CPU usage savings 25%
TEST - 8vmdk_100ws_8k_50rdpct_100randompct_4threads (8KB IO, 50% read, 100% random)
  • Non-RDMA > 468353.5 IOPS @ cpu.usage 95.43% , cpu.utilization 77.71%
  • RDMA> 576964.7 IOPS @ cpu.usage 90.18% , cpu.utilization 73.57%
  • Gains by RDMA over Non-RDMA +23% IOPS, -6% cpu.usage, -6% cpu.utilization
  • RDMA CPU usage savings 29%
TEST - 8vmdk_100ws_256k_0rdpct_0randompct_1threads (256KB IO, 100% write, 100% sequent.)
  • Non-RDMA > 24112.6 IOPS @ cpu.usage 64.04% , cpu.utilization 41.21%
  • RDMA> 24474.6 IOPS @ cpu.usage 56.7% , cpu.utilization 35.2%
  • Gains by RDMA over Non-RDMA +1.5% IOPS, -11% cpu.usage, -14.6% cpu.utilization
  • RDMA CPU usage savings 12.5%
If I read and calculate u/irzyk27 results correctly, RDMA allows more IOPS (~20%) for the almost same CPU usage when small I/Os (4 KB, 8 KB) are used for random workloads. So, for smaller I/Os (4k,8k random) the gain is between ~25% and 30%.

For bigger I/Os (256 KB) and sequential workloads, similar storage performance/throughput is achieved with ~11% less CPU usage. So, for larger I/Os (256k sequential workload) the RDMA gain is around 11%.

This leads me to the validation of my assumption and conclusion that ... 
"RDMA can save between 10% and 30% of CPU depending on I/O size"
So, is it worth to enable RDMA (RoCEv2)?

It depends, but if your infrastructure supports RDMA (RoCEv2) and it is not a manageability issue in your environment/organization, I would highly recommend enabling RDMA, because savings between 10% and 30% of CPU have a very positive impact on TCO

Fewer CPU Cores required to handle vSAN traffic also means less hardware, less power consumption, and fewer VMware VCF/VVF licenses, so in the end, it means less money and I'm pretty sure no one will be angry if you save money. 

Unfortunately, my Cisco UCS infrastructure with VIC 15230 doesn’t currently support RDMA for vSAN even though RDMA/RoCE is supported for vSphere. I have to ask Cisco what’s the reason and what other workload then vSAN can benefit from RDMA. 

Note: For someone interesting what's the difference between cpu.utilization and cpu.usage ...
  • cpu.utilization - Provides statistics for physical CPUs.
  • cpu.usage - Provides statistics for logical CPUs. This is based on CPU Hyperthreading.

Conclusion

It is always good to have some rule of thumb even if they are averaged and oversimplified.

If we work by following rules of thumb ...

  • 1 b/s for RAM transfer requires ~0.01 Hz
  • 1 b/s for network transfer requires ~0.25 Hz
  • 1 b/s for general local NVMe storage requires ~0.5 Hz
  • 1 b/s for vSAN storage requires ~3.5 Hz

... we can see that  

  • STORAGE vs NETWORK
    • general local NVMe storage transfer requires 2x more CPU cycles than network transfer 
    • vSAN storage transfer requires 14x more CPU cycles than network transfer 
  • STORAGE vs RAM
    • general local NVMe storage transfer requires 50x more CPU cycles than RAM transfer
    • vSAN storage transfer requires 350x more CPU cycles than RAM transfer 
  • NETWORK vs RAM
    • network transfer requires 25x more CPU cycles than RAM transfer  

General CPU compute processing involves operations related to RAM, network, and storage. Nothing else correct?

That’s the reason why I believe that around 60% of CPU cycles could be leveraged for hyper-converged storage I/O and the rest (40%) could be good enough for data manipulation (calculations, transformations, etc.) in RAM and for front-end network communication. 

IT infrastructure is here for data processing. Storage and network operations are a significant part of data processing, therefore keeping vSAN storage I/O on 60% CPU usage and using the rest (40%) for real data manipulation and ingress/egress traffic from/to vSphere/vSAN cluster is probably good infrastructure use case and helps us to increase the usage of common x86-64 CPU and improve our datacenter infrastructure ROI.

All above are just assumptions based on some synthetic testing. The workload reality could be different, hovever, there is not a big risk to plan 60% of CPU for storage transfers, as if the initial assumption about CPU usage distribution among RAM, network, and storage isn't met, you can always scale out. That's the advantage and beauty of scale-out systems like VMware vSAN. And if you cannot scale-out from various reasons, you can leverage vSAN IOPS limits to decrease storage stress and lower CPU usage. Of course, it means that your storage response time will increase and you intentionally decrase the storage performance quality. That's where various VM Storage Polices come in to play and you can do a per vDisk storage tiering.  

It is obvious that CPU offload technologies are helping us to use our systems to the maximum. It is beneficial to leverage technologies like TOE, RSS, LSO, RDMA, etc. 

Double check your NIC driver/firmware supports TOE, LSO, RSS, RDMA (RoCE). 

It seems that RDMA/RoCEv2 can significantly reduce CPU usage (up to 30%) so if you can, enable it. 

It is even more important when software-defined networking (NSX) come in to play. There are other hardware offload technologies (SR-IOV, NetDump, GENEVE-Offload) helping to decrease CPU cycles for network operations.  

Friday, December 06, 2024

VMware vSAN ESA - storage performance testing

I have just finished my first VMware vSAN ESA Plan, Design, and Implement project and had a chance to test vSAN ESA performance. By the way, every storage should be stressed and benchmarked before being put into production. VMware's software-defined hyperconverged storage (vSAN) is no different. It is even more important because the server's CPU, RAM, and Network usually used only for VM workloads are leveraged to emulate the enterprise-class storage.

vSAN ESA Environment

All storage performance tests were performed on
  • 6-node vSAN ESA Cluster (6x ESXi hosts) 
    • ESXi Specification
      • OS: VMware ESXi 8.0 U3 (8.0.3 Build: 24280767)
      • Server Model: Cisco UCS X210c M7
      • CPU: 32 CPU Cores - 2x CPU Intel Xeon Gold 6544Y 16C @ 3.6 GHz
        • 115.2 GHz capacity
      • RAM: 1.5 TB
      • NIC: Cisco VIC 15230 - 2x 50Gbps
        • vSAN vmknic is active/standby, therefore active on one 50 Gbps NIC (vmnic)
          • 50 Gbps is physically two 25G-KR (transceiver modules)
      • Storage: 5x NVMe 6.4 TB 2.5in U.2 P5620 NVMe High Perf High Endurance
        • The usable raw capacity of one disk is 5.82 TB, that's the difference between vendor "sales" capacity and reality. almost 0.6 TB difference :-(
  • Storage benchmark software - HCIBench 2.8.3
    • 18 test VMs (8x data vDisk, 2 workers per vDisk) evenly distributed across the vSAN Cluster
      • 3 VMs per ESXi host
    • fio target storage latency 2.5 ms (2,500 us)
  • vSAN Storage Policy:
    • RAID-5
    • compression enabled
    • IOPS Limit 5,000 (to not totally overload the server's CPU)

The vSphere/vSAN storage architecture is depicted in the diagram below.

vSphere/vSAN storage architecture


The Physical network topology is dictated by the Cisco UCS blade system and is depicted below.

Cisco UCS Network Topology

And the diagram of vSphere virtual networking on top of Cisco UCS. 

vSphere Network Architecture


Test Cases


Random storage workloads


32KB IO, 100% read, 100% random

Test Case Name: fio-8vmdk-90ws-32k-100rdpct-100randompct-2500lt-1732885897

Performance Result
Datastore: CUST-1001-VSAN
=============================
JOB_NAME: job0
Number of VMs: 18
I/O per Second: 721,317.28 IO/S
Throughput: 22,540.00 MB/s
Read Latency: 2.03 ms
Write Latency: 0.00 ms
95th Percentile Read Latency: 2.00 ms
95th Percentile Write Latency: 0.00 ms

ESXi Host CPU Usage during test 78 GHz (1 GHz is used in idle)
vSAN vmnic4 transmit traffic ~3.4 GB/s (27.2 Gb/s)
vSAN vmnic4 receive traffic ~3.4 GB/s (27.2 Gb/s)
Storage IOPS per ESXi: 120,220 IOPS (721,317 IOPS / 6 ESXi hosts)

ESXi CPU Usage due to vSAN Storage + vSAN Network Traffic
120,220 Storage IOPS + 27.2 Gb/s Network transmit traffic + 27.2 Gb/s Network receive traffic requires 77 GHz
That means 1 vSAN read 32 KB I/O operation (including TCP network traffic) requires ~640 KHz.
In other words, 640,000 Hz for 32 KB read I/O (256,000 bits) means ~2.5 Hz to read 1 bit of data.

ESXi CPU Usage due to vSAN network traffic
I have tested that
9.6 Gb/s of transmit pure network traffic requires 1681 MHz (1.68 GHz) of CPU usage
That means
10,307,921,510 b/s transmit traffic requires 1,681,000,000 Hz
1 b/s transmit traffic requires 0.163 Hz
1 Gb/s transmit traffic requires 163 MHz

I have also tested that
10 Gb/s of receive pure network traffic requires 4000 MHz (4 GHz) of CPU usage
That means
10,737,418,240 b/s transmit traffic requires 4,000,000,000 Hz
1 b/s receive traffic requires 0.373 Hz
1 Gb/s receive traffic requires 373 MHz 

vSAN ESXi host reports transmitting network traffic of  27.2 Gb/s, thus it requires ~ 4.43 GHz CPU 
vSAN ESXi host reports receiving network traffic of 27.2 Gb/s, thus it requires ~ 10.15 GHz CPU

ESXi CPU Usage due to vSAN Storage without vSAN network traffic
We can deduct 14.58 GHz (4.43 + 10.15) CPU usage (the cost of bidirectional network traffic) from 77 GHz total ESXi CPU usage. That means we need 62.42 GHz CPU usage for vSAN storage operations without network transfers.  
We were able to achieve 120,220 IOPS on the ESXi host at 62.42 GHz (62,420,000,000 Hz)
That means 1 NVMe read 32 KB I/O operation without a TCP network traffic requires ~519 KHz.
In other words, 519,000 CPU Hz for 32 KB read I/O (256,000 bits) means ~2 Hz to read 1 bit of data

32k IO, 100% write, 100% random

Test Case Name: fio-8vmdk-90ws-32k-0rdpct-100randompct-2500lt-1732885897

Performance Result
Datastore: CUST-1001-VSAN
=============================
JOB_NAME: job0
Number of VMs: 18
I/O per Second: 285,892.55 IO/S
Throughput: 8,934.00 MB/s
Read Latency: 0.00 ms
Write Latency: 1.74 ms
95th Percentile Read Latency: 0.00 ms
95th Percentile Write Latency: 2.00 ms

ESXi Host CPU Usage during test 88 GHz (1 GHz is used in idle)
vSAN vmnic4 transmit traffic ~4.44 GB/s (35.5 Gb/s)
vSAN vmnic4 receive traffic ~5 GB/s (40 Gb/s)
Storage IOPS per ESXi: 47,650 IOPS (285,892 IOPS / 6 ESXi hosts)

ESXi CPU Usage due to vSAN Storage + vSAN Network Traffic
47,650 Storage IOPS + 35.5 Gb/s Network transmit traffic + 40 Gb/s Network receive traffic requires 87 GHz
That means 1 vSAN write 32 KB I/O operation (including TCP network traffic) requires ~1,825 KHz.
In other words, 1,825,000 CPU Hz for 32 KB write I/O (256,000 bits) means ~7.13 Hz to write 1 bit of data.

ESXi CPU Usage due to vSAN network traffic
1 Gb/s transmit traffic requires 163 MHz
1 Gb/s receive traffic requires 373 MHz 

vSAN ESXi host reports transmitting network traffic of  35.5 Gb/s, thus it requires ~ 5.79 GHz CPU 
vSAN ESXi host reports receiving network traffic of 40 Gb/s, thus it requires ~ 14.92 GHz CPU

ESXi CPU Usage due to vSAN Storage without vSAN network traffic
We can deduct 20.71 GHz (5.79 + 14.92) CPU usage (the cost of bidirectional network traffic) from 87 GHz total ESXi CPU usage. We need 66.29 GHz CPU usage for vSAN storage operations without network transfers.  
We were able to achieve 47,650 IOPS on the ESXi host at 66.29 GHz (66,290,000,000 Hz)
That means 1 NVMe write 32 KB I/O operation without a TCP network traffic requires ~1,391 KHz.
In other words, 1,391,000 CPU Hz for 32 KB write I/O (256,000 bits) means ~5.43 Hz to write 1 bit of data

32k IO, 70% read - 30% write, 100% random

Test Case Name: fio-8vmdk-90ws-32k-70rdpct-100randompct-2500lt-1732908719

Performance Result
Datastore: CUST-1001-VSAN
=============================
JOB_NAME: job0
Number of VMs: 18
I/O per Second: 602,702.73 IO/S
Throughput: 18,834.00 MB/s
Read Latency: 1.55 ms
Write Latency: 1.99 ms
95th Percentile Read Latency: 2.00 ms
95th Percentile Write Latency: 2.00 ms

ESXi Host CPU Usage during test 95 GHz (1 GHz is used in idle)
vSAN vmnic4 transmit traffic ~4.5 GB/s (36 Gb/s)
vSAN vmnic4 receive traffic ~4.7 GB/s (37.6 Gb/s)
Storage IOPS per ESXi: 100,450 IOPS (602,702 IOPS / 6 ESXi hosts)

Sequential storage workloads


1024k IO, 100% read, 100% sequential

Test Case Name: fio-8vmdk-90ws-1024k-100rdpct-0randompct-2500lt-1732911329

Performance Result
Datastore: CUST-1001-VSAN
=============================
JOB_NAME: job0
Number of VMs: 18
I/O per Second: 22,575.50 IO/S
Throughput: 22,574.00 MB/s
Read Latency: 6.38 ms
Write Latency: 0.00 ms
95th Percentile Read Latency: 6.00 ms
95th Percentile Write Latency: 0.00 ms

ESXi Host CPU Usage during test 60 GHz (1 GHz is used in idle)
vSAN vmnic4 transmit traffic ~3.4 GB/s (27.2 Gb/s)
vSAN vmnic4 receive traffic ~3.2 GB/s (25.6 Gb/s)
Storage IOPS per ESXi: 3,762 IOPS (22,574 IOPS / 6 ESXi hosts)
Throughput per ESXi: 3,762.00 MB/s (22,574.00 MB/s / 6 ESXi hosts)

ESXi CPU Usage due to vSAN Storage + vSAN Network Traffic
3,762 Storage IOPS + 27.2 Gb/s Network transmit traffic + 25.6 Gb/s Network receive traffic requires 59 GHz
That means 1 vSAN read 1024 KB I/O operation (including TCP network traffic) requires ~15,683 KHz.
In other words, 15,640,000 CPU Hz for 1024 KB read I/O (8,388,608 bits) means ~1.86 Hz to read 1 bit of data.

ESXi CPU Usage due to vSAN network traffic
1 Gb/s transmit traffic requires 163 MHz
1 Gb/s receive traffic requires 373 MHz 

vSAN ESXi host reports transmitting network traffic of  27.2 Gb/s, thus it requires ~4.43 GHz CPU 
vSAN ESXi host reports receiving network traffic of 25.6 Gb/s, thus it requires ~9.55 GHz CPU

ESXi CPU Usage due to vSAN Storage without vSAN network traffic
We can deduct 13.98 GHz (4.43 + 9.55) CPU usage (the cost of bidirectional network traffic) from 59 GHz total ESXi CPU usage. That means we need 45.02 GHz CPU usage for vSAN storage operations without network transfers.  
We were able to achieve 3,162 IOPS on the ESXi host at 45.02 GHz (45,020,000,000 Hz)
That means 1 NVMe read 1 MB I/O operation without a TCP network traffic requires ~14,238 KHz.
In other words, 14,238,000 CPU Hz for 1024 KB read I/O (8,388,608 bits) means ~ 1.69 Hz to read 1 bit of data


1024k IO, 100% write, 100% sequential

Test Case Name: fio-8vmdk-90ws-1024k-0rdpct-0randompct-2500lt-1732913825

Performance Result
Datastore: CUST-1001-VSAN
=============================
JOB_NAME: job0
Number of VMs: 18
I/O per Second: 15,174.08 IO/S
Throughput: 15,171.00 MB/s
Read Latency: 0.00 ms
Write Latency: 8.30 ms
95th Percentile Read Latency: 0.00 ms
95th Percentile Write Latency: 12.00 ms

ESXi Host CPU Usage during test 60 GHz (1 GHz is used in idle)
vSAN vmnic4 transmit traffic ~3.9 GB/s (31.2 Gb/s)
vSAN vmnic4 receive traffic ~3.9 GB/s (31.2 Gb/s)
Storage IOPS per ESXi: 2,529 IOPS (15,171.00  IOPS / 6 ESXi hosts)
Throughput per ESXi: 2,529 MB/s (15,171.00 MB/s / 6 ESXi hosts)

ESXi CPU Usage due to vSAN Storage + vSAN Network Traffic
2,529 Storage IOPS + 31.2 Gb/s Network transmit traffic + 31.2 Gb/s Network receive traffic requires 59 GHz
That means 1 vSAN 1024 KB write I/O operation (including TCP network traffic) requires ~23,329 KHz.
In other words, 23,329,000 CPU Hz for 1024 KB write I/O (8,388,608 bits) means ~2.78 Hz to write 1 bit of data.

ESXi CPU Usage due to vSAN network traffic
1 Gb/s transmit traffic requires 163 MHz
1 Gb/s receive traffic requires 373 MHz 

vSAN ESXi host reports transmitting network traffic of  27.2 Gb/s, thus it requires ~4.43 GHz CPU 
vSAN ESXi host reports receiving network traffic of 25.6 Gb/s, thus it requires ~9.55 GHz CPU

ESXi CPU Usage due to vSAN Storage without vSAN network traffic
We can deduct 13.98 GHz (4.43 + 9.55) CPU usage (the cost of bidirectional network traffic) from 59 GHz total ESXi CPU usage. That means we need 45.02 GHz CPU usage for vSAN storage operations without network transfers.  
We were able to achieve 2,259 IOPS on the ESXi host at 45.02 GHz (45,020,000,000 Hz)
That means 1 NVMe 1024 KB write I/O operation without a TCP network traffic requires ~19,929 KHz.
In other words, 19,929,000 CPU Hz for 1024 KB write I/O (8,388,608 bits) means ~ 2.37 Hz to write 1 bit of data

1024k IO, 70% read - 30% write, 100% sequential

Performance Result
Datastore: CUST-1001-VSAN
=============================
JOB_NAME: job0
Number of VMs: 18
I/O per Second: 19,740.90 IO/S
Throughput: 19,738.00 MB/s
Read Latency: 5.38 ms
Write Latency: 8.68 ms
95th Percentile Read Latency: 7.00 ms
95th Percentile Write Latency: 12.00 ms

ESXi Host CPU Usage during test 62 GHz (1 GHz is used in idle)
vSAN vmnic4 receive traffic ~4.15 GB/s (33.2 Gb/s)
vSAN vmnic4 transmit traffic ~4.3 GB/s (34.4 Gb/s)
Storage IOPS per ESXi: 3,290 IOPS (19,740.90 IOPS / 6 ESXi hosts)
Throughput per ESXi: 3,290 MB/s (19,738.00  MB/s / 6 ESXi hosts)


Observations and explanation

Observation 1 - Storage and network workload requires CPU resources.

This is obvious and logical, however, here is some observed data from our storage performance benchmark exercise.

32K, 100% read, 100% random (721,317.28 IOPS in VM guest,  22,540.00 MB/s in VM guest)
    => CPU Usage ~77 GHz
    => ~2.5 Hz to read 1 bit of data (storage + network) 
    => ~2 Hz to read 1 bit of data (storage only)
    => 25% goes to network traffic

32K, 70%read 30%write, 100% random (602,702.73 IOPS in VM guest, 18,834.00 MB/s in VM guest)
    => CPU Usage ~94 GHz << THIS IS STRANGE, WHY IS IT MORE CPU THAN 100% WRITE? I DON'T KNOW.

32K, 100% write, 100% random (285,892.55 IOPS in VM guest, 8,934.00 MB/s in VM guest) 
    => CPU Usage ~87 GHz
    => ~7.13 Hz to write 1 bit of data  (storage + network)
    => ~5.43 Hz to write 1 bit of data (storage only)
    => 31% goes to network traffic

1M, 100% read, 100% random (22,575.50 IOPS in VM guest,  22,574.00 MB/s in VM guest)
    => CPU Usage ~60 GHz
    => ~1.86 Hz to read 1 bit of data  (storage + network)
    => ~1.69 Hz to read 1 bit of data (storage only)
    => 10% goes to network traffic

1M, 70% read 30% write, 100% random (19,740.90 IOPS in VM guest, 19,738.00 MB/s in VM guest) 
    => CPU Usage ~61 GHz

1M, 100% write, 100% random (15,174.08 IOPS in VM guest, 15,171.00  MB/s in VM guest) 
    => CPU Usage ~59 GHz
    => ~2.78 Hz to write 1 bit of data  (storage + network)
    => ~2.37 Hz to write 1 bit of data (storage only)
    => 17% goes to network traffic

Reading 1 bit of information from vSAN hyper-converged storage requires roughly between ~1.86 Hz (1024 KB I/O size) and 2.5 Hz (32 KB I/O size).

Writing 1 bit of information to vSAN hyper-converged storage requires roughly between ~2.78 Hz (1024 KB I/O size) and 7.13 Hz (32 KB I/O size).

The above numbers are not set in stone but it is good to observe system behavior. 

When I had no IOPS limits in vSAN Storage Polices, I was able to fully saturate ESXi CPU's. 

CPU usage
CPU usage -16.52 GHz - interesting, right?


That's a clear sign that storage subsystem (NVMe NAND Flash disks) nor Ethernet/IP network (up to 50 Gbps via a single vmnic4) are bottlenecks. The bottleneck in this case is the CPU. Remember, there is always some bottleneck and we are not looking for maximum storage performance, but for predictable and consistent storage performance without a negative impact on other resources (CPU, Network, Disks). 

That's the reason why it is really good to know at least these rough numbers to do some capacity/performance planning of the hyper-converged vSAN solution.

With IOPS limit 5,000, 144 vDisks @ 5000 IOPS can have a sustainable response time of around 2 ms (32 KB I/O). The vSphere/vSAN infrastructure is designed for ~150 VM's so that's perfectly balanced. We have other two VM Storage Polices (10,000 IOPS limit and 15,000 IOPS limit) for more demanding VMs hosting SQL Servers and other storage-intensive workloads.

That's about 720,000 IOPS aggregated in total. Pretty neat for a 6-node vSAN cluster, isn't it? 

Observation 2 - Between 10% and 30% CPU is consumed due to TCP network traffic

vSAN is a hyper-converged (Compute, Storage, Network) software-defined storage striping data across ESXi hosts, thus heavily leveraging standard ethernet network and TCP/IP for transport storage data across vSAN nodes (ESXi hosts). vSAN RAID (Redundant Array of Independent Disks) is actually RAIN (Redundant Array of Independent Nodes), therefore the network is highly utilized during heavy storage load. You can see the numbers above in the test results.

As I planned, designed, and implemented vSAN on Cisco UCS infrastructure with 100Gb networking (partitioned into 2x32Gb FCoE, 2x10 Gb Ethernet, 2x10Gb Ethernet, 2x50Gb Ethernet), RDMA over Converged Ethernet (RoCE) would be great to use to decrease CPU requirements and even improve latency and I/O response time. RoCE v2 is supported on vSphere 8.0 U3 for my network interface card Cisco VIC 15230 (driver nenic version 2.0.11) but Cisco is not listed among vendors supporting vSAN over RDMA. I will ask somebody in Cisco why and if they have something in the roadmap.
  


Monday, December 02, 2024

What is the core dump size for ESXi 8.0 U3?

Nine years ago, I wrote the blog "How large is my ESXi core dump partition?". Back then, it was about core dumps in ESXi 5.5. Over the years, a lot has changed in ESXi which is true for core dumps too. 

Let's write a new blog post about the same topic but right now for ESXi 8.0 U3. The behavior should be the same in ESXi 7.0. In this blog post, I will use some data from ESXi 7.0 U3 because we are still running ESXi 7.0 U3 in production and I plan and design the upgrade to vSphere 8. That's why I have ESXi 8.0 U3 just in the lab where some hardware configurations are unavailable. We use ESXi hosts with 1.5 TB RAM in production but I don't have hosts with such memory capacity in my lab.

What is a core dump? It boils down to PSOD. ESXi host Purple Screen of Death (PSOD) happens when VMkernel experiences a critical failure. This can be due to hardware issues, driver problems, deadlock, etc. During the PSOD event, the ESXi hypervisor captures a core dump to help diagnose the cause of the failure. Here’s what happens during this process:

After a PSOD, ESXi captures a core dump, which includes a snapshot of the hypervisor memory and the state of the virtual machines. The core dump is stored based on the host configuration (core dump partition, file, or network), and it helps diagnose the cause of the critical failure by providing insights into the state of the system at the time of the crash. A core dump is crucial for troubleshooting and resolving the issues leading to PSOD. And here is the change. In ESXi 6.7, the core dump was stored in a disk partition but since ESXi 7, it has been stored in the precreated file.

For the detailed vSphere design, I would like to know the typical core dump file size to allocate optimal storage space for core dumps potentially redirected to shared datastore (by default, in ESXi 7 and later, the core dumps are stored in ESX-OSData partition, typically on boot disk). Of course, the core dump size depends on multiple factors, but the main factor should be the memory used by vmKernel.   

ESXi host memory usage is split into three buckets

  1. vmKernel memory usage (core hypervisor)
  2. Other memory usage
    • BusyBox Console including
      • Core BusyBox Utilities (e.g., ls, cp, mv, ps, top, etc.)
      • Networking and Storage Tools (ifconfig, esxcfg-nics, esxcfg-vswitch, esxcli, etc.)
      • Direct Console User Interface (DCUI)
      • Management Agents and Daemons (hostd, vpxa, network daemons like SSH, DNS, NTP, and network file copy aka NFC)
  3. Free memory

So let's go to the lab and test it. Here is data from three different ESXi host configurations I have access to. 

ESXi, 8.0.3 (24022510) with 256 GB (262 034 MB) physical RAM

vSAN is disabled, NSX is installed

In Production mode running 10 Powered On VMs having 24 GB vRAM:

  • vmKernel memory usage:  1544 MB
  • Other memory usage: 21 498 MB
  • Free memory: 238 991 MB
In Maintenance mode (no VMs):
  • vmKernel memory usage:  1453 MB
  • Other memory usage: 4 207 MB
  • Free memory: 256 373 MB
Let's try PSOD on the ESXi host in maintenance mode.

In ESXi 8.0.3 with 256 GB RAM, the core dump is set to be stored into a 3.6 GB file (3,882,876,928 bytes) at the ESX-OSData.
 [root@dp-esx02:~] esxcli system coredump file list  
 Path                                                   Active Configured    Size  
 ------------------------------------------------------------------------------------------------------- ------ ---------- ----------  
 /vmfs/volumes/66d993b7-e9cd83a8-b129-0025b5ea0e15/vmkdump/00000000-00E0-0000-0000-000000000008.dumpfile  true    true 3882876928  

It is configured and active. 

 [root@dp-esx02:~] esxcli system coredump file get  
   Active: /vmfs/volumes/66d993b7-e9cd83a8-b129-0025b5ea0e15/vmkdump/00000000-00E0-0000-0000-000000000008.dumpfile  
   Configured: /vmfs/volumes/66d993b7-e9cd83a8-b129-0025b5ea0e15/vmkdump/00000000-00E0-0000-0000-000000000008.dumpfile  

The core dump file has 3.6 GB
 [root@dp-esx02:~] ls -lah /vmfs/volumes/66d993b7-e9cd83a8-b129-0025b5ea0e15/vmkdump/00000000-00E0-0000-0000-000000000008.dumpfile  
 -rw-------  1 root   root    3.6G Oct 29 13:07 /vmfs/volumes/66d993b7-e9cd83a8-b129-0025b5ea0e15/vmkdump/00000000-00E0-0000-0000-000000000008.dumpfile  

Now let's try the first PSOD on the ESXi host in maintenance mode and watch what happens. Below is the command to initiate PSOD and the screenshot
 vsish -e set /reliability/crashMe/Panic 1  

VMware Support will ask you for zdump file (VMware proprietary bin file) which can be generated by command esxcfg-dumppart
 [root@dp-esx02:~] esxcfg-dumppart --file --copy --devname /vmfs/volumes/66d993b7-e9cd83a8-b129-0025b5ea0e15/vmkdump/00000000-00E0-0000-0000-000000000008.dumpfile --zdumpname /vmfs/volumes/DP-STRG02-Datastore01/zdump/zdump-coredump.dp-esx02.1  
 Created file /vmfs/volumes/DP-STRG02-Datastore01/zdump/zdump-coredump.dp-esx02.1.1  
 [root@dp-esx02:~] ls -lah /vmfs/volumes/DP-STRG02-Datastore01/zdump/zdump-coredump.dp-esx02.1.1  
 -rw-r--r--  1 root   root   443.9M Oct 29 13:07 /vmfs/volumes/DP-STRG02-Datastore01/zdump/zdump-coredump.dp-esx02.1.1  
The extracted VMkernel zdump file from the first PSOD has 443.9 MB.

Now let's try the second PSOD.
 vsish -e set /reliability/crashMe/Panic 1  

Let's extract the core dump.
 [root@dp-esx02:~] esxcfg-dumppart --file --copy --devname /vmfs/volumes/66d993b7-e9cd83a8-b129-0025b5ea0e15/vmkdump/00000000-00E0-0000-0000-000000000008.dumpfile --zdumpname /vmfs/volumes/DP-STRG02-Datastore01/zdump/zdump-coredump.dp-esx02.2  
 Created file /vmfs/volumes/DP-STRG02-Datastore01/zdump/zdump-coredump.dp-esx02.2.1  
 [root@dp-esx02:~] ls -lah /vmfs/volumes/DP-STRG02-Datastore01/zdump/zdump-coredump.dp-esx02.2.1  
 -rw-r--r--    1 root     root      311.2M Nov  4 09:33 /vmfs/volumes/DP-STRG02-Datastore01/zdump/zdump-coredump.dp-esx02.2.1
The extracted VMkernel zdump file from the 2nd PSOD has 311.2 MB.

Only one core dump exists in the core dump file; therefore, multiple core dumps are not stored in the system. Thus the best practice is to extract every core dump ( esxcfg-dumppart --file --copy)  from the core dump file to an external storage location to allow core dump analysis of older PSODs. 
 
Let's continue in our PSOD test with additional PSOD's to find out if the system can manage sequential core dumps having a total size bigger than 3.6 GB, which is the size of a core dump file. Let's assume the single core dump size is always around 300 MB, we would need 13 PSODs, so let's do 14 PSODs. 

_3rd PSOD:    304.9 MB
_4th PSOD:    285.4 MB
_5th PSOD:    303.2 MB
_6th PSOD:    316.6 MB
_7th PSOD:    322.9 MB
_8th PSOD:    288.3 MB
_9th PSOD:    283.0 MB
10th PSOD:    276.7 MB
11th PSOD:    292.5 MB
12th PSOD:    289.7 MB
13th PSOD:    281.8 MB
14th PSOD:    290.3 MB
TOTAL:           4.2 GB

So even though we have a 3.6 GB core dump file for the ESXi host with 256 GB RAM, we can collect more than 4 GB core dumps. This is the proof, that the core dump file is used for a single core dump and the next core dump rewrites the old one. 

ESXi, 8.0.3 (24022510) with 128 GB (131 008 MB) physical RAM

vSAN is disabled, NSX is not installed

In Maintenance mode (no VMs):
  • vmKernel memory usage:  694 MB
  • Other memory usage: 1 660 MB
  • Free memory: 128 653 MB
In ESXi 8.0.3 with 128 GB RAM, the core dump is set to be stored into a 2.27 GB file (2,441,084,928 bytes) at the ESX-OSData partition.
 [root@esx21:~] esxcli system coredump file list  
 Path                                                                                                     Active  Configured        Size
-------------------------------------------------------------------------------------------------------  ------  ----------  ----------
/vmfs/volumes/6727594d-c447be9c-5a0e-90b11c13fc14/vmkdump/4C4C4544-0054-5810-8033-B3C04F48354A.dumpfile    true        true  2441084928

Now let's try PSOD on ESXi host in maintenance mode.
 vsish -e set /reliability/crashMe/Panic 1  

Let's extract the core dump.
[root@esx21:~] esxcfg-dumppart --file --copy --devname /vmfs/volumes/6727594d-c447be9c-5a0e-90b11c13fc14/vmkdump/4C4C4544-0054-5810-8033-B3C04F48354A.dumpfile --zdumpname /vmfs/volumes/ESX21
-FLASH-01/coredump.esx21.1
Created file /vmfs/volumes/ESX21-FLASH-01/coredump.esx21.1.1
[root@esx21:~] ls -lah /vmfs/volumes/ESX21-FLASH-01/coredump.esx21.1.1
-rw-r--r--    1 root     root      111.0M Dec  2 08:26 /vmfs/volumes/ESX21-FLASH-01/coredump.esx21.1.1
The VMkernel zdump file extracted from the PSOD on an ESXi 8.0 U3 host with 128 GB of RAM is 111 MB in size, which is significantly smaller than the zdump file from an ESXi 8.0 U3 host with 256 GB of RAM.

Let's compare it to ESXi host with 256 GB RAM
  • 128 GB RAM is half of 256 GB RAM
  • vmKernel memory usage 694 MB is ~half of 1453 MB
  • coredump file 2.27 GB file is ~60% of 3.6 GB file
  • zdump file 111 MB is  ~4x smaller than 443.9 MB

ESXi, 7.0.3 (23794027) with 512 GB (524 178 MB) physical RAM

In Production mode running 38 Powered On VMs having 310.37 GB vRAM:
  • vmKernel memory usage:  3 227 MB
  • Other memory usage: 366 140 MB
  • Free memory: 154 810 MB
In Maintenance mode (no VMs):
  • vmKernel memory usage:  2 776 MB
  • Other memory usage: 25 402 MB
  • Free memory: 495 998 MB
In ESXi 7.0.3 with 512 GB RAM, the core dump is set to be stored into an 8.16 GB file at the ESX-OSData partition.
 [root@prg03t0-esx05:~] esxcli system coredump file list  
 Path                                                        Active Configured    Size  
 ------------------------------------------------------------------------------------------------------------------ ------ ---------- ----------  
 /vmfs/volumes/6233a3c2-58e4bf62-94e7-0025b5ea0e13/vmkdump/00000000-00E0-0000-0000-000000000006-8162115584.dumpfile  true    true 8162115584  

Now let's try PSOD on ESXi host in maintenance mode.
 vsish -e set /reliability/crashMe/Panic 1  


Let's extract the core dump.
[root@prg03t0-esx05:~] esxcfg-dumppart --file --copy --devname /vmfs/volumes/6233a3c2-58e4bf62-94e7-0025b5ea0e13/vmkdump/00000000-00E0-0000-0000-000000000006-8162115584.dumpfile --zdumpname /vmfs/volumes/PRG03T0-HDD01/coredump.esx05.1
Created file /vmfs/volumes/PRG03T0-HDD01/coredump.esx05.1.1  
[root@prg03t0-esx05:~] ls -lah /vmfs/volumes/PRG03T0-HDD01/coredump.esx05.1.1
-rw-r--r--    1 root     root        4.6G Nov  5 18:11 /vmfs/volumes/PRG03T0-HDD01/coredump.esx05.1.1
The extracted VMkernel zdump file from the PSOD of ESXi 7.0 U3 with 512 GB RAM has 4.6 GB, which is significantly bigger than the zdump file from ESXi 8.0 U3 with 256 GB RAM. 

The VMkernel zdump file extracted from the PSOD on an ESXi 7.0 U3 host with 512 GB of RAM is 4.6 GB in size, which is significantly larger than the zdump file from an ESXi 8.0 U3 host with 256 GB of RAM.

Let's compare it to ESXi 8.0 U3 with 256 GB RAM
  • 512 GB RAM is 2x bigger than 256 GB RAM
  • vmKernel memory usage 2 776 MB is ~2x larger than 1453 MB (it makes sense)
  • coredump file 8.16 GB file is ~2.25x larger than 3.6 GB file (it makes sense)
  • zdump file 4.6 GB (4 710 MB) is ~10x larger than 443.9 MB (hmm, interesting)
Why is the zdump file 10x bigger on ESXi 7.0 U3 with 512 GB RAM and not just 2x bigger than I would expect? To be honest, I don't know. I have to retest it on ESXi 8.0 U3 with 512 GB RAM and 1.5 TB RAM when possible.

ESXi, 7.0.3 (23794027) with 1.5 TB (1 571 489 MB) physical RAM

This ESXi host (1.5 GB RAM) is only in production so it is managed by the operational team and we want to avoid testing PSOD in production. However, we checked the memory usage and core dump file size. 

In Maintenance mode (no VMs):
  • vmKernel memory usage:  2 705 MB
  • Other memory usage: 2 705 MB
  • Free memory: 1 561 570 MB
In ESXi 7.0.3 with 1.5 TB RAM, the core dump is set to be stored into a 16.1 GB file at the ESX-OSData partition.
 [root@prg0301-esx36:~] esxcli system coredump file list  
 Path                                                         Active Configured     Size  
 ------------------------------------------------------------------------------------------------------------------- ------ ---------- -----------  
 /vmfs/volumes/5dec0956-3d83cd8b-de10-0025b52ae000/vmkdump/00000000-0021-0000-0000-000000000024.dumpfile        true    true 16106127360  
I cannot test PSOD in the production system by myself, so I have to wait until our operation team schedules the vSphere 8 upgrade and we can test it together. 

Conclusion

Core dump files for ESXi 6.7 and lower are stored in a disk partition. ESXi 7 and higher store core dumps into the core dump file. The core dump file is used for a single core dump, therefore it should be extracted (esxcfg-dumppart --file --copy) by vSphere administrator immediately after the PSOD otherwise it will be lost when another PSOD occurs.

In the current ESXi 8, the core file is located in ESX-OSData partition which can be on a boot disk or an additional disk.

If the boot disk is higher than 128 GB ESXi 8 the standard layout is
  1. 101 MB   - Boot Loader partition
  2. 4 GB        - Boot Bank 1 partition
  3. 4 GB        - Boot Bank 2 partition
  4. 119.9 GB - ESX-OSData partition
This is the disk usage of ESXi 8.0 U3 with 256 GB RAM and 128 GB boot disk - vSAN disabled, NSX installed.
 Filesystem  Size  Used Available Use% Mounted on  
 VMFSOS   119.8G  5.2G  114.6G  4% /vmfs/volumes/OSDATA-66d98185-2bceed00-72c5-0025b5ea0e0d  
 vfat     4.0G 274.1M   3.7G  7% /vmfs/volumes/BOOTBANK1  
 vfat     4.0G 338.9M   3.7G  8% /vmfs/volumes/BOOTBANK2  
As you see, 5.2 GB is used in ESX-OSData partition. 

We use ESXi with 1.5 TB RAM booting from SAN (Fibre Channel) in our production environment. The boot disk (LUN on shared storage) is a size of 32 GB. In such case, the partition layout looks as described below
  1. 101 MB - Boot Loader partition
  2. 4 GB      - Boot Bank 1 partition
  3. 4 GB      - Boot Bank 2 partition
  4. 23.9 GB - ESX-OSData partition
The ESX-OSData volume takes on the role of the legacy /scratch partition, locker partition for VMware Tools, and core dump destination. In ESX-OSData having 23.9 GB, there is still space for a core dump file (16.1 GB), log files, and trace files. If we want to keep 20% free space on ESX-OSData partition, we have 19.1 GB available. 16.1 GB is preallocated for the core dump file and 3 GB is available for logs and traces. This should be enough. 

Note: Logs and traces are also configured to be sent to a remote syslog server (Aria Operations for Logs / aka LogInsight).

Even though core dumps can be redirected to a shared datastore (by changing Scratch Partition location), keeping the core dump in the boot device is a relatively good design choice when using a 32 GB or higher capacity durable boot disk device (HDD, NVMe, SATADOM, etc.). 

The VMware minimum recommended boot disk size is 32 GB, while 128 GB is considered ideal. VMware also recommends using durable boot devices, such as local disks or NVMe drives, instead of SD cards or USB sticks for ESXi 7.0 and later versions.

Note: In my home lab, I still boot from USB and have ESX-OSData on NVMe disk because my old equipment does not support booting from NVMe.



Wednesday, October 30, 2024

IPv4 Addresses Cheat Sheet

Bellow is my cheat sheet about IPv4 addresses and subnetting.

IPv4_Address_Cheat_Sheet

The cheat sheet is primarily for myself :-), but somebody else can find it helpful and use it.

Description: The math binary representation of IP octets (bytes) and relation to Net Subnetting.

Keywords: Class Addressing, Classless Addressing


Wednesday, July 06, 2022

Monolithic versus Microservices application architecture consideration

Microservices application architecture is very popular nowadays, however, it is important to understand that everything has advantages and drawbacks. I absolutely understand advantages of micro-services application architecture, however, there is at least one drawback. Of course, there are more, but let's show at least the potential impact on performance. The performance is about latency.

Monolithic application calls functions (aka procedures) locally within a single compute node memory (RAM). Latency of RAM is approximately 100 ns (0.0001 ms) and Python function call in decent computer has latency ~370 ns (0.00037 ms). Note: You can test Python function latency in your computer with the code available at https://github.com/davidpasek/function-latency/tree/main/python

Microservices application is using remote procedure calls (aka RPC) over network. Typically as REST or gRPC call over https, therefore, it has to traverse the network. Even the latency of modern 25GE Ethernet network is approximately 480 ns (0.00048 ms is still 5x slower than latency of RAM), and RDMA over Converged Ethernet latency can be ~3,000 ns (0.003 ms), the latency of microservice gRPC function call is somewhere between 40 and 300 ms. [source

Conclusion

Python local function call latency is ~370 ns. Python remote function call latency is ~280 ms. That's the order of magnitude (10^6) higher latency of micro-services application. RPC in low-level programming languages like C++ can be 10x faster, but it is still 10^5 slower than local Python function call.

I'm not saying that micro-services application is bad. I just recommend to consider this negative impact on performance during your application design and specification of application services.