Wednesday, July 30, 2025

vSAN ESA RAID5 issue

I'm observing unexpected behavior in my vSAN ESA cluster. I have a 6-node vSAN ESA cluster and a VM with a Storage Policy configured for RAID-5 (Erasure Coding). Based on the cluster size, I would expect a 4+1 stripe configuration. However, the system is using 2+1 striping, which typically applies to clusters with only 3 to 5 nodes.

vSAN ESA RAIDs

vSAN ESA is VMware’s software-defined storage solution. Each virtual hard disk (vDisk) is represented as an object within the vSAN datastore. The properties of these vSAN objects are governed by vSAN VM Storage Policies, which define data placement and protection rules. While these policies may emulate traditional RAID (Redundant Array of Independent Disks), vSAN actually implements RAIN (Redundant Array of Independent Nodes). This is because data components, such as stripes and replicas, are distributed across failure domains, which by default correspond to vSphere/vSAN cluster nodes (ESXi hosts). The specific striping and distribution are determined by the configured failures-to-tolerate policy and vSAN cluster size.

vSAN ESA supports multiple levels of RAID/RAIN (Redundant Array of Independent Nodes) for data protection:

RAIN-0: No redundancy (data is not protected)
RAIN-1: Mirroring (1+1) across two nodes
RAIN-5: Erasure coding with a 2+1 or 4+1 configuration (minimum 4 or 6 hosts)
RAIN-6: Erasure coding with higher fault tolerance, typically 4+2, but can also be 6+2 or 8+2 depending on cluster size

These options allow you to balance storage efficiency, performance, and fault tolerance based on your specific workload and cluster topology.

How do I check the build or version number of VMware ESX?

The ESX build (version number) information is available in the Summary tab of the vSphere Client, but in larger environments it is worth to use some kind of automation. PowerShell/PowerCLI is well know scripting tool for VMware vSphere.

Below is PowerCLI one-liner to easily query all vCenters where you are connected ...

Get-VMhost | Select-Object Name,Version,Build

If you want connect to vCenter(s) interactively, you can use following script ...

# Connect to vCenter
Write-Host "Connecting to vCenter ..."
$VC = Read-Host "Enter one vCentre Server or multiple vCenter servers delimted by comma."
Write-Host "Enter vCenter credentials ..."
$CRED = Get-Credential
Connect-VIServer -Server $VC -Credential $CRED -ErrorAction Stop | Out-Null

Thursday, July 03, 2025

VMwareOpsGuide.com has been retired

I'm an architect and designer, not involved in day-to-day operations, but I firmly believe that any system architecture must be thoughtfully designed for efficient operations, otherwise the Ops team will go mad in no time.

Over the years, I’ve been learning a lot from the book VMware Operations Management by Iwan E1 Rahabok, which covers everything related to vROps, Aria Operations, and now VCF Operations.

Veeam Backup & Replication on Linux v13 [Beta]

I have finally found some spare time and I decided to test Veeam Backup & Replication on Linux v13 [Beta] in my home lab. It is BETA, so it is good to test it and be prepared for the final release, even anything can change before the final release is available.

There is clear information that update and upgrade into newer versions will not be possible, but I'm really curious how Veeam transition from Windows to Linux is doing.

Anyway, let's test it and get the feeling about the Veeam future with Linux based systems.

PureStorage has 150TB DirectFlash Modules

I have just realized that PureStorage has 150TB DirectFlash Modules.

That got me thinking.

Flash capacity is increasing year by year. What are performance/capacity ratios?

The reason I'm thinking about it is that poor Tech Designer (like me) need some rule-of-thumb numbers for capacity/performance planning and sizing.

Virtual NIC Link Speed - is it really speed?

This will be a quick blog post, prompted by another question I received about VMware virtual NIC link speed. In this blog post I’d like to demonstrate that the virtual link speed shown in operating systems is merely a reported value and not an actual limit on throughput.

I have two Linux Mint (Debian based) systems mlin01 and mlin02 virtualized in VMware vSphere 8.0.3. Each system has VMXNET3 NIC. Both virtual machines are hosted on the same ESXi host, so they are not constraint by physical network. Let's test network bandwidth between these two systems with iperf.

How to troubleshoot virtual disk high latencies in VMware Virtual Machine

In VMware vSphere environments, even the most critical business applications are often virtualized. Occasionally, application owners may report high disk latency issues. However, disk I/O latency can be a complex topic because it depends on several factors, such as the size of the I/O operations, whether the I/O is a read or a write and in which ratio, and of course, the performance of the underlying storage subsystem.

One of the most challenging aspects of any storage troubleshooting is understanding what size of I/O workload is being generated by the virtual machine. Storage workload I/O size is the significant factor to response time. There are different response times for 4 KB I/O and 1 MB I/O. Here are examples from my vSAN ESA performance testing.

32k IO, 100% read, 100% random - Read Latency: 2.03 ms Write Latency: 0.00 ms
32k IO, 100% write, 100% random - Read Latency: 0.00 ms Write Latency: 1.74 ms
32k IO, 70% read - 30% write, 100% random - Read Latency: 1.55 ms Write Latency: 1.99 ms
1024k IO, 100% read, 100% sequential - Read Latency: 6.38 ms Write Latency: 0.00 ms
1024k IO, 100% write, 100% sequential - Read Latency: 0.00 ms Write Latency: 8.30 ms
1024k IO, 70% read - 30% write, 100% sequential - Read Latency: 5.38 ms Write Latency: 8.68 ms

You can see that response times vary based on storage profile. However, application owners very often do not know what is the storage profile of their application workload and just complain that storage is slow.

As one storage expert (I think it was Howard Marks [1] [2]) once said, there are only two types of storage performance - good enough and not good enough.

Fortunately, on an ESXi host, we have a useful tool called vscsiStats. We have to know on which ESXi host VM is running and ssh into that particular ESXi host.

The vSCSI monitoring procedure is

List all running virtual machines on particular ESXi host, and identify our Virtual Machine and its identifiers (worldGroupID and Virtual SCSI Disk handleID)
Start vSCSI statistics collection in ESXi host
Collect vSCSI statistics histogram data
Stop vSCSI statistics collection

The procedure is documented in VMware KB - Using vscsiStats to collect IO and Latency stats on Virtual Disks

Let's test it in lab.

Are you looking for VMware SRM and cannot find it?

Here is what happened with VMware Site Recovery Manager. It was repackaged into VMware Live Recovery.

UPDATE 2025-07-07: Nice VCF 9 Disaster Recovery / Business Continuity (DRBC) solution overview is explained at VMware official blog post "VMware Cloud Foundation Recovery Improvements with VMware Live Recovery".

What is VMware Live Recovery?

VMware Live Recovery is the latest version of disaster and ransomware recovery from VMware. It combines VMware Live Site Recovery (previously Site Recovery Manager) with VMware Live Cyber Recovery (previously VMware Cloud Disaster Recovery) under a single shared management console and a single license. Customers can protect applications and data from modern ransomware and other disasters across VMware Cloud Foundation environments on-premises and in public clouds with flexible licensing for changing business needs and threats.

For more details see the VMware Live Recovery FAQ and the VMware Live Recovery resource page.

In this blog post I will just copy information from Site Recovery Manager FAQ PDF, because that's what old good on-prem SRM is, and it is good to have it in HTML form in case Broadcom/VMware PDF from what ever reasons disapeer.

Here you have it ...

How to run IPERF on ESXi host?

iperf is great tool to test network throughput.There is iperf3 in ESXi host, but there are restrictions and you cannot run it.

There is the trick.

First of all, you have to disable ESXi advanced option execInstalledOnly=0. This enables you to run executable binaries which were not preinstalled by VMware.

Second step is to make a copy of iperf binary, because installed version os estricted and cannot be run.

The third step is to disable ESXi firewall to allow cross ESXi communication between iperf client and iperf server.

After finishing performance testing, you should clean ESXi environment

delete your copy of iperf
re-enable ESXi firewall to allow only required tcp/udp ports for ESXi services
re-enable ESXi advanced option (execInstalledOnly=1) to keep ESXi hypervisor secure by default

ESXi Commands

# Allow execute binaries which are not part of base installation

localcli system settings advanced set -o /User/execInstalledOnly -i 0

# Make a copy of iperf

cp /usr/lib/vmware/vsan/bin/iperf3 /usr/lib/vmware/vsan/bin/iperf3.copy

# Disable firewall

esxcli network firewall set --enabled false

# Run iperf server
./iperf3.copy -s -B 192.168.123.22

# Run iperf client (typically in another ESXi host than iperf server is running)
./iperf3.copy -c -B 192.168.123.22

After iperf benchmarking you should enable firewall and disallow execution of binaries which are not part of base installation

# Cleaning
rm /usr/lib/vmware/vsan/bin/iperf3.copy

esxcli network firewall set --enabled true
localcli system settings advanced set -o /User/execInstalledOnly -i 1

Wednesday, May 14, 2025

Test Jumbo Frames (MTU 9000) between ESXi hosts

When we want to enable Jumbo-Fames on VMware vSphere, it must be enabled on

physical switches
virtual switches - VMware Distributed Switch (VDS) or VMware Standard Switch (VSS)
VMkernel interfaces where you would like to use Jumbo-Frames (typically NFS, iSCSI, NVMeoF, vSAN, vMotion)

Let's assume it is configured by network and vSphere administrators and we want to validate that vMotion network between two ESXi hosts supports Jumbo Frames. Let's say we have these two ESXi hosts

ESX11 has IP address 10.160.22.111 on vMotion vmk interface within vMotion TCP/IP stack.
ESX12 has IP address 10.160.22.112 on vMotion vmk interface within vMotion TCP/IP stack.

Ping is a good network diagnostic tool for this purpose. It uses ICMP protocol within IP protocol. So, what is the maximum size of IP/ICMP packet? With Jumbo Frame we have 9000 bytes for Layer 2 (Ethernet Frame) payload and within Ethernet Frame is IP packet cariing ICMP packet with Echo Request. So, here is the calculation.

9000 (MTU) - 20 (IP header) - 8 (ICMP header) = 8972 bytes

Let's do ping with 8972 bytes payload and with flag -d (fragmentation disabled).

[root@esx11:~] ping -I vmk1 -S vmotion -s 8972 -d 10.160.22.112
PING 10.160.22.112 (10.160.22.112): 8972 data bytes
8980 bytes from 10.160.22.112: icmp_seq=0 ttl=64 time=0.770 ms
8980 bytes from 10.160.22.112: icmp_seq=1 ttl=64 time=0.637 ms
8980 bytes from 10.160.22.112: icmp_seq=2 ttl=64 time=0.719 ms

We can see succesful test of large ICMP packets without fragmentation. We validated that ICMP packets with size 8972 bytes can be transfered over the network without fragmentation. That's the indication that Jumbo Frames (MTU 9000) are enabled end-to-end.

Now let's try to cary ICMP packets with size 8973 bytes.

[root@esx11:~] ping -I vmk1 -S vmotion -s 8973 -d 10.160.22.112
PING 10.160.22.112 (10.160.22.112): 8973 data bytes
sendto() failed (Message too long)
sendto() failed (Message too long)
sendto() failed (Message too long)

We can see that ICMP packets with size 8973 bytes cannot be transfered over the network without fragmentation. This is expected behavior and it proves that we know what we do.

Thursday, May 08, 2025

Business Continuity and Disaster Recovery Terminology

Almost 10 years ago, I gave a presentation at the local VMware User Group (VMUG) meeting in Prague, Czechia, on Metro Cluster High Availability and SRM Disaster Recovery. The slide deck is available here on Slideshare. I highly recommend reviewing the slide deck, as it clearly explains the fundamental concepts and terminology of Business Continuity and Disaster Recovery (BCDR), along with the VMware technologies used to plan, design, and implement effective BCDR solutions.

Let me briefly outline the key BCDR concepts, documents, and terms below.

Business Continuity Basic Concepts

Resilience (High Availability)
Recovery (Disaster Recovery)
Mitigation (Disaster Avoidance)

Business Continuity Essential Documents

BIA (Business Impact Analysis) is essential document for any Business Continuity Initiative
Risk Management Plan (Prevention and mitigation)
Contingency Plan (Response and recovery)

Business Continuity Basic Terms

RPO (Recovery Point Objective) – Data level
RTO (Recovery Time Objective) – Infrastructure level
WRT (Work Recovery Time) – Application level
MTD (Maximum Tolerable Downtime) = RTO + WRT – Business level

RPO (Recovery Point Objective) and RTO (Recovery Time Objective) are the most known terms in Business Continuity and Disaster Recovery world and I hope all IT professionals know at least these two terms. However, the repetition is the mother of wisdom so let's repeat what RPO and RTO are. The picture is worth 1,000 words so look at the picture bellow.

RPO and RTO

RPO - The maximum acceptable amount of data loss measured in time. In other words, "How far back in time can we afford to go in our backups?"

RTO - The maximum acceptable time to restore systems and services (aka infrastructure) after a disaster.

WRT - The time needed after systems are restored (post-RTO) to make applications fully operational (e.g., data validation, restarting services). It’s a subset of MTD and follows RTO.

MTD - How much time does our business accept before the company is back and running after a disaster. In other words, it is the total time a business can be unavailable before causing irrecoverable damage or significant impact.

Easy, right? Not really, Disaster Recovery projects are, based on my experience, the most complex project in IT infrastructure.

Are you ready to support your business after the disaster and guarantee business continuity? Are you sure?

Do you test your BCDR regularly? How often? Once a year? Twice a year? Or even more?

Feel free to leave a comment.

Wednesday, May 07, 2025

How to check your public IP address from FreeBSD or Linux console

Web service available at https://ifconfig.me/ will expose the client IP address. This is useful when you do not know your public IP address as you are behind the NAT (Network Address Translation) in some public Wi-Fi access point or even in your home behing CGNAT (Carrier-Grade NAT) very often used by Internet Service Providers using IPv4.

How we can leverage it from FreeBSD?

It is easy. Just use old good fetch command available on FreeBSD by default to retrieve a file by Uniform Resource Locator (URL).

Oneliner:

fetch -qo - https://ifconfig.me | grep "ip_addr:"

How we can leverage it from Linux?

It is easy. I use Debian and there is old good wget command available by default. It can be used to retrieve a file by Uniform Resource Locator (URL) similarly to fetch on FreeBSD.

Oneliner:

wget -o /dev/null -O - https://ifconfig.me | grep "ip_addr:"

Conclusion

In Unix-like operating systems is very easy to leverage standard tools to get job done.

Tuesday, May 06, 2025

What is and how to identify ESX OSDATA partition?

I have two VMware vSphere home labs with relatively old hardware (10+ years old). Even I have upgraded the old hardware to use local SATA SSD disks or even NVMe disks the old systems does not support boot from NVMe. That's the reason I still boot my homelab ESXi hosts from USB flash disks, even it is highly recommended to not use USB flash disks or SD cards as boot media for ESXi 7 and later. The reason for this recommendation is to keep boot medium in healthy state even ESXi 7 and later writes more frequently to boot media then earlier ESXi versions. If you want to know more details about reasons for this recommendation, please read my older post vSphere 7 - ESXi boot media partition layout changes.

During the ESXi installation you can choose USB disk as boot medium and local NVMe disk as a disk for ESX OSDATA partition, which will be used to write ESXi system data like config backups, logs (ESX, vSAN, NSX), vSAN traces, core dumps, VM tools (directory /productLocker), and other ephemeral or persistent system files.

How it looks like and how you can identify ESX OSDATA partition on local disks?

vCenter / vSphere Client

Interestingly enough, if you check disk partition layout form vCenter, you will see this 128 GB partition identified as "Legacy MBR". See the screenshot below.

ESX OSDATA Partition details in vSphere Client - Legacy MBR (128 GB)

This is, IMHO, little bit misleading.

ESXi host Web Management

You can connect directly into particular ESXi host and check the partition diagram in ESXi host web management. See the screenshot below.

ESXi OSDATA partition details about NVMe disk - VMFSL (128 GB)

There is no information about "Legacy MBR" partition but it is identified as "VMFSL" partition. So again, there is not explicit information about ESXi OSDATA partition but it is identified as VMFSL file system. What is VMFSL? VMFSL in ESXi stands for VMware File System Logical — it's not an officially documented term in VMware's main literature, but it refers to an internal filesystem layer used by ESXi for managing system partitions and services that aren’t exposed like traditional datastores (VMFS or NFS).

It is, IMHO better information than vSphere Client provides via vCenter.

ESXi Shell

We can also list partition tables in ESXi Shell.

 [root@esx22:~] partedUtil getptbl /vmfs/devices/disks/eui.0000000001000000e4d25cc9a0325001  
 gpt  
 62260 255 63 1000215216  
 7 2048 268435455 4EB2EA3978554790A79EFAE495E21F8D vmfsl 0  
 8 268435456 1000212480 AA31E02A400F11DB9590000C2911D1B8 vmfs 0

It is also listed as VMFSL partition, but we can also see partition GUID.

The GUID (Globally Unique Identifier) that identifies the partition type. This GUID is unique to the partition and corresponds to a specific partition format (e.g., VMFS6, OSDATA, etc.).

4EB2EA3978554790A79EFAE495E21F8D is OSDATA

AA31E02A400F11DB9590000C2911D1B8 is VMFS6

And that's it. Now you know all details about ESX OSDATA partition and how to identify it.

Sunday, May 04, 2025

SAP HANA on VMware

Recently Broadcom announced that vSAN ESA support for SAP HANA was introduced. Erik Rieger is Broadcom's Principal SAP Global Technical Alliance Manager and Architect. Erik was the guest in Duncan Epping's podcast Unexplored Territory and you can listen their discussion on all major podcast platforms. The episode name is "#094 - Discussing SAP HANA support for vSAN ESA 8.x with Erik Rieger!"

For technical details you can check following SAP and VMware resources.

SAP note 3406060

SAP HANA on VMware vSphere 8 and vSAN 8 for details - https://me.sap.com/notes/3406060

SAP HANA and VMware support pages

https://help.sap.com/docs/SUPPORT_CONTENT/virtualization/3362185786.html

SAP HANA on HCI powered by vSAN
vSphere and SAP HANA best practices - https://www.vmware.com/docs/vmw-sap-hana-vmware-vsphere-best-practices

Friday, May 02, 2025

Network throughput and CPU efficiency of FreeBSD 14.2 and Debian 10.2 in VMware - PART 2

In PART 1, I have compared FreeBSD 14.2 and Debian 10.2 default installations and performed some basic network tuning of FreeBSD to approach Debian tcp throughput, which is, based on my testing, higher than network throughput on FreeBSD. The testing in PART 1 was performed on Cisco UCS enterprise servers with 2x CPU Intel Xeon CPU E5-2680 v4 @ 2.40GHz with ESXi 8.0.3. This is approximately 9 year old server with Intel server Xeon Family CPU.

In this PART 2, I will continue with network deep dive into network throughput tuning with some additional context and advanced network tuning of FreeBSD and Debian. Tests will be performed on 9 years old consumer PC (Intel NUC 6i3SYH) having 1x CPU Intel Core i3-6100U CPU @ 2.30GHz with ESXi 8.0.3.

VM hardware

VM hardware used for iperf tests has following specification

1 vCPU (artificially limited by hypervisor to 2000 MHz, to be able to easy recalculate Hz to bit/s)
1 GB RAM
vNIC type is vmxnet3
VM hardware / Compatibility: ESXi 8.0 U2 and later (VM version 21)

iperf parameters

I run iperf -s on one VM01 and iperf -c [IP-OF-VM01] -t600 -i5 on VM02. I use iperf parameter -P1, -P2, -P3, -P4 to test impact of more paralel client threads and watching results, because I realized that more paralel client threads has a positive impact on FreeBSD network throughput and none or little bit negative impact on Debian (linux).

Network Offload Features

I test network throughput with and without following hardware offload capabilities

rxcsum (Receive Checksum Offload) - This offloads work from the OS, which improves performance (less CPU usage), network throughput (faster packet processing), efficiency (especially in high-speed interfaces like 10G/25G)
rxcsum_ipv6 - rxsum IPv6 version
txcsum (Transmit Checksum Offload) - computes checksums for outgoing packets (TCP, UDP, IP), instead of the CPU. This reduces CPU load and improves performance, especially when handling lots of network traffic.
txsum_ipv6 - txsum IPv6 version
tso4 (TCP Segmentation Offload for IPv4) - TSO (TCP Segmentation Offload) allows the NIC (or virtual NIC) to split large TCP packets into smaller segments on its own, instead of making the CPU do it. Without tso4, the CPU breaks a large packet (say 64 KB) into many small 1500-byte segments before sending. With tso4, the CPU sends one big chunk to the NIC (e.g., 64 KB), and the NIC does the chopping into MTU-sized pieces. This brings less CPU usage, higher throughput, better efficiency on high-speed links or when sending large files. !!! Requires txcsum to work !!!
tso6 (TCP Segmentation Offload for IPv6)
lro (Large Receive Offload) - LRO allows the network interface (or virtual NIC) to combine multiple incoming TCP packets into one large packet before passing them up the stack to the OS. It is good because it means fewer interrupts, fewer packets for the CPU to process, higher throughput (especially on high-speed networks), better performance in TCP-heavy applications.

What is Offload Feature Impact for VM ↔ VM on same ESXi?

When two VMs talk to each other on the same ESXi host:

Packets never leave the host, but the vNIC emulation layer still processes them.
VMXNET3 is a paravirtualized driver, so it's optimized, but offloads still reduce CPU cycles in the guest.
With txcsum + rxcsum, you're telling FreeBSD not to burn CPU on checksums → lower guest load.
With tso4 + lro, the guest kernel deals with fewer packets → fewer context switches and interrupts.

RSS (Receive Side Scaling)

RSS is another important network technology to achieve high network traffic and throughput. RSS spreads incoming network traffic across multiple CPU cores by using a hash of the packet headers (IP, TCP, UDP).

Without RSS:

All packets are processed by a single CPU core (usually core 0).
Can cause bottlenecks on high-throughput systems.

With RSS:

Incoming packets from different connections go to different cores.
Much better use of multi-core CPUs.
Improves parallelism, reduces interrupt contention, and increases throughput.

In this exercise we test network throughput of single vCPU virtual machines, therefore RSS would not help us anyway. I will focus on multi CPU VMs in the future.

Anyway, it seems that RSS is not implemented in FreeBSD's vmx driver of VMXNET3 network card and only partly implemented in VMXNET3 driver in Linux. The reason is, that RSS would add overhead inside a VM.

Implementing RSS would:

Require hashing each packet in software
Possibly break hypervisor assumptions
Add extra CPU work in the guest — often worse than the performance gains

In most cases, multiqueue + interrupt steering gives enough performance inside a VM without the cost of full RSS.

MSI-X

FreeBSD blacklists MSI/MSI-X (Message Signaled Interrupts) for some virtual and physical devices to avoid bugs or instability. In VMware VMs, this means that MSI-X (which allows multiple interrupt vectors per device) is disabled by default, limiting performance — especially for multiqueue RX/TX and RSS (Receive Side Scaling).

With MSI-X enabled, you get:

Multiple interrupt vectors per device
Support for multiqueue RX and TX paths
Better support for RSS (load balancing packet processing across CPUs)
Improved throughput and reduced latency — especially on multi-core VMs

This setting affects all PCI devices, not just vmx, so it should be tested carefully in production VMs. On ESXi 6.7+ and FreeBSD 12+, MSI-X is generally stable for vmxnet3.

This is another potential improvement for multi vCPU VMs but it should not help us in single vCPU VM.

I have test it and it really does not help in single vCPU VM. I will definitely test this setting along with RSS and RX/TX queues later in future parts of this series of articles about FreeBSD network throughput when I will test impact of multiple vCPUs and network queues.

Interrupt Service Routines (net.isr)

By default, FreeBSD uses a single thread to process all network traffic in accordance with the strong ordering requirements found in some protocols, such as TCP.

net.isr.maxthreads

In order to increase potential packet processing concurrency, net.isr.maxthreads can be define as "-1" which will automatically enable netisr threads equal to the number of CPU cores in the machine. Now, all CPU cores can be used for packet processing and the system will not be limited to a single thread running on a single CPU core.

As we are testing TCP network throughput in single CPU Core machine, this is not going to help us.

net.isr.defaultqlimit

The net.isr.defaultqlimit setting in FreeBSD controls the queue length for Interrupt Service Routines (ISR), which are part of the network packet processing pipeline. Specifically, this queue holds incoming network packets that are being processed in the Interrupt Handler before being passed up to higher layers (e.g., the TCP/IP stack).

The ISR queues help ensure that network packets are processed efficiently without being dropped prematurely when the network interface card (NIC) is receiving packets at a high rate.

We can experiment with different values. The default value is 256, but for high-speed networks, we might try values like 512 or 1024.

How to set defaultqlimit?

Add following line to /boot/loader.conf

net.isr.defaultqlimit="1024" # (default 256)

How to check defaultqlimit setting?

sysctl net.isr.defaultqlimit

net.isr.defaultqlimit default value is 256. During my testing I have not seen any increase of network throughput when the value is set to 512 or 1024, therefore, default value is ok.

net.isr.dispatch

The net.isr.dispatch sysctl in FreeBSD controls how inbound network packets are processed in relation to the netisr (network interrupt service routines) system. This is central to FreeBSD's network stack parallelization.

How to check the status of net.isr.dispatch?

sysctl net.isr.dispatch

How to set net.isr.dispatch?

You can ass it to /boot/loader.conf or uses sysctl to change it in running system.

sysctl net.isr.dispatch=direct (default)

sysctl net.isr.dispatch=deferred

Interrupt Service Routines (ISRs) are tightly tied to how VMXNET3 RX/TX queues are handled in the guest OS, especially for distributing packet processing across CPU cores. This is not beneficial for single CPU machine.

Btw, FreeBSD command vmstat -i | grep vmx show us the interrupts associated with VMXNET3, so I will definitely tested in the future network throughput testing of multi CPU machines.

net.link.ifqmaxlen

The net.link.ifqmaxlen sysctl in FreeBSD controls the maximum length of the interface output queue, i.e., how many packets can be queued for transmission on a network interface before packets start getting dropped.

Every network interface in FreeBSD has an output queue for packets that are waiting to be transmitted. net.link.ifqmaxlen defines the default maximum number of packets that can be held in this queue. If the queue fills up (e.g., due to a slow link or CPU bottleneck), additional packets are dropped until space becomes available again.

The default value is typically 50, which can be too low for high-throughput scenarios.

How to check defaultqlimit setting?

sysctl net.link.ifqmaxlen

How to set net.link.ifqmaxlen?

Add following line to /boot/loader.conf

net.link.ifqmaxlen="1024" # (default 50)

And reboot the system.

net.link.ifqmaxlen default value is 50. During my testing I have not seen any increase of network throughput when the value is set to 1024 or 20000, therefore, default value is ok.

VMXNET3 RX/TX queues

FreeBSD lets you set the number of queues via loader.conf if supported by the driver.

It can be set in /boot/loader.conf

hw.vmx.rx_queues=8
hw.vmx.tx_queues=8

VMXNET3 queues are designed to distribute packet processing across multiple CPU cores. When 2 vCPUs VM is used, RX/TX queues can be set to 2 for RX and 2 for TX. Such configuration requires MSI-X enabled. Maximum number of RX/TX queues is 8 for RX and 8 for TX. Such configuration requires 4+ vCPUs.

With only 1 core, there's no benefit (and typically no support) for having more than 1 TX and 1 RX queue. FreeBSD’s vmx driver will automatically limit the number of queues to match the number of cores available.

At the moment we test network throughput of single vCPU machines, therefore, we do not tune this setting.

Other FreeBSD Network Tuning

In this chapter I will consider additional FreeBSD network tuning described for example at https://calomel.org/freebsd_network_tuning.html and other resources over the Internet.

soreceive_stream

soreceive_stream() can significantly reduced CPU usage and lock contention when receiving fast TCP streams. Additional gains are obtained when the receiving application, like a web server, is using SO_RCVLOWAT to batch up some data before a read (and wakeup) is done.

How to enable soreceive_stream?

Add following line to /boot/loader.conf

net.inet.tcp.soreceive_stream="1" # (default 0)

How to check the status of soreceive_stream?

sysctl -a | grep soreceive

soreceive_stream is disabled by default. During my testing I have not seen any increase of network throughput when soreceive_stream is enabled, therefore, we can keep it on default - disabled.

Congestion Control Algorithms

There are several TCP Congestion Control Algorithms in FreeBSD and Debian. In FreeBSD are cubic (default), newreno, htcp (Hamilton TCP), vegas, cdg, chd. In Debian is cubic (default) and reno. Cubic is the default TCP Congestion Control Algorithms in FreeBSD and Debian and I tested all these algorithms in FreeBSD and cubic is optimal.

TCP Stack

FreeBSD 14.2 currently supports three TCP stacks.

freebsd (default FreeBSD TCP stack)
rack (RACK-TLP Loss Detection Algorithm for TCP aka Recent ACKnowledgment is modern loss recovery stack)
bbr (Bottleneck Bandwidth and Round-Trip Time Algorithm)

I found out that default FreeBSD TCP stack has the biggest throughput in data center network, therefore changing TCP stack does not help to increase network throughput.

Here are IPv4 iperf TCP test results on FreeBSD

All Network Hardware Offload Features disabled

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)

Test conditions:

vmx interface -rxcsum -rxcsum6 -txcsum -txcsum6 -tso4 -tso6 -lro -vlanhwtso mtu 1500
soreceive_stream = 0 (disabled, default)
net.isr.defaultqlimit = 256 (default)

1.51 Gb/s with CPU 60% on server and 95% on client (iperf -P1)

1.49 Gb/s with CPU 60% on server and 95% on client (iperf -P2)

1.47 Gb/s with CPU 60% on server and 95% on client (iperf -P3)

Almost all Network Hardware Offload Features enabled but LRO disabled (FreeBSD default)

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)

Test conditions:

vmx interface rxcsum rxcsum6 txcsum txcsum6 tso4 tso6 -lro vlanhwtso mtu 1500
soreceive_stream = 0 (disabled, default)
net.isr.defaultqlimit = 256 (default)

2.1 Gb/s with CPU 70% on server and 70% on client (iperf -P1)

2.15 Gb/s with CPU 75% on server and 75% on client (iperf -P2)

2.15 Gb/s with CPU 75% on server and 75% on client (iperf -P3)

2.15 Gb/s with CPU 75% on server and 75% on client (iperf -P4)

All Network Hardware Offload Features including LRO

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)

Test conditions:

vmx interface rxcsum rxcsum6 txcsum txcsum6 tso4 tso6 lro vlanhwtso mtu 1500
soreceive_stream = 0 (disabled, default)
net.isr.defaultqlimit = 256 (default)

7.7 Gb/s with CPU 40% on server and 30% on client (iperf -P1)

9.49 Gb/s with CPU 65% on server and 40% on client (iperf -P2)

10.4 Gb/s with CPU 70% on server and 50% on client (iperf -P3)

11.1 Gb/s with CPU 75% on server and 60% on client (iperf -P4)

11.1 Gb/s with CPU 75% on server and 60% on client (iperf -P5)

11.3 Gb/s with CPU 75% on server and 60% on client (iperf -P6)

11.2 Gb/s with CPU 75% on server and 60% on client (iperf -P7)

11.3 Gb/s with CPU 75% on server and 60% on client (iperf -P8)

All Network Hardware Offload Features including LRO + Jumbo Frames

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)

Test conditions:

vmx interface rxcsum rxcsum6 txcsum txcsum6 tso4 tso6 lro vlanhwtso mtu 9000
soreceive_stream = 0 (disabled, default)
net.isr.defaultqlimit = 256 (default)

10.2 Gb/s with CPU 55% on server and 35% on client (iperf -P1)

11.6 Gb/s with CPU 70% on server and 60% on client (iperf -P2)

12.2 Gb/s with CPU 70% on server and 65% on client (iperf -P3)

12.4 Gb/s with CPU 75% on server and 70% on client (iperf -P4)

12.3 Gb/s with CPU 75% on server and 70% on client (iperf -P5)

11.6 Gb/s with CPU 75% on server and 75% on client (iperf -P6)

Here are IPv4 iperf TCP test results on Debian

... TBD ...

Pages

Wednesday, July 30, 2025

Friday, July 25, 2025

Tuesday, July 15, 2025

Thursday, July 03, 2025

Sunday, June 15, 2025

Saturday, June 14, 2025

Tuesday, June 03, 2025

Monday, May 19, 2025

What is VMware Live Recovery?

Thursday, May 15, 2025

ESXi Commands

Wednesday, May 14, 2025

Thursday, May 08, 2025

Wednesday, May 07, 2025

How we can leverage it from FreeBSD?

How we can leverage it from Linux?

Conclusion

Tuesday, May 06, 2025

vCenter / vSphere Client

ESXi host Web Management

ESXi Shell

Sunday, May 04, 2025

Friday, May 02, 2025

VM hardware

iperf parameters

Network Offload Features

What is Offload Feature Impact for VM ↔ VM on same ESXi?

RSS (Receive Side Scaling)

MSI-X

Interrupt Service Routines (net.isr)

net.isr.maxthreads

net.isr.defaultqlimit

net.isr.dispatch

net.link.ifqmaxlen

VMXNET3 RX/TX queues

Other FreeBSD Network Tuning

soreceive_stream

Congestion Control Algorithms

TCP Stack

Here are IPv4 iperf TCP test results on FreeBSD

All Network Hardware Offload Features disabled

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)

Almost all Network Hardware Offload Features enabled but LRO disabled (FreeBSD default)

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)

All Network Hardware Offload Features including LRO

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)

All Network Hardware Offload Features including LRO + Jumbo Frames

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)

Here are IPv4 iperf TCP test results on Debian

Subscribe To