Friday, May 02, 2025

Network throughput and CPU efficiency of FreeBSD 14.2 and Debian 10.2 in VMware - PART 2

In PART 1, I have compared FreeBSD 14.2 and Debian 10.2 default installations and performed some basic network tuning of FreeBSD to approach Debian tcp throughput, which is, based on my testing, higher than network throughput on FreeBSD. The testing in PART 1 was performed on Cisco UCS enterprise servers with 2x CPU Intel Xeon CPU E5-2680 v4 @ 2.40GHz with ESXi 8.0.3. This is approximately 9 year old server with Intel server Xeon Family CPU.

In this PART 2, I will continue with network deep dive into network throughput tuning with some additional context and advanced network tuning of FreeBSD and Debian. Tests will be performed on 9 years old consumer PC (Intel NUC 6i3SYH) having 1x CPU Intel Core i3-6100U CPU @ 2.30GHz with ESXi 8.0.3.

VM hardware

VM hardware used for iperf tests has following specification

  • 1 vCPU (artificially limited by hypervisor to 2000 MHz, to be able to easy recalculate Hz to bit/s)
  • 1 GB RAM
  • vNIC type is vmxnet3
  • VM hardware / Compatibility: ESXi 8.0 U2 and later (VM version 21)

iperf parameters

I run iperf -s on one VM01 and iperf -c [IP-OF-VM01] -t600 -i5 on VM02. I use iperf parameter -P1, -P2, -P3, -P4 to test impact of more paralel client threads and watching results, because I realized that more paralel client threads has a positive impact on FreeBSD network throughput and none or little bit negative impact on Debian (linux).

Network Offload Features

I test network throughput with and without following hardware offload capabilities

  • rxcsum (Receive Checksum Offload) - This offloads work from the OS, which improves performance (less CPU usage), network throughput (faster packet processing), efficiency (especially in high-speed interfaces like 10G/25G)
  • rxcsum_ipv6 - rxsum IPv6 version
  • txcsum (Transmit Checksum Offload) - computes checksums for outgoing packets (TCP, UDP, IP), instead of the CPU. This reduces CPU load and improves performance, especially when handling lots of network traffic.
  • txsum_ipv6 - txsum IPv6 version
  • tso4 (TCP Segmentation Offload for IPv4) - TSO (TCP Segmentation Offload) allows the NIC (or virtual NIC) to split large TCP packets into smaller segments on its own, instead of making the CPU do it. Without tso4, the CPU breaks a large packet (say 64 KB) into many small 1500-byte segments before sending. With tso4, the CPU sends one big chunk to the NIC (e.g., 64 KB), and the NIC does the chopping into MTU-sized pieces. This brings less CPU usage, higher throughput, better efficiency on high-speed links or when sending large files. !!! Requires txcsum to work !!!
  • tso6 (TCP Segmentation Offload for IPv6)
  • lro (Large Receive Offload) - LRO allows the network interface (or virtual NIC) to combine multiple incoming TCP packets into one large packet before passing them up the stack to the OS. It is good because it means fewer interrupts, fewer packets for the CPU to process, higher throughput (especially on high-speed networks), better performance in TCP-heavy applications.

What is Offload Feature Impact for VM ↔ VM on same ESXi?

When two VMs talk to each other on the same ESXi host:

  • Packets never leave the host, but the vNIC emulation layer still processes them.
  • VMXNET3 is a paravirtualized driver, so it's optimized, but offloads still reduce CPU cycles in the guest.
  • With txcsum + rxcsum, you're telling FreeBSD not to burn CPU on checksums → lower guest load.
  • With tso4 + lro, the guest kernel deals with fewer packets → fewer context switches and interrupts.

RSS (Receive Side Scaling)

RSS is another important network technology to achieve high network traffic and throughput. RSS spreads incoming network traffic across multiple CPU cores by using a hash of the packet headers (IP, TCP, UDP).

Without RSS:

  • All packets are processed by a single CPU core (usually core 0).
  • Can cause bottlenecks on high-throughput systems.

With RSS:

  • Incoming packets from different connections go to different cores.
  • Much better use of multi-core CPUs.
  • Improves parallelism, reduces interrupt contention, and increases throughput.

In this exercise we test network throughput of single vCPU virtual machines, therefore RSS would not help us anyway. I will focus on multi CPU VMs in the future.

Anyway, it seems that RSS is not implemented in FreeBSD's vmx driver of VMXNET3 network card and only partly implemented in VMXNET3 driver in Linux. The reason is, that RSS would add overhead inside a VM.

Implementing RSS would:

  •     Require hashing each packet in software
  •     Possibly break hypervisor assumptions
  •     Add extra CPU work in the guest — often worse than the performance gains

In most cases, multiqueue + interrupt steering gives enough performance inside a VM without the cost of full RSS.

MSI-X

FreeBSD blacklists MSI/MSI-X (Message Signaled Interrupts) for some virtual and physical devices to avoid bugs or instability. In VMware VMs, this means that MSI-X (which allows multiple interrupt vectors per device) is disabled by default, limiting performance — especially for multiqueue RX/TX and RSS (Receive Side Scaling).

With MSI-X enabled, you get:

  • Multiple interrupt vectors per device
  • Support for multiqueue RX and TX paths
  • Better support for RSS (load balancing packet processing across CPUs)
  • Improved throughput and reduced latency — especially on multi-core VMs

This setting affects all PCI devices, not just vmx, so it should be tested carefully in production VMs. On ESXi 6.7+ and FreeBSD 12+, MSI-X is generally stable for vmxnet3.

This is another potential improvement for multi vCPU VMs but it should not help us in single vCPU VM.

I have test it and it really does not help in single vCPU VM. I will definitely test this setting along with RSS and RX/TX queues later in future parts of this series of articles about FreeBSD network throughput when I will test impact of multiple vCPUs and network queues.  

Interrupt Service Routines (net.isr)

By default, FreeBSD uses a single thread to process all network traffic in accordance with the strong ordering requirements found in some protocols, such as TCP. 

net.isr.maxthreads

In order to increase potential packet processing concurrency, net.isr.maxthreads can be define as "-1" which will automatically enable netisr threads equal to the number of CPU cores in the machine. Now, all CPU cores can be used for packet processing and the system will not be limited to a single thread running on a single CPU core.

As we are testing TCP network throughput in single CPU Core machine, this is not going to help us.

net.isr.defaultqlimit

The net.isr.defaultqlimit setting in FreeBSD controls the queue length for Interrupt Service Routines (ISR), which are part of the network packet processing pipeline. Specifically, this queue holds incoming network packets that are being processed in the Interrupt Handler before being passed up to higher layers (e.g., the TCP/IP stack).

The ISR queues help ensure that network packets are processed efficiently without being dropped prematurely when the network interface card (NIC) is receiving packets at a high rate.

We can experiment with different values. The default value is 256, but for high-speed networks, we might try values like 512 or 1024.

How to set defaultqlimit?
Add following line to /boot/loader.conf
net.isr.defaultqlimit="1024" # (default 256)
 
How to check defaultqlimit setting?
sysctl net.isr.defaultqlimit
 
net.isr.defaultqlimit default value is 256. During my testing I have not seen any increase of network throughput when the value is set to 512 or 1024, therefore, default value is ok.

net.isr.dispatch

The net.isr.dispatch sysctl in FreeBSD controls how inbound network packets are processed in relation to the netisr (network interrupt service routines) system. This is central to FreeBSD's network stack parallelization.

How to check the status of net.isr.dispatch?
sysctl net.isr.dispatch
 
How to set net.isr.dispatch?
You can ass it to /boot/loader.conf or uses sysctl to change it in running system. 
sysctl net.isr.dispatch=direct (default)
sysctl net.isr.dispatch=deferred
 
Interrupt Service Routines (ISRs) are tightly tied to how VMXNET3 RX/TX queues are handled in the guest OS, especially for distributing packet processing across CPU cores. This is not beneficial for single CPU machine.

Btw, FreeBSD command vmstat -i | grep vmx show us the interrupts associated with VMXNET3, so I will definitely tested in the future network throughput testing of multi CPU machines.

net.link.ifqmaxlen

The net.link.ifqmaxlen sysctl in FreeBSD controls the maximum length of the interface output queue, i.e., how many packets can be queued for transmission on a network interface before packets start getting dropped. 

Every network interface in FreeBSD has an output queue for packets that are waiting to be transmitted. net.link.ifqmaxlen defines the default maximum number of packets that can be held in this queue. If the queue fills up (e.g., due to a slow link or CPU bottleneck), additional packets are dropped until space becomes available again.

The default value is typically 50, which can be too low for high-throughput scenarios.

How to check defaultqlimit setting?
sysctl net.link.ifqmaxlen
 
How to set net.link.ifqmaxlen?
Add following line to /boot/loader.conf
net.link.ifqmaxlen="1024" # (default 50)
 
And reboot the system.
 
net.link.ifqmaxlen default value is 50. During my testing I have not seen any increase of network throughput when the value is set to 1024 or 20000, therefore, default value is ok.

VMXNET3 RX/TX queues

FreeBSD lets you set the number of queues via loader.conf if supported by the driver.

It can be set in /boot/loader.conf
hw.vmx.rx_queues=8
hw.vmx.tx_queues=8
 
VMXNET3 queues are designed to distribute packet processing across multiple CPU cores. When 2 vCPUs VM is used, RX/TX queues can be set to 2 for RX and 2 for TX. Such configuration requires MSI-X enabled. Maximum number of RX/TX queues is 8 for RX and 8 for TX. Such configuration requires 4+ vCPUs.

With only 1 core, there's no benefit (and typically no support) for having more than 1 TX and 1 RX queue. FreeBSD’s vmx driver will automatically limit the number of queues to match the number of cores available.

At the moment we test network throughput of single vCPU machines, therefore, we do not tune this setting.

Other FreeBSD Network Tuning

In this chapter I will consider additional FreeBSD network tuning described for example at https://calomel.org/freebsd_network_tuning.html and other resources over the Internet.

soreceive_stream

soreceive_stream() can significantly reduced CPU usage and lock contention when receiving fast TCP streams. Additional gains are obtained when the receiving application, like a web server, is using SO_RCVLOWAT to batch up some data before a read (and wakeup) is done.

How to enable soreceive_stream?

Add following line to /boot/loader.conf

net.inet.tcp.soreceive_stream="1" # (default 0)

How to check the status of soreceive_stream?

sysctl -a | grep soreceive

soreceive_stream is disabled by default. During my testing I have not seen any increase of network throughput when soreceive_stream is enabled, therefore, we can keep it on default - disabled.

Congestion Control Algorithms

There are several TCP Congestion Control Algorithms in FreeBSD and Debian. In FreeBSD are cubic (default), newreno, htcp (Hamilton TCP), vegas, cdg, chd. In Debian is cubic (default) and reno. Cubic is the default TCP Congestion Control Algorithms in FreeBSD and Debian and I tested all these algorithms in FreeBSD and cubic is optimal.

TCP Stack

FreeBSD 14.2 currently supports three TCP stacks.

  1. freebsd (default FreeBSD TCP stack)
  2. rack (RACK-TLP Loss Detection Algorithm for TCP aka Recent ACKnowledgment is modern loss recovery stack)
  3. bbr (Bottleneck Bandwidth and Round-Trip Time Algorithm)

I found out that default FreeBSD TCP stack has the biggest throughput in data center network, therefore changing TCP stack does not help to increase network throughput.

Here are IPv4 iperf TCP test results on FreeBSD

All Network Hardware Offload Features disabled

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default) 

Test conditions:  
  • vmx interface -rxcsum -rxcsum6 -txcsum -txcsum6 -tso4 -tso6 -lro -vlanhwtso mtu 1500
  • soreceive_stream = 0 (disabled, default)
  • net.isr.defaultqlimit = 256 (default)
1.51 Gb/s with CPU 60% on server and 95% on client (iperf -P1)
1.49 Gb/s with CPU 60% on server and 95% on client (iperf -P2)
1.47 Gb/s with CPU 60% on server and 95% on client (iperf -P3)

Almost all Network Hardware Offload Features enabled but LRO disabled (FreeBSD default)

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default) 

Test conditions:  
  • vmx interface rxcsum rxcsum6 txcsum txcsum6 tso4 tso6 -lro vlanhwtso mtu 1500
  • soreceive_stream = 0 (disabled, default)
  • net.isr.defaultqlimit = 256 (default)
2.1 Gb/s with CPU 70% on server and 70% on client (iperf -P1)
2.15 Gb/s with CPU 75% on server and 75% on client (iperf -P2)
2.15 Gb/s with CPU 75% on server and 75% on client (iperf -P3)
2.15 Gb/s with CPU 75% on server and 75% on client (iperf -P4)

All Network Hardware Offload Features including LRO

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default) 

Test conditions:  
  • vmx interface rxcsum rxcsum6 txcsum txcsum6 tso4 tso6 lro vlanhwtso mtu 1500
  • soreceive_stream = 0 (disabled, default)
  • net.isr.defaultqlimit = 256 (default)
7.7 Gb/s with CPU 40% on server and 30% on client (iperf -P1)
9.49 Gb/s with CPU 65% on server and 40% on client (iperf -P2)
10.4 Gb/s with CPU 70% on server and 50% on client (iperf -P3)
11.1 Gb/s with CPU 75% on server and 60% on client (iperf -P4)
11.1 Gb/s with CPU 75% on server and 60% on client (iperf -P5)
11.3 Gb/s with CPU 75% on server and 60% on client (iperf -P6)
11.2 Gb/s with CPU 75% on server and 60% on client (iperf -P7)
11.3 Gb/s with CPU 75% on server and 60% on client (iperf -P8)

All Network Hardware Offload Features including LRO + Jumbo Frames

soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default) 

Test conditions:  
  • vmx interface rxcsum rxcsum6 txcsum txcsum6 tso4 tso6 lro vlanhwtso mtu 9000
  • soreceive_stream = 0 (disabled, default)
  • net.isr.defaultqlimit = 256 (default)
10.2 Gb/s with CPU 55% on server and 35% on client (iperf -P1)
11.6 Gb/s with CPU 70% on server and 60% on client (iperf -P2)
12.2 Gb/s with CPU 70% on server and 65% on client (iperf -P3)
12.4 Gb/s with CPU 75% on server and 70% on client (iperf -P4)
12.3 Gb/s with CPU 75% on server and 70% on client (iperf -P5)
11.6 Gb/s with CPU 75% on server and 75% on client (iperf -P6)

Here are IPv4 iperf TCP test results on Debian

... TBD ...


No comments: