In PART 1, I have compared FreeBSD 14.2 and Debian 10.2 default installations and performed some basic network tuning of FreeBSD to approach Debian tcp throughput, which is, based on my testing, higher than network throughput on FreeBSD. The testing in PART 1 was performed on Cisco UCS enterprise servers with 2x CPU Intel Xeon CPU E5-2680 v4 @ 2.40GHz with ESXi 8.0.3. This is approximately 9 year old server with Intel server Xeon Family CPU.
In this PART 2, I will continue with network deep dive into network throughput tuning with some additional context and advanced network tuning of FreeBSD and Debian. Tests will be performed on 9 years old consumer PC (Intel NUC 6i3SYH) having 1x CPU Intel Core i3-6100U CPU @ 2.30GHz with ESXi 8.0.3.
VM hardware
VM hardware used for iperf tests has following specification
- 1 vCPU (artificially limited by hypervisor to 2000 MHz, to be able to easy recalculate Hz to bit/s)
- 1 GB RAM
- vNIC type is vmxnet3
- VM hardware / Compatibility: ESXi 8.0 U2 and later (VM version 21)
iperf parameters
I run iperf -s on one VM01 and iperf -c [IP-OF-VM01] -t600 -i5 on VM02. I use iperf parameter -P1, -P2, -P3, -P4 to test impact of more paralel client threads and watching results, because I realized that more paralel client threads has a positive impact on FreeBSD network throughput and none or little bit negative impact on Debian (linux).
Network Offload Features
I test network throughput with and without following hardware offload capabilities
- rxcsum (Receive Checksum Offload) - This offloads work from the OS, which improves performance (less CPU usage), network throughput (faster packet processing), efficiency (especially in high-speed interfaces like 10G/25G)
- rxcsum_ipv6 - rxsum IPv6 version
- txcsum (Transmit Checksum Offload) - computes checksums for outgoing packets (TCP, UDP, IP), instead of the CPU. This reduces CPU load and improves performance, especially when handling lots of network traffic.
- txsum_ipv6 - txsum IPv6 version
- tso4 (TCP Segmentation Offload for IPv4) - TSO (TCP Segmentation Offload) allows the NIC (or virtual NIC) to split large TCP packets into smaller segments on its own, instead of making the CPU do it. Without tso4, the CPU breaks a large packet (say 64 KB) into many small 1500-byte segments before sending. With tso4, the CPU sends one big chunk to the NIC (e.g., 64 KB), and the NIC does the chopping into MTU-sized pieces. This brings less CPU usage, higher throughput, better efficiency on high-speed links or when sending large files. !!! Requires txcsum to work !!!
- tso6 (TCP Segmentation Offload for IPv6)
- lro (Large Receive Offload) - LRO allows the network interface (or virtual NIC) to combine multiple incoming TCP packets into one large packet before passing them up the stack to the OS. It is good because it means fewer interrupts, fewer packets for the CPU to process, higher throughput (especially on high-speed networks), better performance in TCP-heavy applications.
What is Offload Feature Impact for VM ↔ VM on same ESXi?
When two VMs talk to each other on the same ESXi host:
- Packets never leave the host, but the vNIC emulation layer still processes them.
- VMXNET3 is a paravirtualized driver, so it's optimized, but offloads still reduce CPU cycles in the guest.
- With txcsum + rxcsum, you're telling FreeBSD not to burn CPU on checksums → lower guest load.
- With tso4 + lro, the guest kernel deals with fewer packets → fewer context switches and interrupts.
RSS (Receive Side Scaling)
RSS is another important network technology to achieve high network traffic and throughput. RSS spreads incoming network traffic across multiple CPU cores by using a hash of the packet headers (IP, TCP, UDP).
Without RSS:
- All packets are processed by a single CPU core (usually core 0).
- Can cause bottlenecks on high-throughput systems.
With RSS:
- Incoming packets from different connections go to different cores.
- Much better use of multi-core CPUs.
- Improves parallelism, reduces interrupt contention, and increases throughput.
In this exercise we test network throughput of single vCPU virtual machines, therefore RSS would not help us anyway. I will focus on multi CPU VMs in the future.
Anyway, it seems that RSS is not implemented in FreeBSD's vmx driver of VMXNET3 network card and only partly implemented in VMXNET3 driver in Linux. The reason is, that RSS would add overhead inside a VM.
Implementing RSS would:
- Require hashing each packet in software
- Possibly break hypervisor assumptions
- Add extra CPU work in the guest — often worse than the performance gains
In most cases, multiqueue + interrupt steering gives enough performance inside a VM without the cost of full RSS.
MSI-X
FreeBSD blacklists MSI/MSI-X (Message Signaled Interrupts) for some virtual and physical devices to avoid bugs or instability. In VMware VMs, this means that MSI-X (which allows multiple interrupt vectors per device) is disabled by default, limiting performance — especially for multiqueue RX/TX and RSS (Receive Side Scaling).
With MSI-X enabled, you get:
- Multiple interrupt vectors per device
- Support for multiqueue RX and TX paths
- Better support for RSS (load balancing packet processing across CPUs)
- Improved throughput and reduced latency — especially on multi-core VMs
This setting affects all PCI devices, not just vmx, so it should be tested carefully in production VMs. On ESXi 6.7+ and FreeBSD 12+, MSI-X is generally stable for vmxnet3.
This is another potential improvement for multi vCPU VMs but it should not help us in single vCPU VM.
I have test it and it really does not help in single vCPU VM. I will definitely test this setting along with RSS and RX/TX queues later in future parts of this series of articles about FreeBSD network throughput when I will test impact of multiple vCPUs and network queues.
Interrupt Service Routines (net.isr)
By default, FreeBSD uses a single thread to process all network traffic in accordance with the strong ordering requirements found in some protocols, such as TCP.
net.isr.maxthreads
In order to increase potential packet processing concurrency, net.isr.maxthreads can be define as "-1" which will automatically enable netisr threads equal to the number of CPU cores in the machine. Now, all CPU cores can be used for packet processing and the system will not be limited to a single thread running on a single CPU core.
As we are testing TCP network throughput in single CPU Core machine, this is not going to help us.
net.isr.defaultqlimit
The net.isr.defaultqlimit setting
in FreeBSD controls the queue length for Interrupt Service Routines
(ISR), which are part of the network packet processing pipeline.
Specifically, this queue holds incoming network packets that are being
processed in the Interrupt Handler before being passed up to higher
layers (e.g., the TCP/IP stack).
The ISR queues help ensure that
network packets are processed efficiently without being dropped
prematurely when the network interface card (NIC) is receiving packets
at a high rate.
We can experiment with different values. The default value is 256, but for high-speed networks, we might try values like 512 or 1024.
net.isr.dispatch
The net.isr.dispatch sysctl in FreeBSD controls how inbound network packets are processed in relation to the netisr (network interrupt service routines) system. This is central to FreeBSD's network stack parallelization.
net.link.ifqmaxlen
The net.link.ifqmaxlen sysctl in FreeBSD controls the maximum length of the interface output queue, i.e., how many packets can be queued for transmission on a network interface before packets start getting dropped.
Every network interface in FreeBSD has an output queue for packets that are waiting to be transmitted. net.link.ifqmaxlen defines the default maximum number of packets that can be held in this queue. If the queue fills up (e.g., due to a slow link or CPU bottleneck), additional packets are dropped until space becomes available again.
The default value is typically 50, which can be too low for high-throughput scenarios.
VMXNET3 RX/TX queues
FreeBSD lets you set the number of queues via loader.conf if supported by the driver.
hw.vmx.tx_queues=8
With only 1 core, there's no benefit (and typically no support) for having more than 1 TX and 1 RX queue. FreeBSD’s vmx driver will automatically limit the number of queues to match the number of cores available.
At the moment we test network throughput of single vCPU machines, therefore, we do not tune this setting.
Other FreeBSD Network Tuning
In this chapter I will consider additional FreeBSD network tuning described for example at https://calomel.org/freebsd_network_tuning.html and other resources over the Internet.
soreceive_stream
soreceive_stream() can significantly reduced CPU usage and lock contention when receiving fast TCP streams. Additional gains are obtained when the receiving application, like a web server, is using SO_RCVLOWAT to batch up some data before a read (and wakeup) is done.
How to enable soreceive_stream?
Add following line to /boot/loader.conf
net.inet.tcp.soreceive_stream="1" # (default 0)
How to check the status of soreceive_stream?
sysctl -a | grep soreceive
soreceive_stream is disabled by default. During my testing I have not seen any increase of network throughput when soreceive_stream is enabled, therefore, we can keep it on default - disabled.
Congestion Control Algorithms
There are several TCP Congestion Control Algorithms in FreeBSD and Debian. In FreeBSD are cubic (default), newreno, htcp (Hamilton TCP), vegas, cdg, chd. In Debian is cubic (default) and reno. Cubic is the default TCP Congestion Control Algorithms in FreeBSD and Debian and I tested all these algorithms in FreeBSD and cubic is optimal.
TCP Stack
FreeBSD 14.2 currently supports three TCP stacks.
- freebsd (default FreeBSD TCP stack)
- rack (RACK-TLP Loss Detection Algorithm for TCP aka Recent ACKnowledgment is modern loss recovery stack)
- bbr (Bottleneck Bandwidth and Round-Trip Time Algorithm)
I found out that default FreeBSD TCP stack has the biggest throughput in data center network, therefore changing TCP stack does not help to increase network throughput.
Here are IPv4 iperf TCP test results on FreeBSD
All Network Hardware Offload Features disabled
soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)
- vmx interface -rxcsum -rxcsum6 -txcsum -txcsum6 -tso4 -tso6 -lro -vlanhwtso mtu 1500
- soreceive_stream = 0 (disabled, default)
- net.isr.defaultqlimit = 256 (default)
Almost all Network Hardware Offload Features enabled but LRO disabled (FreeBSD default)
soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)
- vmx interface rxcsum rxcsum6 txcsum txcsum6 tso4 tso6 -lro vlanhwtso mtu 1500
- soreceive_stream = 0 (disabled, default)
- net.isr.defaultqlimit = 256 (default)
All Network Hardware Offload Features including LRO
soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)
- vmx interface rxcsum rxcsum6 txcsum txcsum6 tso4 tso6 lro vlanhwtso mtu 1500
- soreceive_stream = 0 (disabled, default)
- net.isr.defaultqlimit = 256 (default)
All Network Hardware Offload Features including LRO + Jumbo Frames
soreceive_stream disabled (default), net.isr.defaultqlimit = 256 (default)
- vmx interface rxcsum rxcsum6 txcsum txcsum6 tso4 tso6 lro vlanhwtso mtu 9000
- soreceive_stream = 0 (disabled, default)
- net.isr.defaultqlimit = 256 (default)
No comments:
Post a Comment