I have just realized that PureStorage has 150TB DirectFlash Modules.
That got me thinking.
Flash capacity is increasing year by year. What are performance/capacity ratios?
I believe the Next Generation Computing is Software Defined Infrastructure on top of the robust physical infrastructure. You can ask me anything about enterprise infrastructure (virtualization, compute, storage, network) and we can discuss it deeply on this blog. Don't hesitate to contact me.
I have just realized that PureStorage has 150TB DirectFlash Modules.
That got me thinking.
Flash capacity is increasing year by year. What are performance/capacity ratios?
This will be a quick blog post, prompted by another question I received about VMware virtual NIC link speed. In this blog post I’d like to demonstrate that the virtual link speed shown in operating systems is merely a reported value and not an actual limit on throughput.
I have two Linux Mint (Debian based) systems mlin01 and mlin02 virtualized in VMware vSphere 8.0.3. Each system has VMXNET3 NIC. Both virtual machines are hosted on the same ESXi host, so they are not constraint by physical network. Let's test network bandwidth between these two systems with iperf.
In VMware vSphere environments, even the most critical business applications are often virtualized. Occasionally, application owners may report high disk latency issues. However, disk I/O latency can be a complex topic because it depends on several factors, such as the size of the I/O operations, whether the I/O is a read or a write and in which ratio, and of course, the performance of the underlying storage subsystem.
One of the most challenging aspects of any storage troubleshooting is understanding what size of I/O workload is being generated by the virtual machine. Storage workload I/O size is the significant factor to response time. There are different response times for 4 KB I/O and 1 MB I/O. Here are examples from my vSAN ESA performance testing.
You can see that response times vary based on storage profile. However, application owners very often do not know what is the storage profile of their application workload and just complain that storage is slow.
As one storage expert (I think it was Howard Marks [1] [2]) once said, there are only two types of storage performance - good enough and not good enough.Fortunately, on an ESXi host, we have a useful tool called vscsiStats. We have to know on which ESXi host VM is running and ssh into that particular ESXi host.
The vSCSI monitoring procedure is
The procedure is documented in VMware KB - Using vscsiStats to collect IO and Latency stats on Virtual Disks
Let's test it in lab.
Here is what happened with VMware Site Recovery Manager. It was repackaged into VMware Live Recovery.
VMware Live Recovery is the latest version of disaster and ransomware recovery from VMware. It combines VMware Live Site Recovery (previously Site Recovery Manager) with VMware Live Cyber Recovery (previously VMware Cloud Disaster Recovery) under a single shared management console and a single license. Customers can protect applications and data from modern ransomware and other disasters across VMware Cloud Foundation environments on-premises and in public clouds with flexible licensing for changing business needs and threats.
For more details see the VMware Live Recovery FAQ and the VMware Live Recovery resource page.
In this blog post I will just copy information from Site Recovery Manager FAQ PDF, because that's what old good on-prem SRM is, and it is good to have it in HTML form in case Broadcom/VMware PDF from what ever reasons disapeer.
Here you have it ...
iperf is great tool to test network throughput.There is iperf3 in ESXi host, but there are restrictions and you cannot run it.
There is the trick.
First of all, you have to disable ESXi advanced option execInstalledOnly=0. This enables you to run executable binaries which were not preinstalled by VMware.
Second step is to make a copy of iperf binary, because installed version os estricted and cannot be run.
The third step is to disable ESXi firewall to allow cross ESXi communication between iperf client and iperf server.
After finishing performance testing, you should clean ESXi environment
When we want to enable Jumbo-Fames on VMware vSphere, it must be enabled on
Let's assume it is configured by network and vSphere administrators and we want to validate that vMotion network between two ESXi hosts supports Jumbo Frames. Let's say we have these two ESXi hosts
[root@esx11:~] ping -I vmk1 -S vmotion -s 8972 -d 10.160.22.112
PING 10.160.22.112 (10.160.22.112): 8972 data bytes
8980 bytes from 10.160.22.112: icmp_seq=0 ttl=64 time=0.770 ms
8980 bytes from 10.160.22.112: icmp_seq=1 ttl=64 time=0.637 ms
8980 bytes from 10.160.22.112: icmp_seq=2 ttl=64 time=0.719 ms
We can see succesful test of large ICMP packets without fragmentation. We validated that ICMP packets with size 8972 bytes can be transfered over the network without fragmentation. That's the indication that Jumbo Frames (MTU 9000) are enabled end-to-end.
Now let's try to cary ICMP packets with size 8973 bytes.
[root@esx11:~] ping -I vmk1 -S vmotion -s 8973 -d 10.160.22.112
PING 10.160.22.112 (10.160.22.112): 8973 data bytes
sendto() failed (Message too long)
sendto() failed (Message too long)
sendto() failed (Message too long)
Almost 10 years ago, I gave a presentation at the local VMware User Group (VMUG) meeting in Prague, Czechia, on Metro Cluster High Availability and SRM Disaster Recovery. The slide deck is available here on Slideshare. I highly recommend reviewing the slide deck, as it clearly explains the fundamental concepts and terminology of Business Continuity and Disaster Recovery (BCDR), along with the VMware technologies used to plan, design, and implement effective BCDR solutions.
Let me briefly outline the key BCDR concepts, documents, and terms below.
RPO (Recovery Point Objective) and RTO (Recovery Time Objective) are the most known terms in Business Continuity and Disaster Recovery world and I hope all IT professionals know at least these two terms. However, the repetition is the mother of wisdom so let's repeat what RPO and RTO are. The picture is worth 1,000 words so look at the picture bellow.
![]() |
RPO and RTO |
RPO - The maximum acceptable amount of data loss measured in time. In other words, "How far back in time can we afford to go in our backups?"
RTO - The maximum acceptable time to restore systems and services (aka infrastructure) after a disaster.
WRT - The time needed after systems are restored (post-RTO) to make applications fully operational (e.g., data validation, restarting services). It’s a subset of MTD and follows RTO.
MTD - How much time does our business accept before the company is back and running after a disaster. In other words, it is the total time a business can be unavailable before causing irrecoverable damage or significant impact.
Easy, right? Not really, Disaster Recovery projects are, based on my experience, the most complex project in IT infrastructure.
Web service available at https://ifconfig.me/ will expose the client IP address. This is useful when you do not know your public IP address as you are behind the NAT (Network Address Translation) in some public Wi-Fi access point or even in your home behing CGNAT (Carrier-Grade NAT) very often used by Internet Service Providers using IPv4.
It is easy. Just use old good fetch command available on FreeBSD by default to retrieve a file by Uniform Resource Locator (URL).
Oneliner:
fetch -qo - https://ifconfig.me | grep "ip_addr:"
It is easy. I use Debian and there is old good wget command available by default. It can be used to retrieve a file by Uniform Resource Locator (URL) similarly to fetch on FreeBSD.
Oneliner:
wget -o /dev/null -O - https://ifconfig.me | grep "ip_addr:"
In Unix-like operating systems is very easy to leverage standard tools to get job done.
I have two VMware vSphere home labs with relatively old hardware (10+ years old). Even I have upgraded the old hardware to use local SATA SSD disks or even NVMe disks the old systems does not support boot from NVMe. That's the reason I still boot my homelab ESXi hosts from USB flash disks, even it is highly recommended to not use USB flash disks or SD cards as boot media for ESXi 7 and later. The reason for this recommendation is to keep boot medium in healthy state even ESXi 7 and later writes more frequently to boot media then earlier ESXi versions. If you want to know more details about reasons for this recommendation, please read my older post vSphere 7 - ESXi boot media partition layout changes.
During the ESXi installation you can choose USB disk as boot medium and local NVMe disk as a disk for ESX OSDATA partition, which will be used to write ESXi system data like config backups, logs (ESX, vSAN, NSX), vSAN traces, core dumps, VM tools (directory /productLocker), and other ephemeral or persistent system files.
How it looks like and how you can identify ESX OSDATA partition on local disks?
Interestingly enough, if you check disk partition layout form vCenter, you will see this 128 GB partition identified as "Legacy MBR". See the screenshot below.
![]() |
ESX OSDATA Partition details in vSphere Client - Legacy MBR (128 GB) |
This is, IMHO, little bit misleading.
You can connect directly into particular ESXi host and check the partition diagram in ESXi host web management. See the screenshot below.
![]() |
ESXi OSDATA partition details about NVMe disk - VMFSL (128 GB) |
There is no information about "Legacy MBR" partition but it is identified as "VMFSL" partition. So again, there is not explicit information about ESXi OSDATA partition but it is identified as VMFSL file system. What is VMFSL? VMFSL in ESXi stands for VMware File System Logical — it's not an officially documented term in VMware's main literature, but it refers to an internal filesystem layer used by ESXi for managing system partitions and services that aren’t exposed like traditional datastores (VMFS or NFS).
It is, IMHO better information than vSphere Client provides via vCenter.
We can also list partition tables in ESXi Shell.
[root@esx22:~] partedUtil getptbl /vmfs/devices/disks/eui.0000000001000000e4d25cc9a0325001
gpt
62260 255 63 1000215216
7 2048 268435455 4EB2EA3978554790A79EFAE495E21F8D vmfsl 0
8 268435456 1000212480 AA31E02A400F11DB9590000C2911D1B8 vmfs 0
It is also listed as VMFSL partition, but we can also see partition GUID.
The GUID (Globally Unique Identifier) that identifies the partition type. This GUID is unique to the partition and corresponds to a specific partition format (e.g., VMFS6, OSDATA, etc.).
And that's it. Now you know all details about ESX OSDATA partition and how to identify it.
In PART 1, I have compared FreeBSD 14.2 and Debian 10.2 default installations and performed some basic network tuning of FreeBSD to approach Debian tcp throughput, which is, based on my testing, higher than network throughput on FreeBSD. The testing in PART 1 was performed on Cisco UCS enterprise servers with 2x CPU Intel Xeon CPU E5-2680 v4 @ 2.40GHz with ESXi 8.0.3. This is approximately 9 year old server with Intel server Xeon Family CPU.
In this PART 2, I will continue with network deep dive into network throughput tuning with some additional context and advanced network tuning of FreeBSD and Debian. Tests will be performed on 9 years old consumer PC (Intel NUC 6i3SYH) having 1x CPU Intel Core i3-6100U CPU @ 2.30GHz with ESXi 8.0.3.
VM hardware used for iperf tests has following specification
I run iperf -s on one VM01 and iperf -c [IP-OF-VM01] -t600 -i5 on VM02. I use iperf parameter -P1, -P2, -P3, -P4 to test impact of more paralel client threads and watching results, because I realized that more paralel client threads has a positive impact on FreeBSD network throughput and none or little bit negative impact on Debian (linux).
I test network throughput with and without following hardware offload capabilities
When two VMs talk to each other on the same ESXi host:
RSS is another important network technology to achieve high network traffic and throughput. RSS spreads incoming network traffic across multiple CPU cores by using a hash of the packet headers (IP, TCP, UDP).
Without RSS:
With RSS:
In this exercise we test network throughput of single vCPU virtual machines, therefore RSS would not help us anyway. I will focus on multi CPU VMs in the future.
Anyway, it seems that RSS is not implemented in FreeBSD's vmx driver of VMXNET3 network card and only partly implemented in VMXNET3 driver in Linux. The reason is, that RSS would add overhead inside a VM.
Implementing RSS would:
In most cases, multiqueue + interrupt steering gives enough performance inside a VM without the cost of full RSS.
FreeBSD blacklists MSI/MSI-X (Message Signaled Interrupts) for some virtual and physical devices to avoid bugs or instability. In VMware VMs, this means that MSI-X (which allows multiple interrupt vectors per device) is disabled by default, limiting performance — especially for multiqueue RX/TX and RSS (Receive Side Scaling).
With MSI-X enabled, you get:
This setting affects all PCI devices, not just vmx, so it should be tested carefully in production VMs. On ESXi 6.7+ and FreeBSD 12+, MSI-X is generally stable for vmxnet3.
This is another potential improvement for multi vCPU VMs but it should not help us in single vCPU VM.
I have test it and it really does not help in single vCPU VM. I will definitely test this setting along with RSS and RX/TX queues later in future parts of this series of articles about FreeBSD network throughput when I will test impact of multiple vCPUs and network queues.
By default, FreeBSD uses a single thread to process all network traffic in accordance with the strong ordering requirements found in some protocols, such as TCP.
In order to increase potential packet processing concurrency, net.isr.maxthreads can be define as "-1" which will automatically enable netisr threads equal to the number of CPU cores in the machine. Now, all CPU cores can be used for packet processing and the system will not be limited to a single thread running on a single CPU core.
As we are testing TCP network throughput in single CPU Core machine, this is not going to help us.
The net.isr.defaultqlimit setting
in FreeBSD controls the queue length for Interrupt Service Routines
(ISR), which are part of the network packet processing pipeline.
Specifically, this queue holds incoming network packets that are being
processed in the Interrupt Handler before being passed up to higher
layers (e.g., the TCP/IP stack).
The ISR queues help ensure that
network packets are processed efficiently without being dropped
prematurely when the network interface card (NIC) is receiving packets
at a high rate.
We can experiment with different values. The default value is 256, but for high-speed networks, we might try values like 512 or 1024.
The net.isr.dispatch sysctl in FreeBSD controls how inbound network packets are processed in relation to the netisr (network interrupt service routines) system. This is central to FreeBSD's network stack parallelization.
The net.link.ifqmaxlen sysctl in FreeBSD controls the maximum length of the interface output queue, i.e., how many packets can be queued for transmission on a network interface before packets start getting dropped.
Every network interface in FreeBSD has an output queue for packets that are waiting to be transmitted. net.link.ifqmaxlen defines the default maximum number of packets that can be held in this queue. If the queue fills up (e.g., due to a slow link or CPU bottleneck), additional packets are dropped until space becomes available again.
The default value is typically 50, which can be too low for high-throughput scenarios.
FreeBSD lets you set the number of queues via loader.conf if supported by the driver.
With only 1 core, there's no benefit (and typically no support) for having more than 1 TX and 1 RX queue. FreeBSD’s vmx driver will automatically limit the number of queues to match the number of cores available.
At the moment we test network throughput of single vCPU machines, therefore, we do not tune this setting.
In this chapter I will consider additional FreeBSD network tuning described for example at https://calomel.org/freebsd_network_tuning.html and other resources over the Internet.
soreceive_stream() can significantly reduced CPU usage and lock contention when receiving fast TCP streams. Additional gains are obtained when the receiving application, like a web server, is using SO_RCVLOWAT to batch up some data before a read (and wakeup) is done.
How to enable soreceive_stream?
Add following line to /boot/loader.conf
net.inet.tcp.soreceive_stream="1" # (default 0)
How to check the status of soreceive_stream?
sysctl -a | grep soreceive
soreceive_stream is disabled by default. During my testing I have not seen any increase of network throughput when soreceive_stream is enabled, therefore, we can keep it on default - disabled.
There are several TCP Congestion Control Algorithms in FreeBSD and Debian. In FreeBSD are cubic (default), newreno, htcp (Hamilton TCP), vegas, cdg, chd. In Debian is cubic (default) and reno. Cubic is the default TCP Congestion Control Algorithms in FreeBSD and Debian and I tested all these algorithms in FreeBSD and cubic is optimal.
FreeBSD 14.2 currently supports three TCP stacks.
I found out that default FreeBSD TCP stack has the biggest throughput in data center network, therefore changing TCP stack does not help to increase network throughput.