VCDX #200 The Ultimate Way to VirtualizeBlog of one VMware Infrastructure Designer: 2014

Wednesday, December 31, 2014

esxcli formatters

I have just read great blog post "Hidden esxcli Command Output Formats You Probably Don’t Know" where the author (Steve Jin) exposed undocumented esxcli options to choose different formatters of esxcli output. Following esxcli formatters are available:

xml
csv
keyvalue

Here is example of one particular esxcli command without formatter.

~ # esxcli system version get
   Product: VMware ESXi
   Version: 5.5.0
   Build: Releasebuild-1746018
Update: 1

Here is example with csv formatter

~ # esxcli --formatter=csv system version get
Build,Product,Update,Version,
Releasebuild-1746018,VMware ESXi,1,5.5.0,

Here is example with keyvalue formatter

~ # esxcli --formatter=keyvalue system version get
VersionGet.Build.string=Releasebuild-1746018
VersionGet.Product.string=VMware ESXi
VersionGet.Update.string=1
VersionGet.Version.string=5.5.0

It is quite obvious how xml formatter output looks like, right? All these formatters can help with output parsing which can be very helpful for automation scripts. Even more formatters are available when debug option is used. Following esxcli debug formatters are available:

python
json
html
table
tree
simple

Here is example with json formatter

~ # esxcli --debug --formatter=json system version get
{"Build": "Releasebuild-1746018", "Product": "VMware ESXi", "Update": "1", "Version": "5.5.0"}

Conclusion

esxcli formatters might be very handy for those who do automation and need easier parsing of esxcli output. You probably know that esxcli can be also used from PowerCLI so formatters can significantly simplified the script code.

Friday, December 19, 2014

Galera Cluster for MySQL vs MySQL (NDB) Cluster

In the past, I used to use MySQL database for lot of projects so I was really interested what is the progress in MySQL clustering technologies and what is possible today. Therefore I have attended very interesting webinar about MySQL clustering possibilities. Official webinar name is "Galera Cluster for MySQL vs MySQL (NDB) Cluster. A High Level Comparison" and Webinar Replay & Slides are available online here.

I was very impressed what is possible today and even I moved from software architecture and development to datacenter infrastructure architecture and consulting it is good to know what database clustering is available nowadays.

Monday, December 15, 2014

Disk queue depth in an ESXi environment

On the internet, there are a lot of information and documents about ESXi and Disk Queue Depth but I didn't find any single document explained all details I would like to know and in the format for easy consumption. Different vendors have their specific recommendations and best practices but without deeper principal explanation and holistic view. Some documents are incomplete and some others have, in my opinion, wrong or at least strange information. That's the reason I have decided to write this blog post mainly for my own purpose but I hope it helps other people from the VMware community.

When I did research I found Frank Denneman blog post from March 2009 where almost everything is very well and accurately explained but technology little bit changed over time. Another great document is here on VMware Community Docs.

Disclaimer: I work for DELL Services, therefore, I work on a lot of projects with DELL Compellent Storage Systems, DELL Servers and QLogic HBA's. Therefore this blog post is based on "DELL Compellent Best Practices with VMware vSphere 5.x" and "Qlogic Best Practices for VMware vSphere 5.x". I have all this equipment in the lab so I was able to test it and verify what's going on. However, although this blog has information about specific hardware general principles are the same for any other HBAs and storage systems.

Some blog sections are just copied and paste from public documents mentioned above so all credits go there. Some other sections are based on my lab tests, experiments and thoughts.

What is queue depth?

Queue depth is defined as the number of disk transactions that are allowed to be “in flight” between an initiator and a target (parallel I/O transaction), where the initiator is typically an ESXi host HBA port/iSCSI initiator and the target is typically the Storage Center front-end port. Since any given target can have multiple initiators sending it data, the initiator queue depth is generally used to throttle the number of parallel transactions being sent to a target to keep it from becoming “flooded”. When this happens, the transactions start to pile up causing higher latencies and degraded performance. That being said, while increasing the queue depth can sometimes increase performance, if it is set too high, there is an increased risk of over-driving (over-loading and saturating) the storage array. As data travels between the application and the storage array, there are several places that the queue depth can be set to throttle the number of concurrent (parallel) disk transactions. The most common places queue depth can be modified are:

The application itself (Default=dependent on application)
The virtual SCSI card driver in the guest (Default=32)
The VMFS layer (DSNRO) (Default=32)
The HBA VMkernel Module driver (Default=64)
The HBA BIOS (Default=Varies)

HBA Queue Depth = AQLEN

(Default Varies but in my QLogic card it is 2176)

Here are Compellent recommended BIOS settings for QLogic Fibre Channel card:

The “connection options” field should be set to 1 for point to point only
The “login retry count” field should be set to 60 attempts
The “port down retry” count field should be set to 60 attempts
The “link down timeout” field should be set to 30 seconds
The “queue depth” (or “Execution Throttle”) field should be set to 255. This queue depth can be set to 255 because the ESXi VMkernel driver module and DSNRO can more conveniently control the queue depth

I don't think the point (5) is correct and up to date in Compellent Best Practices. Based on QLogic document HBA "Queue Depth" doesn't exist in QLogic BIOS and "Execution Throttle" is not used anymore. QLogic Adapter Queue Depth is only managed by the driver and default value is 64 for vSphere 5.x. For more info look at here.

I believe QLogic Host Bus Adapter has Queue Depth 2176 because this number is visible in esxtop as AQLEN (Adapter Queue Length) for each particular HBA (vmhba).

HBA LUN Queue Depth

(Default=64)
HBA LUN Queue Depth is by default 64 but can be changed via VMkernel Module driver. Compellent recommends to set HBA max queue depth to 255 but this queue depth can be used just for single VM running on single LUN (disk). If more VMs are running on LUN than DSNRO has precedence. However, when SIOC (Storage I/O Control) is used then DSNRO is not used at all and HBA VMkernel Module Driver Queue Depth value is used for each device queue depth (DQLEN). I think SIOC is beneficial in this situation because you can have deeper queues with dynamic queue management across all ESXi hosts connected to particular LUN. Please note, that special attention has to be taken for correct SIOC latency threshold especially on LUNs with sub-lun tiering.

If you want to check your Disk (aka LUN) Queue Depth value it is nicely visible in esxtop as DQLEN.

So here is the procedure on how to change HBA VMkernel Module Driver Queue Depth.

Find the appropriate driver name for the module that is loaded for QLogic HBA:

~ # esxcli system module list | grep ql
qlnativefc true true

In our case, the name of module for QLogic driver is qlnativefc.

Set the QLogic driver queue depth and timeouts using the esxcli command:

esxcli system module parameters set -m qlnativefc -p "ql2xmaxqdepth=255 ql2xloginretrycount=60 qlport_down_retry=60"

You have to reboot ESXi host to apply module changes.

Below is a description of QLogic Parameters we have just change

ql2xmaxqdepth (int) - Maximum queue depth to report for target devices.
ql2xloginretrycount (int) - Specify an alternate value for the NVRAM login retry count.
qlport_down_retry (int) - Maximum number of command retries to a port that returns a PORT-DOWN status.

You can verify QLogic driver options by the following command:

esxcfg-module --get-options qlnativefc

And affected change (after ESXi reboot) is visible on esxtop on disk devices as DQLEN.

VMFS Layer DSNRO

(Default=32)
When two or more virtual machines share a LUN (logical unit number), this parameter controls the total number of outstanding commands permitted from all virtual machines collectively on the host to that LUN (this setting is not per virtual machine). For more information see KB

The HBA LUN queue depth (DQLEN) determines how many commands the HBA is willing to accept and process per LUN and target. If a single VM virtual machine is issuing IO, the HBA LUN Queue Depth (DQLEN) setting is applicable. When multiple VM’s are simultaneously issuing IO’s to the LUN, the ESX parameter Disk.SchedNumReqOutstanding (DSNRO) value becomes the leading parameter, and HBA LUN Queue Depth is ignored. The above statement is only true when SIOC is not enabled on the LUN (datastore).

Parameter can be changed to the value between 1 and 256 so when

esxcli storage core device set -d -O

No of outstanding IOs with competing worlds (DSNRO) can be listed for each device

~ # esxcli storage core device list -d naa.6000d3100025e7000000000000000098
naa.6000d3100025e7000000000000000098
Display Name: COMPELNT Fibre Channel Disk (naa.6000d3100025e7000000000000000098)
Has Settable Display Name: true
Size: 1048576
Device Type: Direct-Access
Multipath Plugin: NMP
Devfs Path: /vmfs/devices/disks/naa.6000d3100025e7000000000000000098
Vendor: COMPELNT
Model: Compellent Vol
Revision: 0504
SCSI Level: 5
Is Pseudo: false
Status: on
Is RDM Capable: true
Is Local: false
Is Removable: false
Is SSD: false
Is Offline: false
Is Perennially Reserved: false
Queue Full Sample Size: 0
Queue Full Threshold: 0
Thin Provisioning Status: yes
Attached Filters:
VAAI Status: supported
Other UIDs: vml.02000200006000d3100025e7000000000000000098436f6d70656c
Is Local SAS Device: false
Is Boot USB Device: false
No of outstanding IOs with competing worlds: 32

The issue with DSNRO is that it must be set per devices (LUNs) on all ESXi hosts so it has a negative impact on implementation and overall manageability. It is usually left on default value anyway because 32 outstanding (async) IOs is good enough on low latency SANs and properly designed storage devices with good response times. Bigger queues are beneficial on higher latency SANs and/or storage systems which need more parallelism (more parallel threads) to draw out more I/O's storage is physically capable of. It doesn't help in situations when storage doesn't have enough back end performance (usually spindles). Bigger queues can generate more outstanding IOs to the storage system which can overload the storage and everything will be even worse. That's the reason why I think it is a good idea to leave DSNRO on the default value.

And once again. DSNRO is not used when SIOC is enabled because in that case "HBA LUN queue depth" is the leading parameter for disk queue depth (DQLEN).

Storage Target Port Queue Depth

Queues are on a lot of places and you have to understand the path from end to end. A queue exist on the storage array controller port as well, this is called the “Target Port Queue Depth“. Modern midrange storage arrays, like most EMC and HP arrays, can handle around 2048 outstanding IO’s.
2048 IO’s sounds a lot, but most of the time multiple servers communicate with the storage controller at the same time. Because a port can only service one request at a time, additional requests are placed in the queue and when the storage controller port receives more than 2048 IO requests, the queue gets flooded. When the queue depth is reached, this status is called (QFULL), the storage controller issues an IO throttling command to the host to suspend further requests until space in the queue becomes available. The ESXi host accepts the IO throttling command and decreases the LUN queue depth to the minimum value, which is 1! The QFULL condition is handled by the QLogic driver itself. The QFULL status is not returned to the OS. But some storage devices return BUSY rather than QFULL. BUSY errors are logged in the /var/log/vmkernel. Not every busy error is a Qfull error! Check the SCSI sense codes indicated in the vmkernel message to determine what type of error it is. The VMkernel will check every 2 seconds to check if the QFULL condition is resolved. If it is resolved, theVMkernel will slowly increase the LUN queue depth to its normal value, usually, this can take up to 60 seconds.

Compellent SC8000 front-end ports act as targets and have Target Port Queue Depth 1900+. What 1900+ means? Compellent is excellent storage operating system on top of commodity servers, IO cards, disk enclosures and disks. By the way, is it software-designed-storage or not? But back to queues, 1900+ means that Queue Depth depends on IO card used for front-end port but 1900 is the guaranteed minimum. So for our case let's assume we have Storage Target Port Queue Depth 2048.

All together will give as the holistic view
Here are end to end disk queue parameters:

DSNRO is 32 - this is default ESXi value.
HBA LUN Queue Depth is 64 by default in QLogic HBAs. Compellent recommends to change it to 256 by parameter of HBA VMkernel Module Driver.
HBA Queue Depth is 2176.
Compellent Target Ports Queue Depths are 2048.

Everything is depicted in the figure below.

Disk Queue Depth - Holistic View

What is the optimal HBA LUN Queue Depth?

We already know 64 is the default value for QLogic and you can change it via VMkernel Module driver. So what is the optimal HBA LUN Queue Depth?

To prevent flooding the target port queue depth, the result of the combination of a number of host paths + HBA LUN Queue Depth+ number of presented LUNs through the host port must be less than the target port queue depth. In short T >= P * Q * L

T = Target Port Queue Depth (2048)
P = Paths connected to the target port (1)
Q = HBA LUN Queue Depth (?)
L = number of LUN presented to the host through this port (20)

So when I have 20 LUNs exposed from Compellent storage system HBA LUN Queue Depth should be maximally Q = T / P /L = 2048 / 1 / 20 = 102.4 for a single ESXi host. When I have 16 hosts HBA LUN Queue Depth must be divided by 16 so the value for this particular scenario is 6.4.
This number is without any overbooking which is not the real case. Current QLogic HBA LUN Queue Depth default value (64) introduces fan in fan-out ratio 10:1 which is probably based on long term experience with virtualized workloads. In the past, Qlogic has lower default values 16 in ESX 3.5 and 32 in ESX 4.x which had lower fan-in / fan-out ratios (2.5:1 and 5:1).

With Compellent recommended QLogic HBA LUN Queue Depth 255 we will have, in this particular case, fan-in / fan-out ratio 40 : 1. But only when SIOC is used. And when SIOC is used and normalized datastore latency is higher than SIOC threshold dynamic queue management kick in and disk queues are automatically throttled. So all is good.

What is my real Disk Queue Depth?

VMkernel Disk Queue Depth (DQLEN) is the number that matters. However, this number is dynamic and it depends on several factors. Here are scenarios you can have:

Scenario 1 - SIOC enabledWhen you have datastore with SIOC enabled then your DQLEN will be set to HBA LUN Queue Depth. It is by default 64 but Compellent recommends to set it to 255. But Compellent doesn't recommend to use SIOC.
Scenario 2 - SIOC disabled and you have only one VM on datastoreWhen you have just one VM on datastore your DQLEN will be set o HBA LUN Queue Depth. But how often do you have just one VM on datastore?
Scenario 3 - SIOC disabled and you have two or more VMs on datastoreIn this scenario your DQLEN will be set to DSNRO which is by default 32 and it has precedence over HBA LUN Queue Depth.

ESXi Disk Queue Management - None, Adaptive Queuing or Storage I/O Control?

By default, ESXi doesn't use any Disk Queue Length (DQLEN) throttling mechanism and your DQLEN is set to DSNRO (32) when two or more VM (vDisks) are on the datastore. When only one VM disk is on the datastore your DQLEN will be set to HBA LUN Queue Depth (default=64, Compellent recommends 255).

When you have enabled Adaptive Queuing or Storage I/O Control your Disk Queue Length (DQLEN) will be throttled automatically when I/O congestion occurs. The difference between Adaptive Queuing and SIOC is how I/O congestion is detected.

Adaptive queuing is waiting for storage SCSI Sense Code of BUSY or QUEUE FULL status on the I/O path. For adaptive queueing, there are some advanced parameters that must be set on a per-host basis. From the Configuration tab, you have to select Software Advanced Settings. Navigate to Disk and the two parameters you need are Disk.QFullSampleSize and Disk.QFullThreshold. By default, QFullSampleSize is set to 0, meaning that it is disabled. When this value is set, adaptive queueing will kick in and half the queue depth when this number of queue full conditions is reported by the array. The QFullThreshold is the number of good statuses to receive before incrementing the queue once again.

Storage I/O Control uses the concept of a congestion threshold, which is based on latency. But not normal latency of datastore on only one particular ESXi host but there is a sophisticated algorithm preparing normalized datastore latency of all datastore latencies across ESXi hosts using one particular datastore. On top of datastore wise normalization, there is another type of normalization which takes in to account just "normal" I/O sizes. What is a normal IO size? ~~I cannot find this detail but I think I read somewhere that the normal I/O size is between 2KB and 16KB.~~ Update 2016-10-07: Now when I work for VMware I have access to ESXi source codes and if I read the code correctly it seems to me that IO size normalization works a little bit differently. I cannot publish the exact formula as it is VMware intellectual property and implementation details but the whole idea are that for example 1MB I/Os would significantly skewed latency against normally accepted latency values in the storage industry, therefore, latency for bigger I/O's are somehow adjusted based on I/O size.

Information in the above section is based on Cormag Hogan blog post.

Conclusion

Based on the information above I think that default ESXi and QLogic values are the best for general workloads and tuning queues is not something I would recommend to do. It is important to know that queues tuning does not help with performance in most cases. Queues are used for latency improvements during transient I/O burst. If you have storage performance issues, it is usually because of an overloaded storage system and not about queues in the network path. Bigger Queue Depth can help you to handle more parallel I/Os to the storage subsystem. This could be beneficial when your target storage system can handle more I/Os. It cannot help in situations when the target storage system is overloaded and you expect more performance just by increasing queue depth.

The question I still need the answer to my self is why Compellent doesn't recommend VMware Storage I/O Control and Adaptive Queuing for dynamic queue management which is in my opinion very good thing.

UPDATE 2014-12-16: I have downloaded the latest Compellent Best Practices and there is a new statement about SIOC. SIOC can be enabled but you must know what's happening and if it is beneficial for you. Here is the snippet from the document ...

SIOC is a feature that was introduced in ESX/ESXi 4.1 to help VMware administrators regulate storage performance and provide fairness across hosts sharing a LUN. Due to factors such as Data Progression and the fact that Storage Center uses a shared pool of disk spindles, it is recommended that caution is exercised when using this feature. Due to how Data Progression migrates portions of volumes into different storage tiers and RAID levels at the block level, this could ultimately affect the latency of the volume, and trigger the resource scheduler at inappropriate times. Practically speaking, it may not make sense to use SIOC unless pinning particular volumes into specific tiers of disk.

I still believe SIOC is the way to go and special attention has to be paid to SIOC latency threshold. Compellent recommends keeping it on default value 30 milliseconds which makes perfect sense. Storage System will do all the hard work for you but when there are huge congestion and normalized latency is too high dynamic ESXi disk queue management can kick in. It makes sense to me.

I also believe that Adaptive Queuing is a really good and practical safety mechanism when your storage array has full queues in storage front-end ports or LUNs. However, it is applicable only with storage arrays supporting Adaptive Queuing and sending back to ESXi SCSI sense codes about full queue condition. If Adaptive Queueing is not used, even SIOC cannot help you with LUNs/Datastores issues (Datastore Disconnections) because SIOC algorithm is based on device response time but not on queue full storage response. Therefore, I strongly recommend enabling SIOC together with Adaptive Queuing unless your storage vendor has a really good justification to not do so. SIOC will help you with storage traffic throttling during high device response times and Adaptive Queuing when storage array queues are full and the device cannot accept new I/O's. However, Adaptive Queuing should be configured in concert with your storage vendor. For more information on how to enable Adaptive Queuing read VMware KB 1008113. Please note, that SIOC and Adaptive Queuing are just safety mechanisms on how to mitigate impacts of storage issues but the root cause is the Capacity Planning on the storage array.

UPDATE 2020-10-22: SIOC control throttles queues based on hitting a datastore latency threshold. This is the disk (device) latency not including any additional latency induced by additional VMkernel queueing. SIOC is using a minimal threshold 5 ms response time to kick in SIOC, this can be good for traditional storage arrays with rotational disks, however, in the all-flash era, you expect response times between 1 and 3 ms, right? Sometimes even sub-millisecond response time. SIOC will not help you to achieve fairness and defined SLA/OLA in such all-flash environments, but can you do dynamic queue management at least in situations the response times increase above 5 ms.

If any of you with a deep understanding of vSphere and storage architectures see an error in my analysis, please let me know so that I can correct it appropriately.

Useful links:

VirtualGeek : VMware I/O queues, “micro-bursting”, and multipathing
Cormac Hogan : Adaptive Queueing vs. Storage I/O Control
VMware KB: Checking the queue depth of the storage adapter and the storage device
Mariusz Kaczorek : What is Storage Queue Depth (QD) and why is it so important?
Duncan Epping : Why Queue Depth matters!
Duncan Epping : Storage I/O Fairness
Duncan Epping : DQLEN changes, what is going on?
Cody Hosterman : UNDERSTANDING VMWARE ESXI QUEUING AND THE FLASHARRAY
Jason Massae, Cody Hosterman : Core Storage Best Practice Deep Dive (VMworld 2020 - HCI1691)
HPE : HPE 3PAR Storage System - Handling Port Target_qlength_above_threshold Alerts on ESX Servers

Wednesday, December 03, 2014

Force10: How to prepare logs and configs for DELL tech support

The command show tech-support will show you all configurations and logs required for troubleshooting on the console. It is usually not what you want because you have to transfer support file somewhere. Therefore you can simply save it to internal flash device as an file and transfer it via ftp, tftp or scp to some computer.

F10-S4810-A#show tech-support | save flash://tech-supp.2014-12-03
Start saving show command report .......

The file tech-supp.2014-12-03 is created in the flash device and you can list it.

F10-S4810-A#dir
Directory of flash:
1 drwx 4096 Jan 01 1980 00:00:00 +00:00 .
2 drwx 3072 Dec 04 2014 03:49:32 +00:00 ..
3 drwx 4096 Mar 01 2004 21:28:04 +00:00 TRACE_LOG_DIR
4 drwx 4096 Mar 01 2004 21:28:04 +00:00 CORE_DUMP_DIR
5 d--- 4096 Mar 01 2004 21:28:04 +00:00 ADMIN_DIR
6 drwx 4096 Mar 01 2004 21:28:06 +00:00 RUNTIME_PATCH_DIR
7 drwx 4096 Nov 04 2014 01:09:36 +00:00 CONFIG_TEMPLATE
8 -rwx 6731 Dec 04 2014 01:46:00 +00:00 startup-config
9 -rwx 6285 Nov 07 2013 04:41:32 +00:00 startup-config.bak
10 -rwx 25614609 Feb 21 2013 17:24:22 +00:00 FTOS-SE-8.3.12.1.bin
11 -rwx 524528 Feb 21 2013 17:40:42 +00:00 U-boot.1.2.0.2.bin
12 -rwx 7202 Nov 07 2013 05:35:14 +00:00 david.pasek-config
13 drwx 4096 May 07 2014 04:06:42 +00:00 CONFD_LOG_DIR
14 -rwx 4094 Jul 18 2014 02:25:28 +00:00 inetd.conf
15 -rwx 0 Jul 19 2014 00:14:56 +00:00 pdtrc.lo0
16 -rwx 6125 Jul 15 2014 02:05:52 +00:00 backup-config.dp
17 -rwx 80 Jul 18 2014 21:45:22 +00:00 memtrc.lo0
18 -rwx 199770 Dec 04 2014 01:46:08 +00:00 confd_cdb.tar.gz
19 -rwx 0 Jul 18 2014 21:44:52 +00:00 pdtrc3.lo0
20 drwx 4096 Jul 19 2014 00:14:56 +00:00 bgp-trc
21 -rwx 6058 Oct 25 2014 05:13:10 +00:00 config-vrf-no-vrrp
22 -rwx 6110 Oct 25 2014 05:05:08 +00:00 config-vrf-vrrp
23 -rwx 135698 Dec 04 2014 03:50:16 +00:00 tech-supp.2014-12-03

Now it is very easy to transfer this single file somewhere via ssh(scp), ftp or tftp. You can use command similar to

copy tech-supp.2014-12-03 scp://

I don't anybody wish troubles but hope this helps with troubleshooting when needed ...

Force10 switch port iSCSI configuration

Here is snippet of Force10 switch port configuration of port facing storage front-end port or host NIC port dedicated just for iSCSI. In other words this is non-DCB switch port configuration.

interface TenGigabitEthernet 0/12
no ip address
mtu 12000
switchport
flowcontrol rx on tx off
spanning-tree rstp edge-port
spanning-tree rstp edge-port bpduguard shutdown-on-violation
storm-control broadcast 100 in
storm-control unknown-unicast 50 in
storm-control multicast 50 in
no shutdown

MTU is set to 12000 which is the supported maximum and allowing to use Jumbo Frames. I know 9216 is enough MTU for iSCSI however modern switches is capable for larger MTU without performance overhead so why not use it?

Flow-control is enabled for receive traffic so switch can sen PAUSE frame to initiator or target to wait a while in case switch port buffers are full. Transience flow-control is not enabled because switch port buffers are not deep enough to help.

Switch port is set to edge-port to eliminate spanning-tree algorithm because connected devices are edge devices - server (iSCSI initiator) or storage (iSCSI target).

[Updated at 2015-01-19 based on Martin's comment]
Bpduguard can be optionally enabled on switch port as prevention against cabling mistake (also known as an human factor) because iSCSI storage should not send any BPDU.

[/Updated]

Storm control of broadcasts, multicasts and unknown unicasts is enabled to eliminate unwanted storms. Please note that unicast storm may not be enabled because iSCSI generates lot of traffic which can be possible identified as unicast storm.

That's it for iSCSSI network configuration of dedicated ports. When you use shared ports for iSCSI and LAN traffic you should use DCB. DCB configuration is similar to configuration above but replacing flow-control with priority-flow-control (PFC) and leveraging link QoS technology (ETS) based on 802.1p CoS. You can check DCB design justification and sample configuration for iSCSI here,

And as always, any comment is really appreciated.

Saturday, November 29, 2014

The ZALMAN ZM VE200 SATA hard disk caddy with DVD/HDD/FDD emulation

I have just bought external USB drive with DVD emulation from ISO file. That's should be pretty handy for OS installs. I'm looking forward for first ESXi installation directly from ISO file.

Here is nice and useful tutorial how to use it.

Friday, November 21, 2014

Announcing the VMware Learning Zone

As a VMware vExpert I had a chance to use beta access to VMware Learning Zone. I blogged about my experience here. VMware Learning Zone has been officially announced today.

VMware Learning Zone is a new subscription-based service that gives you a full year of unlimited, 24/7 access to official VMware video-based training. Top VMware experts and instructors discuss solutions, provide tips and give advice on a variety of advanced topics. Your VMware Learning Zone subscription gives you:

Easy to consume training on the latest products and technologies
Powerful search functionality to find the answers you need fast
Content that delivers exactly the knowledge you need
Mobile access for on the go viewing
Much more

Learn more here.

ISCSI Best Practices

General ISCSI Best Practices

Separate VLAN for iSCSI traffic.
Two separate networks or VLANs for multipath iSCSI.
Two separate IP subnets for the separate networks or VLANs in multipath iSCSI.
Gigabit (or better) Full Duplex connectivity between storage targets (storage front-end ports) and all storage initiators (server ports)
Auto-Negotiate for all switches that will correctly negotiate Full Duplex
Full Duplex hard set for all iSCSI ports for switches that do not correctly negotiate
Bi-Directional Flow Control enabled for all Switch Ports that servers or controllers are using for iSCSI traffic.
Bi-Directional Flow Control enabled for all ports that handle iSCSI traffic. This includes all devices between two sites that are used for replication.
Unicast storm control disabled on every switch that handles iSCSI traffic.
Multicast disabled at the switch level for any iSCSI VLANs.
Broadcast disabled at the switch level for any iSCSI VLANs.
Routing disabled between the regular network and iSCSI VLANs.
Do not use Spanning Tree (STP or RSTP) on ports that connect directly to end nodes (the server or storage iSCSI ports.) If you must use it, enable the Cisco PortFast option or equivalent on these ports so that they are configured as edge ports.
Ensure that any switches used for iSCSI are of a non-blocking design.
When deciding which switches to use, remember that you are running SCSI traffic over it. Be sure to use a quality managed enterprise-class networking equipment. It is not recommended to use SBHO (small business/home office) class equipment outside of lab/test environments.

For Jumbo Frame Support

Some switches have limited buffer sizes and can only support Flow Control or Jumbo Frames, but not both at the same time. It is strongly recommended to choose Flow Control.
All devices connected through iSCSI need to support 9k jumbo frames.
All devices used to connect iSCSI devices need to support it.
This means every switch, router, WAN Accelerator, and any other network device that will handle iSCSI traffic needs to support 9k Jumbo Frames.
If it is not 100% positive that every device in the iSCSI network supports 9k Jumbo Frames, then NOT turn on Jumbo Frames.
Because devices on both sides (server and SAN) need Jumbo Frames enabled, change disable to enable Jumbo Frames is recommended during a maintenance window. If servers have it enabled first, the Storage System will not understand their packets. If Storage System enables it first, servers will not understand its packets.

VMware ESXi iSCSI tunning

Disabling "TCP Delayed ACK" (esxcli iscsi adapter param set -A vmhba33 -k DelayedAck -v 0 - command not tested)
Adjust iSCSI Login Timeout (esxcli iscsi adapter param set -A vmhba33 -k LoginTimeout -v 60)
Disable large receive offload (LRO) (esxcli system settings advanced set -o /Net/TcpipDefLROEnabled 0 or esxcfg-advcfg -s 0 /Net/TcpipDefLROEnabled)
Set up Jumbo Frames is configured end to end (esxcli network vswitch standard set -m 9000 -v vSwitch2 and esxcli network ip interface set -m 9000 -i vmk1)
Set up appropriate multi pathing based on iSCSI storage system
FlowControl is enabled on ESXi by default. To display FlowControl settings use ethtool --show-pause vmnic0 or esxcli system module parameters list --module e1000 | grep "FlowControl"

If you know about some other best practice, tuning setting or recommendation don't hesitate to leave a comment below this blog post.

Related documents:

[1] VMware. Best Practices For Running VMware vSphere On iSCSI. In: core.vmware.com, URL: https://core.vmware.com/resource/best-practices-running-vmware-vsphere-iscsi

Saturday, November 15, 2014

How to quickly get changed ESXi advanced settings?

Below is esxcli command to list ESXi Advanced Settings that have changed from the system defaults:

esxcli system settings advanced list -d

Here is real example form my ESXi host in lab ...

~ # esxcli system settings advanced list -d
Path: /UserVars/SuppressShellWarning
Type: integer
Int Value: 1
Default Int Value: 0
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: Don't show warning for enabled local and remote shell access

You can see that I'm suppressing Shell Warning because I really want to have SSH enabled and running on my lab ESXi all the time.

If you want list kernel settings there is another command

esxcli system settings kernel list

and you can also used option -d to get just changed settings from default.

Friday, November 14, 2014

Virtualisation Design & Project Framework

Gareth Hogarth wrote excellent high level plan (aka methodology, framework) how to properly deliver virtualization project as a turn key solution. I used very similar approach and not only for virtualization project but to any IT project where I have a role of Leading Architect. I have never written a blog post about this particular topic because it is usually internal intellectual property of any consulting organization. So if you have never seen any similar methodology look at Gareth's post to get an idea of project phases and overall project process. It is good to note that all these methodologies are just frameworks and frameworks are usually good starting points which doesn't stop you to improve it to fulfill all specific project requirements and make your project successful.

Friday, November 07, 2014

40Gb over existing LC fiber optics

Do you know DELL has QSFP+ LM4 transciever allowing 40Gb traffic up to 160m on LC OM4 MMF (multi mode fiber) or up to 2km on LC SMF (single mode fiber)?

Use Case:

This optic has an LC connection and is ideal for customers who want to use existing LC fiber. It can be used for 40GbE traffic up to 160m on MultiMode Fiber OR 2km on Single Mode fiber.

Specification

Periferal Type: DELL QSFP+ LM4

Connection: LC Connection, Dulplex Multi-Mode Fiber or Dulpex Single-Mode Fiber
Max Distance: 140m OM3 or 160m OM4 MMF, 2km SMF
Transmitter Output Wavelength (nm): 1270 to 1330
Transmit Output Power (dBm): -7.0 to 3.5 [avg power per lane]
Receive Input Power (dBm): -10.0 to 3.5 [avg power per lane]
Temperature: 0 to 70C
Power: 3.5W max

Based on wavelength range 1270 to 1330 I assume 40Gb is achieved as 4 x 10Gb leveraging wavelength-division multiplexing (CWDM) on following wave lengths:

1270 nm
1290 nm
1310 nm
1330 nm

Thursday, November 06, 2014

ESXi Network Troubleshooting

Introduction

As VMware vExpert, I had a chance and privilege to use VMware Learning Zone. There are excellent training videos. Today I would like to blog about useful commands trained on video training “Network Troubleshooting at the ESXi Command Line”. If you ask me I have to say that Vmware Learning Zone has very valuable content and it comes really handy during real troubleshooting.

UPDATE 2020-10-17: I have just found the blog post "ESXi Network Troubleshooting Tools" containing a lot of useful tools and insights.

NIC Adapters Information

To see Network Interface Cards Information you can run following command

~ # /usr/lib/vmware/vm-support/bin/nicinfo.sh | more

Network Interface Cards Information.

Name PCI Device Driver Link Speed Duplex MAC Address MTU Description

----------------------------------------------------------------------------------------

vmnic0 0000:001:00.0 bnx2 Up 1000 Full 14:fe:b5:7d:8d:05 1500 Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

vmnic1 0000:001:00.1 bnx2 Up 1000 Full 14:fe:b5:7d:8d:07 1500 Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

vmnic2 0000:002:00.0 bnx2 Up 1000 Full 14:fe:b5:7d:8d:6d 1500 Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

vmnic3 0000:002:00.1 bnx2 Up 1000 Full 14:fe:b5:7d:8d:6f 1500 Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

NIC: vmnic0

NICInfo:

Advertised Auto Negotiation: true

Advertised Link Modes: 1000baseT/Full, 2500baseT/Full

Auto Negotiation: true

Cable Type: FIBRE

Current Message Level: -1

Driver Info:

NICDriverInfo:

Bus Info: 0000:01:00.0

Driver: bnx2

Firmware Version: 7.8.53 bc 7.4.0 NCSI 2.0.13

Version: 2.2.3t.v55.7

Link Detected: true

Link Status: Up

Name: vmnic0

PHY Address: 2

Pause Autonegotiate: false

Pause RX: true

Pause TX: true

Supported Ports: TP, FIBRE

Supports Auto Negotiation: true

Supports Pause: true

Supports Wakeon: true

Transceiver: internal

Wakeon: MagicPacket(tm)

Ring parameters for vmnic0:

Pre-set maximums:

RX: 4080

RX Mini: 0

RX Jumbo: 16320

TX: 255

Current hardware settings:

RX: 255

RX Mini: 0

RX Jumbo: 0

TX: 255

…

Output above is sniped just for vmnic0. You can see useful information like PCI Device ID, Driver, Link Status, Speed, Duplex and MTU for each vmnic.

It also shows detail driver information, FlowControl (Pause Frame) status, cable type. etc.

To find particular vmnic PCI Vendor ID's use command vmkchdev

~ # vmkchdev -l | grep vmnic0

0000:01:00.0 14e4:163a 1028:02dc vmkernel vmnic0

PCI Slot: 0000:01:00.0

VID (Vendor ID): 14e4

DID (Device ID): 163a

SVID (Sub-Vendor ID): 1028

SSID (Sub-Device ID): 02dc

You can use PCI devices Vendor ID’s to find the latest drivers at VMware Compatibility Guide (http://www.vmware.com/go/hcl/).

Below is another command how to find full details of all PCI devices.

esxcli hardware pci list

If you are interested just for particular vmnic PCI details command below can be used.

~ # esxcli hardware pci list | grep -B 6 -A 29 vmnic0

000:001:00.0

Address: 000:001:00.0

Segment: 0x0000

Bus: 0x01

Slot: 0x00

Function: 0x00

VMkernel Name: vmnic0

Vendor Name: Broadcom Corporation

Device Name: Broadcom NetXtreme II BCM5709S 1000Base-SX

Configured Owner: Unknown

Current Owner: VMkernel

Vendor ID: 0x14e4

Device ID: 0x163a

SubVendor ID: 0x1028

SubDevice ID: 0x02dc

Device Class: 0x0200

Device Class Name: Ethernet controller

Programming Interface: 0x00

Revision ID: 0x20

Interrupt Line: 0x0f

IRQ: 15

Interrupt Vector: 0x2b

PCI Pin: 0x75

Spawned Bus: 0x00

Flags: 0x0201

Module ID: 4125

Module Name: bnx2

Chassis: 0

Physical Slot: 0

Slot Description: Embedded NIC 1

Passthru Capable: true

Parent Device: PCI 0:0:1:0

Dependent Device: PCI 0:0:1:0

Reset Method: Link reset

FPT Sharable: true

Note: same command can be used for HBA cards by substituting vmnic0 by vmhba0

VLAN Sniffing

The commands below enable VLAN statistics collection on particular vmnic which can be shown and used for troubleshooting.

esxcli network nic vlan stats set --enabled=true -n vmnic0

~ # esxcli network nic vlan stats get -n vmnic0

VLAN 0

Packets received: 22

Packets sent: 0

VLAN 22

Packets received: 21

Packets sent: 10

VLAN 201

Packets received: 28

Packets sent: 0

VLAN 202

Packets received: 28

Packets sent: 0

VLAN 204

Packets received: 5

Packets sent: 0

VLAN 205

Packets received: 5

Packets sent: 0

Don’t forget to disable VLAN statistics after troubleshooting.

esxcli network nic vlan stats set --enabled=false -n vmnic0

VMkernel Arp Cache

To work with ESXi ARP cache you can use command

esxcli network ip neighbor

Below is example how to list ARP entries …

~ # esxcli network ip neighbor list

Neighbor Mac Address Vmknic Expiry State Type

--------- ----------------- ------ ------- ----- -------

10.2.22.1 5c:26:0a:ae:5a:c6 vmk0 933 sec Unknown

You can see there just default gateway 10.2.22.1

Let’s ping some other device in the same broadcast domain and look at ARP entries again.

~ # ping 10.2.22.51

PING 10.2.22.51 (10.2.22.51): 56 data bytes

64 bytes from 10.2.22.51: icmp_seq=0 ttl=128 time=0.802 ms

~ # esxcli network ip neighbor list

Neighbor Mac Address Vmknic Expiry State Type

---------- ----------------- ------ -------- ----- -------

10.2.22.51 00:0c:29:4a:5b:ba vmk0 1195 sec Unknown

10.2.22.1 5c:26:0a:ae:5a:c6 vmk0 878 sec Unknown

Now you can see entry for device 10.2.22.51 in ARP table as well. Below is another command to remove ARP entry from ARP table.

~ # esxcli network ip neighbor remove -v 4 -a 10.2.22.51

… and let’s check if ARP entry has been removed.

~ # esxcli network ip neighbor list

Neighbor Mac Address Vmknic Expiry State Type

--------- ----------------- ------ ------- ----- -------

10.2.22.1 5c:26:0a:ae:5a:c6 vmk0 817 sec Unknown

Note: ESXi ARP timeout is 1200 second therefore remove command can be handy in some situations.

VMkernel Routing

Since vSphere 5.1 it is possible to have more than one networking stack. Normally you work with default networking stack.

To show ESXi routing table you can use command

esxcli network ip route ipv4 list

~ # esxcli network ip route ipv4 list

Network Netmask Gateway Interface Source

--------- ------------- --------- --------- ------

default 0.0.0.0 10.2.22.1 vmk0 MANUAL

10.2.22.0 255.255.255.0 0.0.0.0 vmk0 MANUAL

You can see default gateway 10.2.22.1 used for default networking stack.

Command esxcli network ip connection list shows all IP network connections from and to ESXi host.

~ # esxcli network ip connection list

Proto Recv Q Send Q Local Address Foreign Address State World ID CC Algo World Name

----- ------ ------ ------------------------------- ------------------ ----------- -------- ------- ---------------

tcp 0 0 127.0.0.1:8307 127.0.0.1:54854 ESTABLISHED 34376 newreno hostd-worker

tcp 0 0 127.0.0.1:54854 127.0.0.1:8307 ESTABLISHED 570032 newreno rhttpproxy-work

tcp 0 0 127.0.0.1:443 127.0.0.1:54632 ESTABLISHED 570032 newreno rhttpproxy-work

tcp 0 0 127.0.0.1:54632 127.0.0.1:443 ESTABLISHED 1495503 newreno python

tcp 0 0 127.0.0.1:8307 127.0.0.1:61173 ESTABLISHED 34806 newreno hostd-worker

tcp 0 0 127.0.0.1:61173 127.0.0.1:8307 ESTABLISHED 570032 newreno rhttpproxy-work

tcp 0 0 127.0.0.1:80 127.0.0.1:60974 ESTABLISHED 34267 newreno rhttpproxy-work

tcp 0 0 127.0.0.1:60974 127.0.0.1:80 ESTABLISHED 35402 newreno sfcb-vmware_bas

tcp 0 0 10.2.22.101:80 10.44.44.110:50351 TIME_WAIT 0

tcp 0 0 127.0.0.1:5988 127.0.0.1:14341 FIN_WAIT_2 35127 newreno sfcb-HTTP-Daemo

tcp 0 0 127.0.0.1:14341 127.0.0.1:5988 CLOSE_WAIT 1473527 newreno hostd-worker

tcp 0 0 127.0.0.1:8307 127.0.0.1:45011 ESTABLISHED 34806 newreno hostd-worker

tcp 0 0 127.0.0.1:45011 127.0.0.1:8307 ESTABLISHED 570032 newreno rhttpproxy-work

NetCat

Netcat program (nc) is available on ESXi and it can test TCP connectivity to some IP target.

~ # nc -v 10.2.22.100 80

Connection to 10.2.22.100 80 port [tcp/http] succeeded!

TraceNet

Tracenet is very handy program available in ESXi to identify also latencies inside vmkernel IP stack.

~ # tracenet 10.2.22.51

Using interface vmk0 ...

Time 0.068 0.023 0.019 ms

Location: ESXi-Firewall

Time 0.070 0.025 0.020 ms

Location: VLAN_InputProcessor@#

Time 0.073 0.027 0.022 ms

Location: vSwitch0: port 0x2000004

Time 0.089 0.030 0.024 ms

Location: VLAN_OutputProcessor@#

Time 0.090 0.031 0.025 ms

Location: DC01

Endpoint: 10.2.22.51

Roundtrip Time: 0.417 0.195 0.196 ms

Dropped packets

In this section are commands to verify dropped packets on different places of VMkernel Ip stack.

Command net-stats –l list all devices (Clients – nic-ports,vmk-ports, vm-ports) connected to VMware switch. You can simply identify to which vSwitch port number (PortNum) is device connected.

~ # net-stats -l

PortNum Type SubType SwitchName MACAddress ClientName

33554434 4 0 vSwitch0 14:fe:b5:7d:8d:05 vmnic0

33554436 3 0 vSwitch0 14:fe:b5:7d:8d:05 vmk0

33554437 5 9 vSwitch0 00:0c:29:4a:5b:ba DC01

33554438 5 9 vSwitch0 00:0c:29:f0:df:4c VC01

Note: SubType is VM Hardware Version

vSwitch port numbers are important for following commands.

Command esxcli network port stats get –p shows statistics for particular vSwitch port.

~ # esxcli network port stats get -p 33554434

Packet statistics for port 33554434

Packets received: 2346445

Packets sent: 5853

Bytes received: 295800113

Bytes sent: 1225842

Broadcast packets received: 1440669

Broadcast packets sent: 336

Multicast packets received: 896958

Multicast packets sent: 120

Unicast packets received: 8818

Unicast packets sent: 5397

Receive packets dropped: 0

Transmit packets dropped: 0

You can also show filter statistics for ESXi firewall by command esxcli network port filter stats get –p 33554436

~ # esxcli network port filter stats get -p 33554436

Filter statistics for ESXi-Firewall

Filter direction: Receive

Packets in: 5801

Packets out: 5660

Packets dropped: 141

Packets filtered: 150

Packets faulted: 0

Packets queued: 0

Packets injected: 0

Packet errors: 0

Filter statistics for ESXi-Firewall

Filter direction: Transmit

Packets in: 4893

Packets out: 4887

Packets dropped: 6

Packets filtered: 6

Packets faulted: 0

Packets queued: 0

Packets injected: 0

Packet errors: 0

To show physical NIC statistics you have to use command esxcli network nic stats get –n vmnic0

~ # esxcli network nic stats get -n vmnic0

NIC statistics for vmnic0

Packets received: 2350559

Packets sent: 8083

Bytes received: 312690659

Bytes sent: 5791889

Receive packets dropped: 0

Transmit packets dropped: 0

Total receive errors: 0

Receive length errors: 0

Receive over errors: 0

Receive CRC errors: 0

Receive frame errors: 0

Receive FIFO errors: 0

Receive missed errors: 0

Total transmit errors: 0

Transmit aborted errors: 0

Transmit carrier errors: 0

Transmit FIFO errors: 0

Transmit heartbeat errors: 0

Transmit window errors: 0

Packet Capture

If you want to do deeper network troubleshooting you can do packet capturing on ESXi host. You have two tools available for packet capturing

tcpdump-uw (example: tcpdump-uw –I vmk0 –s0 –C100M –W 10 –w /var/tmp/test.pcap)
pktcap-uw

pktcap Examples:

pktcap-uw –uplink vmnicX –capture UplinkRcv
pktcap-uw –uplink vmnicX –capture UplinkSnd
you can filter for icmp –proto 0x01 or beacon probes –ethtype 0x8922

Other example based on [SOURCE] https://kb.fortinet.com/kb/documentLink.do?externalID=FD47845

In case of connectivity issue between a VM and other VM/s it is worth sniffing traffic on the hypervisor side in order to isolate the issue.

In order to sniff traffic on ESXi server, it is necessary to perform the steps below:

- Enable ssh access on ESXi.

- Ssh to ESXi.

- Run in CLI net-stats -l | grep <VM name> in order to find virtual switchport of the VM.

- In vSphere 6.5 or earlier it is necessary to specify direction of sniffing (either input or output).

- Switchport number for particular VM can be found using net-stats command.

- 'O' defines path where pcap file will be created and specify file name.

- dir specify direction (either input or output):

pktcap-uw --switchport 123 -o /tmp/in.pcap --dir input

pktcap-uw --switchport 123 -o /tmp/out.pcap --dir output

- In vSphere 6.7 or later it is possible to sniff traffic in both directions by setting --dir 2:

pktcap-uw --switchport 123 -o /tmp/both.pcap --dir 2

- Run Ctrl-C in CLI order to stop sniffing.

- Download created pcap file/s over ssh from ESXi.

Pages

Wednesday, December 31, 2014

Conclusion

Friday, December 19, 2014

Monday, December 15, 2014

What is queue depth?

HBA Queue Depth = AQLEN

HBA LUN Queue Depth

VMFS Layer DSNRO

Storage Target Port Queue Depth

What is the optimal HBA LUN Queue Depth?

What is my real Disk Queue Depth?

ESXi Disk Queue Management - None, Adaptive Queuing or Storage I/O Control?

Conclusion

Wednesday, December 03, 2014

Saturday, November 29, 2014

Friday, November 21, 2014

Saturday, November 15, 2014

Friday, November 14, 2014

Friday, November 07, 2014

Use Case:

Specification

Thursday, November 06, 2014

Introduction

NIC Adapters Information

VLAN Sniffing

VMkernel Arp Cache

VMkernel Routing

NetCat

TraceNet

Dropped packets

Packet Capture

Subscribe To