Wednesday, January 28, 2015

The fastest vMotion over Force10 MXL?

Here is the question I have got yesterday ...
My customer has two M1000e chassis in a single rack with MXL blade switches in fabrics A and B.  MXL fabric B is connected to 10G EQL SAN.  The goal is to allow vmotion to occur very fast between the two chassis using fabric A without going to the top of rack 10G switch.  The question is what interconnect between the A fabric is both chassis is best?
 Is VLT or stacking preferred?
Is it best to vertically stack chassis 1 MXL A1 to chassis 2 MXL A1 and then LACP to TOR S4810?
Or is it better to horizontally stack chassis 1 MXL A1 to chassis 1 MXL A2 and then LACP to TOR S4810?
Let's do a consultative design exercise.

Requirements
  • R1 - vMotion to occur very fast between two chassis
  • R2- vMotion over fabric A without going to the TOR switch
  • R3 - use fabric B only for iSCSI 
Constraints
  • C1 - 2x Blade chassis DELL M1000e
  • C2 - Each blade chassis has Force10 MXL blade switch IO modules on fabrics A1 and A2 for ethernet/IP trafffic
  • C3 - Each blade chassis has Force10 MXL blade switch IO modules on fabrics B1 and B2 for iSCSI trafffic
  • C4 - VMware vSphere ESXi hypervisor on each blade server
  • C5 - maximum 8 vMotions per ESXi host
Assumptions
  • A1- Blade servers have 2x 10GB NIC connected to fabric A (A1, A2)
  • A2- Blade servers have 2x 10GB NIC connected to fabric B (B1, B2)
  • A3- 16x half height blade servers are used on each blade chassis
  • A4 - Each ESXi server has NIC Teaming with dual homing to A1 and A2 IO modules
Design decision and justification
  • MXL switches in fabrics A1/A2 on each blade chassis are stacked vertically via 160Gb interconnect for east/west traffic with fan in/out ratio 1:1
  • Vertical stacking allow single management of two switches in fabric A1 but still allow non disruptive firmware upgrade because fabric A2 is independent fault zone and NIC teaming will handle automated fail over. That's the reason horizontal stacking is not used.
  • Northbound connectivity for north/south traffic is done via VLT port-channel giving 80Gb total upstream bandwidth for each MXL which is fan in/out ratio 2:1. 
  • Top of rack switches (2x Force10 S4810) are formed into single VLT domain (aka virtual chassis) to have loop free topology and utilize full upstream bandwidth. 
Design impact
  • vMotion vmKernel interfaces has to be configured on the same physical NIC (vmnic) on each ESXi host. This will ensure vMotion traffic inside Fabric A without unwanted TOR switch traffic.
  • MultiNIC vMotion cannot be used otherwise vMotion traffic between A1 and A2 would potentially go through TOR switch which is against requirement REQ-002.
  • LACP/Etherchannel Teaming cannot be used because upstream MXL switches are not in stack or VLT. Therefore IP Hash based load balancing cannot be used and single VM traffic will be always routed over single physical NIC and single VM will not be able to handle more than 10Gb.
  • VM Traffic for particular portgroups (L2 segments) should be configured as active/standby consistently across servers for optimal east-west traffic in particular VLAN in non-degraded state and eliminating VM traffic flow across TOR switch.
  • L3 traffic will be routed over TOR.
Alternative
  • Leveraging vSphere MultiNIC vMotion can improve vMotion performance (REQ-001) but would be against REQ-002 because vMotion communication would fly over TOR switch.
Design decision qualities
  • Availability: Great
  • Performance: Very good for vMotion, Good for VM Traffic 
  • Manageability: Good - just two logical switches to manage
  • Scalability: max 6 members in stack
Logical design drawing


Monday, January 26, 2015

DELL 13G servers with PERC H730 finally certified for VSAN

I'm reading and learning about VMware's VSAN a lot. I really believe there will be lot of use cases in the future for software defined distributed storage. However I don't see VSAN momentum right now because of several factors. Three most obvious factors are mentioned below:

  • Maturity
  • TCO
  • Single point of support - if you compare it to traditional SAN based storage vendors support

That's the reason I didn't have a chance and time to play with VMware VSAN so far but I'm getting lot of questions from colleagues, DELL partners, customers and folks from VMware community about the right DELL storage controller for VSAN which can be used on the latest DELL server generation.

DELL 13th server generation was unveiled September 8, 2014. Since then, there was not any DELL storage controller for DELL 13G servers officially supported by VMware for VSAN.
Today I have got information that DELL PERC H730 is officially supported by DELL and VMware for VSAN. For more information look here.
This is really great info for VSAN early adapters planning to use DELL servers. One little advice to all VSAN enthusiasts ... If you are not going to use officially supported VSAN nodes or EVO:Rail appliance and you are designing your own VSAN cluster do it very carefully and don't forget to do PoC before or during design phase and perform design and operational validation tests (aka test plan) before putting VSAN into real production. Be sure you know something about queue depth of adapters (AQLEN) and disks (DQLEN).

If you build your own software defined storage then you are the storage architect with little bit higher risk and responsibility in comparison to classic storage system (this is my opinion). That's the risk of any modern (aka emerging) technology before it's become the commodity. On the other hand, this can be your added value to your customers and there are no doubts there are some benefits.

But never forget why "data centers" are so important and business critical? Because usually we have there very valuable data which must be always available with reasonable performance. Think about 99.999% storage up time with some reasonable response time (3-20ms) for expected IOPS workload.

I wish everybody lot of success with hyper converge systems like VSAN and leave a comment of your hopefully success stories and use cases. And I'm still looking forward for my first VSAN project  :-) 

vCenter SSO: Active Directory as a LDAP Server

Recently I had a need to use secondary Active Directory (VPOD02.example.com) to my vCenter SSO in the lab which is already integrated with Active Directory (VPOD01.example.com).

Here are several facts just to give you brief overview of my lab.

I have two independent vPODs in my lab. Each vPOD has everything what's needed for VMware vSphere infrastructure. I have there dedicated hardware (Compute, Storage, Network), vSphere components like vCenter, SSO, ESXi hosts, Site Recovery Manager, vSphere Replication Appliance, and also Domain Controllers and DNS servers.

vCenter SSO placed in VPOD01 is using Integrated Windows Authentication with Microsoft Active Directory "VPOD01.example.com". Therefore another integration with Microsoft Active Directory "VPOD02.example.com" can be done only via LDAP. Configuration of additional identity source is depicted on the screenshot below.

SSO: Add identity source
Identity source type: Active Directory as a LDAP Server
Identity source settings:
  Name: vpod02.example.com
  Base DN for users: dc=vpod02,dc=example,dc=com
  Domain name: vpod02.example.com
  Domain alias: vpod02
  Base DN for groups: dc=vpod02,dc=example,dc=com
  Primary server URL: ldap://10.2.22.51:389
  Secondary server URL: empty
  Username: administrator@vpod02.example.com
I know that two Microsoft domains can be integrated in to the single "Domain Trust" but because I'm not to much familiar and experienced with Microsoft Active Directory I think that vCenter Single Sign-On capability of multiple identity sources is another nice design option.

Simpler manageability for non-Microsoft oriented vSphere Admin was the primary reason and justification to use this option in my vSphere lab :-)




Monday, January 19, 2015

DELL Force10 : DCB configuration - design decision justification and configuration

Introduction to DCB

Datacenter bridging (DCB) is group of protocols for modern QoS mechanism on Ethernet networks. There are four key DCB protocols described with more details here. In this blog post I'll show you how to configure DCB ETS, PFC and DCBX on Force10 S4810.

ETS (Enhanced Transmission Selection) is bandwidth management allowing reservations of link bandwidth resources when link is congested. DCB QoS is based on 802.1p CoS (Class of Service) which can handle up to 8 class of services (aka priority levels). Any QoS is always done via dedicated queues for different class of services and I/O scheduler which understand configured priorities.

S4810 has 4 queues and 802.1p CoS are by default mapped as outputted bellow …
DCSWCORE-A#show qos dot1p-queue-mapping
Dot1p Priority : 0  1  2  3  4  5  6  7
         Queue : 0  0  0  1  2  3  3  3
Command service-class dot1p-mapping can reconfigure mapping but let's use default mapping for our example. Queue CoS mapping:

  • To Queue 0 are mapped CoS'es 0,1,2
  • To Queue 1 is mapped CoS 3
  • To Queue 2 is mapped CoS 4
  • To Queue 3 are mapped CoS'es 5,6,7

PFC (Priority Flow Control) is nothing else then classic Ethernet flow control protocol but just in one specific 802.1p CoS. Force10 S4810 support PFC on two queues.

Now, let's define our design requirements and constraints for our specific design decision.

Design decision justification

R1: 4Gb guarantee for iSCSI traffic on each 10Gb converged link is required.
R2: Lost-less ethernet is required for iSCSI traffic
R3: 1Gb guarantee for Hypervisor Management network on each 10Gb converged link is required.
R4: 2Gb guarantee for Hypervisor Live Migration network on each 10Gb converged link is required.
R5: 3Gb guarantee for production networks on each 10Gb converged link is required.

C1: We have 10Gb links to edge devices (servers and storage)
C2: We have only four switch queues for DCB on DELL Force10 S4810
C3: We have DCB capable iSCSI storage DELL EqualLogic

A1: No other storage protocol then iSCSI is required
A2: No other network traffic type requires QoS
A3: We have iSCSI traffic in 802.1p CoS 4

Let's design best DCB Mapping based on requirements, constraints and assumptions above. Following priority groups reflects all requirements and constraints.

  • PG0 - Hypervisor management; 10% reservation; lossy ethernet;  CoS 0,1,2 -> Switch Queue 0
  • PG1 - Hypervisor live migrations; 20% reservation; lossy ethernet;  CoS 3 -> Switch Queue 1
  • PG2 - iSCSI; 40% reservation; loss-less ethernet;  CoS 4 -> Switch Queue 2
  • PG3 - Production; 30% reservation; lossy ethernet;  CoS 5,6,7 -> Switch Queue 3

Below is Force10 configuration snippet of DCB mapping to 802.1p CoS'es.
dcb-map converged
  priority-group 0 bandwidth 10 pfc off
  priority-group 1 bandwidth 20 pfc off
  priority-group 2 bandwidth 40 pfc on
  priority-group 3 bandwidth 30 pfc off
  priority-pgid 0 0 0 1 2 3 3 3
DCB map has to be configured on particular Force10 switch port. One particular switch port configuration snippet is below.
interface TenGigabitEthernet 0/6
 no ip address
 mtu 12000
 switchport
 spanning-tree rstp edge-port
 dcb-map converged
!
 protocol lldp
  dcbx port-role auto-downstream
 no shutdown
Following technologies are configured on switch port Te 0/6 by configuration snippet above.

  • DCB ETS and PFC defined in dcb-map converged
  • LLDP  protocol streaming down DCB information configured in the network
  • MTU 12000 (Force10 maximum) because Jumbo Frames are beneficial for iSCSI. iSCSI Jumbo Frames require payload 9000 bytes plus some Ethernet and TCP/IP protocol overhead. MTU 9216 woudl be enough but why not set maximal MTU in the core network? Performance overhead is negligible and we are ready for everything.
  • Edge port configuration for faster port transition to forwarding state

Advertisement
Techno Tip : Get an instant safe access to your favorite windows applications/data from anywhere on any device with innovative cloud solutions by CloudDesktopOnline. For hosted SharePoint and other hosted software products visit Apps4Rent.com today !

Friday, January 16, 2015

BPDU filter and Forged Transmit on VMware vSwitch to prevent loops

Do you know there is a potential risk of Spanning Tree loop when someone will do virtual bridging between two vNICs inside VMware vSphere VM? Or there can be rogue tool in VM guest OS to send BPDUs from VM to your physical network?

Let's assume we have Rapid STP enabled on our network. Below is typical Force10 configuration snippet for server access ports.
interface TenGigabitEthernet 0/2
 no ip address
 switchport
 spanning-tree rstp rootguard 
[updated]
 spanning-tree rstp edge-port bpduguard shutdown-on-violation
 no shutdown
Same or similar configs are usually used also for ESXi servers. ESXi NICs are used as vSwitch uplinks. It is important to note that VMware vSwitch is not a switch but some kind of port extender so it cannot make a loop in your network and not generating BPDUs at all. However when some VM on top of ESXi is generating BPDUs these BPDUs will arrive to switch ports and your ESXi access switch ports will be blocked by bpduguard feature. That's good from network stability point of view because this is what we want and configure it on switch, right? 

But what will happen on ESXi? VMware ESXi vSwitch will detect that link is down and will do fail over to another uplink in the vSwitch connected to another physical switch port which will be eventually disabled by bpduguard as well. 

Ok, but the problem is that at the end all ESXi physical NICs (vSwitch uplinks) will be down and all VMs running on top of ESXi will be disconnected from the network. That can be a serious problem. The best solution would be to have BPDU Guard functionality on VMware vSwitch but such feature does not surprisingly exist. There is relatively new possibility (since ESX 5.1) to use BPDU Filter which can help as to keep our shared switch port still up and running because no BPDUs arrive to the physical switch port but that's not all we need to protect loops.VMware's BPDU Filter functionality has to be configured for each ESXi host by altering advanced setting Net.BlockGuestBPDU.
Default setting is: Net.BlockGuestBPDU = 0
To allow BPDU Filter: Net.BlockGuestBPDU = 1
Look at ESXi 5.1 and BPDU Guard for full article with all details about this topic.

By the way there is yet another possibility to protect your network against unwanted attacks or misconfigurations. It is generally recommended to use vSphere vSwitch security policy "Forged Transmits" to reject unauthorized MAC addresses. In that case only burn in MAC address (actually virtual BIA VMware assigned) will be allowed to communicate to the network and therefore in-guest virtual networking (L2 frame forwarding) will be effectively disabled and your network will be protected against potential STP issues like simulating root switch from vSphere environment, For further information about STP attacks from the linux guest look for example here and here

Thursday, January 08, 2015

Can you please tell me more about VN-Link?

Back in 2010 when I have worked for CISCO Advanced Services as UCS Architect, Consultant, Engineer I compiled presentation about CISCO's virtual networking point of view in enterprise environments. Later I published this presentation on Slideshare as "VMware Networking, CISCO Nexus 1000V, and CISCO UCS VM-FEX". I used this presentation to educate CISCO partners and customers because it was really abstract topic for regular network specialists without server virtualization awareness. Please note, that SDN (Software Defined Networking) was not known and abused at that time.

Yesterday I received following Slideshare comment / question about this preso from John A.
Hi David, Thanks for this great material. Can you please tell me more about VN-Link?
I have decided to write blog post instead simple answer on Slideshare comments.

Disclaimer: I don't work for CISCO more then 3 years and I work for competitor (DELL) so this blog is my private opinion and my own understanding of CISCO technology. I might oversimplify some definitions or might be inaccurate on some statements but I believe I'm right conceptually which is the most important for John and other readers interested in CISCO virtual networking technologies for enterprise environments.

VN-link was CISCO marketing and conceptual term which is currently replaced with new term VM-FEX. VM-FEX (Virtual Machine Fabric Extender) is in my opinion better understandable term for CISCO networking professionals familiar with CISCO FEX technology. However VN-link/VM-FEX term is purely conceptual and abstract construct achievable by several different technologies or protocols. I have always imagined VN-LINK as the permanent virtual link between virtual machine virtual NIC (for example VMware vNIC) and CISCO Switch switchport with virtualization capabilities (vEth). When I'm saying switchport virtualization capabilities there are several technologies which can be used to fulfill conceptual idea of VN-link. VN-link conceptual and logical idea is always the same but implementation differs. Generally it is some kind of network overlay and each VN-link (virtual link) is the tunnel implemented by some standard protocol or proprietary technology. CISCO VN-link has one tunnel end point always the same - it is vEth on some model of CISCO Nexus switch. It can be physical Nexus switch (UCS FI, N5K, N7K, ...) or virtual switch Nexus 1000v (N1K). The second tunnel (vNIC) end point can be terminated on several places of your virtualized infrastructure. Below is conceptual view of VN-link or virtual wire if you wish.



So let's deep dive in two different technologies for CISCO VN-LINK tunnels implementations.

Nexus 1000v  (VN-link in software)

VN-link can be implemented in software by CISCO Nexus 1000v. The first VN-link tunel end point (vEth) in this particular case is in Nexus 1000v VSM (Virtual Supervisor Module) and second tunel end point (vNIC) is instantiated in CISCO virtual switch Nexus 1000v VEM (Virtual Ethernet Module) on particular hypervisor. Nexus 1000v architecture is not in scope of this blog post but someone familiar with CISCO FEX technology can imagine VSM as parent Nexus switch and VEM as remote line card (aka FEX - Fabric Extender).

VN-link  in hardware is hardware independent and everything is done in software. Is it Software Defined Networking? I can imagine Nexus 1000v VSS as a form of SDN controller. However when I speak personally with Martin Casado about this analogy on VMworld 2012 he was against it. I agree that Nexus 1000v has smaller scalability then NSX controller but conceptually this analogy works for me quite well. It always depends what scalability is required for particular environment and what kind of scalability, performance, availability and manageability you are looking for. There are always some pros and cons on each technology.

CISCO UCS and hypervisor module (VN-link in hardware)

For VN-link in hardware you must have appropriate CISCO UCS (Unified Computing System) hardware supporting protocol 802.1Qbh. Protocol 802.1Qbh (aka VN-TAG) allows physical switch port and server NIC port virtualization effectively establish virtual link over physical link. This technology dynamically creates vEth interfaces on top of physical switch interface (UCS FI). This vEth is one end point of VN-link (virtual link, virtual wire) established between CISCO UCS FI vEth and virtual machine vNIC. Virtual machine can be virtual server instance on top of any server virtualization platform (VMware vSphere, Microsoft Hyper-V, KVM, etc.) for which CISCO has plugin/module in hypervisor. CISCO VN-TAG (802.1Qbh) protocol is conceptually similar to HP Multichannel VEPA (802.1Qbg) but VN-TAG advantage is that virtual wire can be composed from several segments. This multisegment advantage is leveraged in UCS because one virtual link is combined from two following virtual links. First virtual link is in hardware and second is in software. Below are listed two segments of single virtual link over UCS infrastructure.
  1. From UCS FI vEth to UCS VIC (Virtual Interface Card) logical NIC (it goes through UCS IOM which is effectively normal CISCO physical FEX) 
  2. From UCS VIC logical NIC to VM vNIC (it goes through hypervisor module - software FEX)  
Below are specified hardware components required for VN-link in hardware.
  • CISCO UCS FI (Fabric Interconnects). UCS FI act as the first VN-link tunel end point where vEths exists.
  • CISCO UCS IOM (I/O Module) on each UCS Blade chassis is working as regular FEX   
  • CISCO VIC (Virtual Interface Card) on each server hosting specific hypervisor and allowing NIC partitioning of single physical adapter into logical NICs or HBAs.   

Conclusion

I hope this explenation of VN-link answered John A. question and help others who want to know what VN-link really is. I forget to mention that VN-link primary use case is mostly about operational collaboration between virtualization and network admins. CISCO believes that VN-link allows to keep virtual networking administration to legacy network specialists. To be honest I'm not very optimistic about this idea because it makes infrastructure significantly more complex. In my opinion IT silos (network, compute, storage) has to be merged into one team and modern datacenter administrators must be able to administer servers, storage and networking. However I agree that this is pretty big mental shift and it will definitely take some time. Especially in big enterprise environments.

Dell Virtual Racks

Virtual racks with Dell equipment are available at http://esgvr.dell.com/

Dell Server Virtual Rack

Direct link to DELL Server Virtual Rack where you can see how particular compute systems physically looks.

Dell Storage Virtual Rack

Direct link to DELL Storage Virtual Rack where you can see how particular storage systems physically looks.

Dell Networking Virtual Rack

Direct link to DELL NetworkingVirtual Rack where you can see how particular network systems physically looks.

Wednesday, December 31, 2014

esxcli formatters

I have just read great blog post "Hidden esxcli Command Output Formats You Probably Don’t Know" where the author (Steve Jin) exposed undocumented esxcli options to choose different formatters of esxcli output. Following esxcli formatters are available:

  • xml
  • csv
  • keyvalue

Here is example of one particular esxcli command without formatter.
~ # esxcli system version get
   Product: VMware ESXi
   Version: 5.5.0
   Build: Releasebuild-1746018
   Update: 1
Here is example with csv formatter
~ # esxcli --formatter=csv system version get
Build,Product,Update,Version,
Releasebuild-1746018,VMware ESXi,1,5.5.0,

Here is example with keyvalue formatter
~ # esxcli --formatter=keyvalue system version get
VersionGet.Build.string=Releasebuild-1746018
VersionGet.Product.string=VMware ESXi
VersionGet.Update.string=1
VersionGet.Version.string=5.5.0

It is quite obvious how xml formatter output looks like, right? All these formatters can help with output parsing which can be very helpful for automation scripts. Even more formatters are available when debug option is used. Following esxcli debug formatters are available:

  • python
  • json
  • html
  • table
  • tree
  • simple


Here is example with json formatter
~ # esxcli --debug --formatter=json system version get
{"Build": "Releasebuild-1746018", "Product": "VMware ESXi", "Update": "1", "Version": "5.5.0"}

Conclusion

esxcli formatters might be very handy for those who do automation and need easier parsing of esxcli output. You probably know that esxcli can be also used from PowerCLI so formatters can significantly simplified the script code.   

Friday, December 19, 2014

Galera Cluster for MySQL vs MySQL (NDB) Cluster

In the past, I used to use MySQL database for lot of projects so I was really interested what is the progress in MySQL clustering technologies and what is possible today. Therefore I have attended very interesting webinar  about MySQL clustering possibilities. Official webinar name is "Galera Cluster for MySQL vs MySQL (NDB) Cluster. A High Level Comparison" and Webinar Replay & Slides are available online  here.

I was very impressed what is possible today and even I moved from software architecture and development to datacenter infrastructure architecture and consulting it is good to know what database clustering is available nowadays.

Monday, December 15, 2014

Disk queue depth in an ESXi environment

On the internet, there are a lot of information and documents about ESXi and Disk Queue Depth but I didn't find any single document explained all details I would like to know and in the format for easy consumption. Different vendors have their specific recommendations and best practices but without deeper principal explanation and holistic view. Some documents are incomplete and some others have, in my opinion, wrong or at least strange information. That's the reason I have decided to write this blog post mainly for my own purpose but I hope it helps other people from the VMware community.

When I did research I found Frank Denneman blog post from March 2009 where almost everything is very well and accurately explained but technology little bit changed over time. Another great document is here on VMware Community Docs.

Disclaimer: I work for DELL Services, therefore, I work on a lot of projects with DELL Compellent Storage Systems, DELL Servers and QLogic HBA's. Therefore this blog post is based on "DELL Compellent Best Practices with VMware vSphere 5.x" and "Qlogic Best Practices for VMware vSphere 5.x". I have all this equipment in the lab so I was able to test it and verify what's going on. However, although this blog has information about specific hardware general principles are the same for any other HBAs and storage systems.

Some blog sections are just copied and paste from public documents mentioned above so all credits go there. Some other sections are based on my lab tests, experiments and thoughts.

What is queue depth?

Queue depth is defined as the number of disk transactions that are allowed to be “in flight” between an initiator and a target (parallel I/O transaction), where the initiator is typically an ESXi host HBA port/iSCSI initiator and the target is typically the Storage Center front-end port. Since any given target can have multiple initiators sending it data, the initiator queue depth is generally used to throttle the number of parallel transactions being sent to a target to keep it from becoming “flooded”. When this happens, the transactions start to pile up causing higher latencies and degraded performance. That being said, while increasing the queue depth can sometimes increase performance, if it is set too high, there is an increased risk of over-driving (over-loading and saturating) the storage array. As data travels between the application and the storage array, there are several places that the queue depth can be set to throttle the number of concurrent (parallel) disk transactions. The most common places queue depth can be modified are:
  • The application itself (Default=dependent on application)
  • The virtual SCSI card driver in the guest (Default=32)
  • The VMFS layer (DSNRO) (Default=32)
  • The HBA VMkernel Module driver (Default=64)
  • The HBA BIOS (Default=Varies)

HBA Queue Depth = AQLEN 

(Default Varies but in my QLogic card it is 2176)

Here are Compellent recommended BIOS settings for QLogic Fibre Channel card:
  1. The “connection options” field should be set to 1 for point to point only
  2. The “login retry count” field should be set to 60 attempts
  3. The “port down retry” count field should be set to 60 attempts
  4. The “link down timeout” field should be set to 30 seconds
  5. The “queue depth” (or “Execution Throttle”) field should be set to 255. This queue depth can be set to 255 because the ESXi VMkernel driver module and DSNRO can more conveniently control the queue depth
I don't think the point (5) is correct and up to date in Compellent Best Practices. Based on QLogic document HBA "Queue Depth" doesn't exist in QLogic BIOS and "Execution Throttle" is not used anymore. QLogic Adapter Queue Depth is only managed by the driver and default value is 64 for vSphere 5.x. For more info look at here.

I believe QLogic Host Bus Adapter has Queue Depth 2176 because this number is visible in esxtop as AQLEN (Adapter Queue Length) for each particular HBA (vmhba).

HBA LUN Queue Depth 

(Default=64)
HBA LUN Queue Depth is by default 64 but can be changed via VMkernel Module driver. Compellent recommends to set HBA max queue depth to 255 but this queue depth can be used just for single VM running on single LUN (disk). If more VMs are running on LUN than DSNRO has precedence. However, when SIOC (Storage I/O Control) is used then DSNRO is not used at all and HBA VMkernel Module Driver Queue Depth value is used for each device queue depth  (DQLEN). I think SIOC is beneficial in this situation because you can have deeper queues with dynamic queue management across all ESXi hosts connected to particular LUN. Please note, that special attention has to be taken for correct SIOC latency threshold especially on LUNs with sub-lun tiering.

If you want to check your Disk (aka LUN) Queue Depth value it is nicely visible in esxtop as DQLEN.

So here is the procedure on how to change HBA VMkernel Module Driver Queue Depth.

Find the appropriate driver name for the module that is loaded for QLogic HBA:
~ # esxcli system module list | grep ql
qlnativefc                          true        true
In our case, the name of module for QLogic driver is qlnativefc.

Set the QLogic driver queue depth and timeouts using the esxcli command:
esxcli system module parameters set -m qlnativefc -p "ql2xmaxqdepth=255 ql2xloginretrycount=60 qlport_down_retry=60"
You have to reboot ESXi host to apply module changes.

Below is a description of QLogic Parameters we have just change
  • ql2xmaxqdepth (int) - Maximum queue depth to report for target devices. 
  • ql2xloginretrycount (int) - Specify an alternate value for the NVRAM login retry count. 
  • qlport_down_retry (int) - Maximum number of command retries to a port that returns a PORT-DOWN status.
You can verify QLogic driver options by the following command:
esxcfg-module --get-options qlnativefc 
And affected change (after ESXi reboot) is visible on esxtop on disk devices as DQLEN.

VMFS Layer DSNRO 

(Default=32)
When two or more virtual machines share a LUN (logical unit number), this parameter controls the total number of outstanding commands permitted from all virtual machines collectively on the host to that LUN (this setting is not per virtual machine). For more information see KB

The HBA LUN queue depth (DQLEN) determines how many commands the HBA is willing to accept and process per LUN and target. If a single VM virtual machine is issuing IO, the HBA LUN Queue Depth (DQLEN) setting is applicable. When multiple VM’s are simultaneously issuing IO’s to the LUN, the ESX parameter Disk.SchedNumReqOutstanding (DSNRO) value becomes the leading parameter, and HBA LUN Queue Depth is ignored. The above statement is only true when SIOC is not enabled on the LUN (datastore).

Parameter can be changed to the value between 1 and 256 so when 
esxcli storage core device set -d -O
No of outstanding IOs with competing worlds (DSNRO) can be listed for each device
~ # esxcli storage core device list -d naa.6000d3100025e7000000000000000098
naa.6000d3100025e7000000000000000098
   Display Name: COMPELNT Fibre Channel Disk (naa.6000d3100025e7000000000000000098)
   Has Settable Display Name: true
   Size: 1048576
   Device Type: Direct-Access
   Multipath Plugin: NMP
   Devfs Path: /vmfs/devices/disks/naa.6000d3100025e7000000000000000098
   Vendor: COMPELNT
   Model: Compellent Vol
   Revision: 0504
   SCSI Level: 5
   Is Pseudo: false
   Status: on
   Is RDM Capable: true
   Is Local: false
   Is Removable: false
   Is SSD: false
   Is Offline: false
   Is Perennially Reserved: false
   Queue Full Sample Size: 0
   Queue Full Threshold: 0
   Thin Provisioning Status: yes
   Attached Filters:
   VAAI Status: supported
   Other UIDs: vml.02000200006000d3100025e7000000000000000098436f6d70656c
   Is Local SAS Device: false
   Is Boot USB Device: false
   No of outstanding IOs with competing worlds: 32
The issue with DSNRO is that it must be set per devices (LUNs) on all ESXi hosts so it has a negative impact on implementation and overall manageability. It is usually left on default value anyway because 32 outstanding (async) IOs is good enough on low latency SANs and properly designed storage devices with good response times. Bigger queues are beneficial on higher latency SANs and/or storage systems which need more parallelism (more parallel threads) to draw out more I/O's storage is physically capable of. It doesn't help in situations when storage doesn't have enough back end performance (usually spindles). Bigger queues can generate more outstanding IOs to the storage system which can overload the storage and everything will be even worse. That's the reason why I think it is a good idea to leave DSNRO on the default value.

And once again. DSNRO is not used when SIOC is enabled because in that case "HBA LUN queue depth" is the leading parameter for disk queue depth (DQLEN).

Storage Target Port Queue Depth

Queues are on a lot of places and you have to understand the path from end to end. A queue exist on the storage array controller port as well, this is called the “Target Port Queue Depth“. Modern midrange storage arrays, like most EMC and HP arrays, can handle around 2048 outstanding IO’s.
2048 IO’s sounds a lot, but most of the time multiple servers communicate with the storage controller at the same time. Because a port can only service one request at a time, additional requests are placed in the queue and when the storage controller port receives more than 2048 IO requests, the queue gets flooded. When the queue depth is reached, this status is called (QFULL), the storage controller issues an IO throttling command to the host to suspend further requests until space in the queue becomes available. The ESXi host accepts the IO throttling command and decreases the LUN queue depth to the minimum value, which is 1! The QFULL condition is handled by the QLogic driver itself. The QFULL status is not returned to the OS. But some storage devices return BUSY rather than QFULL. BUSY errors are logged in the /var/log/vmkernel. Not every busy error is a Qfull error! Check the SCSI sense codes indicated in the vmkernel message to determine what type of error it is. The VMkernel will check every 2 seconds to check if the QFULL condition is resolved. If it is resolved, theVMkernel will slowly increase the LUN queue depth to its normal value, usually, this can take up to 60 seconds.

Compellent SC8000 front-end ports act as targets and have Target Port Queue Depth 1900+. What 1900+ means? Compellent is excellent storage operating system on top of commodity servers, IO cards, disk enclosures and disks. By the way, is it software-designed-storage or not? But back to queues, 1900+ means that Queue Depth depends on IO card used for front-end port but 1900 is the guaranteed minimum. So for our case let's assume we have Storage Target Port Queue Depth 2048.

All together will give as the holistic view
Here are end to end disk queue parameters:
  • DSNRO is 32 - this is default ESXi value.
  • HBA LUN Queue Depth is 64 by default in QLogic HBAs. Compellent recommends to change it  to 256 by parameter of HBA VMkernel Module Driver.
  • HBA Queue Depth is 2176.
  • Compellent Target Ports Queue Depths are 2048.
Everything is depicted in the figure below.
Disk Queue Depth - Holistic View

What is the optimal HBA LUN Queue Depth?

We already know 64 is the default value for QLogic and you can change it via VMkernel Module driver. So what is the optimal HBA LUN Queue Depth?

To prevent flooding the target port queue depth, the result of the combination of a number of host paths + HBA LUN Queue Depth+ number of presented LUNs through the host port must be less than the target port queue depth. In short T >= P * Q * L

T = Target Port Queue Depth (2048)
P = Paths connected to the target port (1)
Q = HBA LUN Queue Depth (?)
L = number of LUN presented to the host through this port (20)

So when I have 20 LUNs exposed from Compellent storage system HBA LUN Queue Depth should be maximally Q = T / P /L = 2048 / 1 / 20 = 102.4 for a single ESXi host. When I have 16 hosts HBA LUN Queue Depth must be divided by 16 so the value for this particular scenario is  6.4.
This number is without any overbooking which is not the real case. Current QLogic HBA LUN Queue Depth default value (64) introduces fan in fan-out ratio 10:1 which is probably based on long term experience with virtualized workloads. In the past, Qlogic has lower default values 16 in ESX 3.5 and 32 in ESX 4.x which had lower fan-in / fan-out ratios (2.5:1 and 5:1).

With Compellent recommended QLogic HBA LUN Queue Depth 255 we will have, in this particular case, fan-in / fan-out ratio 40 : 1. But only when SIOC is used. And when SIOC is used and normalized datastore latency is higher than SIOC threshold dynamic queue management kick in and disk queues are automatically throttled. So all is good.

What is my real Disk Queue Depth?

VMkernel Disk Queue Depth (DQLEN) is the number that matters. However, this number is dynamic and it depends on several factors. Here are scenarios you can have:
Scenario 1 - SIOC enabledWhen you have datastore with SIOC enabled then your DQLEN will be set to HBA LUN Queue Depth. It is by default 64 but Compellent recommends to set it to 255. But Compellent doesn't recommend to use SIOC.
Scenario 2 - SIOC disabled and you have only one VM on datastoreWhen you have just one VM on datastore your DQLEN will be set o HBA LUN Queue Depth. But how often do you have just one VM on datastore?
Scenario 3 - SIOC disabled and you have two or more VMs on datastoreIn this scenario your DQLEN will be set to DSNRO which is by default 32 and it has precedence over HBA LUN Queue Depth.

ESXi Disk Queue Management - None, Adaptive Queuing or Storage I/O Control?

By default, ESXi doesn't use any Disk Queue Length (DQLEN) throttling mechanism and your DQLEN is set to DSNRO (32) when two or more VM (vDisks) are on the datastore. When only one VM disk is on the datastore your DQLEN will be set to HBA LUN Queue Depth (default=64, Compellent recommends 255).

When you have enabled Adaptive Queuing or Storage I/O Control your Disk Queue Length (DQLEN) will be throttled automatically when I/O congestion occurs. The difference between Adaptive Queuing and SIOC is how I/O congestion is detected.

Adaptive queuing is waiting for storage SCSI Sense Code of BUSY or QUEUE FULL status on the I/O path. For adaptive queueing, there are some advanced parameters that must be set on a per-host basis. From the Configuration tab, you have to select Software Advanced Settings. Navigate to Disk and the two parameters you need are Disk.QFullSampleSize and Disk.QFullThreshold. By default, QFullSampleSize is set to 0, meaning that it is disabled. When this value is set, adaptive queueing will kick in and half the queue depth when this number of queue full conditions is reported by the array. The QFullThreshold is the number of good statuses to receive before incrementing the queue once again.

Storage I/O Control uses the concept of a congestion threshold, which is based on latency. But not normal latency of datastore on only one particular ESXi host but there is a sophisticated algorithm preparing normalized datastore latency of all datastore latencies across ESXi hosts using one particular datastore. On top of datastore wise normalization, there is another type of normalization which takes in to account just "normal" I/O sizes. What is a normal IO size? I cannot find this detail but I think I read somewhere that the normal I/O size is between 2KB and 16KB.  Update 2016-10-07: Now when I work for VMware I have access to ESXi source codes and if I read the code correctly it seems to me that IO size normalization works a little bit differently. I cannot publish the exact formula as it is VMware intellectual property and implementation details but the whole idea are that for example 1MB I/Os would significantly skewed latency against normally accepted latency values in the storage industry, therefore, latency for bigger I/O's are somehow adjusted based on I/O size.

Information in the above section is based on Cormag Hogan blog post.

Conclusion

Based on the information above I think that default ESXi and QLogic values are the best for general workloads and tuning queues is not something I would recommend to do. It is important to know that queues tuning does not help with performance in most cases. Queues are used for latency improvements during transient I/O burst. If you have storage performance issues, it is usually because of an overloaded storage system and not about queues in the network path. Bigger Queue Depth can help you to handle more parallel I/Os to the storage subsystem. This could be beneficial when your target storage system can handle more I/Os. It cannot help in situations when the target storage system is overloaded and you expect more performance just by increasing queue depth.

The question I still need the answer to my self is why Compellent doesn't recommend VMware Storage I/O Control and Adaptive Queuing for dynamic queue management which is in my opinion very good thing.

UPDATE 2014-12-16: I have downloaded the latest Compellent Best Practices and there is a new statement about SIOC. SIOC can be enabled but you must know what's happening and if it is beneficial for you. Here is the snippet from the document ...
SIOC is a feature that was introduced in ESX/ESXi 4.1 to help VMware administrators regulate storage performance and provide fairness across hosts sharing a LUN. Due to factors such as Data Progression and the fact that Storage Center uses a shared pool of disk spindles, it is recommended that caution is exercised when using this feature. Due to how Data Progression migrates portions of volumes into different storage tiers and RAID levels at the block level, this could ultimately affect the latency of the volume, and trigger the resource scheduler at inappropriate times. Practically speaking, it may not make sense to use SIOC unless pinning particular volumes into specific tiers of disk.
I still believe SIOC is the way to go and special attention has to be paid to SIOC latency threshold. Compellent recommends keeping it on default value 30 milliseconds which makes perfect sense. Storage System will do all the hard work for you but when there are huge congestion and normalized latency is too high dynamic ESXi disk queue management can kick in. It makes sense to me.

I also believe that Adaptive Queuing is a really good and practical safety mechanism when your storage array has full queues in storage front-end ports or LUNs. However, it is applicable only with storage arrays supporting Adaptive Queuing and sending back to ESXi SCSI sense codes about full queue condition. If Adaptive Queueing is not used, even SIOC cannot help you with LUNs/Datastores issues (Datastore Disconnections) because SIOC algorithm is based on device response time but not on queue full storage response. Therefore, I strongly recommend enabling SIOC together with Adaptive Queuing unless your storage vendor has a really good justification to not do so. SIOC will help you with storage traffic throttling during high device response times and Adaptive Queuing when storage array queues are full and the device cannot accept new I/O's. However, Adaptive Queuing should be configured in concert with your storage vendor. For more information on how to enable Adaptive Queuing read VMware KB 1008113. Please note, that SIOC and Adaptive Queuing are just safety mechanisms on how to mitigate impacts of storage issues but the root cause is the Capacity Planning on the storage array.

UPDATE 2020-10-22: SIOC control throttles queues based on hitting a datastore latency threshold. This is the disk (device) latency not including any additional latency induced by additional VMkernel queueing. SIOC is using a minimal threshold 5 ms response time to kick in SIOC, this can be good for traditional storage arrays with rotational disks, however, in the all-flash era, you expect response times between 1 and 3 ms, right? Sometimes even sub-millisecond response time. SIOC will not help you to achieve fairness and defined SLA/OLA in such all-flash environments, but can you do dynamic queue management at least in situations the response times increase above 5 ms. 

If any of you with a deep understanding of vSphere and storage architectures see an error in my analysis, please let me know so that I can correct it appropriately.

Useful links:

Wednesday, December 03, 2014

Force10: How to prepare logs and configs for DELL tech support

The command show tech-support will show you all configurations and logs required for troubleshooting on the console. It is usually not  what you want because you have to transfer support file somewhere. Therefore you can simply save it to internal flash device as an file and transfer it via ftp, tftp or scp to some computer.
F10-S4810-A#show tech-support | save flash://tech-supp.2014-12-03
Start saving show command report .......
The file tech-supp.2014-12-03 is created in the flash device and you can list it.
F10-S4810-A#dir
Directory of flash:
  1  drwx       4096   Jan 01 1980 00:00:00 +00:00 .
  2  drwx       3072   Dec 04 2014 03:49:32 +00:00 ..
  3  drwx       4096   Mar 01 2004 21:28:04 +00:00 TRACE_LOG_DIR
  4  drwx       4096   Mar 01 2004 21:28:04 +00:00 CORE_DUMP_DIR
  5  d---       4096   Mar 01 2004 21:28:04 +00:00 ADMIN_DIR
  6  drwx       4096   Mar 01 2004 21:28:06 +00:00 RUNTIME_PATCH_DIR
  7  drwx       4096   Nov 04 2014 01:09:36 +00:00 CONFIG_TEMPLATE
  8  -rwx       6731   Dec 04 2014 01:46:00 +00:00 startup-config
  9  -rwx       6285   Nov 07 2013 04:41:32 +00:00 startup-config.bak
 10  -rwx   25614609   Feb 21 2013 17:24:22 +00:00 FTOS-SE-8.3.12.1.bin
 11  -rwx     524528   Feb 21 2013 17:40:42 +00:00 U-boot.1.2.0.2.bin
 12  -rwx       7202   Nov 07 2013 05:35:14 +00:00 david.pasek-config
 13  drwx       4096   May 07 2014 04:06:42 +00:00 CONFD_LOG_DIR
 14  -rwx       4094   Jul 18 2014 02:25:28 +00:00 inetd.conf
 15  -rwx          0   Jul 19 2014 00:14:56 +00:00 pdtrc.lo0
 16  -rwx       6125   Jul 15 2014 02:05:52 +00:00 backup-config.dp
 17  -rwx         80   Jul 18 2014 21:45:22 +00:00 memtrc.lo0
 18  -rwx     199770   Dec 04 2014 01:46:08 +00:00 confd_cdb.tar.gz
 19  -rwx          0   Jul 18 2014 21:44:52 +00:00 pdtrc3.lo0
 20  drwx       4096   Jul 19 2014 00:14:56 +00:00 bgp-trc
 21  -rwx       6058   Oct 25 2014 05:13:10 +00:00 config-vrf-no-vrrp
 22  -rwx       6110   Oct 25 2014 05:05:08 +00:00 config-vrf-vrrp
 23  -rwx     135698   Dec 04 2014 03:50:16 +00:00 tech-supp.2014-12-03

Now it is very easy to transfer this single file somewhere via ssh(scp), ftp or tftp. You can use command similar to 
copy tech-supp.2014-12-03 scp://
I don't anybody wish troubles but hope this helps with troubleshooting when needed ...
 



Force10 switch port iSCSI configuration

Here is snippet of Force10 switch port configuration of port facing storage front-end port or host NIC port dedicated just for iSCSI. In other words this is non-DCB switch port configuration.
interface TenGigabitEthernet 0/12
  no ip address
  mtu 12000
  switchport
  flowcontrol rx on tx off
  spanning-tree rstp edge-port
  spanning-tree rstp edge-port bpduguard shutdown-on-violation
  storm-control broadcast 100 in
  storm-control unknown-unicast 50 in
  storm-control multicast 50 in
  no shutdown
MTU is set to 12000 which is the supported maximum and allowing to use Jumbo Frames. I know 9216 is enough MTU for iSCSI however modern switches is capable for larger MTU without performance overhead so why not use it?

Flow-control is enabled for receive traffic so switch can sen PAUSE frame to initiator or target to wait a while in case switch port buffers are full. Transience flow-control is not enabled because switch port buffers are not deep enough to help.

Switch port is set to edge-port to eliminate spanning-tree algorithm because connected devices are edge devices - server (iSCSI initiator) or storage (iSCSI target).

[Updated at 2015-01-19 based on Martin's comment]
Bpduguard can be optionally enabled on switch port as prevention against cabling mistake (also known as an human factor) because iSCSI storage should not send any BPDU. 
[/Updated]

Storm control of broadcasts, multicasts and unknown unicasts is enabled to eliminate unwanted storms. Please note that unicast storm may not be enabled because iSCSI generates lot of traffic which can be possible identified as unicast storm.

That's it for iSCSSI network configuration of dedicated ports. When you use shared ports for iSCSI and LAN traffic you should use DCB. DCB configuration is similar to configuration above but replacing flow-control with priority-flow-control (PFC) and leveraging link QoS technology (ETS) based on 802.1p CoS. You can check DCB design justification and sample configuration for iSCSI here

And as always, any comment is really appreciated. 

Saturday, November 29, 2014

The ZALMAN ZM VE200 SATA hard disk caddy with DVD/HDD/FDD emulation

I have just bought external USB drive with DVD emulation from ISO file. That's should be pretty handy for OS installs. I'm looking forward for first ESXi installation directly from ISO file.

Here is nice and useful tutorial how to use it.

Friday, November 21, 2014

Announcing the VMware Learning Zone

As a VMware vExpert I had a chance to use beta access to VMware Learning Zone. I blogged about my experience here. VMware Learning Zone has been officially announced today.

VMware Learning Zone is a new subscription-based service that gives you a full year of unlimited, 24/7 access to official VMware video-based training. Top VMware experts and instructors discuss solutions, provide tips and give advice on a variety of advanced topics. Your VMware Learning Zone subscription gives you:

  • Easy to consume training on the latest products and technologies
  • Powerful search functionality to find the answers you need fast
  • Content that delivers exactly the knowledge you need
  • Mobile access for on the go viewing
  • Much more

Learn more here.

ISCSI Best Practices

General ISCSI Best Practices
  • Separate VLAN for iSCSI traffic. 
  • Two separate networks or VLANs for multipath iSCSI. 
  • Two separate IP subnets for the separate networks or VLANs in multipath iSCSI. 
  • Gigabit (or better) Full Duplex connectivity between storage targets (storage front-end ports) and all storage initiators (server ports) 
  • Auto-Negotiate for all switches that will correctly negotiate Full Duplex 
  • Full Duplex hard set for all iSCSI ports for switches that do not correctly negotiate 
  • Bi-Directional Flow Control enabled for all Switch Ports that servers or controllers are using for iSCSI traffic. 
  • Bi-Directional Flow Control enabled for all ports that handle iSCSI traffic. This includes all devices between two sites that are used for replication. 
  • Unicast storm control disabled on every switch that handles iSCSI traffic. 
  • Multicast disabled at the switch level for any iSCSI VLANs. 
  • Broadcast disabled at the switch level for any iSCSI VLANs. 
  • Routing disabled between the regular network and iSCSI VLANs. 
  • Do not use Spanning Tree (STP or RSTP) on ports that connect directly to end nodes (the server or storage iSCSI ports.) If you must use it, enable the Cisco PortFast option or equivalent on these ports so that they are configured as edge ports. 
  • Ensure that any switches used for iSCSI are of a non-blocking design. 
  • When deciding which switches to use, remember that you are running SCSI traffic over it. Be sure to use a quality managed enterprise-class networking equipment. It is not recommended to use SBHO (small business/home office) class equipment outside of lab/test environments. 
For Jumbo Frame Support
  • Some switches have limited buffer sizes and can only support Flow Control or Jumbo Frames, but not both at the same time. It is strongly recommended to choose Flow Control. 
  • All devices connected through iSCSI need to support 9k jumbo frames. 
  • All devices used to connect iSCSI devices need to support it. 
  • This means every switch, router, WAN Accelerator, and any other network device that will handle iSCSI traffic needs to support 9k Jumbo Frames. 
  • If it is not 100% positive that every device in the iSCSI network supports 9k Jumbo Frames, then NOT turn on Jumbo Frames. 
  • Because devices on both sides (server and SAN) need Jumbo Frames enabled, change disable to enable Jumbo Frames is recommended during a maintenance window. If servers have it enabled first, the Storage System will not understand their packets. If Storage System enables it first, servers will not understand its packets.

VMware ESXi iSCSI tunning
  • Disabling "TCP Delayed ACK" (esxcli iscsi adapter param set -A vmhba33 -k DelayedAck -v 0 - command not tested)
  • Adjust iSCSI Login Timeout (esxcli iscsi adapter param set -A vmhba33 -k LoginTimeout -v 60)
  • Disable large receive offload (LRO) (esxcli system settings advanced set -o /Net/TcpipDefLROEnabled 0 or esxcfg-advcfg -s 0 /Net/TcpipDefLROEnabled)
  • Set up Jumbo Frames is configured end to end (esxcli network vswitch standard set -m 9000 -v vSwitch2 and  esxcli network ip interface set -m 9000 -i vmk1)
  • Set up appropriate multi pathing based on iSCSI storage system
  • FlowControl is enabled on ESXi by default. To display FlowControl settings use ethtool --show-pause vmnic0 or esxcli system module parameters list --module e1000 | grep "FlowControl"
If you know about some other best practice, tuning setting or recommendation don't hesitate to leave a comment below this blog post. 
 
Related documents:
[1] VMware. Best Practices For Running VMware vSphere On iSCSI. In: core.vmware.com, URL: https://core.vmware.com/resource/best-practices-running-vmware-vsphere-iscsi

Saturday, November 15, 2014

How to quickly get changed ESXi advanced settings?

Below is esxcli command to list ESXi Advanced Settings that have changed from the system defaults:
esxcli system settings advanced list -d
Here is real example form my ESXi host in lab ...
~ # esxcli system settings advanced list -d
   Path: /UserVars/SuppressShellWarning
   Type: integer
   Int Value: 1
   Default Int Value: 0
   Min Value: 0
   Max Value: 1
   String Value:
   Default String Value:
   Valid Characters:
   Description: Don't show warning for enabled local and remote shell access
You can see that I'm suppressing Shell Warning because I really want to have SSH enabled and running on my lab ESXi all the time.

If you want list kernel settings there is another command
esxcli system settings kernel list
and you can also used option -d to get just changed settings from default.

Friday, November 14, 2014

Virtualisation Design & Project Framework

Gareth Hogarth wrote excellent high level plan (aka methodology, framework) how to properly deliver virtualization project as a turn key solution. I used very similar approach and not only for virtualization project but to any IT project where I have a role of Leading Architect. I have never written a blog post about this particular topic because it is usually internal intellectual property  of any consulting organization. So if you have never seen any similar methodology look at Gareth's post to get an idea of project phases and overall project process. It is good to note that all these methodologies are just frameworks and frameworks are usually good starting points which doesn't stop you to improve it to fulfill all specific project requirements and make your project successful.

Friday, November 07, 2014

40Gb over existing LC fiber optics

Do you know DELL has QSFP+ LM4 transciever allowing 40Gb traffic up to 160m on LC OM4 MMF (multi mode fiber) or up to 2km on LC SMF (single mode fiber)?


Use Case:  

This optic has an LC connection and is ideal for customers who want to use existing LC fiber.  It can be used for 40GbE traffic up to 160m on MultiMode Fiber OR 2km on Single Mode fiber.

Specification

Periferal Type: DELL QSFP+ LM4
Connection: LC Connection, Dulplex Multi-Mode Fiber or Dulpex Single-Mode Fiber
Max Distance: 140m OM3 or 160m OM4 MMF, 2km SMF
Transmitter Output Wavelength (nm): 1270 to 1330
Transmit Output Power (dBm): -7.0 to 3.5 [avg power per lane]
Receive Input Power (dBm): -10.0 to 3.5 [avg power per lane]
Temperature: 0 to 70C
Power:  3.5W max

Based on wavelength range 1270 to 1330 I assume 40Gb is achieved as 4 x 10Gb leveraging wavelength-division multiplexing (CWDM) on following wave lengths:

  • 1270 nm
  • 1290 nm
  • 1310 nm
  • 1330 nm


Thursday, November 06, 2014

ESXi Network Troubleshooting

Introduction

As VMware vExpert, I had a chance and privilege to use VMware Learning Zone. There are excellent training videos. Today I would like to blog about useful commands trained on video training “Network Troubleshooting at the ESXi Command Line”.  If you ask me I have to say that Vmware Learning Zone has very valuable content and it comes really handy during real troubleshooting. 

UPDATE 2020-10-17: I have just found the blog post "ESXi Network Troubleshooting Tools" containing a lot of useful tools and insights.

NIC Adapters Information

To see Network Interface Cards Information you can run following command
~ # /usr/lib/vmware/vm-support/bin/nicinfo.sh | more
Network Interface Cards Information.

Name    PCI Device     Driver  Link  Speed  Duplex  MAC Address        MTU   Description
----------------------------------------------------------------------------------------
vmnic0  0000:001:00.0  bnx2    Up     1000  Full    14:fe:b5:7d:8d:05  1500  Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX
vmnic1  0000:001:00.1  bnx2    Up     1000  Full    14:fe:b5:7d:8d:07  1500  Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX
vmnic2  0000:002:00.0  bnx2    Up     1000  Full    14:fe:b5:7d:8d:6d  1500  Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX
vmnic3  0000:002:00.1  bnx2    Up     1000  Full    14:fe:b5:7d:8d:6f  1500  Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

NIC:  vmnic0

NICInfo:
   Advertised Auto Negotiation: true
   Advertised Link Modes: 1000baseT/Full, 2500baseT/Full
   Auto Negotiation: true
   Cable Type: FIBRE
   Current Message Level: -1
   Driver Info:
      NICDriverInfo:
         Bus Info: 0000:01:00.0
         Driver: bnx2
         Firmware Version: 7.8.53 bc 7.4.0 NCSI 2.0.13
         Version: 2.2.3t.v55.7
   Link Detected: true
   Link Status: Up
   Name: vmnic0
   PHY Address: 2
   Pause Autonegotiate: false
   Pause RX: true
   Pause TX: true
   Supported Ports: TP, FIBRE
   Supports Auto Negotiation: true
   Supports Pause: true
   Supports Wakeon: true
   Transceiver: internal
   Wakeon: MagicPacket(tm)
Ring parameters for vmnic0:
Pre-set maximums:
RX:             4080
RX Mini:        0
RX Jumbo:       16320
TX:             255
Current hardware settings:
RX:             255
RX Mini:        0
RX Jumbo:       0
TX:             255

Output above is sniped just for vmnic0. You can see useful information like PCI Device ID, Driver, Link Status, Speed, Duplex and MTU for each vmnic.
It also shows detail driver information, FlowControl (Pause Frame) status, cable type. etc.
To find particular vmnic PCI Vendor ID's use command vmkchdev
~ # vmkchdev -l | grep vmnic0
0000:01:00.0 14e4:163a 1028:02dc vmkernel vmnic0

PCI Slot: 0000:01:00.0
VID (Vendor ID): 14e4
DID (Device ID): 163a
SVID (Sub-Vendor ID): 1028
SSID (Sub-Device ID): 02dc

You can use PCI devices Vendor ID’s  to find the latest drivers at VMware Compatibility Guide (http://www.vmware.com/go/hcl/).


Below is another command how to find full details of all PCI devices.
esxcli hardware pci list
If you are interested just for particular vmnic PCI details command below can be used.
~ # esxcli hardware pci list | grep -B 6 -A 29 vmnic0
000:001:00.0
   Address: 000:001:00.0
   Segment: 0x0000
   Bus: 0x01
   Slot: 0x00
   Function: 0x00
   VMkernel Name: vmnic0
   Vendor Name: Broadcom Corporation
   Device Name: Broadcom NetXtreme II BCM5709S 1000Base-SX
   Configured Owner: Unknown
   Current Owner: VMkernel
   Vendor ID: 0x14e4
   Device ID: 0x163a
   SubVendor ID: 0x1028
   SubDevice ID: 0x02dc
   Device Class: 0x0200
   Device Class Name: Ethernet controller
   Programming Interface: 0x00
   Revision ID: 0x20
   Interrupt Line: 0x0f
   IRQ: 15
   Interrupt Vector: 0x2b
   PCI Pin: 0x75
   Spawned Bus: 0x00
   Flags: 0x0201
   Module ID: 4125
   Module Name: bnx2
   Chassis: 0
   Physical Slot: 0
   Slot Description: Embedded NIC 1
   Passthru Capable: true
   Parent Device: PCI 0:0:1:0
   Dependent Device: PCI 0:0:1:0
   Reset Method: Link reset
   FPT Sharable: true

Note: same command can be used for HBA cards by substituting vmnic0 by vmhba0

VLAN Sniffing

The commands below enable VLAN statistics collection on particular vmnic which can be shown and used for troubleshooting.  
esxcli network nic vlan stats set --enabled=true -n vmnic0
~ # esxcli network nic vlan stats get -n vmnic0
VLAN 0
   Packets received: 22
   Packets sent: 0

VLAN 22
   Packets received: 21
   Packets sent: 10

VLAN 201
   Packets received: 28
   Packets sent: 0

VLAN 202
   Packets received: 28
   Packets sent: 0

VLAN 204
   Packets received: 5
   Packets sent: 0

VLAN 205
   Packets received: 5
   Packets sent: 0

Don’t forget to disable VLAN statistics after troubleshooting.

esxcli network nic vlan stats set --enabled=false -n vmnic0


VMkernel Arp Cache

To work with ESXi ARP cache you can use command
esxcli network ip neighbor  
Below is example how to list ARP entries …
~ # esxcli network ip neighbor list
Neighbor   Mac Address        Vmknic   Expiry  State  Type
---------  -----------------  ------  -------  -----  -------
10.2.22.1  5c:26:0a:ae:5a:c6  vmk0    933 sec         Unknown

You can see there just default gateway 10.2.22.1
Let’s ping some other device in the same broadcast domain and look at ARP entries again.
~ # ping 10.2.22.51
PING 10.2.22.51 (10.2.22.51): 56 data bytes
64 bytes from 10.2.22.51: icmp_seq=0 ttl=128 time=0.802 ms

~ # esxcli network ip neighbor list
Neighbor    Mac Address        Vmknic    Expiry  State  Type
----------  -----------------  ------  --------  -----  -------
10.2.22.51  00:0c:29:4a:5b:ba  vmk0    1195 sec         Unknown
10.2.22.1   5c:26:0a:ae:5a:c6  vmk0     878 sec         Unknown

Now you can see  entry for device 10.2.22.51 in ARP table as well. Below is another command to remove ARP entry from ARP table.
~ # esxcli network ip neighbor remove -v 4 -a 10.2.22.51
… and let’s check if ARP entry has been removed.
~ # esxcli network ip neighbor list
Neighbor   Mac Address        Vmknic   Expiry  State  Type
---------  -----------------  ------  -------  -----  -------
10.2.22.1  5c:26:0a:ae:5a:c6  vmk0    817 sec         Unknown

Note: ESXi ARP timeout is 1200 second therefore remove command can be handy in some situations.

VMkernel Routing

Since vSphere 5.1 it is possible to have more than one networking stack. Normally you work with default networking stack.
To show ESXi routing table you can use command
esxcli network ip route ipv4 list  
~ # esxcli network ip route ipv4 list
Network    Netmask        Gateway    Interface  Source
---------  -------------  ---------  ---------  ------
default    0.0.0.0        10.2.22.1  vmk0       MANUAL
10.2.22.0  255.255.255.0  0.0.0.0    vmk0       MANUAL

You can see default gateway 10.2.22.1 used for default networking stack.
Command esxcli network ip connection list shows all IP network connections from and to ESXi host.
~ # esxcli network ip connection list
Proto  Recv Q  Send Q  Local Address                    Foreign Address     State        World ID  CC Algo  World Name
-----  ------  ------  -------------------------------  ------------------  -----------  --------  -------  ---------------
tcp         0       0  127.0.0.1:8307                   127.0.0.1:54854     ESTABLISHED     34376  newreno  hostd-worker
tcp         0       0  127.0.0.1:54854                  127.0.0.1:8307      ESTABLISHED    570032  newreno  rhttpproxy-work
tcp         0       0  127.0.0.1:443                    127.0.0.1:54632     ESTABLISHED    570032  newreno  rhttpproxy-work
tcp         0       0  127.0.0.1:54632                  127.0.0.1:443       ESTABLISHED   1495503  newreno  python
tcp         0       0  127.0.0.1:8307                   127.0.0.1:61173     ESTABLISHED     34806  newreno  hostd-worker
tcp         0       0  127.0.0.1:61173                  127.0.0.1:8307      ESTABLISHED    570032  newreno  rhttpproxy-work
tcp         0       0  127.0.0.1:80                     127.0.0.1:60974     ESTABLISHED     34267  newreno  rhttpproxy-work
tcp         0       0  127.0.0.1:60974                  127.0.0.1:80        ESTABLISHED     35402  newreno  sfcb-vmware_bas
tcp         0       0  10.2.22.101:80                   10.44.44.110:50351  TIME_WAIT           0
tcp         0       0  127.0.0.1:5988                   127.0.0.1:14341     FIN_WAIT_2      35127  newreno  sfcb-HTTP-Daemo
tcp         0       0  127.0.0.1:14341                  127.0.0.1:5988      CLOSE_WAIT    1473527  newreno  hostd-worker
tcp         0       0  127.0.0.1:8307                   127.0.0.1:45011     ESTABLISHED     34806  newreno  hostd-worker
tcp         0       0  127.0.0.1:45011                  127.0.0.1:8307      ESTABLISHED    570032  newreno  rhttpproxy-work

NetCat

Netcat program (nc) is available on ESXi and it can test TCP connectivity to some IP target.
~ # nc -v 10.2.22.100 80
Connection to 10.2.22.100 80 port [tcp/http] succeeded!

TraceNet

Tracenet is very handy program available in ESXi to identify also latencies inside vmkernel IP stack.
~ # tracenet 10.2.22.51
Using interface vmk0 ...
Time         0.068 0.023 0.019 ms
Location:    ESXi-Firewall
Time         0.070 0.025 0.020 ms
Location:    VLAN_InputProcessor@#
Time         0.073 0.027 0.022 ms
Location:    vSwitch0: port 0x2000004
Time         0.089 0.030 0.024 ms
Location:    VLAN_OutputProcessor@#
Time         0.090 0.031 0.025 ms
Location:    DC01
Endpoint:       10.2.22.51
Roundtrip Time: 0.417 0.195 0.196 ms


Dropped packets

In this section are commands to verify dropped packets on different places of VMkernel Ip stack.
Command net-stats –l list all devices (Clients – nic-ports,vmk-ports, vm-ports) connected to VMware switch. You can simply identify to which vSwitch port number (PortNum) is device connected.
~ # net-stats -l
PortNum          Type SubType SwitchName       MACAddress         ClientName
33554434            4       0 vSwitch0         14:fe:b5:7d:8d:05  vmnic0
33554436            3       0 vSwitch0         14:fe:b5:7d:8d:05  vmk0
33554437            5       9 vSwitch0         00:0c:29:4a:5b:ba  DC01
33554438            5       9 vSwitch0         00:0c:29:f0:df:4c  VC01

Note: SubType is VM Hardware Version
vSwitch port numbers are important for following commands.
Command esxcli network port stats get –p shows statistics for particular vSwitch port.
~ # esxcli network port stats get -p 33554434
Packet statistics for port 33554434
   Packets received: 2346445
   Packets sent: 5853
   Bytes received: 295800113
   Bytes sent: 1225842
   Broadcast packets received: 1440669
   Broadcast packets sent: 336
   Multicast packets received: 896958
   Multicast packets sent: 120
   Unicast packets received: 8818
   Unicast packets sent: 5397
   Receive packets dropped: 0
   Transmit packets dropped: 0

You can also show filter statistics for ESXi firewall by command esxcli network port filter stats get –p 33554436
~ # esxcli network port filter stats get -p 33554436
Filter statistics for ESXi-Firewall
   Filter direction: Receive
   Packets in: 5801
   Packets out: 5660
   Packets dropped: 141
   Packets filtered: 150
   Packets faulted: 0
   Packets queued: 0
   Packets injected: 0
   Packet errors: 0

Filter statistics for ESXi-Firewall
   Filter direction: Transmit
   Packets in: 4893
   Packets out: 4887
   Packets dropped: 6
   Packets filtered: 6
   Packets faulted: 0
   Packets queued: 0
   Packets injected: 0
   Packet errors: 0

To show physical NIC statistics you have to use command esxcli network nic stats get –n vmnic0
~ # esxcli network nic stats get -n vmnic0
NIC statistics for vmnic0
   Packets received: 2350559
   Packets sent: 8083
   Bytes received: 312690659
   Bytes sent: 5791889
   Receive packets dropped: 0
   Transmit packets dropped: 0
   Total receive errors: 0
   Receive length errors: 0
   Receive over errors: 0
   Receive CRC errors: 0
   Receive frame errors: 0
   Receive FIFO errors: 0
   Receive missed errors: 0
   Total transmit errors: 0
   Transmit aborted errors: 0
   Transmit carrier errors: 0
   Transmit FIFO errors: 0
   Transmit heartbeat errors: 0
   Transmit window errors: 0

Packet Capture

If you want to do deeper network troubleshooting you can do packet capturing on ESXi host. You have two tools available for packet capturing

  •  tcpdump-uw (example: tcpdump-uw –I vmk0 –s0 –C100M –W 10 –w /var/tmp/test.pcap)
  • pktcap-uw
pktcap Examples:
  • pktcap-uw –uplink vmnicX –capture UplinkRcv
  • pktcap-uw –uplink vmnicX –capture UplinkSnd
  •    you can filter for icmp –proto 0x01 or beacon probes –ethtype 0x8922

Other example based on [SOURCE] https://kb.fortinet.com/kb/documentLink.do?externalID=FD47845

In case of connectivity issue between a VM and other VM/s it is worth sniffing traffic on the hypervisor side in order to isolate the issue.
In order to sniff traffic on ESXi server, it is necessary to perform the steps below:

- Enable ssh access on ESXi.
- Ssh to ESXi.
- Run in CLI net-stats -l | grep <VM name> in order to find virtual switchport of the VM.
- In vSphere 6.5 or earlier it is necessary to specify direction of sniffing (either input or output).

- Switchport number for particular VM can be found using net-stats command.
- 'O' defines path where pcap file will be created and specify file name.
- dir specify direction (either input or output):

    pktcap-uw --switchport 123 -o /tmp/in.pcap --dir input
    pktcap-uw --switchport 123 -o /tmp/out.pcap --dir output

- In vSphere 6.7 or later it is possible to sniff traffic in both directions by setting --dir 2:

    pktcap-uw --switchport 123 -o /tmp/both.pcap --dir 2

- Run Ctrl-C in CLI order to stop sniffing.
- Download created pcap file/s over ssh from ESXi.