VCDX #200 The Ultimate Way to VirtualizeBlog of one VMware Infrastructure Designer

Thursday, November 06, 2014

ESXi Network Troubleshooting

Introduction

As VMware vExpert, I had a chance and privilege to use VMware Learning Zone. There are excellent training videos. Today I would like to blog about useful commands trained on video training “Network Troubleshooting at the ESXi Command Line”. If you ask me I have to say that Vmware Learning Zone has very valuable content and it comes really handy during real troubleshooting.

UPDATE 2020-10-17: I have just found the blog post "ESXi Network Troubleshooting Tools" containing a lot of useful tools and insights.

NIC Adapters Information

To see Network Interface Cards Information you can run following command

~ # /usr/lib/vmware/vm-support/bin/nicinfo.sh | more

Network Interface Cards Information.

Name PCI Device Driver Link Speed Duplex MAC Address MTU Description

----------------------------------------------------------------------------------------

vmnic0 0000:001:00.0 bnx2 Up 1000 Full 14:fe:b5:7d:8d:05 1500 Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

vmnic1 0000:001:00.1 bnx2 Up 1000 Full 14:fe:b5:7d:8d:07 1500 Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

vmnic2 0000:002:00.0 bnx2 Up 1000 Full 14:fe:b5:7d:8d:6d 1500 Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

vmnic3 0000:002:00.1 bnx2 Up 1000 Full 14:fe:b5:7d:8d:6f 1500 Broadcom Corporation Broadcom NetXtreme II BCM5709S 1000Base-SX

NIC: vmnic0

NICInfo:

Advertised Auto Negotiation: true

Advertised Link Modes: 1000baseT/Full, 2500baseT/Full

Auto Negotiation: true

Cable Type: FIBRE

Current Message Level: -1

Driver Info:

NICDriverInfo:

Bus Info: 0000:01:00.0

Driver: bnx2

Firmware Version: 7.8.53 bc 7.4.0 NCSI 2.0.13

Version: 2.2.3t.v55.7

Link Detected: true

Link Status: Up

Name: vmnic0

PHY Address: 2

Pause Autonegotiate: false

Pause RX: true

Pause TX: true

Supported Ports: TP, FIBRE

Supports Auto Negotiation: true

Supports Pause: true

Supports Wakeon: true

Transceiver: internal

Wakeon: MagicPacket(tm)

Ring parameters for vmnic0:

Pre-set maximums:

RX: 4080

RX Mini: 0

RX Jumbo: 16320

TX: 255

Current hardware settings:

RX: 255

RX Mini: 0

RX Jumbo: 0

TX: 255

…

Output above is sniped just for vmnic0. You can see useful information like PCI Device ID, Driver, Link Status, Speed, Duplex and MTU for each vmnic.

It also shows detail driver information, FlowControl (Pause Frame) status, cable type. etc.

To find particular vmnic PCI Vendor ID's use command vmkchdev

~ # vmkchdev -l | grep vmnic0

0000:01:00.0 14e4:163a 1028:02dc vmkernel vmnic0

PCI Slot: 0000:01:00.0

VID (Vendor ID): 14e4

DID (Device ID): 163a

SVID (Sub-Vendor ID): 1028

SSID (Sub-Device ID): 02dc

You can use PCI devices Vendor ID’s to find the latest drivers at VMware Compatibility Guide (http://www.vmware.com/go/hcl/).

Below is another command how to find full details of all PCI devices.

esxcli hardware pci list

If you are interested just for particular vmnic PCI details command below can be used.

~ # esxcli hardware pci list | grep -B 6 -A 29 vmnic0

000:001:00.0

Address: 000:001:00.0

Segment: 0x0000

Bus: 0x01

Slot: 0x00

Function: 0x00

VMkernel Name: vmnic0

Vendor Name: Broadcom Corporation

Device Name: Broadcom NetXtreme II BCM5709S 1000Base-SX

Configured Owner: Unknown

Current Owner: VMkernel

Vendor ID: 0x14e4

Device ID: 0x163a

SubVendor ID: 0x1028

SubDevice ID: 0x02dc

Device Class: 0x0200

Device Class Name: Ethernet controller

Programming Interface: 0x00

Revision ID: 0x20

Interrupt Line: 0x0f

IRQ: 15

Interrupt Vector: 0x2b

PCI Pin: 0x75

Spawned Bus: 0x00

Flags: 0x0201

Module ID: 4125

Module Name: bnx2

Chassis: 0

Physical Slot: 0

Slot Description: Embedded NIC 1

Passthru Capable: true

Parent Device: PCI 0:0:1:0

Dependent Device: PCI 0:0:1:0

Reset Method: Link reset

FPT Sharable: true

Note: same command can be used for HBA cards by substituting vmnic0 by vmhba0

VLAN Sniffing

The commands below enable VLAN statistics collection on particular vmnic which can be shown and used for troubleshooting.

esxcli network nic vlan stats set --enabled=true -n vmnic0

~ # esxcli network nic vlan stats get -n vmnic0

VLAN 0

Packets received: 22

Packets sent: 0

VLAN 22

Packets received: 21

Packets sent: 10

VLAN 201

Packets received: 28

Packets sent: 0

VLAN 202

Packets received: 28

Packets sent: 0

VLAN 204

Packets received: 5

Packets sent: 0

VLAN 205

Packets received: 5

Packets sent: 0

Don’t forget to disable VLAN statistics after troubleshooting.

esxcli network nic vlan stats set --enabled=false -n vmnic0

VMkernel Arp Cache

To work with ESXi ARP cache you can use command

esxcli network ip neighbor

Below is example how to list ARP entries …

~ # esxcli network ip neighbor list

Neighbor Mac Address Vmknic Expiry State Type

--------- ----------------- ------ ------- ----- -------

10.2.22.1 5c:26:0a:ae:5a:c6 vmk0 933 sec Unknown

You can see there just default gateway 10.2.22.1

Let’s ping some other device in the same broadcast domain and look at ARP entries again.

~ # ping 10.2.22.51

PING 10.2.22.51 (10.2.22.51): 56 data bytes

64 bytes from 10.2.22.51: icmp_seq=0 ttl=128 time=0.802 ms

~ # esxcli network ip neighbor list

Neighbor Mac Address Vmknic Expiry State Type

---------- ----------------- ------ -------- ----- -------

10.2.22.51 00:0c:29:4a:5b:ba vmk0 1195 sec Unknown

10.2.22.1 5c:26:0a:ae:5a:c6 vmk0 878 sec Unknown

Now you can see entry for device 10.2.22.51 in ARP table as well. Below is another command to remove ARP entry from ARP table.

~ # esxcli network ip neighbor remove -v 4 -a 10.2.22.51

… and let’s check if ARP entry has been removed.

~ # esxcli network ip neighbor list

Neighbor Mac Address Vmknic Expiry State Type

--------- ----------------- ------ ------- ----- -------

10.2.22.1 5c:26:0a:ae:5a:c6 vmk0 817 sec Unknown

Note: ESXi ARP timeout is 1200 second therefore remove command can be handy in some situations.

VMkernel Routing

Since vSphere 5.1 it is possible to have more than one networking stack. Normally you work with default networking stack.

To show ESXi routing table you can use command

esxcli network ip route ipv4 list

~ # esxcli network ip route ipv4 list

Network Netmask Gateway Interface Source

--------- ------------- --------- --------- ------

default 0.0.0.0 10.2.22.1 vmk0 MANUAL

10.2.22.0 255.255.255.0 0.0.0.0 vmk0 MANUAL

You can see default gateway 10.2.22.1 used for default networking stack.

Command esxcli network ip connection list shows all IP network connections from and to ESXi host.

~ # esxcli network ip connection list

Proto Recv Q Send Q Local Address Foreign Address State World ID CC Algo World Name

----- ------ ------ ------------------------------- ------------------ ----------- -------- ------- ---------------

tcp 0 0 127.0.0.1:8307 127.0.0.1:54854 ESTABLISHED 34376 newreno hostd-worker

tcp 0 0 127.0.0.1:54854 127.0.0.1:8307 ESTABLISHED 570032 newreno rhttpproxy-work

tcp 0 0 127.0.0.1:443 127.0.0.1:54632 ESTABLISHED 570032 newreno rhttpproxy-work

tcp 0 0 127.0.0.1:54632 127.0.0.1:443 ESTABLISHED 1495503 newreno python

tcp 0 0 127.0.0.1:8307 127.0.0.1:61173 ESTABLISHED 34806 newreno hostd-worker

tcp 0 0 127.0.0.1:61173 127.0.0.1:8307 ESTABLISHED 570032 newreno rhttpproxy-work

tcp 0 0 127.0.0.1:80 127.0.0.1:60974 ESTABLISHED 34267 newreno rhttpproxy-work

tcp 0 0 127.0.0.1:60974 127.0.0.1:80 ESTABLISHED 35402 newreno sfcb-vmware_bas

tcp 0 0 10.2.22.101:80 10.44.44.110:50351 TIME_WAIT 0

tcp 0 0 127.0.0.1:5988 127.0.0.1:14341 FIN_WAIT_2 35127 newreno sfcb-HTTP-Daemo

tcp 0 0 127.0.0.1:14341 127.0.0.1:5988 CLOSE_WAIT 1473527 newreno hostd-worker

tcp 0 0 127.0.0.1:8307 127.0.0.1:45011 ESTABLISHED 34806 newreno hostd-worker

tcp 0 0 127.0.0.1:45011 127.0.0.1:8307 ESTABLISHED 570032 newreno rhttpproxy-work

NetCat

Netcat program (nc) is available on ESXi and it can test TCP connectivity to some IP target.

~ # nc -v 10.2.22.100 80

Connection to 10.2.22.100 80 port [tcp/http] succeeded!

TraceNet

Tracenet is very handy program available in ESXi to identify also latencies inside vmkernel IP stack.

~ # tracenet 10.2.22.51

Using interface vmk0 ...

Time 0.068 0.023 0.019 ms

Location: ESXi-Firewall

Time 0.070 0.025 0.020 ms

Location: VLAN_InputProcessor@#

Time 0.073 0.027 0.022 ms

Location: vSwitch0: port 0x2000004

Time 0.089 0.030 0.024 ms

Location: VLAN_OutputProcessor@#

Time 0.090 0.031 0.025 ms

Location: DC01

Endpoint: 10.2.22.51

Roundtrip Time: 0.417 0.195 0.196 ms

Dropped packets

In this section are commands to verify dropped packets on different places of VMkernel Ip stack.

Command net-stats –l list all devices (Clients – nic-ports,vmk-ports, vm-ports) connected to VMware switch. You can simply identify to which vSwitch port number (PortNum) is device connected.

~ # net-stats -l

PortNum Type SubType SwitchName MACAddress ClientName

33554434 4 0 vSwitch0 14:fe:b5:7d:8d:05 vmnic0

33554436 3 0 vSwitch0 14:fe:b5:7d:8d:05 vmk0

33554437 5 9 vSwitch0 00:0c:29:4a:5b:ba DC01

33554438 5 9 vSwitch0 00:0c:29:f0:df:4c VC01

Note: SubType is VM Hardware Version

vSwitch port numbers are important for following commands.

Command esxcli network port stats get –p shows statistics for particular vSwitch port.

~ # esxcli network port stats get -p 33554434

Packet statistics for port 33554434

Packets received: 2346445

Packets sent: 5853

Bytes received: 295800113

Bytes sent: 1225842

Broadcast packets received: 1440669

Broadcast packets sent: 336

Multicast packets received: 896958

Multicast packets sent: 120

Unicast packets received: 8818

Unicast packets sent: 5397

Receive packets dropped: 0

Transmit packets dropped: 0

You can also show filter statistics for ESXi firewall by command esxcli network port filter stats get –p 33554436

~ # esxcli network port filter stats get -p 33554436

Filter statistics for ESXi-Firewall

Filter direction: Receive

Packets in: 5801

Packets out: 5660

Packets dropped: 141

Packets filtered: 150

Packets faulted: 0

Packets queued: 0

Packets injected: 0

Packet errors: 0

Filter statistics for ESXi-Firewall

Filter direction: Transmit

Packets in: 4893

Packets out: 4887

Packets dropped: 6

Packets filtered: 6

Packets faulted: 0

Packets queued: 0

Packets injected: 0

Packet errors: 0

To show physical NIC statistics you have to use command esxcli network nic stats get –n vmnic0

~ # esxcli network nic stats get -n vmnic0

NIC statistics for vmnic0

Packets received: 2350559

Packets sent: 8083

Bytes received: 312690659

Bytes sent: 5791889

Receive packets dropped: 0

Transmit packets dropped: 0

Total receive errors: 0

Receive length errors: 0

Receive over errors: 0

Receive CRC errors: 0

Receive frame errors: 0

Receive FIFO errors: 0

Receive missed errors: 0

Total transmit errors: 0

Transmit aborted errors: 0

Transmit carrier errors: 0

Transmit FIFO errors: 0

Transmit heartbeat errors: 0

Transmit window errors: 0

Packet Capture

If you want to do deeper network troubleshooting you can do packet capturing on ESXi host. You have two tools available for packet capturing

tcpdump-uw (example: tcpdump-uw –I vmk0 –s0 –C100M –W 10 –w /var/tmp/test.pcap)
pktcap-uw

pktcap Examples:

pktcap-uw –uplink vmnicX –capture UplinkRcv
pktcap-uw –uplink vmnicX –capture UplinkSnd
you can filter for icmp –proto 0x01 or beacon probes –ethtype 0x8922

Other example based on [SOURCE] https://kb.fortinet.com/kb/documentLink.do?externalID=FD47845

In case of connectivity issue between a VM and other VM/s it is worth sniffing traffic on the hypervisor side in order to isolate the issue.

In order to sniff traffic on ESXi server, it is necessary to perform the steps below:

- Enable ssh access on ESXi.

- Ssh to ESXi.

- Run in CLI net-stats -l | grep <VM name> in order to find virtual switchport of the VM.

- In vSphere 6.5 or earlier it is necessary to specify direction of sniffing (either input or output).

- Switchport number for particular VM can be found using net-stats command.

- 'O' defines path where pcap file will be created and specify file name.

- dir specify direction (either input or output):

pktcap-uw --switchport 123 -o /tmp/in.pcap --dir input

pktcap-uw --switchport 123 -o /tmp/out.pcap --dir output

- In vSphere 6.7 or later it is possible to sniff traffic in both directions by setting --dir 2:

pktcap-uw --switchport 123 -o /tmp/both.pcap --dir 2

- Run Ctrl-C in CLI order to stop sniffing.

- Download created pcap file/s over ssh from ESXi.

DELL FX2 is comming

Michael Dell announced FX2 yesterday at DellWorld 2014.

FX2 is new 2U flexible chassis for sleds. Sleds are basically hardware cartridges having one of three roles listed below

flexible server (FC) - FC630, FC430, FC830
flexible micro servers (FM) - FM120X4
flexible disk enclosures (FD) - FD332

You can look at FX2 overview video below. It is marketing video however it is nice illustration how revolutionary this platform is.

If you want more deeper FX2 review I would recommend to read Kevin Houston's blog post "A First Look at Dell’s FX Architecture".

I personally really like this form factor as it is more granular and therefore more flexible then full blade chassis and still allow high density and simplified cable management. You can also use one or two low profile PCIe cards for FX servers. These PCIs cards are accessible from the rear of chassis. Anybody who works with servers for long times probably known that 2U form factor always was and still is the favorite compute form factor in datacenters.

FX2 Chassis leverages DELL PowerEdge FN IO Aggregator which is actually new version of blade DELL PowerEdge M1000e IO Aggregator redesigned for FX2 form factor.Both these IO Aggregators are based on Force10 technology.

If you ask me I can see lot of nice use cases for this platform including classic VMware vSphere deployments with shared storage, VMware VSAN, EVO:Rail, Nutanix, etc.

What do you think about this hardware platform?
Share your opinions in comments.
And as always any comment is welcome.

Monday, November 03, 2014

Resetting DELL Force10 to factory defaults

In OS 9.5, DELL introduced a new command to reset the switch to factory default mode. The command is Dell# restore factory-defaults stack-unit all clear-all It does the following:

Deletes the startup configuration
Clears the NOVRAM and Boot variables, depending on the arguments passed
Enables BMP
Resets the user ports to their default native modes (ie., non-stacking, no 40G to 4x10G breakouts, etc.)
Removes all CLI users Then, the command reloads the switch in a similar state to a brand new device Restore does not change the current OS images and partition from which the switch will boot up. Likewise, restore does not delete any of the files you store in the SD (except startup-config)

Monday, October 27, 2014

FreeBSD with multiple Serial Adapters acting as Access Console Server

I play a lot with network equipment like switches, routers and firewalls. It is very useful to have local serial access to consoles of such devices. When I say local, I mean remote access to local serial console. I can use some commercial Access Console Servers from companies like Avocent but these devices are usually very expensive and don't do anything else than linux box with multiple serial ports accessible remotely via ssh or telnet.

So my idea was to use my favorite unix-like system (FreeBSD) with multiple serial ports. For such appliances I usually use Soekris or Alix boards with FreeBSD on Flash. The question is how to have multiple serial (RS-232) ports. The simplest method nowadays is to use usb serial adapters. I know these usb serial converters has some issues but it is really the simplest peace of hardware to buy, plug and play.

When you use some of these USB converters you should see new devices. In my case I see in dmesg following devices:

uftdi0: on usbus1
uftdi1: on usbus1
uftdi2: on usbus1
uftdi3: on usbus1

To make serial console working you have to load uftdi module. uftdi -- USB support for serial adapters based on the FTDI family of USB serial adapter chips.

The easiest way is to load this module during boot. You just need to add to /boot/loader.conf following line

uftdi_load="yes"

After next boot you will have following new devices in your /dev/ directory

/dev/cuaU0
/dev/cuaU1
/dev/cuaU2
/dev/cuaU3

... and you can use program cu to connect to particular serial console. For example

cu -l /dev/cuaU0 -s 9600

to connect to console with speed 9600 bauds.

Soekris NET4801-48 with USB reduction to 4xRS232

Friday, October 17, 2014

vCenter, Windows 2012 R2, .NET 3.5 issue

It is well know that vCenter Server 5.5 requires .NET Framework 3.5. It is quite easy to install it by Server Manager GUI or by following command:

dism /online /enable-feature /featurename:NetFX3 /all /Source:d:\sources\sxs /LimitAccess

Command above assumes Windows 2012 DVD in drive d:

... but i had an issue with installation getting following error.

PS C:\Users\Administrator> dism /online /enable-feature /featurename:NetFX3 /all /Source:d:\sources\sxs /LimitAccess

Deployment Image Servicing and Management tool
Version: 6.3.9600.17031

Image Version: 6.3.9600.17031

Enabling feature(s)
[===========================66.4%====== ]

Error: 0x800f081f

The source files could not be found.
Use the "Source" option to specify the location of the files that are required to restore the feature. For more informat
ion on specifying a source location, see http://go.microsoft.com/fwlink/?LinkId=243077.

The DISM log file can be found at C:\Windows\Logs\DISM\dism.log
PS C:\Users\Administrator>

I discuss this issue with our Microsoft Specialist and he already knew the root cause and fix. The root cause was some bad Windows update. It is already fixed by Microsoft and if you didn't do update in bad time you should not experience this issue. However, when you hit this bug the only solution is to run following Microsoft fix.

NDPFixit-KB3005628-X64.exe

http://go.microsoft.com/fwlink/?LinkId=513775

Some more information about this issue:

KB: http://support2.microsoft.com/kb/3005628

HowTo

http://blogs.technet.com/b/askpfeplat/archive/2014/09/29/attempting-to-install-net-framework-3-5-on-windows-server-2012-r2-fails-with-error-code-0x800f0906-or-the-source-files-could-not-be-downloaded-even-when-supplying-source.aspx

Monday, October 13, 2014

Fibre Channel NPV and NPIV

I'm often asked by customers and colleagues what is the difference between NPV and NPIV. I don't want to write information which are already well written and explain by someone else. So please read this Tony Bourke blog post which is IMHO very well written.

Just quick summary.

NPV is CISCO term doing the same thing like Brocade Access Gateway or DELL Force10 NPG (NPIV Proxy Mode). All these technologies put the Fibre Channel switch in to the mode where they don't have Fibre Channel Domain ID and therefore works like absolutelly transparent Fibre Channel multiplexer or intelligent pass-through if you wish. It significantly simplified SAN architectures and multivendor interoperability.

NPIV is the feature allowing Fibre Channel switch operates more FCIDs over single fibre channel switch port. So it effectively allows aggregation of more Fibre Channel Nodes (N-Port IDs) per single FC link.

Friday, October 10, 2014

Did you know? Mixing of FCoE and iSCSI on the same converged fabric...

Mixing of FCoE and iSCSI on the same converged fabric is not recommended and not supported by Dell.

Saturday, September 20, 2014

DELL OME Download

DELL OpenManage Essentials download link.
http://marketing.dell.com/ome-software

Tuesday, September 16, 2014

Compellent Storage Center Live Volume and vSphere Metro Cluster

Are you interested in metro clusters (aka stretched clusters)?

Watch this video which introduces the new Synchronous Live Volume features available in Dell Compellent Storage Center 6.5.

And if you need more technical deep dive use this guide focuses on two main data protection and mobility features available in Dell Compellent Storage Center: synchronous replication and Live Volume. In this paper, each feature is discussed and sample use cases are highlighted where these technologies fit independently or together.

Compellent Live Volume curretnly doesn't support automated fail-over based on arbiter on third site so that's the reason why it is not certified as VMware vSphere Metro Cluster storage. Certification is just a matter of time. However, you can leverago Compellent Live Volume with vSphere. The only drawback is that whole storage node fail-over has to be done manually which can be enough or preferred method in some environments.

Wednesday, September 10, 2014

Tool for Network Assessment and Documentation

Do you need tool for Automated Network Assessment and Documentation? Try NetBrain and let me know how do you like it. I'm writing this tool to my todo list I need to test in my lab so I'll write another blog post after test.

NetBrain's deep network discovery will build a rich mathematical model of the network’s topology and underlying design. The data collected by the system is automatically embedded within every diagram and exportable to MS Visio, Word, or Excel.

NetBrain Personal Edition is the totally free version of NetBrain. It will let you discover up to 20 network devices and will never expire.

iSCSI and Ethernet

Each manufacturer of Ethernet switch may implement features unique to their specific model. Below are some general tips to look for when implementing an iSCSI network infrastructure. Each tip may or may not apply to a specific installation. Be aware that this is list is inspired by DELL Compellent iSCSI bets practices and it is not an all-inclusive list.

Bi-Directional Flow Control enabled for all Switch Ports that carry iSCSI traffic, including any inter switch links.
Separate networks or VLANs from data.
Separate iSCSI traffic multi-path traffic also.
Unicast storm control disabled on every switch that handles iSCSI traffic.
Multicast disabled at the switch level for any iSCSI VLANs - Multicast storm control enabled (if available) when multicast cannot disabled.
Broadcast disabled at the switch level for any iSCSI VLANs - Broadcast storm control enabled (if available) when broadcast cannot disabled.
Routing disabled between regular network and iSCSI VLANs - Use extreme caution if routing any storage traffic, performance of the network can be severely affected. This should only be done under controlled and monitored conditions.
Disable Spanning Tree (STP or RSTP) on ports which connect directly to end nodes (the server or Dell Compellent controller's iSCSI ports.) You can do it by enabling PortFast or EdgePort option on these ports so that they are configured as edge ports.
Ensure that any switches used for iSCSI are of a non-blocking design.
Hard set for all switch ports and server ports for Gigabit Full Duplex if applicable.
When deciding which switches to use, remember that you are running SCSI traffic over it. Be sure to use a quality managed enterprise class networking equipment. It is not recommended to use SBHO (small business/home office) class equipment outside of lab/test environments.

Do you want configuration examples for DELL PowerConnect and DELL Force10 switches? Leave a comment with particular switch model and firmware version and I'll try my best to prepare it for you.

Tuesday, September 09, 2014

DELL Force10 switch and NIC Teaming

NIC teaming is a feature that allows multiple network interface cards in a server to be represented by one MAC address and one IP address in order to provide transparent redundancy, balancing, and to fully utilize network adapter resources. If the primary NIC fails, traffic switches to the secondary NIC because they are represented by the same set of addresses.

Let's assume we have the host with two NICs where primary NIC is connected to Force10 switch port 0/1 and secondary NIC to switch port 0/5. When you use NIC teaming, consider that the server MAC address is originally learned on Port 0/1 of the switch and Port 0/5 is the failover port. When the NIC fails, the system automatically sends an ARP request for the gateway or host NIC to resolve the ARP and refresh the egress interface. When the ARP is resolved, the same MAC address is learned on the same port where the ARP is resolved (in the previous example, this location is Port 0/5 of the switch). To ensure that the MAC address is disassociated with one port and re-associated with another port in the ARP table, configure the

mac-address-table station-move refresh-arp

command on the Dell Networking switch at the time that NIC teaming is being configured on the server.

! NOTE: If you do not configure the mac-address-table station-move refresh-arp command, traffic continues to be forwarded to the failed NIC until the ARP entry on the switch times out.

UPDATE 2015-03-16:
I have just discovered another FTOS command ...

arp learn-enable
Enable ARP learning using gratuitous ARP.

NIC Teaming solutions can leverage gratuitous ARP so it is worth to enable it in my opinion.

~~This command should be very beneficial on VMware environments where VMware vSwitch sends gratuitous ARP after VM is vMotioned from one ESXi host to another.~~
ESXi host doesn't use gratuitous arp but reverse arp (aka RARP). Anyway these two commands are beneficial for VMware vMotion.

Monday, September 08, 2014

Redirect ESXi syslog and coredump over network

Let's assume we have syslog server on IP address [SYSLOG-SERVER] and coredump server at [COREDUMP-SERVER]. Here are CLI commands how to quickly and effectively configure network redirection.

REDIRECT SYSLOG

esxcli system syslog config set --loghost=udp://[SYSLOG-SERVER]
esxcli network firewall ruleset set --ruleset-id=syslog --enabled=true
esxcli network firewall refresh
esxcli system syslog reload

VERIFY SYSLOG SETTINGS

esxcli system syslog config get

REDIRECT COREDUMP

esxcli system coredump network set --interface-name vmk0 --server-ipv4 [COREDUMP-SERVER] --server-port 6500
esxcli system coredump network set --enable true

VERIFY COREDUMP SETTINGS

esxcli system coredump network check

By the way, do you know that VMware vCenter Server Appliance works as syslog and coredump server? So why not use it? It is free of charge.

vCenter Log Insight is much better syslog server because you can easily search centralized logs and do some advanced analytic however that's another topic.

Sunday, September 07, 2014

vSphere HA Cluster Redundancy

All vSphere administrators and implementers know how easily vSphere HA Cluster can be configured. However sometimes quick and simple configuration doesn't do exactly what is expected. You can, and typically you should, enable Admission Control in vSphere HA Cluster configuration settings. VMware vSphere HA Admission Control is control mechanism checking if another VM can be powered on in HA enabled cluster and still satisfy redundancy requirement. So far so good however complexity starts from here because you have several options what algorithm you will use to fulfill your spare capacity redundancy requirement. So what options do you have?

Admission Control can be configured for following three algorithms:

Define fail-over capacity by static number of hosts
Define fail-over capacity by reserving a percentage of cluster resources
Use dedicated fail-over hosts

Let's deep dive into each option ...

Algorithm 1 is generally N+X host redundancy

When N+X redundancy is required most vSphere designers go with this option because it looks like most suitable choice. However, it is important to know that this particular algorithm is working with HA Slot Size. HA Slot Size is calculated based on defined reservations on powered VMs. If you don't use CPU/MEM reservations per VM than default reservation values (32 MHz, memory virtualization overhead) are used for HA Slot Size calculation. By the way, VMware recommends to set reservations per resource pools and not per VM so there is relatively high probability you don't have VM reservations and you will have very low HA Slot Size which means that Admission Control will allow to power on lot of VMs which introduce high resource over-allocation and your N+1 redundancy can significantly suffer. On the other hand, if you have just one VM with huge CPU/MEM reservations it can significantly impact and skew HA Slot Size with a negative impact on your VM consolidation ratio.

How can we solve this problem? One solution is HA Cluster Advanced Options described below.

Maximum HA Slot size can be limited to two following advanced options.

das.slotcpuinmhz - Defines the maximum bound on the CPU slot size. If this option is used, the slot size is the smaller of this value or the maximum CPU reservation of any powered-on virtual machine in the cluster.
das.slotmeminmb - Defines the maximum bound on the memory slot size. If this option is used, the slot size is the smaller of this value or the maximum memory reservation plus memory overhead of any powered-on virtual machine in the cluster.

It helps in a situation when you have one VM with high CPU or RAM reservations. Such VM will not increase HA Slot Size but it consumes smaller HA Slots.

Default VM reservation values for HA slot calculation can be defined by another two advanced options.

das.vmcpuminmhz - Defines the default CPU resource value assigned to a virtual machine if its CPU reservation is not specified or zero. This is used for the Host Failures Cluster Tolerates admission control policy. If no value is specified, the default is 32MHz.
das.vmmemoryminmb - Defines the default memory resource value assigned to a virtual machine if its memory reservation is not specified or zero. This is used for the Host Failures Cluster Tolerates admission control policy. If no value is specified, the default is 0 MB.

Default VM reservation values can help you to define HA Slot Size you want but it doesn't automatically correspond with required overbooking and planed spare fail-over capacity because HA Slot Size is not proportional to VM sizes on a particular cluster. If you really want to have one real spare host fail-over capacity you have to go with option 3 (Use dedicated fail-over hosts).

Algorithm 2 : percentage cluster spare capacity

This algorithm doesn't use HA Slot size but it simply calculates total cluster CPU/MEM resources and decrease these cluster resources by spare capacity defined in percentage. The rest of cluster available resources is also decreased by powered on VM reservations and new VMs can be powered on only when some cluster resources are available. Quite clear and simple, right? However, it also requires to have VM reservations otherwise you will end up with over-allocated cluster and your overbooking ratio will be too high which can introduce some performance issues. So once again, if you really want to have one real spare host fail-over capacity without dealing with VM reservations the best way is to go with option 3 (Use dedicated fail-over hosts).
Note that algorithm 2 doesn't use HA Cluster Advanced Options related to HA Slot mentioned above. However das.vmCpuMinMHz and das.vmMemoryMinMB can be used to set default reservations. For more details read this.

Algorithm 3 : dedicated fail-over hosts

This algorithm simply dedicates specified hosts to be unused during normal conditions and used only in case of ESXi host failure. Multiple fail-over dedicated hosts are supported since vSphere 5.0. This algorithm will keep your capacity and performance absolutely predictable and independent on VM reservations. You'll get exactly what you configure.

UPDATE 2018-01-09: for some additional details about dedicated fail-over hosts read the blog post Admission Control - Dedicated fail-over hosts.

CONCLUSION
So what option to use? The correct answer is, as usually , ... it depends :-)

However, if VM reservations are not used and absolutely predictable N+X redundancy is required I currently recommend Option 3.

If you have a mental problem with not using some ESXi host during non-degraded cluster state (isn't it exactly what is required?) I recommend Option 1 but VM reservations must be used to have a realistic size of HA Slot. In this options, artificial HA Slot can be designed leveraging advanced options.

If you don't want elaborate with HA Slot and use all ESXi hosts in the cluster you can use Option 2 but VM reservations must be used for some capacity guarantee to avoid high overbooking ratio.

FEATURE REQUEST
It would be great if VMware vSphere has some kind of Cluster Reservation policy for VMs. For example, if you want to guarantee cluster resources overbooking 2:1 you would set up 50% CPU and 50% RAM reservations for each VM running in HA Cluster. This policy should be dynamic so if someone changes VM size from CPU or RAM perspective reservations would be recalculated automatically.

Let's break down our example above. We are assuming following HA CLUSTER RESERVATION POLICY => CPU 50%, RAM 50% assigned to our HA Cluster. Let's powered on VM with 2x vCPUs and 6GB RAM. Dynamic reservation calculation is quite easy from RAM perspective because memory reservation would be 3GB (50% from 6GB). It is a little bit more complicated from CPU reservation perspective. CPU dynamic reservation has to be calculated based on physical CPU where VM is running. So let's assume we have Intel Xeon E5-2450 @ 2.1GHz. So 50% from 2.1GHz is 1.05GHz but we have 2 vCPUs so we have to multiply it by 2. Therefore dynamic CPU reservation for our VM is 2.1GHz. I believe with such dynamic reservation policy we would be able to guarantee overbooking ratio and define cluster redundancy more predictable from overbooking and performance degradation point of view.

FEEDBACK
I would like to know what is your preferred HA Cluster Admission Control setting. So, don't hesitate to leave a comment and share your thoughts with the community. Any feedback is very welcome and highly appreciated.

Friday, September 05, 2014

EVO:RAIL Introduction Video

EVO:RAIL introduction video is quite impressive. Check it your self at

https://www.youtube.com/watch?v=J30zrhEUvKQ

I'm really looking forward for first EVO:RAIL implementation.

Friday, August 29, 2014

How to clear all jobs on DELL Lifecycle Controller via iDRAC

When you have problem with DELL Lifecycle Controller jobs you can delete all jobs by single iDRAC command. This command

racadm -r ip address -u user name -p password jobqueue delete -i JID_CLEARALL_FORCE

deletes all of the jobs plus the orphaned pending and restarts the data manager service on the iDRAC. It will take about 90-120 secs before the iDRAC is able to process another job.

Occasionally the iDRAC may need to be reset after issuing the above command due to other issues outside the job queue processing. In such case you can issue commands in the following order:

racadm -r ip address -u user name -p password jobqueue delete -i JID_CLEARALL_FORCE
Wait 120 secs
racadm –r ip address –u user name –p password racreset

Sunday, August 17, 2014

Network communications between virtual machines

I was contacted by colleague of mine who pointed to very often mentioned statement about network communication between virtual machines on the same ESXi host. One of such statement is cited below.

"Network communications between virtual machines that are connected to the same virtual switch on the same ESXi host will not use the physical network. All the network traffic between the virtual machines will remain on the host."

He was discussing this topic within his team and even they are very skilled virtualization administrators they had doubts about real behavior. I generally agree with statement above but it is actually correct statement only in specific situation when virtual machines are in the same L2 segment (the same broadcast domain - usually VLAN).

Figure 1 - L3 routing on physical network

I've prepared the drawing above to explain real behavior clearly. Network communication between VM1 and VM2 will stay on the same ESXi host because they are in the same L2 segment however communication between VM1 and VM3 has to go to physical switch (pSwitch) to be routed between VLAN 100 and VLAN 200 and return back to ESXi host and VM3.

Discussed statement above can be slightly reformulated to be always correct.

"Network communications between virtual machines that are connected to the same virtual switch portgroup on the same ESXi host will not use the physical network. All the network traffic between the virtual machines will remain on the host."

Both VMware standard and also distributed vSwitch are dump L2 switches so L3 routing must be done somewhere else, typically on physical switches. However there can be two scenarios when even L3 traffic between virtual machines on same ESXi host can stay there and not use physical network.

First scenario is when L3 routing is done on virtual machine running on top of the same ESXi host. Examples of such virtual routers are VMware's vShield Edge, Brocade's Vyatta, CISCO's CSR, open source router pfSense, or some other general OS with routing services. This scenario, also known as network function virtualization, is depicted on Figure 2.

Figure 2 - L3 routing on virtual machine (Network Function Virtualization)

It is worth to mention that L3 traffic between VM5 and VM6 will go through physical network because L3 router is on another ESXi host.

Second scenario is when distributed virtual router like VMware's NSX is used. This scenario is depicted on Figure 3. In this scenario, L2 and L3 traffic of all virtual machines running on same ESXi host is optimized and will remain on the host without physical network usage.

Figure 3 - Distributed Virtual Routing (VMware NSX)

So in our particular scenario L2 and L3 network communication among VM1, VM2, VM3 and VM4 will stay on the same ESXi host. The same apply to VM5 and VM6.

Hope, I've covered all possible scenarios and this blog post will be helpful to others during similar discussions in virtualization teams. And as always, comments are very welcome.

Tuesday, July 29, 2014

Force10 VLT - Design Verification Test Plan

One my philosophical rule is "Trust, but Verify". Design Verification Test Plan is good approach to be sure how the system you have designed behaves. Typical design verification test plan contains Usability, Performance and Reliability tests.

Force10 VLT domain configuration is actually two node cluster (the system) providing L2/L3 network services. What network services your VLT domain should provides depends on customer requirements. However typical VLT customer requirement is to have high availability and eliminate network down times when some system component fails or is maintained by administrator. Planing and executing reliability tests is good approach to verify that customer's high availability requirements have been achieved.

Bellow are some reliability tests I'm thinking are worth to execute and when my gear will be back in my lab I'll try to find some time and execute tests described below and publish real test results.

If you know about some other tests which make sense to perform, please don't be shy, leave the comment and I'll do it for you.

Test #1
Description:
Simulate VLT Domain secondary node failure impact on Ethernet traffic. How long (in ms) is traffic disrupted?
Tasks:
Use system A and system B both connected via VLT link to VLT Domain
Ping from system A to system B at least 10x per second
Power Off secondary VLT node
Measure network disruption
Expected Results:
It should be sub second failure.

Test Result:
TBD

Test #2
Description:
Simulate VLT Domain primary node failure impact on Ethernet traffic. How long (in ms) is traffic disrupted?
Tasks:
Use system A and system B both connected via VLT link to VLT Domain
Ping from system A to system B at least 10x per second
Power Off primary VLT node
Measure network disruption
Expected Results:
It should be sub second failure.
Test Result:
TBD

Test #3
Description:
Simulate one link from VLTi (ISL) port-channel failure.
Tasks:
Use system A and system B both connected via VLT link to VLT Domain
Ping from system A to system B at least 10x per second
Pull out one cable participating in VLTi static port-channel
Measure network disruption
Expected Results:
VLT Domain should be still working without traffic impact.
Test Result:
TBD
Test #4
Description:
Simulate all links from VLTi (ISL) port-channel failure.
Tasks:
Use system A and system B both connected via VLT link to VLT Domain
Ping from system A to system B at least 10x per second
Pull out all cables participating in VLTi static port-channel
Measure network disruption
Expected Results:
Backup link should act as arbiter. VLT Domain should be still working but in split brain mode and only primary VLT node should handle the traffic.
Test Result:
TBD

Test #5
Description:
Simulate VLT Domain backup link failure. Backup link configured as IP heartbeat over out-of-band management.
Tasks:
Use system A and system B both connected via VLT link to VLT Domain
Ping from system A to system B at least 10x per second
Pull out cable participating in backup link
Measure network disruption
Expected Results:
All traffic should work correctly but VLT should report backup link failure.
Test Result:
TBD
Test #6
Description:
Simulate one link failure on some virtual link trunk (aka VLT or virtual port-channel).
Tasks:
Use system A and system B both connected via VLT link to VLT Domain
Ping from system A to system B at least 10x per second
Pull out cable participating in VLT
Measure network disruption
Expected Results:
Port channel should survive this failure.
Test Result:
TBD

These six tests should verify basic high availability and resiliency of Force10 VLT cluster.

All problems should be notified by SNMP and/or syslog to central monitoring system in case it is configured properly. That can move us to Usability Tests .... but that's another set of tests ...

And please remember that TOO MUCH TESTING WOULD NEVER BE ENOUGH :-)

Wednesday, July 23, 2014

CISCO UDLD alternative on Force10

I've been asked by one DELL System Engineer if we support CISCO's UDLD feature because it was required in some RFI. Well, DELL Force10 Operating System have similar feature solving the same problem and it is called FEFD.

Here is the explanation from FTOS 9.4 Configuration Guide ...

FEFD (Far-end failure detection) is supported on the Force10 S4810 platform. FEFD is a protocol that senses remote data link errors in a network. FEFD responds by sending a unidirectional report that triggers an echoed response after a specified time interval. You can enable FEFD globally or locally on an interface basis. Disabling the global FEFD configuration does not disable the interface configuration.

Figure caption: Configuring Far-End Failure Detection

The report consists of several packets in SNAP format that are sent to the nearest known MAC address. In the event of a far-end failure, the device stops receiving frames and, after the specified time interval, assumes that the far-end is not available. The connecting line protocol is brought down so that upper layer protocols can detect the neighbor unavailability faster.

Update 2015-05-20:
If I understand it correctly CISCO's UDLD main purpose is to detect potential uni-directional links and mitigate the risk of loop in the network because STP cannot help in this scenario. Force10 has another feature to prevent a loop in such situation - STP loop guard.

The STP loop guard feature provides protection against Layer 2 forwarding loops (STP loops) caused by a hardware failure, such as a cable failure or an interface fault. When a cable or interface fails, a participating STP link may become unidirectional (STP requires links to be bidirectional) and an STP port does not receive BPDUs. When an STP blocking port does not receive BPDUs, it transitions to a Forwarding state. This condition can create a loop in the network.

Sunday, July 13, 2014

Heads Up! VMware virtual disk IOPS limit bad behavior in VMware ESX 5.5

I've been informed about strange behavior of VM virtual disk IOPS limits by one my customer for whom I did vSphere design recently. If you don't know how VM vDisk IOPS limits can be useful in some scenarios read my another blog post - "Why use VMware VM virtual disk IOPS limit?". And because I designed this technology for some of my customers they are very impacted by bad vDisk IOPS limit behavior in ESX 5.5

I've tested VM IOPS limits in my lab to see it by myself. Fortunately I have two labs. Older vSphere 5.0 lab with Fibre Channel Compellent storage and newer vSphere 5.5 lab with iSCSI storage EqualLogic. First of all let's look how it works in ESX 5.0. Same behavior is in ESX 5.1 and this behavior make perfect sense.

By default VM vDisks doesn't have limits as seen on next screen shot.

When I run IOmeter with single worker (thread) on unlimited vDisk I can achieve 4,846 IOPS. That's what datastore (physical storage) is able to give to single thread.

When I run IOmeter with two workers (threads) on unlimited vDisk I can achieve 7,107 IOPS. That's ok because all shared storages have implemented algorithms to limit performance for threads. That's actually protection against single thread abuse of all storage performance.

Now let's try to setup SIOC to 200 IOPS limits on both vDisks on VM as depicted on picture below.

Due to settings above IOmeter single worker generated workload is limited to 400 IOPS (2 x 200) per whole VM because all limit values are consolidated per virtual machine per LUN. For more info look at http://kb.vmware.com/kb/1038241. So it behaves as expected because IOmeter IOPS was oscillating between 330 and 400 IOPSes as you can see in picture below.

We can observe similar behavior with two workloads.

So in ESX 5.0 lab everything works as expected. Now let's move to another lab where I have vSphere 5.5. There is iSCSI storage so first of all we will run IOmeter without vDisk IOPS limits to see maximal performance we can get. On picture below we can see that single thread is able to get 1741 IOPSes.

... and two workers can get 3329 IOPSes.

So let's setup vDisk IOPS limits to 200 IOPS limits on both vDisks on VM as in test on ESX 5.0. I have also 2 disks on this VM. Due to these settings IOmeter single worker generated workload should be also limited to 400 IOPS (2 x 200) per whole VM. But unfortunately it is not limited and it can get 2000 IOPSes. It is strange and in my opinion bad behavior.

But even worse behavior can be observe when there are more threads. In examples below you can see two and four workers (threads) behavior. VM is getting really slow performance.

ESX 5.5 VM vDisk behavior is really strange and because all typical OS storage workloads (even OS booting) are multi-threaded than VM vDisk IOPS limits technology is unusable. My customer has opened support request so I believe it is a bug and VMware Support will help to escalate it in to VMware engineering.

UPDATE 2014-07-14 (Workaround):
I've tweet about this issue to @DuncanYB @CormacJHogan @MostafaVMW and Duncan moved me immediately in to the right direction. He reveal me the secret ... ESXi has two disk schedulers old one and new one (aka mClock). ESXi 5.5 uses new one (mClock) by default. If you switch back to the old one, disk scheduler behaves as expected. Below is the setting how to switch to the old one.

Go to ESX Host Advanced Settings and set Disk.SchedulerWithReservation=0

This will switch back to the old disk scheduler.

Kudos to Duncan.

Switching back to old scheduler is good workaround which will probably appear in VMware KB but there is definitely some reason why VMware introduced new disk scheduler in 5.5. I hope we will get more information from VMware engineering so stay tuned for more details ...

UPDATE 2015-12-3:
Here are some references to more information about disk schedulers in ESXi 5.5 and above ...

ESXi 5.5
http://www.yellow-bricks.com/2014/07/14/new-disk-io-scheduler-used-vsphere-5-5/
http://cormachogan.com/2014/09/16/new-mclock-io-scheduler-in-vsphere-5-5-some-details/
http://anthonyspiteri.net/esxi-5-5-iops-limit-mclock-scheduler/

ESXi 6
http://www.cloudfix.nl/2015/02/02/vsphere-6-mclock-scheduler-reservations/

Saturday, July 12, 2014

Why use VMware VM virtual disk IOPS limit?

What is VM IOPS limit? Here is explanation from VMware documentation ....

When you allocate storage I/O resources, you can limit the IOPS that are allowed for a virtual machine. By default, these are unlimited. If a virtual machine has more than one virtual disk, you must set the limit on all of its virtual disks. Otherwise, the limit will not be enforced for the virtual machine. In this case, the limit on the virtual machine is the aggregation of the limits for all virtual disks.

I really like this feature because VM vDisk IOPS limit is excellent mechanism to protect physical storage back-end against overloading by some disk intensive VMs and allows to set up some fair user policy for storage performance. Somebody can argue with usage of VM disk share mechanism. Yes, that's of course possible as well and it can be complementary. However, with shares fair user policy your users will get high performance at the beginning when back-end storage has lot of available performance but their performance will decrease later during time when more VMs will use this particular datastore. It means that performance is not predictive and users can complain.

Let's do simple IOPS limit example. You have datastores provisioned on storage pool with automated storage tiering which can serve up to 25,000 IOPS and you have there 100 virtual disks (vDisks). Setting 250 IOPS limit to each virtual disk ensures that if all VMs will use all their IOPSes back-end datastores will not be overloaded. I agree it is very strict limitation and VMs cannot use more IOPS when performance is available in physical storage. But this is business problem and best vDisk limiting policy depends on your business model and company strategy. Below are listed two business models for virtual disk performance limits I've already used on some my vSphere projects:

Service catalog strategy
Capacity/performance ratio strategy

Service catalog strategy allows customers (internal or external) increase or decrease vDisk IOPSes as needed and of course pay for it appropriately.

Capacity/performance ratio strategy approach is to calculate ratio between physical storage capacity and performance and use same ratio for vDisks. So if you have storage having 50 TB with 25,000 front-end performance you have 1 GB with 0.5 IOPS. You should define and apply some overbooking ratio because you use shared storage. Let's use ratio 3:1 and we will have 150 IOPSes for 100GB vDisk.

To be honest I prefer service catalog strategy as it is what real world need because each workload is different and service catalog gives better way how to define vDisks to match workloads in your particular environment.

Summary
VM vDisk IOPS limit approach is useful in environments where you want to have guaranteed and long term predictable storage performance (response time) for VMs vDisks. Please, be aware that even this approach is not totally fair because IOPS reality is much more complex and total number of IOPSes on back-end storage is not static number as we use in our example. In real physical storage, the number of front-end IOPSes you can get from storage is function of several parameters like IO size, read/write ratio, RAID type, workload type (sequence or random), cache hit, automated storage algorithm, etc ...

I hope VMware VVOLs will move this approach to the next level in future. However vDisk IOPS limit is technology we can use today.

Monday, July 07, 2014

How social media and community sharing help entrprise customers

I'm always happy when someone finds my blog article or shared document useful. Here is one example of recent email communication from one DELL customer who Googled my DELL OME (Open Manage Essentials is basic system management for DELL hardware inforastructure) document explaining network communication flows among DELL Open Manage components.

All personal information are anonymized so customer real name is changed to Mr.Customer.

VCDX Defense Timer

If you are prepering for VCDX and you want to do VCDX mock defense you can use the exact timer which is used during real VCDX defense.

The timer is available online at https://vcdx.vmware.com/vcdx-timer

Good luck with your VCDX journey!!!

Pages

Thursday, November 06, 2014

Introduction

NIC Adapters Information

VLAN Sniffing

VMkernel Arp Cache

VMkernel Routing

NetCat

TraceNet

Dropped packets

Packet Capture

Monday, November 03, 2014

Monday, October 27, 2014

Friday, October 17, 2014

Monday, October 13, 2014

Friday, October 10, 2014

Saturday, September 20, 2014

Tuesday, September 16, 2014

Wednesday, September 10, 2014

Tuesday, September 09, 2014

Monday, September 08, 2014

Sunday, September 07, 2014

Friday, September 05, 2014

Friday, August 29, 2014

Sunday, August 17, 2014

Tuesday, July 29, 2014

Wednesday, July 23, 2014

Sunday, July 13, 2014

Saturday, July 12, 2014

Monday, July 07, 2014

Subscribe To