Saturday, March 30, 2013

VMware VXLAN Deployment Guide

Vyenkatesh Deshpande recently published "VMware Network Virtualization Design Guide" which can be downloaded here. However deployment guide which is here is very valuable if you really want to implement VXLAN in your environment.

Sunday, February 24, 2013

SG3_UTILS: How to send SCSI commands to devices

http://sg.danny.cz/sg/sg3_utils.html
http://linux.die.net/man/8/sg3_utils

The sg3_utils package contains utilities that send SCSI commands to devices. As well as devices on transports traditionally associated with SCSI (e.g. Fibre Channel (FCP), Serial Attached SCSI (SAS) and the SCSI Parallel Interface(SPI)) many other devices use SCSI command sets.


How the Cluster service reserves a disk and brings a disk online

http://support.microsoft.com/kb/309186

This article (link above) describes how the Microsoft Cluster service reserves and brings online disks that are managed by cluster service and related drivers.


Wednesday, February 20, 2013

PuppetLabs | Razor: Next-Generation Provisioning


System administrators require the same agility and productivity from their hardware infrastructure that they get from the cloud. In response, Puppet Labs and EMC collaboratively developed Razor, a next-generation physical and virtual hardware provisioning solution. Razor provides you with unique capabilities for managing your hardware infrastructure, including:
  • Auto-Discovered Real-Time Inventory Data
  • Dynamic Image Selection
  • Model-Based Provisioning
  • Open APIs and Plug-in Architecture
  • Metal-to-Cloud Application Lifecycle Management
Together, Razor and Puppet enable system administrators to automate every phase of the IT infrastructure lifecycle, from bare metal to fully deployed cloud applications.



Monday, February 18, 2013

Automated Storage Tiering - Sub-LUN tiering

Excellent comparisons between Automated Storage Tiering technologies of different vendors.
I personally believe automated storage tiering (AST) is really important for dynamic virtualized datacenter and because AST differs among vendors I'm going to collect important information for design considerations.  I don't want to preferred or offend against any product. Each product has some advantages and disadvantages and we as infrastructure architects has to fully and deeply understand technology to be able prepare good design which is the most important factor for reliable and well performed infrastructure.

Good mid-range storage products on the market (my personal opinion):

  • DELL Compellent
  • Hitachi HUS
  • EMC VNX

DELL Compellent
Tiers: SSD, SAS, NL-SAS (SATA)
AST Sub-LUN tiering block: 512kb, 2MB (default), 4MB
Tiering optimisation analysis period: [TBD]
Tiering optimisation relocation period: [TBD]
Tiering algorithm: [TBD]
QoS per LUN: no

Hitachi HUS (HUS 110, HUS 130, HUS 150)
Tiers: SSD, SAS, NL-SAS (SATA)
AST Sub-LUN tiering block: 32MB
Tiering optimisation analysis period: 30 minutes
Tiering optimisation relocation period: [TBD]
Tiering algorithm: [TBD]
QoS per LUN: no

EMC VNX 
Tiers: SSD, SAS, NL-SAS (SATA)
AST Sub-LUN tiering block: 1GB
Tiering optimisation analysis period: 60 minutes
Tiering optimisation relocation period: user defined
Tiering algorithm:
During user-defined relocation window, 1GB slice ae promoted according to both the rank ordering performed in the analysis stage and a tiering policy set by the user. During relocation, FAST VP relocates higher-priority slices to higher tiers; slices are relocated to lower tiers only if the space they occupy is required for a higher-priority slice. This way, FAST VP fully utilized the highest-performing spindles first. Lower-tier spindles are utilized as capacity demand grows. Relocation can be initiated manually or by a user configurable, automated scheduler. The relocation process targets to create 10% free capacity in the highest tiers in the pool. Free capacity in these tiers is used for new slice allocations of high priority LUNs between relocations.
QoS per LUN: yes


I've collected information from several public resources so if there is some wrong information please let me know directly or via comments.



Wednesday, February 13, 2013

Understand SCSI, SCSI command responses and sense codes

During troubleshooting VMware vSphere and storage related issues it is quite useful to understand SCSI command responses and sense codes.

Usually you can see in log something like "failed H:0x8 D:0x0 P:0x0 Possible sense data: 0xA 0xB 0xC"

H: means host codes
D: means device codes
P: means plugin codes
A: is Sense Key
B: is Additional Sense Code
C: is Additional Sense Code Qualifier

Some host codes:
0x2 Bus state busy
0x3 Timeout for other reason
0x5 Told to abort for some other reason
0x8 Bus reset

Some device codes:
00h  GOOD
02h  CHECK CONDITION
04h  CONDITION MET
08h  BUSY
18h  RESERVATION CONFLICT
28h  TASK SET FULL
30h  ACA ACTIVE
40h  TASK ABORTED

Some plugin codes:
00h  No error.
01h  An unspecified error occurred. Note: The I/O cmd should be tried.
02h  The device is a deactivated snapshot. Note: The I/O cmd failed because the device is a deactivated snapshot and so the LUN is read-only.
03h  SCSI-2 reservation was lost.
04h  The plug-in wants to requeue the I/O back. Note: The I/O will be retried.
05h  The test and set data in the ATS request returned false for equality.
06h  Allocating more thin provision space. Device server is in the process of allocating more space in the backing pool for a thin provisioned LUN.
07h  Thin provisioning soft-limit exceeded.
08h  Backing pool for thin provisioned LUN is out of space.

Some SCSI Sense Keys:
SCSI Sense Keys appear in the Sense Data available when a command returns with a CHECK CONDITION status. The sense key contains all the information necessary to understand why the command has failed.

Code Name
0h   NO SENSE
1h   RECOVERED ERROR
2h   NOT READY
3h   MEDIUM ERROR
4h   HARDWARE ERROR
5h   ILLEGAL REQUEST
6h   UNIT ATTENTION
7h   DATA PROTECT
8h   BLANK CHECK
9h   VENDOR SPECIFIC
Ah   COPY ABORTED
Bh   ABORTED COMMAND
Dh   VOLUME OVERFLOW
Eh   MISCOMPARE

There is VMware KB with further details here.

It is worth to read following documents
http://www.tldp.org/LDP/khg/HyperNews/get/devices/scsi.html (this is quite old document for programmers willing to write SCSI driver)
http://en.wikipedia.org/wiki/SCSI
http://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition
http://en.wikipedia.org/wiki/SCSI_Request_Sense_Command

What is SCSI reservation
http://mrwhatis.com/scsi-reservation.html

SCSI-3 Persistent Group Reservation
http://scsi3pr.blogspot.cz/


Tuesday, February 12, 2013

Using the VMware I/O Analyzer v1.5: A Guide to Testing Multiple Workloads

I encourage you to watch great video about good practice how to use VMware I/O Analyzer (VMware bundle of IOmeter).

There is mentioned very important step to get relevant results. The step is to increase the size of second disk in virtual machine (OVF appliance). Default size is 4GB which is not enough because it hits the cache of almost any storage array and results are unreal and misleading.

Video is here
bit.ly/118kWs1 
or here
http://www.youtube.com/watch?v=zHJr957kN1s&feature=youtu.be

Enjoy.

Tuesday, January 22, 2013

HP Flex-10 Design, Plan, Implement, Test

Before design phase of VMware vSphere Infrastructure I recommend to read blog post "Understanding HP Flex-10 Mappings with VMware ESX/vSphere" to get general overview about server infrastructure and advanced  network interconnect. During design phase prepare detail test plan (aka operational verification) and test it during implementation phase. You can use blog post "Testing Scenario's VMware / HP c-Class Infrastructure" as a template for your test plan. I don't doubt that you normally test infrastructure before put it into production :-)

Saturday, January 19, 2013

MSCS RDMs causing long boot of ESX

That's because RDM LUN attached to MSCS cluster has permanent SCSI reservation initiated by active node of cluster.

In ESX 5 you have to mark all such LUNs as perennially reserved and your ESX boot can be fast as usual.

Here is CLI command to mark LUN
esxcli storage core device setconfig -d naa.id --perennially-reserved=true

This has to be changed on all ESX hosts with visibility to the LUN.

More info at http://kb.vmware.com/kb/1016106

Wednesday, January 09, 2013

How to calculate storage performance from host perspective

Storage performance is usually quantified as IOPS (I/O transactions per second). The performance from storage perspective is quite easy. It really depends on speed of each particular disk - also known as spindle. Each disk has some speed and bellow are written average values which are usually used for storage performance calculation
  • SATA disk = 80 IOPS
  • SCSI DISK(SAS or FC) 10k RPM = 150 IOPS
  • SCSI DISK(SAS or FC) 15k RPM = 180 IOPS
  • SSD disk (SLC aka EFD) = 6000 IOPS
So when we need higher performance we have to bundle disks. Disks can be bundled with standard RAID technology.

Here are most common RAID types used on standard disk arrays:
  • RAID 0 - no redundancy, disk bundle, higest performance => WRITE PENALTY = 0
  • RAID 1 - disk mirror, max bundle of 2 disks, high performance => WRITE PENALTY = 2
  • RAID 10 - RAID 1 + RAID 0 for bundling disk pairs, max disk bundle depends on disk array limits, high performance => WRITE PENALTY = 2
  • RAID 5 - block level striping with rotated parity, max disk bundle depends on disk array limits, moderate performance => WRITE PENALTY = 4
  • RAID 6 - block level striping with double parity, max disk bundle depends on disk array limits, lower performance => WRITE PENALTY = 6


So performance from storage perspective and from host perspective are different. Performance from storage perspective is simply summation of speed of all disks in RAID group. Performance from host perspective depends on selected RAID type.

To calculate estimated storage performance from host perspective we need to use the formula of several variables.

First of all let's define variables

P=write penalty of selected RAID type
R=Read % of disk workload
W=Write % of disk workload
H=IOPS from host perspective
S=IOPS from storage perspective

and now we can write formula to calculate storage performance from host perspective
H = S / (R+W*P)


Do you want to know all steps how to get this formula? It is simple. Start from another formula which describes storage behavior.
R*(1*H) + W*(P*H) = S
Above formula says - each host read IOPS generates single storage IOPS but each write IOPS generates multiple IOPS based on RAID type penalty (P).

Does it make sense? If not example can help you to understand.

My RAID group has 9 SAS disks 600GB/15k RPM and I use RAID 5 (8+1).
So from storage perspective I have 9 disks where each can perform 180 IOPS which means I have performance 1620 IOPS from storage perspective. Let's assume I have strange read/write ratio 20:80.

S = 1620
P = 4 (because of RAID 5)
R = 20% = 0.2
W= 80% = 0.8
I need to know H ... storage performance from host perspective.

H = 1620 / (0.2 + 0.8 * 4) = 1620 / 3.4 = 476.47 IOPS from host perspective.

Note: Modern disk arrays often offer AST (Automated Storage Tiering). The calculation described in this blog post is valid even for those disk arrays. You have to fully understand internal architecture and design of particular storage but generally all storage pools are build from some sub disk groups bundled and protected by some RAID type. So if you have 125 disks bundled by 5 disks in RAID 5 (4+1) then the principle is the same. We have 125 spindles and write penalty is 4 because of RAID 5.

Thursday, December 20, 2012

Set the Scratch Partition from the vSphere Client

If a scratch partition is not set up, you might want to configure one, especially if low memory is a concern. When a scratch partition is not present, vm-support output is stored in a ramdisk.
The directory to use for the scratch partition must exist on the host.

1

Use the vSphere Client to connect to the host.
2

Select the host in the Inventory.
3

In the Configuration tab, select Software.
4
Select Advanced Settings.
5
Select ScratchConfig.
The field ScratchConfig.CurrentScratchLocation shows
the current location of the scratch partition.
6

In the field ScratchConfig.ConfiguredScratchLocation,
enter a directory path that is unique for this host.

Example of directory path is
/vmfs/volumes/NFS-SYNOLOGY-SSD/scratch/esx21.home.uw.cz

In the example above, I have
datastore with name NFS-SYNOLOGY-SSD
where I have subdirectory scratch
having another subdirectory esx21.home.uw.cz 

7

Reboot the host for the changes to take effect.

(copy from vSphere documentation)

For automated scratch partition configuration you can use vCLI, PowerCLI. For details see. VMware KB 1033696.

And here is my PowerCLI script inspired by KB above to set scratch location on all ESXi hosts in particular vSphere clusters.

Wednesday, December 19, 2012

ESXi strange related log entry in /var/log/vmkernel.log


I've just found in /var/log/vmkernel.log lot of following storage errors


2012-12-19T01:34:02.010Z cpu2:4098)NMP: nmp_ThrottleLogForDevice:2318: Cmd 0x93 (0x412401965f00, 5586) to dev "naa.60060e80102d5f500511c97d000000d4" on path "vmhba2:C0:T0:L2" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x96 0x32. Act:NONE
2012-12-19T01:34:02.010Z cpu2:4098)ScsiDeviceIO: 2322: Cmd(0x412401965f00) 0x93, CmdSN 0xc6fd5 from world 5586 to dev "naa.60060e80102d5f500511c97d000000d4" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x96 0x32.



The main part of log entry is "failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x96 0x32"

If I understand correctly
D: 0x2 = DEVICE CHECK CONDITION
Sense code 0x5 = ILLEGAL REQUEST

What is it? What doe's it mean?

I have ESXi 5.0 build 768111, storage HDS AMS 2300, CISCO UCS blade system, CISCO FC switches.

Update 1:
I've thought more about the root cause ... important detail is that it is happen when storage vMotion or other data migration is happening. So I've a hypotheses that it is related to VAAI. Storage is VAAI enabled and VAAI is supported. However disk block size is different on datastores (we are just in the middle of migration from VMFS-3 to VMFS-5).

So I've to do deeper diagnostic and root cause troubleshooting.

Stay tuned.


Update 2:
Solved, VAAI primitives must be enabled also on HDS Host Masking. For more information check
http://www.hds.com/assets/pdf/optimizing-the-hitachi-ams-2000-family-in-vsphere-4-environments.pdf




Friday, December 07, 2012

Storage Queues and Performance

VMware recently published a paper titled Scalable Storage Performance that delivered a wealth of information on storage with respect to the  ESX Server architecture.  This paper contains details about the storage  queues that are a mystery to many of VMware's customers and partners.   I  wanted to start a wiki article on some aspects of this paper that may  be interesting to storage enthusiasts and performance freaks.

Blog post for more information is at http://communities.vmware.com/docs/DOC-6490

These information are very useful for deep understanding of full storage stack.

Wednesday, December 05, 2012

Best Practices for Faster vSphere SDK Scripts

Source at http://www.virtuin.com/2012/11/best-practices-for-faster-vsphere-sdk.html 
The VMware vSphere API is one of the more powerful vendor SDKs available in the Virtualization Ecosystem.  As adoption of VMware vSphere has grown over the years, so has the size of Virtual Infrastructure environments.  In many larger enterprises, the increasing number of VirtualMachines and HostSystems is driving the architectural requirement to deploy multiple vCenter Servers.
In response, the necessity for automation tooling has grown just as quickly.  Automation to create daily reports, perform bulk operations, and aggregate data from large, distributed Virtual Infrastructure environments is a common requirement for managing the increasing virtual sprawl.
In a Virtual Infrastructure comprised of thousands of objects, even a simple script to list all VirtualMachines and their associated HostSystem and Datastores can result in very slow runtime execution.  Developing automation with the following, simple best practices can take orders of magnitude off your vSphere API tool's runtime.

 READ FULL ARTICLE

Monday, December 03, 2012

DELL Active System Manager

DELL Active System is managed by DELL Active System Manager. This is DELL converged infrastructure solution (blade server, networking, storage) to achieve "mainframe of 21st century" with leveraging server virtualization (hypervisors) to have enough flexibility to achieve required infrastructure SLAs.

http://www.youtube.com/watch?v=xU1I93wEHuU


Configuring a Chassis in Dell Active System Manager
http://www.youtube.com/watch?v=cRO0546yJ8U