Wednesday, June 05, 2019

How to get more IOPS from a single VM?

Yesterday, I have got a typical storage performance question. Here is the question ...
I am running a test with my customer how many IOPS we can get from a single VM working with HDS all flash array. The best that I could get with IOmeter was 32K IOPS with 3ms latency at 8KB blocks. No matter what other block size I choose or outstanding IOs, I am unable to have more then 32k. On the other hand I can't find any bottlenecks across the paths or storage. I use PVSCSI storage controller. Latency and queues looks to be ok
IOmeter is good storage test tool. However, you have to understand basic storage principles to plan and interpret your storage performance test properly. The storage is the most crucial component for any vSphere infrastructure, therefore I have some experience with IOmeter and storage performance tests in general and here are my thoughts about this question.

First thing first, every shared storage system requires specific I/O scheduling to NOT give the whole performance to a single worker. The storage worker is the compute process or thread sending storage I/Os down the storage subsystem. If you think about it, it makes a perfect sense as it mitigates the problem of a noisy neighbor. When you invest a lot of money to a shared storage system, you most probably want to use it for multiple servers, right? Does not matter if these servers are physical (ESXi hosts) or virtual (VMs). To get the most performance from shared storage you must use multiple workers and optimally spread them across multiple servers and multiple storage devices (aka LUNs, volumes,  datastores).

IOmeter allows you to use

  • Multiple workers on a single server (aka Manager)
  • Outstanding I/Os within a single worker (asynchronous I/O to a disk queue without waiting for acknowledge)
  • Multiple Managers – the manager is the server generating storage workload (multiple workers) and reporting results to a central IOmeter GUI. This is where IOmeter dynamos come in to play.
To test the performance limits of a shared storage subsystem, it is an always good idea to use multiple servers (IOmeter managers) with multiple workers on each server (nowadays usually VMs) spread across multiple storage devices (datastores / LUNs). This will give you multiple storage queues, which means more parallel I/Os. Parallelism is the way which will give you more performance when such performance exists on shared storage. If such performance does not exist on the shared storage, queueing will not help you to boost performance. If you want, you can also leverage Oustanding I/Os to fill disk queue(s) more quickly and make an additional pressure to a storage subsystem, but it is not necessary if you use the number of workers equal to available queue depth. Outstanding I/Os can help you potentially generating more I/Os with fewer workers but it does not help you to get more performance when your queues are full. You will just increase response times without any positive performance gain.

Just as an example of IOmeter performance test, on the image below, you can see the results from IOmeter distributed performance tests on 2-node vSAN I planned, designed, implemented and tested recently for one of my customers. There is just one disk group (1xSSD cache, 4xSSD capacity).


Above storage performance test was using 8xVMs and each VMs was running 8 storage workers.
I have performed different storage patterns (I/O size, R/W ratio, 100% random access). The performance is pretty good, right? However, I would not be able to get such performance from the single VM having a single vDisk. 
Note: vSAN has a significant advantage in comparison to traditional storage because you do not need to deal with LUNs queueing (HBA Device Queue Depth) as there are no LUNs. On the other hand, in vSAN storage, you have to think about the total performance available for a single vDisk and it boils down to vSAN DiskGroup(s) layout and vDisk object components distribution across physical disks. But that's another topic as the asker is using traditional storage with LUNs.

Unfortunately, using multiple VMs is not the solution for the asker as he is trying to get all I/Os from a single VM.

In the question is declared that a single VM cannot get more than 32K IOPS and observed I/O response time is 3ms. The asker is curious why he cannot get more IOPS from the single VM?

Well, there can be multiple reasons but let’s assume the physical storage is capable provide more than 32K IOPS. I think, that more IOPS cannot be achieved because only one VM is used and IOmeter is using a single vDisk having a single queue. The situation is depicted in drawing below.


So, let’s do the simple math calculation for this particular situation …
  • We have a single vDisk queue having default queue depth 64 (we use Paravirtual SCSI adapter. Non-paravirtualized SCSI adapters have queue depth 32)
  • We have an HBA QLogic having queue default depth 64 (other HBA vendors like Emulex, have default queue depth 32, so it would be another bottleneck on the storage path)
  • The storage has average service time (response time) around 3ms
We have to understand the following basic principles
  • IOPS is the number of I/O operations per second
  • 64 queue depth = 64 I/O operations in parallel = 64 slices for I/O operations
  • Each I/O from these 64 I/Os are in the vDisk queue until SCSI response from the LUN will come back
  • All other I/Os have to wait until there is the free I/O slice in the queue.
And here is the math calculation ...

Q1: How many I/Os can be delivered in this situation per 1 millisecond?
A1: 64 (queue depth) / 3 (service time in ms)  = 64 / 3 = 21.33333 I/Os per 1 millisecond
 
Q2: How many I/Os can be delivered per 1 second?
A2: It is easy. 1,000 times more than in millisecond. So, 21.33333 x 1,000 = 21333.33 IOs per second ~= 21.3K IOPS
 
The asker is claiming he can get 32K IOPS with 3 ms response time, therefore it seems that the response time from storage is better than 3 ms. The math above would tell me that storage response time in this particular exercise is somewhere around 2 ms. There can be other mechanisms to boost performance. For example, I/O coalescing but let's keep it simple.

If the storage would be able to service I/O in 1 ms we would be able to get ~64K IOPS.
If the storage would be able to service I/O in 2 ms we would be able to get ~32K IOPS. 
If the storage would be able to service I/O in 3 ms we would be able to get ~21K IOPS. 

The math above would work if END-2-END queue depth is 64. This would be the case when QLogic HBA is used as it has HBA LUN Queue Depth 64. In the case of Emulex HBA, there is HBA LUN Queue Depth 32, therefore higher vDisk Queue Depth (64), would not help.
 
Hope the principle is clear now.

So how can I boost storage performance for a single VM? If you really need to get more IOPS from the single VM you have only three following options:
  1. increase queue depth, but not only on vDISK itself but END-2-END. IT IS GENERALLY NOT RECOMMENDED as you really must know what you are doing and it can have a negative impact on overall shared storage. However, if you need it and have the justification for it, you can try to tune the system.
  2. use the storage system with low service time (response time). For example, the sub-millisecond storage system (for example 0.5 ms) will give you more IOPS for the same queue depth as a storage system having higher service time (for example 3 ms).
  3. leverage multiple vDisks spread across multiple vSCSI controllers and datastores (LUNs). This would give you more (total) queue depth in a distributed fashion. However, this would have additional requirements for your real application as it would need a filesystem or other mechanism supporting multiple storage devices (vDisks).
I hope options 1 and 2 are clear. Option 3 is depicted in the figure below.


CONCLUSION
On a typical VMware vSphere environment, you use the shared storage system from multiple ESXi hosts, multiple VMs having vDisks on multiple datastores (LUNs). That's the reason why the default queue depth usually makes perfect sense as it provides fairness among all storage consumers. If you have storage system with, let's say 2 ms response time, and queue depth 32, you can still get around 16K IOPS. This should be good enough for any typical enterprise application, and usually, I recommend to use IOPS limiting to limit some VMs (vDisks) even more. This is how storage performance tiering can be very simply achieved on VMware SDDC with unified infrastructure.  If you need higher storage performance, your application is specific and you should do a specific design and leverage specific technologies or tunings.

By the way, I like Howard's Marks (@DeepStorageNet) statement I have heard on his storage technologies related podcast "GrayBeards".  It is something like ...
"There are only two storage performance types - good enough and not good enough." 
This is very true.
 
Hope this writeup helps to broader VMware community.

Relevant articles:

Friday, May 24, 2019

Syslog.global.logHost is invalid or exceeds the maximum number of characters permitted

I have a customer who has a pretty decent vSphere environment and uses VMware vRealize LogInsight as a central syslog server for advanced troubleshooting and actionable loging. VMware vRealize LogInsight is tightly integrated with vSphere so it configures syslog configuration on ESXi hosts automatically through vCenter API. Everything worked fine but one day customer realized there is the issue with one and only one ESXi host.

He saw the following failed vCenter task in his vSphere Client.


The error message:
setting["Syslog.global.logHost"] is invalid or exceeds the maximum number of characters permitted
seemed very strange to me.

From ESXi logs collected by LogInsight was evident that ESXi advanced setting cannot be configured through API. However, the same or similar issue can be reproduced by esxcli command for setting the advanced parameter.

Command:
esxcli system settings advanced set -o /Syslog/global/logHost -s “udp://1.2.3.4:514”

Output:
Unable to find branch Syslog

The resolution of this problem was to configure syslog configuration (as described in my older blog post here) instead of setting advanced parameter /Syslog/global/logHost.

The command to configure remote syslog is
esxcli system syslog config set --loghost='tcp://1.2.3.4:514'  help customer to resolve the issue.

I have never seen this issue in other vSphere environments so hope this helps to at least one other person from VMware community.

Thursday, May 16, 2019

The SPECTRE story continues ... now it is MDS

Last year (2018) started with shocked Intel CPU vulnerabilities Spectre and Meltdown and two days ago was published another SPECTRE variant know as Microarchitectural Data Sampling or MDS. It was obvious from the beginning, that this is just a start and other vulnerabilities will be found over time by security experts and researchers. All these vulnerabilities are collectively known as Speculative Executions aka SPECTRE variants.

Here is the timeline of particular SPECTRE variant vulnerabilities along with VMware Security Advisories.

2018-01-03 - Spectre (speculative execution by performing a bounds-check bypass) / Meltdown (speculative execution by utilizing branch target injection) - VMSA-2018-0002.3 
2018-05-21 - Speculative Store Bypass (SSB) - VMSA-2018-0012.1
2018-08-14 - L1 Terminal Fault - VMSA-2018-0020
2019-05-14 - Microarchitectural Data Sampling (MDS) - VMSA-2019-0008

I published several blog posts about SPECTRE topics in the past

The last two vulnerabilities "L1 Terminal Fault (aka L1TF)" and "Microarchitectural Data Sampling (aka MDS)" are related to Intel CPU Hyper-threading. As per statement here AMD is not vulnerable.

When we are talking about L1TF and MDS, a typical question of my customers having Intel CPUs is if they are safe when Hyper-Threading is disabled in the BIOS. The answer is yes but you would have to power cycle the physical system to reconfigure BIOS settings which can be pretty annoying and time-consuming in larger environments. That's' why VMware recommends leveraging SDDC concept and set it by software change - ESXi hypervisor advanced setting. It is obviously much easier to change two ESXi advanced settings VMkernel.Boot.hyperthreadingMitigation and VMkernel.Boot.hyperthreadingMitigationIntraVM to the value true and disable hyperthreading in ESXi CPU scheduler without a need of physical server power cycle. You can do it by PowerCLI one-liner in a few minutes which is much more flexible than BIOS changes.

So that's it from the security point of view but what about performance?

It is simple and obvious. When hyper-threading is disabled you will obviously lose the CPU performance benefit of Hyper-Threading technology which can be somewhere between 5 - 20% and heavily depends on the type of particular workload. Let's be absolutely clear here. Until the issue is addressed inside the CPU hardware architecture it will be always the tradeoff between security and performance. If I understand Intel messaging correctly, the first hardware solution for their Hyper-Threading is implemented in Cascade Lake family. You can double check it by yourself here ...
Side Channel Mitigation by Product CPU Model
https://www.intel.com/content/www/us/en/architecture-and-technology/engineering-new-protections-into-hardware.html

You can get hyperthreading performance back but only in VMware vSphere 6.7 U2. VMware vSphere 6.7 U2 includes new scheduler options that secure it from the L1TF vulnerability, while also retaining as much performance as possible. This new scheduler has introduced ESXi advanced setting
VMkernel.Boot.hyperthreadingMitigationIntraVM which allows you to set it to FALSE (this is the default) and leverage HyperThreading benefits within Virtual Machine but still do isolation between VMs when VMkernel.Boot.hyperthreadingMitigation is set to TRUE. This possibility is not available in older ESXi hypervisors and there are no plans to backport it. For further info read paper "Performance of vSphere 6.7 Scheduling Options".

By the way, last year I have spent a significant time to test the performance impact of SPECTRE and MELTDOWN vulnerabilities remediations. If you want to check the results of the performance tests of Spectre/Meltdown 2018 variants along with the conclusion, you can read my document published on SlideShare. It would be cool to perform the same tests for L1TF and MDS but it would require additional time effort. I'm not going to do so until sponsored by some of my customers. But anybody can do it by himself as a test plan is described in the document below.



Friday, May 03, 2019

Storage and Fabric latencies - difference in order of magnitude

It is well known, that the storage industry is in a big transformation. SSD's based on Flash is changing the old storage paradigma and supporting fast computing required nowadays in modern applications supporting digital transformation projects.

So the Flash is great but it is also about the bus and the protocol over which the Flash is connected.
We have traditional storage protocols SCSI, SATA, and SAS but these interface protocols were invented for magnetic disks, that's the reason why Flash over these legacy interface protocols cannot leverage the full potential of Flash technology. That's why NVMe (new storage interface protocol over PCI) or even 3D XPoint memory (Intel Optane).

It is all about latency and available bandwidth. Total throughput depends on I/O size and achievable transaction (IOPS). IOPS on storage systems below can be achieved on particular storage media by a single worker with random access, 100% read, 4 KB I/O size workload. Multiple workers can achieve higher performance but with higher latency.

Latencies order of magnitude:
  • ms - miliseconds - 0.001 of second = 10−3
  • μs - microseconds - 0.000001 of second = 10−6
  • ns - nanoseconds - 0.000000001 of second = 10−9
Storage Latencies

SATA - magnetic disk 7.2k RPM ~= 80 I/O per second (IOPS) = 1,000ms / 80 = 12ms
SAS - magnetic disk 15k RPM ~= 200 I/O per second (IOPS) = 1,000ms / 200 = 5 ms

SAS - Solid State Disk (SSD) Mixed use SFF ~= 4,000 I/O per second (IOPS) = 1,000ms / 4,000 = 0.25 ms = 250 μs.

NVMe over RoCE - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.100 ms =  100 μs
NVMe - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.080 ms =  80 μs

DIMM - 3D XPoint memory (Intel Optane) ~=   the latency less than 500 ns (0.5 μs)

Ethernet Fabric Latencies

Gigabit Ethernet - 125 MB/s ~= 25 ~ 65 μs
10G Ethernet - 1.25 GB/s ~=  μs (sockets application) / 1.3 μs (RDMA application)
40G Ethernet - 5 GB/s ~= μs (sockets application) / 1.3 μs (RDMA application)

InfiniBand and Omni-Path Fabrics Latencies

10Gb/s SDR - 1 GB/s  ~=  2.6 μs (Mellanox InfiniHost III)
20Gb/s DDR - 2 GB/s  ~=  2.6 μs (Mellanox InfiniHost III)
40Gb/s QDR - 4 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
40Gb/s FDR-10 - 5.16 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
56Gb/s FDR-10 - 6.82 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
100Gb/s EDR-10 - 12.08 GB/s  ~=  1.01 μs (Mellanox ConnectX-4)
100Gb/s Omni-Path - 12.36 GB/s  ~=  1.04 μs (Intel 100G Omni-Path)

RAM Latency

DIMM - DDR4 SDRAM ~=  75 ns (local NUMA access) - 120 ns (remote NUMA access)

Visualization

Latencies are very well visualized in the figure below.


Conclusion

It is good to realize what latencies we should expect on different infrastructure subsystems 
  • RAM ~= 100 ns
  • 3D XPoint memory ~= 500 ns
  • Modern Fabrics ~= 1-4 μs
  • NVMe ~= 80 μs
  • NVMe over RoCE ~= 100 μs
  • SAS SSD ~= 250 μs
  • SAS magnetic disks ~= 5-12 ms
The latency order of magnitude is important for several reasons. Let's focus on one of them - latency monitoring. It was always a challenge to monitor traditional storage systems as 5 minutes or even 1-minute is simply too large interval for ms (milisecond) latency and the average does not tell you anything about microbursts. However, in lower latency (μs or even ns) systems is 5-minute interval like an eternity. Average, Min and Max of 5-minute interval might not help you to understand what is really happening there. Much deeper mathematical statistics would be needed to have real and valuable visibility into telemetry data. Percentiles are good but Histograms can help even more ...
Wavefront links above are talking mainly about application monitoring but do we have such telemetry granularity in hardware? Mellanox Spectrum claims Real-time Network Visibility but it seems to me as an exception. Intel had an open source project "The Snap Telemetry Framework", however, it seems that it was discontinued by Intel. And what about other components? To be honest, I do not know and it seems to me that real-time visibility is not a big priority for the Infrastructure Industry, however, Operating Systems, Hypervisors and Software Defined Storages could help here. VMware vSphere Performance Manager available via vCenter SOAP API can provide "real-time" monitoring. I'm quoting "real-time" into brackets because it can provide 20-second samples (min, max, average) for metrics in leaf objects. Is it good enough? Well, not really. It is better than a 5-minute or 1-minute sample but still very long for sub-millisecond latencies. Minimum, maximum and average do not have enough information value for some decisions. The histograms could help here. ESXi has an old good tool vscsiStats supporting histograms latency of IOs in Microseconds (us) for virtual machine. Unfortunately, there is no officially supported vCenter API for this tool so it is usually used for short-term manual performance troubleshooting and not for continuous latency monitoring.  William Lam has published a blog post and scripts on how to leverage ESXi API to get vscsiStats histograms. It would be great to be able to get histograms for some objects through vCenter in a supported way and expose such information to external monitoring tools. #FEATURE-REQUEST

Hope this is informative and educational.

Other sources:
Performance Characteristics of Common Network Fabrics: https://www.microway.com/knowledge-center-articles/performance-characteristics-of-common-network-fabrics/
Real-time Network Visibility: http://www.mellanox.com/related-docs/whitepapers/WP_Real-time_Network_Visibility.pdf
Johan van Amersfoort and Frank Denneman present a NUMA deep dive: https://youtu.be/VnfFk1W1MqE
Cormac Hogan : GETTING STARTED WITH VSCSISTATS: https://cormachogan.com/2013/07/10/getting-started-with-vscsistats/
Wiliam Lam : Retrieving vscsiStats Using the vSphere 5.1 API

Tuesday, April 09, 2019

What NSX-T Manager appliance size is good for your environment?

NSX-T 2.4 has NSX Manager and NSX Controller still logically separated but physically integrated within a single virtual appliance which can be clustered as a 3-node management/controller cluster. So the first typical question during NSX-T design workshop or before NSX-T implementation is what NSX-T Manager appliance size is good for my environment.

In NSX-T 2.4 documentation (NSX Manager VM System Requirements) are documented following NSX Manager Appliance sizes.

Appliance Size
Memory
vCPU
Disk Space
VM Hardware Version
NSX Manager Extra Small
8 GB
2
200 GB
10 or later
NSX Manager Small VM
16 GB
4
200 GB
10 or later
NSX Manager Medium VM
24 GB
6
200 GB
10 or later
NSX Manager Large VM
48 GB
12
200 GB
10 or later
In the above documentation section is written that
  • The NSX Manager Extra Small VM resource requirements apply only to the Cloud Service Manager.
  • The NSX Manager Small VM appliance size is suitable for lab and proof-of-concept deployments.
So for NSX-T on-prem production usage, you can use Medium and Large size. But which one? The NSX-T documentation section (NSX Manager VM System Requirements) has no more info to support your design or implementation decision. However, in another part of the documentation (Overview of NSX-T Data Center) is written that
  • The NSX Manager Medium appliance is targeted for deployments up to 64 hosts
  • The NSX Manager Large appliance for larger-scale environments.

Conclusion

Long story short, only Medium and Large sizes are targeted to On-Prem NSX-T production usage. The Medium size should be used in an environment up to 64 ESXi hosts. For larger environments, the Large size is the way to go.

Hope this helps to your NSX-T Plan, Design, and Implement exercise.

Friday, April 05, 2019

vSAN : Number of required ESXi hosts

As you have found this article, I would assume that you know what vSAN is. For those who are new to vSAN, below is the definition from https://searchvmware.techtarget.com/definition/VMware-VSAN-VMware-Virtual-SAN
VMware vSAN (formerly Virtual SAN) is a hyper-converged, software-defined storage (SDS) product developed by VMware that pools together direct-attached storage devices across a VMware vSphere cluster to create a distributed, shared data store. The user defines the storage requirements, such as performance and availability, for virtual machines (VMs) on a VMware vSAN cluster and vSAN ensures that these policies are administered and maintained.
VMware vSAN aggregates local or direct-attached data storage devices to create a single storage pool shared across all ESXi hosts in the vSAN (aka vSphere) cluster. vSAN eliminates the need for external shared storage and simplifies storage configuration and virtual machine provisioning. Data are protected across ESXi hosts. To be more accurate across failure domains, but let's assume we stick with the vSAN default failure domain, which is ESXi host.

vSAN is policy-based storage and policy dictates how data will be redundant, distributed, reserved, etc. You can treat a policy as a set of requirements you can define and storage system will try to deploy and operate the storage object in compliance with these requirements. If it cannot satisfy requirements defined in a policy, the object cannot be deployed or, if already deployed, it becomes in the non-compliant state, therefore at risk.

vSAN is object storage, therefore each object is composed of multiple components.

Let's start with RAID-1. For RAID-1, components can be replicas or witnesses.
Replicas are components containing the data.
Witnesses are components containing just metadata used to avoid split-brain scenario.

Objects components are depicted on the screenshot below where you can see three objects
  1. VM Home 
  2. VM Swap
  3. VM Disk
where each object has two components (data replicas) and one witness (component containing just metadata). 
vSAN Components

The key concept of data redundancy is FTT.  FTT is the number of failures to tolerate. To tolerate failures, vSAN supports two methods of data distribution across vSAN nodes (actually ESXi hosts). It is often referenced as an FTM (Failure Tolerance Method). FTM can be
  • RAID-1 (aka Mirroring)
  • RAID-5/6 (aka Erasure Coding)
As data are distributed across nodes to achieve redundancy and not disks, I'd rather call it RAIN than RAID. Anyway, vSAN terminology uses RAID, so let stick with RAID.

In the table below, you can see how many hosts you need to achieve particular FTT for FTM RAID-1 (Mirroring):

FTTReplicasWitness componentsMinimum # of hosts
0101
1213
2325
3437

In the table below, you can see how many hosts you need to achieve particular FTT for FTM RAID-5/6 (Erasure Coding):
FTTErasure codingRedundancyMinimum # of hosts
0NoneNo redundancy1
1RAID-53D+1P4
2RAID-64D+2P6
3N/AN/AN/A

Design consideration: 
The above number of ESXi hosts are minimal. What does it mean? In case of longer ESXi host maintenance or long-time server failure, vSAN will not be able to rebuild components from affected ESXi node somewhere else. That's the reason why at least one additional ESXi host is highly recommended. Without one additional ESXi host, there can be situations, your data are not redundant, therefore unprotected. 

I have written this article mainly for myself to use it as a quick reference during conversations with customers. Hope you will find it useful as well.

Friday, March 22, 2019

VMware SSO domain design and operational management

Before we will deep dive into VMware SOO management, it is good to understand its architecture and discuss some design considerations. I highly recommend watching the following video


If you have not watched the video yet, do NOT continue and watch it.

The video is great but it is worth to mention that vSphere 6.7 and 6.7U1 come up with few significant improvements in terms of PSC. You can read more about it in the article "External Platform Services Controller, A Thing of the Past". The overall concept stays the same but following enhancements were released:
  • vSphere 6.7 and vSphere 6.5 Update 2 introduced enhanced linked mode support for embedded PSC deployments.
  • The converge utility in vSphere 6.7 Update 1 allows customers with an external PSC deployment to migrate to an embedded PSC deployment. 
  • In vSphere 6.7 is the repoint tool. A stand-alone embedded deployment can join or leave a vSphere SSO Domain. Domain repoint is a feature available in vSphere 6.7 using the cmsso-util CLI command. You can repoint an external vCenter Server across a vSphere SSO domain. New in vSphere 6.7 Update 1 is support for embedded deployment domain repoint. 
So now you should understand VMware architectural basics and we can deep dive into common management operations which can be used also for design verifications.

What is my SSO Domain Name?

It is good to know what is the SSO Domain Name. If I'm logged in PSC (or VCSA /w embedded PSC), following command will show me what is the SSO domain of this particular domain controller (aka PSC)
/usr/lib/vmware-vmafd/bin/vmafd-cli get-domain-name --server-name localhost

The output in my home lab is following

 root@vc01 [ ~ ]# /usr/lib/vmware-vmafd/bin/vmafd-cli get-domain-name --server-name localhost  
 uw.cz  

So my SSO domain is uw.cz

Where my Lookup Service is running?

VCSA command
/usr/lib/vmware-vmafd/bin/vmafd-cli get-ls-location --server-name localhost
show the location of lookup service.

The output in my home lab is following

 root@vc01 [ ~ ]# /usr/lib/vmware-vmafd/bin/vmafd-cli get-ls-location --server-name localhost  
 https://vc01.home.uw.cz/lookupservice/sdk  

So my lookup service is located at  https://vc01.home.uw.cz/lookupservice/sdk  

What is the SSO Site Name?

VCSA command
/usr/lib/vmware-vmafd/bin/vmafd-cli get-site-name --server-name localhost
show the site name where particular domain controller (aka PSC) is located.

The output in my home lab is following

 root@vc01 [ ~ ]# /usr/lib/vmware-vmafd/bin/vmafd-cli get-site-name --server-name localhost  
 ledcice  

So my PSC is on site ledcice which is the village name where my home lab is located.

Domain replication agreements

If I have more PSCs in SSO Domain, I can determine replication agreements and status by command vdcrepadmin as shown below.

cd /usr/lib/vmware-vmdir/bin
./vdcrepadmin

Examples:
./vdcrepadmin -f showservers -h PSC_FQDN -u administrator -w Administrator_Password
./vdcrepadmin -f showpartners -h PSC_FQDN -u administrator -w Administrator_Password
./vdcrepadmin -f showpartnerstatus -h localhost -u administrator -w Administrator_Password
./vdcrepadmin -f createagreement -2 -h Source_PSC_FQDN -H New_PSC_FQDN_to_Replicate -u ./administrator -w Administrator_Password
./vdcrepadmin -f removeagreement -2 -h Source_PSC_FQDN \
-H PSC_FQDN_to_Remove_from_Replication -u administrator -w Administrator_Password


These procedures are documented in VMware KB "Determining replication agreements and status with the Platform Services Controller 6.x (2127057)" available at https://kb.vmware.com/kb/2127057

Domain repoint

Domain repoint is a feature available since vSphere 6.5 using the cmsso-util CLI command. You can repoint an external vCenter Server from one PSC to another PSC within the same vSphere SSO domain. Data migration for such repointing is not necessary as all data are replicated across all PSC's within SSO domain. vSphere 6.7 U1 also supports repointing across different SSO domains along with data migrations.

With cmssso-util you can do the following operations
See the cmsso-util CLI command help in the screenshot below

 root@vc01 [ ~ ]# cmsso-util  
 usage: cmsso-util [-h] {unregister,reconfigure,repoint,domain-repoint} ...  
 Tool for orchestrating unregister of a node from LS, reconfiguring a vCenter Server with embedded PSC and repointing a vCenter Server to an external  
 PSC in same as well as different domain.  
 positional arguments:  
  {unregister,reconfigure,repoint,domain-repoint}  
   unregister     Unregister node. Passing --node-pnid will unregister solution users, computer account and service endpoints. Passing --hostId  
             will unregister only service endpoints and solution users.  
   reconfigure     Reconfigure a vCenter with an embedded Platform Services Controller(PSC) to a vCenter Server. Then it repoints to the provided  
             external PSC node.  
   repoint       Repoints a vCenter with an external Platform Services Controller(PSC) to the provided external PSC node.  
   domain-repoint   Repoint Embedded vCenter Server from one vCenter Server to another given domain. The repoint operation will migrate Tags,  
             Authorization, License data to another Embedded node.  
 optional arguments:  
  -h, --help      show this help message and exit  

Command to unregister system vc02.home.uw.cz would look like
cmsso-util unregister --node-pnid vc02.home.uw.cz --username administrator --passwd VMware1! 

How to decommission/remove a PSC from SSO domain?

You should use cmsso-util unregister command to unregister the Platform Services Controller, however sometimes you can get the error, therefore there is another way how to unregister failed PSCs from the SSO database. You can use the command
/usr/lib/vmware-vmdir/bin/vdcleavefed -h hostname -u administrator -w PASSWORD
where hostname is the hostname of the PSC that must be removed.

Usage: vdcleavefed [ -h ] -u [-w ]
        implying offline mode if is provided, and the server must have been down.

        implying online mode if is not provided

It actually alters SSO configuration and removes federation.
 
How to List of Services Registered with Single Sign-On
 
For vSphere 6.x
/usr/lib/vmidentity/tools/scripts/lstool.py --list 
 
For vSphere 7.x
/usr/lib/vmware-lookupsvc/tools/lstool.py list --url http://localhost:7090/lookupservice/sdk


How to converge VMware SSO domain topology?

Before vSphere 6.7 U1, there was no way how to converge existing SSO topology, however, vSphere 6.7 U1 allows such convergence. If you have deployed or installed a vCenter Server instance with an external Platform Services Controller, you can convert it to a vCenter Server instance with an embedded Platform Services Controller using the converge utility vcsa-util. You can locate the vcsa-util utility in the vcsa-converge-cli directory in vCenter installation media (DVD).

With vcsa-coverge-cli you can do the following operations


For further practical information and examples, you can read following blog posts


Conclusion

I prefer simplicity over complexity, therefore I personally like all improvements vSphere 6.7 U1 brings into the table. I'm always trying to keep SSO topology as simple as possible. However, in large environments with multiple sites across multiple regions, there can be requirements leading to more complex SSO topologies.
 
Update 2021/05/07: I have just been told about very useful tool (lsdoctor) to address potential issues with data stored in the PSC database. See. VMware KB "Using the 'lsdoctor' Tool" https://kb.vmware.com/s/article/80469

Hope this blog post is useful at least for one other person than me. If you know some other commands or ways how to manage VMware SSO domain, please leave the comment below this blog post. 

Thursday, March 14, 2019

How to transfer large ISO files to ESXi Datastore with USB disk?

I'm participating in one VMware virtualization PoC and we had a need to transfer large ISO file to VMFS datastore on standalone ESXi host. Normally you would upload ISO files over the network but PoC network was only 100Mbps so we would like to use USB disk to transfer ISOs to ESXi host.

There is William Lam blog post "Copying files from a USB (FAT32 or NTFS) device to ESXi" describing how you can use USB with FAT or NTFS filesystem to transfer ISOs but it did not work for me, therefore I wanted to use VMFS filesystem for ISO files transfer. I have VMware Fusion on my MacOSX laptop so it is very easy to spin up VM with ESXi 6.7 and have network access (local within a laptop) to ESXi. I use USB stick connected to the laptop and passed through to VM with ESXi. USB disk is recognized by ESXi but the only challenge is to create VMFS datastore because web management (HTML5 Client) does not allow create new VMFS datastore on USB disks.

Som, the only way is to create it from the command line.

By the way, all credits go to the blog post "Creating A VMFS Datastore On A USB Drive" and here is a quick installation procedure based on the mentioned blog post.

STOP USB Arbitrator

/etc/init.d/usbarbitrator status
/etc/init.d/usbarbitrator stop
/etc/init.d/usbarbitrator status

Find USB disk name

vdq -q
esxcfg-scsidevs -l

MYDISK="/vmfs/devices/disks/t10.SanDisk00Ultra00000000000000000000004C530001161026114003"
echo $MYDISK

Create 10GB VMFS datastore on USB disk

partedUtil getptbl $MYDISK
partedUtil mklabel $MYDISK gpt
partedUtil showGuids
partedUtil setptbl $MYDISK gpt "1 2048 20000000 AA31E02A400F11DB9590000C2911D1B8 0"
vmkfstools -C vmfs6 -S E2USB-ISO-Datastore ${MYDISK}:1

So datastore E2USB-ISO-Datastore is created and you can use upload ISO files to datastore and it goes over the virtual network within laptop computer so it is pretty fast.

Datastore usage on real ESXi host

When ISO files are on USB datastore, you can gracefully shutdown virtual ESXi, remove USB disk from a laptop and connect it to physical ESXi system. USB Arbitrator on physical ESXi system must be temporarily disabled by command ...

/etc/init.d/usbarbitrator stop 

... otherwise, the disk would not be usable within ESXi host as a USB device would be ready for USB passthrough, which you do not want in this particular case. After data transfer to non USB datastore, you can remove USB disk and start USB arbitrator ...

/etc/init.d/usbarbitrator start 

Hope this procedure helps at least one other person in VMware virtual community.

What motherboard chipset is used in VMware Virtual Hardware?

Today I have been asked by one of my customers what motherboard chipset is used in VMware Virtual Hardware. The answer is clearly visible from the screenshot below ...

Motherboard chipset

Motherboard chpset is Intel 440BX (https://en.wikipedia.org/wiki/Intel_440BX). This chipset was released by Intel in April 1998. In the same year, VMware Inc. was founded.

The screenshot above was done in Windows 10 running as Guest OS in VM hardware version 13 but the same chipset is used for VM hardware version 14 so I would assume all VM hardware versions use the same chipset and difference among VM hardware versions are additional features like the maximum amount of RAM, number of NIC adapters, CPU features exposed from physical CPU to virtual CPU, etc.

On two pictures below you can see VM hardware difference between ESXi 3.5 and ESXi 4.0

ESXi 4.0

ESXi 3.5





Friday, March 01, 2019

VMware vSphere Memory Hot Add scalability limitation

VMware vSphere Hot Add CPU/Memory feature has specific requirements and limits. To mention some
  • Virtual machines minimum hardware is version 7.
  • It is not compatible with Fault Tolerance
  • vSphere Enterprise Plus license
  • Hot Remove is not supported
  • Hot-Add/Hot-plug must be supported by the Guest operating system (check at http://vmware.com/go/hcl)
  • Guest-OS technical and licensing limitations had to be taken into consideration.
However, it is good to know about another scalability limitation.

VMware has set a maximum value for hot add memory. By default, this value is 16 times the amount of memory assigned to the virtual machine. For example, if the virtual machine memory is 2 GB, the maximum value for hot add memory is 32GB (2x16).

Actually, this is a good safety mechanism and here is the reason for such restriction ...

When hot memory is enabled, the guest operating system uses a huge amount of kernel memory space in the PFN database. Windows operating system does not have dynamic PFN allocation. When adding memory to the virtual machine, to make it visible to the guest operating system, the PFN database needs to be dynamic as Windows lacks this feature.

Do you want to know more about "Page Frame Number (PFN) database"? Read this article.

This topic is documented in VMware KB https://kb.vmware.com/kb/2020846

Now there is another question. Does this limitation apply only to MS Windows or it applies to Linux OS as well? The short answer is yes it applies to Linux as well. However, for Linux OS there is another limitation. If you are running WM with Linux OS having less then 3GB RAM you can change the memory only up to 3GB RAM in total. If you need more. You have to power off VM, increase memory to for example 4 GB RAM and power on again. When you are running linux with more than 3GB you can use hot memory add but again with a limit to increasing it maximally 16 times.

Hope this is informative.

Memory Hot Add related VMware KBs:

Sunday, December 30, 2018

New Home Lab managed by containerized PowerCLI and RACADM

Christmas holidays are a perfect time to rebuild the home lab. I have got a "Christmas present" from my longtime colleague knowing each other from times when we were both Dell employes. Thank you, Ondrej. He currently works for local IT company (Dell partner) and because they did a hardware refresh for one of their customers, I have got from him 4 decommissioned, but still good enough, Dell servers PowerEdge R620 each having populated a single CPU socket and 96 GB RAM. The perfect setup for a home lab, isn't it? My home lab environment is the topic for another blog post but today I would like to write about containerization of management CLI's (VMware PowerCLI and Dell RACADM) which will eventually help me with automation of home lab power off / on operations.

Before these new Dell servers, I had in my lab 4 Intel NUCs which I'm replacing with Dell PE R620. Someone can argue that Dell servers will consume significantly more electrical energy, however, it is not that bad. Single PE R620 server withdraws around 70-80 Watts. Yes, It is more than Intel NUC but it is roughly just 2 or 3 times more. Anyway, 4 x 80 Watt = 320 Watt which is still around 45 EUR per month so I have decided to keep servers Powered Off and spin up them only on demand. Dell servers have out of band management (iDRAC7) so it is easy to start and stop these servers automatically via RACADM CLI. To gracefully shutdown all Virtual Machines and put ESXi hosts into maintenance mode and shutdown them I will leverage PowerCLI. I've decided to use one Intel NUC with ESXi 6.5 to keep some workloads up and running all times. These workloads are vCenter Server Appliance, Management Server, Backup Server, etc. All other servers can be powered off until I need to do some tests or demos in my home lab.

I would like to have RACADM and PowerCLI also up and running to manage Dell Servers and vSphere via CLI os automation scripts. PowerCLI is available as an official VMware docker image and there are also some unofficial RACADM docker images available in DockerHub, therefore I have decided to deploy PhotonOS as a container host and run RACADM and PowerCLI in Docker containers.

In this blog post, I'm going to document steps and gotchas from this exercise.

DEPLOYMENT

Photon OS is available at GitHub as OVA, so deployment is very easy.

CHANGE ROOT PASSWORD

The first step after Photon OS deployment is to log in as root with the default password (default password is "changeme" without quotation marks) and change root password.

CHANGE IP ADDRESS

By default, IP address is assigned via DHCP. I want to use static IP address therefore I have to change network settings. In Photon OS, the process systemd-networkd is responsible for the network configuration.

You can check its status by executing the following command:

systemctl status systemd-networkd

By default, systemd-networkd receives its settings from the configuration file 99-dhcp-en.network located in /etc/systemd/network/ folder.

Setting a Static IP Address is documented here.

I have created file /etc/systemd/network/10-static-en.network with the following content

==============================================
[Match]
Name=eth0

[Network]
DHCP=no
IPv6AcceptRA=no
Address=192.168.4.7/24
Gateway=192.168.4.254
DNS=192.168.4.4 192.168.4.20
Domains=home.uw.cz
NTP=time1.google.com time2.google.com ntp.cesnet.cz
==============================================

File permissions should be 644 so you can enforce it by command
chmod 644 10-static-en.network

New settings are applied by command
systemctl restart systemd-networkd

CREATE USER FOR REMOTE ACCESS

It is always better to use regular user instead of root account having full administration rights on the system. Therefore, the next step is to add my personal account

useradd -m -G sudo dpasek

-m creates the home directory, while -G adds the user to the sudo group.

Set password for this user

passwd dpasek

The next step is to edit the sudoers file with visudo. Search for %sudo and remove the ‘#’ from that line. After that, you can log in with that account and run commands like a root with ’sudo ’. Please note, that sudo is not installed by default, therefore you have to install it by your self by a single command

tdnf install sudo

as described later in this post.

DISABLE PASSWORD EXPIRATION

If you want to disable password expiration use command chage

chage -M 99999 root
chage -M 99999 dpasek

ALLOW PING

Photon OS by default blocks ICMP, therefore you cannot ping from outside. Ping is, IMHO, very essential network tool for troubleshooting, therefore it should be always enabled. I do not think it is worth to disable in the sake of better security. Here are commands to enable ping ...

iptables -A INPUT -p ICMP -j ACCEPT
iptables -A OUTPUT -p ICMP -j ACCEPT

iptables-save > /etc/systemd/scripts/ip4save

UPDATE OS OR INSTALL ADDITIONAL SOFTWARE

Photon OS package manager is tdnf, therefore OS update is done with command ..

tdnf update

if you need to install additional software you can search for it and install it

I have realized there is no sudo in the minimal installation from OVA, therefore if you need it, you can search for sudo

tdnf search sudo

and install it

tdnf install sudo

START DOCKER DAEMON

I'm going to use Photon OS as a Docker host for two containers (PowerCLI and RACADM) therefore I have to start docker daemon ...

systemctl start docker

To start the docker daemon, on boot, use the command:

systemctl enable docker

ADD USER TO DOCKER GROUP

To run docker command without sudo I have to add linux user (me) to group docker.

usermod -a -G docker dpasek

POWERCLI DOCKER IMAGE

I already wrote the blog post how to spin up of PowerCLIcore container here. So let's quickly pull PowerCLIcore image and instantiate PowerCLI container.

docker pull vmware/powerclicore

Now, I can remotely log in (SSH) as a regular user (dpasek) and run any of my PowerCLI commands to manage my home lab environment.

docker run --rm -it vmware/powerclicore

Option --rm stands for "Automatically remove the container when it exits".

To work with PowerCLI following commands are necessary to initialize PowerCLI configuration.

Set-PowerCLIConfiguration -Scope User -ParticipateInCEIP $true
Set-PowerCLIConfiguration -InvalidCertificateAction:ignore

The configuration persists within each container session, however, it disappears when the container is removed, therefore it is better to instantiate container without -rm option, configure PowerCLI configuration, keep the container in the system and start container next time to perform any other PowerCLI operation.

docker run -it -v "/home/dpasek/scripts/homelab:/tmp/scripts" --name homelab-powercli --entrypoint='/usr/bin/pwsh' vmware/powerclicore

Option --name is useful to set the name of the instantiated container because the name can be used to restart container and continue with PowerCLI.

Inside the container, we can initialize PowerCLI configuration and use all other PowerCLI commands, scripts and eventually exit from the container back to the host and return back by command

docker start homelab-powercli -i

In such approach, the PowerCLI configuration persists.

RACADM DOCKER IMAGE

Another image I will need in my homelab is Dell RACADM to manage Dell iDRACs. Let's install and instantiate the most downloadable RACADM image.

docker pull justinclayton/racadm

and it can be used interactively and get system information from iDRAC with hostname esx21-oob

docker run --rm justinclayton/racadm -r esx21-oob -u root -p calvin getsysinfo

INSTALL AND CONFIGURE GIT

I would like to store all my home lab scripts in GitHub repository, synchronize it with my container host and leverage it to manage my home lab.

# install Git
sudo tdnf install git

# configure Git
git config --global user.name "myusrname"
git config --global user.email "mymail@example.com"

git clone https://github.com/davidpasek/homelab

# save Git credentials
git config credential.helper store

RUN POWERCLI SCRIPT STORED IN CONTAINER HOST

In case, I do not want to use PowerCLI interactively and run some predefined PowerCLI scripts then local script directory has to be mapped to the container as shown in the example below

docker run -it --rm -v /home/dpasek/scripts/homelab:/tmp/scripts --entrypoint='/usr/bin/pwsh' vmware/powerclicore /tmp/scripts/get-vms.ps1

The option -rm is used to remove the container from the system after the PowerCLI script is executed.

The option -v is used to do the mapping between container host directory /home/dpasek/scripts/homelab and container directory /tmp/scripts

I was not able to run the PowerCLI script directly with docker command without the option --entrypoint

The whole toolset is up and running so the rest of exercise is to develop RACADM and PowerCLI scripts to effectively managed my home lab. The idea is to shut down all VMs and ESXi hosts when the lab is not needed. When I will need the lab, I will simply power on some vSphere Cluster and VMs within these clusters having vSphere tag "StartUp".

I'm planning to store all these scripts in GitHub repository from two reasons
  1. GitHup repository will be used as a backup solution 
  2. You can track the progress of my home lab automation project
Hope I will find some spare time to finish my idea and automate this process which I have to do manually at the moment.

Related resources:

Tuesday, December 11, 2018

VMware Change Block Tracking (CBT) and the issue with incremental backups

One of my customers is experiencing a weird issue when using a traditional enterprise backup (IBM TSM / Spectrum Protect in this particular case) leveraging VMware vSphere Storage APIs (aka VDDK) for image-level backups of vSphere 6.5 Virtual Machines. They observed strange behavior on the size of incremental backups. IBM TSM backup solution should do a full backup once and incremental backups forever. This is a great approach to save space on backup (secondary) storage. However, my customer observed on some Virtual Machines, randomly created over the time, almost full backups instead of expected continuous incremental backup. This has obviously a very negative impact on the capacity of the backup storage system and also on backup window times.

The customer has vSphere 6.5 U2 (build 9298722) and IBM TSM VE 8.1.4.1. They observed the problem just on VMs where VM hardware was upgraded to version 13. The customer opened a support case with VMware GSS and IBM support.

IBM Support observed VADP/VDDK API function QueryChangedDiskAreas was failing with TSM log message similar to ...

10/19/2018 12:04:26.230 [007260] [11900] : ..\..\common\vm\vmvisdk.cpp(2436): ANS9385W Error returned from VMware vStorage API for virtual machine 'VM-NAME' in vSphere API function __ns2__QueryChangedDiskAreas. RC=12, Detail message: SOAP 1.1 fault: "":ServerFaultCode[no subcode]
"Error caused by file /vmfs/volumes/583eb2d3-4345fd68-0c28-3464a9908b34/VM-NAME/VM-NAME.vmdk"

VMware Support (GSS) instructed my customer to reset CBT - https://kb.vmware.com/kb/2139574 or disable and re-enable CBT - https://kb.vmware.com/kb/1031873 and observe if it solves the problem.

A few days after CBT reset, the problem with backup occurred again, therefore it was not a resolution.

I did some research and found another KB - CBT reports larger area of changed blocks than expected if guest OS performed unmap on a disk (59608). We believe that this the root cause and KB contains workaround and final resolution.

The root cause mentioned in VMware KB 59608 ...
When an unmap is triggered in the guest, the OS issues UNMAP requests to underlying storage. However, the requested blocks include not only unmapped blocks but also unallocated blocks. And all those blocks are captured by CBT and considered as changed blocks then returned to backup software upon calling the vSphere API queryChangedDiskAreas(changeId).
Workaround recommended in KB ...
Disable unmap in guest VM.
For example, in MS Windows Operating Systems UNMAP can be disabled by command

fsutil behavior set Disable DeleteNotify 1 

and re-enabled by command

fsutil behavior set Disable DeleteNotify 0

Warning! Disabling UNMAP in guest OS can have a tremendous negative impact on storage space reclamation, therefore, fixing space issue in secondary storage can cause storage space issue on your primary storage. Check your specific design before the final decision on how to workaround this issue.

Anyway, the final problem resolution has to be done by the backup software vendor ...
If you have VDDK 6.7 or later libraries, take the intersection of VixDiskLib_QueryAllocatedBlocks() and queryChangedDiskAreas(changeId) to calculate the actually changed blocks.
The backup software should not use just API function QueryChangedDiskAreas but also function QueryAllocatedBlocks and calculate disk blocks for incremental backups. Based on VDDK 6.7 Release Notes, VDDK 6.7 can be leveraged even for vSphere 6.5 and 6.0. For more info read Release Notes here.

I believe the problem occurs only on the following conditions
  • The virtual disk must be thin-provisioned.
  • VM Hardware is 11 and later - older VM hardware versions do not pass UNMAP SCSI commands through
  • The guest operating system must be able to identify the virtual disk as thin and issuing UNMAP SCSI commands down to the storage system
Based on conditions above I personally believe, that another workaround to this issue would be to not use thin-provisioned virtual disks and convert them into thick virtual disks. As far as I know, thick virtual disks do not pass UNMAP commands through VM hardware, therefore it should not cause CBT issues.

My customer is not leveraging thin-provisioning on physical storage layer, therefore he is going to test workaround recommended in KB 59608 (disable UNMAP in Guest OS's) as a short-term solution and start the investigation of the long-term problem fix with IBM Spectrum Protect (aka TSM). It seems IBM Spectrum Protect Data Mover 8.1.6 is leveraging VDDK 6.7.1 so upgrade from current version 8.1.4 to 8.1.6 could solve the issue.

Friday, December 07, 2018

ESXi : This host is potentially vulnerable to issues described in CVE-2018-3646

This is a very short post in reaction to those who asked me recently.

When you update to the latest ESXi builds you can see the warning message as depicted on the screenshot below.

Warning message in ESXi Client User Interface (HTML5)
This message just informs you about Intel CPU Vulnerability described in VMware Security Advisory 2018-0020 (VMSA-2018-0020).

You have three choices

  • to eliminate the security vulnerability
  • ignore potential security risk and dismiss the warning
  • keep it as it is and ignore the warning in User Interface
Elimination of "L2 Terminal" security vulnerability is described in VMware KB 55806. It is configurable by ESXi advanced option VMkernel.Boot.hyperthreadingMitigation. If you set a value to TRUE or 1, ESXi will be protected.

The warning message suppression is configurable by another ESXi advanced option UserVars.SuppressHyperthreadWarning. A value TRUE or 1 will suppress the warning message. 

Thursday, December 06, 2018

VMware Metro Storage Cluster - is it DR solution?

Yesterday morning I had a design discussion with one of my customers about HA and DR solutions. We were discussing VMware Metro Storage Cluster topic the same day afternoon within our internal team, therefore it inspired me to write this blog article and use it as a reference for future similar discussions. By the way, I have presented this topic on local VMUG meeting two years ago so you can find the original slides here on SlideShare. On this blog post, I would like to document the topics, architectures, and conclusions I discussed today with several folks.

Stretched (aka active/active) clusters are very popular infrastructure architecture patterns nowadays. VMware implementation of such active/active cluster pattern is vMSC (VMware Metro Storage Cluster). Official VMware vSphere Metro Storage Cluster Recommended Practices can be found here. Let's start with definition what vMSC is and is not from HA (High Availability), DA (Disaster Avoidance) and DR (Disaster Recovery) perspective.

vMSC (VMware Metro Storage Cluster) is
  • High Availability solution extending infrastructure high availability across two availability zones (sites in the metro distance)
  • Disaster Avoidance solution enabling live migration of VMs not only across ESXi hosts within single availability zone (local cluster) but also to another availability zone (another site)
vMSC (VMware Metro Storage Cluster) is great High Availability and Disaster Avoidance technology but it is NOT pure Disaster Recovery solution even it can help with two specific disaster scenarios (one of two storage systems failure,  single site failure). Why it is not pure DR solution? Here are a few reasons
  • vMSC requires Storage Metro Cluster technology which joins two storage systems into a single distributed storage system allowing stretched storage volumes (LUNs) but this creates a single fault zone for situations when LUNs are locked or badly served from the storage system. It is great for HA but not good for DR. Such single fault zone can lead to total cluster outage in situations like described here - http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005201https://kb.vmware.com/kb/2113956
  • vMSC compute cluster (vSphere cluster) requires to be stretched across two availability zones which creates a single fault zone. Such single fault zone can lead to total cluster outage in situations like described here - https://kb.vmware.com/kb/56492
  • DR is not only about infrastructure but also about applications, people and processes.
  • DR should be business service oriented therefore from IT perspective, DR is more about applications than infrastructure
  • DR should be tested on regular basis. Can you afford to power-off the whole site and test that all VMs will be restarted on the other side? Are you sure the applications (or more importantly business services) will survive such test? I know a few environments where they can afford it but most enterprise customers cannot.
  • DR should allow going back into the past, therefore the solution should be able to leverage old data recovery points. Recoverability from old recovery points should be possible on the application group and not for the whole infrastructure.
Combination of HA and DR solutions
Any HA solution should be combined with some DR solution. At a minimum, such DR solution is any classic backup solution having a local or even remote site backup repositories. The typical challenge with any backup solution is RTO (Recovery Time Objective) because
  • You must have the infrastructure hardware ready for workloads to be restored and powered on
  • Time to recovery from traditional backup repositories is usually very time-consuming and it may or may not fulfill RTO requirement
That's the reason why orchestrated DR with storage replications and snapshots is usually better DR solution than a classic backup. vMSC can be safely combined with storage based DR solutions with lower RTO SLA's. VMware has specific Disaster Recovery product called Site Recovery Manager (SRM) to achieve orchestrated vSphere or Storage replications and automated workload recovery. With such combination, you can get Cross Site High Availability, Cross Site Disaster Avoidance provided by vMSC, and pure Disaster Recovery provided by SRM. Such a combination is not so common, at least in my region, because it is relatively expensive. That's the reason customers usually have to decide for only one solution. Now, let's think why vMSC is preferred solution by infrastructure guys over pure DR like SRM. Here are the reasons
  • It is "more simple" and much easier to implement and operate
  • No need to understand, configure and test application dependencies
  • Can be "wrongly" claimed as DR solution 
It is not very well known, but VMware SRM nowadays supports Disaster Recovery and Avoidance on top of stretched storage. It is described in the last architecture concept below.

So let's have a look at various architecture concepts for cross-site HA and DR with VMware products.

VMware Metro Storage Cluster (vMSC) - High Availability and Disaster Avoidance Solution

VMware Metro Storage Cluster (vMSC)
On the figure above I have depicted the VMware Metro Storage Cluster consist of
  • Two availability zones (Site A, Site B)
  • Single centralized vSphere Management (vCenter A)
  • Single stretched storage volume(s) distributed across two storage system each in different availability zone (Site A, Site B)
  • VMware vSphere Cluster stretched across two availability zone (Site A, Site B)
  • Third location (Site C) for storage witness. If the third site is not available, the witness can be placed in Site A or B but storage administrator is the real arbiter in case of potential split-brain scenarios
Advantages of such architecture are
  • Cross-site high availability (positive impact on Availability, thus Business Continuity)
  • Cross-site vMotion (good for Disaster Avoidance)
  • Protects against single storage system (storage in one site) failure scenario
  • Protects against single availability zone (one site) failure scenario
  • Self-initiated fail-over procedure.
Drawbacks
  • vMSC is tightly integrated distributed cluster system between vSphere HA Cluster and Storage Metro Cluster, therefore it is potential single fault zone. Stretched LUN(s) is a single fault zone for issues caused by the distributed storage system or the bad behavior of cluster filesystem (VMFS)
  • Typically, the third location is required for storage witness
  • It is usually very difficult to test HA
  • It is almost impossible to test DR
VMware Site Recovery Manager in Classic Architecture - Disaster Recovery Solution

VMware Site Recovery Manager - Classic Architecture
On the figure above I have depicted the classic architecture of VMware DR solution (Site Recovery Manager) consist of
  • Two availability zones (Site A, Site B)
  • Two independent vSphere Management servers (vCenter A, vCenter B)
  • Two independent DR orchestration servers (SRM A, SRM B)
  • Two independent vSphere Clusters
  • Two independent storage systems. One in Site A, second in Site B
  • Synchronous or asynchronous data replication between storage systems
  • Snapshots (multiple recovery points) on backup site are optional but highly recommended if you do DR planning seriously.
Advantages of such architecture are
  • Cross-site disaster recoverability (positive impact on Recoverability, thus Business Continuity)
  • Maximal infrastructure independence, therefore we have two independent fault zones. The only connection between the two sites is storage (data) replication.
  • Human-driven and well-tested disaster recovery procedure.
  • Disaster Avoidance (migration of applications between sites) can be achieved but only with business service downtime. Protection Group has to be shut down on one site and restarted on another site.
Drawbacks
  • Disaster Avoidance without service disruption is not available.
  • Usually, there is a huge level of effort with application dependency mapping and application-specific recovery plans (Automated or Semi-automated Run Books) has to be planned, created and tested
VMware Site Recovery Manager in Stretched Storage Architecture - Disaster Recovery and Avoidance Solution

VMware Site Recovery Manager - Stretched Storage Architecture
On the last figure, I have depicted the new architecture of VMware DR solution (Site Recovery Manager). In this architecture, SRM supports stretched storage volumes but everything else is independent and specific for each site. The solution consists of
  • Two availability zones (Site A, Site B)
  • Two independent vSphere Management servers (vCenter A, vCenter B)
  • Two independent DR orchestration servers (SRM A, SRM B)
  • Two independent vSphere Clusters
  • Single distributed storage systems having storage volumes stretched across Site A and Site B
  • Snapshots (multiple recovery points) on backup site are optional but highly recommended if you do DR planning seriously.
Advantages of such architecture are
  • Cross-site disaster recoverability (positive impact on Recoverability, thus Business Continuity)
  • Maximal infrastructure independence, therefore we have two independent fault zones. The only connection between the two sites is storage (data) replication.
  • Human-driven and well-tested disaster recovery procedure.
  • Disaster Avoidance without service disruption leveraging cross vCenter vMotion technology.
Drawbacks
  • Usually, there is a huge level of effort with application dependency mapping and application-specific recovery plans (Automated or Semi-automated Run Books) has to be planned, created and tested
  • Virtual Machine internal identifier (moRef ID) is changed after cross vCenter vMotion, therefore your supporting solutions (backup software, monitoring software, etc.) must not be dependent on this identifier.
CONCLUSION

Infrastructure availability and recoverability are two independent infrastructure qualities. Both of them have a positive impact on business continuity but each solves the different situation. High Availability solutions are increasing the reliability of the system with more redundancy and self-healing automated failover among redundant system components. Recoverability solutions are data backups from one system and allow a full recovery in another independent system. Both solutions can and should be combined in compliance with SLA/OLA requirements.

VMware Metro Storage Cluster is great High Availability technology but it should not be used as a replacement for disaster recovery technology. VMware Metro Storage Cluster is not a Disaster Recovery solution even it can protect the system against two specific disaster scenarios (single site failure, single storage system failure). You also do not call VMware "vSphere HA Cluster" as DR solution even it can protect you against single ESXi host failure.

The final infrastructure architecture always depends on specific use cases, requirements and expectations of the particular customer but expectations should be set correctly and we should know what designed system does and what does not. It is always better to know potential risks and not have unknown risks. For known risks, mitigation or contingency plan can be prepared and communicated to system users and business clients. You cannot do it for unknown risks.

Other resources
There are other posts on the blogosphere explaining what vMSC is and is NOT.

“VMware vMSC can give organizations many of the benefits that a local high-availability cluster provides, but with geographically separate sites. Stretched clustering, also called distributed clustering, allows an organization to move virtual machines (VMs) between two data centers for failover or proactive load balancing. VMs in a metro storage cluster can be live migrated between sites with vSphere vMotion and vSphere Storage vMotion. The configuration is designed for disaster avoidance in environments where downtime cannot be tolerated, but should not be used as an organization's primary disaster recovery approach.”
Another very nice technical write up about vMSC is here - The dark side of stretched clusters