Tuesday, June 15, 2021

vSphere 7 - ESXi boot media partition layout changes

VMware vSphere 7 is the major product release with lot of design and architectural changes. Among these changes, VMware also reviewed and changed the layout of ESXi 7 storage partitions on boot devices. Such change has some design implications which I'm trying to cover in this blog post. 

Note: Please, be aware that almost all information in this blog post are sourced from external resources such as VMware Documentation, VMware KB, VMware blog posts, and also VMware community blog posts.

Let's start with ESXi 7 Storage Requirements

Here is the list of boot device storage requirements from VMware documentation - source [2]:
  • Installing ESXi 7.0 requires a boot device that is a minimum of 8 GB for USB or SD devices, and 32 GB for other device types.
  • Upgrading to ESXi 7.0 requires a boot device that is a minimum of 4 GB. 
  • When booting from a local disk, SAN or iSCSI LUN, a 32 GB disk is required to allow for the creation of system storage volumes, which include a boot partition, boot banks, and a VMFS-L based ESX-OSData volume. 
  • The ESX-OSData volume takes on the role of the legacy /scratch partition, locker partition for VMware Tools, and core dump destination.

Key changes between ESXi 6 and ESXi 7

Here are listed key boot media partitioning changes between ESXi 6 and :
  • larger system boot partition
  • larger boot banks
  • introducing ESX OSData (ROM-data, RAM-data)
    • consolidation of coredump, tools and scratch into a single VMFS-L based ESX-OSData volume
    • coredumps default to a file in ESX-OSData
  • variable partition sizes based on boot media capacity

The biggest change to the partition layout is the consolidation of VMware Tools Locker, Core Dump and Scratch partitions into a new ESX-OSData volume (based on VMFS-L). This new volume can vary in size (up to 138GB). [4]

Official support for specifying the size of ESX-OSData has been added to the release of ESXi 7.0 Update 1c with a new ESXi kernel boot option called systemMediaSize which takes one of four values [4]:

  • min = 25GB
  • small = 55GB
  • default = 138GB (default behavior)
  • max = Consumes all available space

What is ESX OS Data partition?

ESX-OSData is new partition to store ESXi configuration, system state, and system or agent virtual machines. The OSData partition is divided into two sections 

  1. ROM-data
  2. RAM-data

ROM-data is not read/only as a name can implied, but it is a section for data written to the disk infrequently. Example of such data is VMtools ISOs, ESXi configurations, core dumps, etc.

RAM-data is for frequently written data like logs, VMFS global traces, vSAN EPD and traces, and live system state files.

How the partition layout changed? 

Below is depicted partition Lay-out in vSphere 6.x and Consolidated Partition Lay-out in vSphere 7  [1]



Partition size variations

There are various partition sizes based on boot device size. The only fix size is for the system boot partition which is always 100 MB. All other variations are depicted on picture below [1].

Note: If you use USB or SD storage devices, the ESX-OSData partition is created on an additional storage device such as an HDD or SSD. When an additional storage device is not available, ESX-OSData is created on USB or SD devices, but the ESX-OSData partition is used only to store ROM data and RAM-data are stored on a RAM disk. [1]

What design options do I have? 

ESX-OSData is used as the unified location to store Scratch, Core Dump, and ProductLocker data. By default, it is located on boot media partition (ESX-OSData) but there are advanced settings allowing these type of data relocate to external location.

Design Option #1 - Changing ScratchPartition location

In ESXi 7.0, a VMFS-L based ESX-OSData volume (where logs, coredumps and configuration are stored) replaces the traditional scratch partition. During upgrade, the configured scratch partition is converted to ESX-OSData. The settings described in VMware KB 1033696 [7] are still applicable for cases where you want to point the scratch path to another location. It is about ESXi advanced setting ScratchConfig.ConfiguredScratchLocation. I wrote the blog post about changing Scratch Location here.

Design Option #2 - Create a core dump file on a datastore

Core dump location can be also changed. To create a core dump file on a datastore, see the KB article 2077516 [8].

Design Option #3 - Changing ProductLocker location

To change productLocker location form boot media to directory on a datastore, see the VMware KB article 2129825 [10].

Applying all three options above can significantly reduce I/O operations to boot media with less endurance such as USB Flash Disk or SD Card. However, hardware industry improved over the last years and nowadays we have new boot media options such as SATA-DOM, M.2 slots for SSD, or low-cost NVMe (PCI-e SSD).

Note: I have not tested above design options in my lab, therefore, I'm assuming it works as expected based on VMware KBs reffered in each option.

Other known problems you can observe when using USB or SD media

There are other known issues with using USB or SD as a boot media, but some of these issues are already addressed or will be addressed in future patches as USB and SSD media is officially supported.
 
 I'm aware about these issues:
  • ESXi hosts experiences All Paths Down events on USB based SD Cards while using the vmkusb driver [5] [15]
    • Luciano Patrao blogged about this (or similar) issue at [14] and he has found the workaround until the final VMware fix which should be released in ESXi 7.0 U3. The Luciano's workaround is to 
      1. login to ESXi console (SSH or DCUI)
      2. execute command "esxcfg-rescan -d vmhba32" several times until it finishes without an error.
      3. You need to give some minutes between each time you rerun the command. Be patient and try again in 2/5m.
      4. After all, errors are gone and the command finishes without any error, you should see in logs that “mpx.vmhba32:C0:T0:L0” was mounted in rw mode, and you should be able to do some work on the ESXi hosts again.
      5. If you still have some issues, restart the management agents
        • /etc/init.d/hostd restart
        • /etc/init.d/vpxa restart   
      6. After this, you should be able to migrate your VMs to another ESXi host and reboot this one. Until it breaks again in case someone is trying to use VMtools.
  • VMFS-L Locker partition corruption on SD cards in ESXi 7.0 U1 and U2 [6] (should be fixed in future ESXi patch)
  • High frequency of read operations on VMware Tools image may cause SD card corruption [12]
    • This issue has been addressed in ESXi 6.7 U3 - changes were made to reduce the number of read operations being sent to the SD card, an advanced parameter was introduced that allows you to migrate your VMware tools image to ramdisk on boot . This way, the information is read only once from the SD card per boot cycle.
      • However, it seems that problem reoccurred in ESXi 7.x, because ToolsRamdisk option is not available with ESXi 7.0.x releases [13]
    • The other vSphere design solution is IMHO the change of ProductLocker location mentioned above, because VMtools image is not located on boot media.

Conclusion

ESXi 7 is using ESX-OSData partition for various logging and debugging files. In addition, if vSAN and/or NSX is enabled in ESXi, there are additional trace files leading into even higher I/O. This ESXi system behavior requires higher endurance of boot media than in the past. 

If you are defining the new hardware specification, it is highly recomended to use larger boot media (~150 GB or more) based on NAND flash technology and connected through modern buses like M.2 or PCI-e. When larger boot media is in use, ESXi 7 will do all the magic required for correct partitioning of ESX boot media.

In case of existing hardware and no budget for additional hardware upgrade, you can still use SD cards or USB drives, but you should carefully design boot media layout and consider relocation of Scratch, Core Dump, and ProductLocker to external locations to mitigate the risk of boot media failure.

Hope this write-up helps and if you will have some other finding or comment do not hesitate to let me know via comments bellow the post, twitter or email.

Sources:

Saturday, May 15, 2021

AWS, FreeBSD AMIs and WebScale application FlexBook

I've started to play with AWS cloud computing. When I'm starting with any new technology, the best way how to learn it, is to use it for some project. And because I participate in one open-source project, where we develop multi-cloud application which can run, scale and auto migrate among various cloud providers, I've decided to do a Proof of Concept in AWS. 

The open-source software I'm going to deploy is FlexBook and is available on GitHub.

Below is the logical infrastructure design of AWS infrastructure for deployment of webscale application.

My first PoC is using following AWS resources

  • 1x AWS Region
  • 1x AWS VPC
  • 1x AWS Availability Zone
  • 1x AWS Internet Gateway
  • 1x AWS Public Segment
  • 1x AWS Private Segment
  • 1x AWS NAT Gateway
  • 6x EC Instances
    • 1x FlexBook Ingress Controller - NGINX used as L7 load balancer redirecting ingress traffic to particualar FlexBook node
    • 1x WebPortal - NGINX used as web server for static portal page using JavaScript components leveraging REST API communication to FlexBook cluster (3 FlexBook nodes which can auto scale if necessary)
    • 1x FlexBook Manager - responsible for FlexBook cluster management including deployment, auto-scale, application distributed resource management, etc.
    • 3x FlexBook Node - this is where multi-tenant FlexBook application is running. App tenants can be migrated across FlexBook nodes.

For all EC2 instances I'm going to use my favorite operating system - FreeBSD.

I've realized, that AWS EC2 instances do not support console access, therefore, ssh is the only way how to log in to servers. You can generate SSH Key Pair during EC2 deployment and download private key (PEM) to your computer. AWS shows you how to connect to your EC2 instance. This is what you see in instructions:

ssh -i "flxb-mgr.pem" root@ec2-32-7-14-5.eu-central-1.compute.amazonaws.com

However, command above does not work for FreeBSD. AWS tells you following information ...

Note: In most cases, the guessed user name is correct. However, read your AMI usage instructions to check if the AMI owner has changed the default AMI user name. 
And that's the point. The default username for FreeBSD AWS AMIs is ec2-user, therefore, following command will let you connect to AWS EC2 FreeBSD instance.

ssh -i "flxb-mgr.pem" ec2-user@ec2-32-7-14-5.eu-central-1.compute.amazonaws.com

When you SSH to the ec2-user, you can su to a root account which does not have any password.

Here are best practices for production usage

  • set a root password
  • remove the ec2-user account and create your own account with your SSH own keys

That's it for now. I will continue with AWS discovery and potential production use of AWS for some FlexBook projects. 

 Sources and additional resources:

Wednesday, March 24, 2021

What's new in vSphere 7 Update 2

vSphere 7 is not only about server virtualization (Virtual Machines) but also about Containers orchestrated by Kubernetes orchestration engine. VMware Kubernetes distribution and the broader platform for modern applications, also known as CNA - Cloud Native Applications or Developer Ready Infrastructure) is called VMware Tanzu. Let's start with enhancements in this area and continue with more traditional areas like Operational, Scalability, and Security improvements.

Developer Ready Infrastructure

vSphere with Tanzu - Integrated LoadBalancer

vSphere Update 2 includes fully supported, integrated, highly available, enterprise-ready Load Balancer for Tanzu Kubernetes Grid Control Plane and Kubernetes Services of type Load Balancer - NSX Advanced Load Balancer Essentials (Formerly Avi Load Balancer). NSX Advanced Load Balancer Essentials is scale out load balancer. The data path for users accessing the VIPs is through a set of Service Engines that automatically scale out as workloads increase.

Sphere with Tanzu - Private Registry Support

If you are using a container registry with self-signed, or private CA signed certs – this allows them to be used with TKG clusters.

Sphere with Tanzu - Advanced security for container-based workloads in vSphere with Tanzu on AMD

For customers interested in running containers with as much security in place as possible, Confidential Containers provides full and complete register and memory isolation and encryption from Pod to Pod and Hypervisor to Pod.

  • Builds on vSphere’s industry-leading, easy-to-enable support for AMD SEV-ES data protections on 2nd & 3rd generation AMD EPYC CPUs
  • Each Pod is uniquely encrypted to protect applications and data in use within CPU and memory
  • Enabled with standard Kubernetes YAML annotation

Artificial Intelligence & Machine Learning

vSphere and NVIDIA. The new Ampere family of NVIDIA GPUs is supported on vSphere 7U2. This is part of a bigger effort between the two companies to build a full stack AI/ML offering for customers.

  • Support for new NVIDIA Ampere family of GPUs
    • In the new Ampere family of GPUs, the A100 GPU is the new high-end offering. Previously the high-end GPU was the V100 – the A100 is about double the performance of the V100. 
  • Multi-Instance GPU (MIG) improves physical isolation between VMs & workloads
    • You can think of MIG as spatial  separation as opposed to the older form of vGPU which did time-slicing to separate one VM from another on the GPU. MIG is used through a familiar vGPU profile assigned to the VM. You enable MIG at the vSphere host level firstly using one simple command "nvidia-smi mig enable -I 0". This requires SR-IOV to be switched on in the BIOS (via the iDRAC on a Dell server, for example).  
  • Performance enhancements with GPUdirect & Address Translation Service in the hypervisor

Operational Enhancements

VMware vSphere Lifecycle Manager - support for Tanzu & NSX-T

  • vSphere Lifecycle Manager now handles vSphere with Tanzu “supervisor” cluster lifecycle operations
  • Uses declarative model for host management

VMware vSphere Lifecycle Manager Desired Image Seeding

Extract an image from an existing host

ESXi Suspend-to-Memory

Suspend to Memory introduces a new option to help reduce the overall ESXi host upgrade time.

  • Depends on Quick Boot
  • New option to suspend the VM state to memory during upgrades
  • Options defined in the Host Remediation Settings
  • Adds flexibility and reduces upgrade time

Availability & Efficiency

vSphere HA support for Persistent Memory Workloads

  • Use vSphere HA to automatically restart workloads with PMEM
  • Admission Control ensures NVDIMM failover capacity
  • Can be enabled with VM Hardware 19

Note: By default, vSphere HA will not attempt to restart a virtual machine using NVDIMM on another host. Allowing HA on host failure to failover the virtual machine, will restart the virtual machine on another host with a new, empty NVDIMM

VMware vMotion Auto Scale

vSphere 7 U2 automatically tunes vMotion to the available network bandwidth for faster live-migrations for faster outage avoidance and less time spent on maintenance.

  • Faster live migration on 25, 40, and 100 GbE networks means faster outage avoidance and less time spent on maintenance
  • One vMotion stream capable of processing 15 Gbps+
  • vMotion automatically scales the number of streams to the available bandwidth
  • No more manual tuning to get the most from your network

VMware vMotion Auto Scale

AMD optimizations

As customers trust in AMD increases, so is the performance of ESXi on modern AMD processors.

  • Optimized scheduler ​for AMD EPYC architecture
  • Better load balancing and cache locality
  • Enormous performance gains

Reduced I/O Jitter for Latency-sensitive Workloads

Under the hood vSphere kernel improvements in vSphere 7U2 allow for significantly improved I/O latency for virtual Telco 5G Radio Access Networks (vRAN) deployments.

  • Eliminate Jitter for Telco 5G Deployments
  • Significantly Improve I/O Latency
  • Reduce NIC Passthrough Interrupts

Security & Compliance

ESXi Key Persistence

ESXi Key Persistence helps eliminate dependency loops and creates options for encryption without the traditional infrastructure. It’s the ability to use a Trusted Platform Module, or TPM, on a host to store secrets. A TPM is a secure enclave for a server, and we strongly recommend customers install them in all of their servers because they’re an inexpensive way to get a lot of advanced security.

  • Helps Eliminate Dependencies
  • Enabled via Hardware TPM
  • Encryption Without vCenter Server

VMware vSphere Native Key Provider 

vSphere Native Key Provider puts data-at-rest protections in reach for all customers.

  • Easily enable vSAN Encryption, VM Encryption, and vTPM
  • Key provider integrated in vCenter Server & clustered ESXi hosts
  • Works with ESXi Key Persistence to eliminate dependencies
  • Adds flexible and easy-to-use options for advanced data-at-rest security
 vSphere has some pretty heavy-duty data-at-rest protections, like vSAN Encryption, VM encryption, and virtual TPMs for workloads. One of the gotchas there is that customers need a third-party key provider to enable those features, traditionally known as a key management service or KMS. There are inexpensive KMS options out there but they add significant complexity to operations. In fact, complexity has been a real deterrent to using these features… until now!

Storage

iSCSI path limits
 
ESXi has had a disparity in path limits between iSCSI and Fibre Channel. 32 paths for FC and 8 (8!) paths for iSCSI. As of ESXi 7.0 U2 this limit is now 32 paths. For further details read this.

File Repository on a vVol Datastore

VMware added a new feature that supports creating a custom size config vVol–while this was technically possible in earlier releases, it was not supported. For further details read this.

VMware Tools and Guest OS

Virtual Trusted Platform Module (vTPM) support on Linux & Windows

  • Easily enable in-guest security requiring TPM support
  • vTPM available for modern versions of Microsoft Windows and select Linux distributions
  • Does not require physical TPM
  • Requires VM Encryption, easy with Native Key Provider!

VMware Tools Guest Content Distribution

Guest store enables the customers to distribute various types of content to the VMs, like an internal CDN system.

  • Distribute content “like an internal CDN”
  • Granular control over participation
  • Flexibility to choose content

VMware Time Provider Plugin for Precision Time on Windows

With the introduction of new plugin: vmwTimeProvider shipped with VMware Tools, guests can synchronize directly with hosts over a low-jitter channel.

  • VMware Tools plugin to synchronize guest clocks with Windows Time Service
  • Added via custom install option in VMware Tools
  • Precision Clock device available in VM Hardware 18+
  • Supported on Windows 10 and Windows Server 2016+
  • High quality alternative to traditional time sources like NTP or Active Directory

Conclusion

vSphere 7 Update 2 is nice evolution of vSphere platform. If you ask me what is the most interesting feature in this release, I would probably answer VMware vSphere Native Key Provider, because it has a positive impact on manageability and simplification of overall architecture. The second one is VMware vMotion Auto Scale, which reduces operational time during ESXi maintenace operations in environments with 25+ Gb NICs already adopted.


 




Wednesday, February 17, 2021

VMware Short URLs

 VMware has a lot of products and technologies, here are few interesting URL shortcuts to quickly get resources for a particular product, technology, or other information.

VMware HCL and Interop

https://vmware.com/go/hcl - VMware Compatibility Guide

https://vmwa.re/vsanhclc or https://vmware.com/go/vsanvcg - VMware Compatibility Guide vSAN 

https://vmware.com/go/interop - VMware Product Interoperability Matrices

VMware Partners

https://www.vmware.com/go/partnerconnect - VMware Partner Connect

VMware Customers

https://www.vmware.com/go/myvmware - My VMware Overview
 
https://www.vmware.com/go/customerconnect - Customer Connect Overview

https://www.vmware.com/go/patch - Customer Connect, where you can download VMware bits

http://vmware.com/go/skyline - VMware Skyline

http://vmware.com/go/skyline/download - Download VMware Skyline

VMware vSphere

http://vmware.com/go/vsphere - VMware vSphere

VMware CLIs

http://vmware.com/go/dcli - VMware Data Center CLI

VMware Software-Defined Networking and Security

https://vmware.com/go/vcn - Virtual Cloud Network

https://vmware.com/go/nsx - VMware NSX Data Center

https://vmware.com/go/vmware_hcx - Download VMware HCX

VVD

https://vmware.com/go/vvd-diagrams - Diagrams for VMware Validated Design

https://vmware.com/go/vvd-stencils - VMware Stencils for Visio and OmniGraffle

http://vmware.com/go/vvd-community - VVD Community

http://www.vmware.com/go/vvd-sddc - Download VMware Validated Design for Software-Defined Data Center

VCF

https://vmware.com/go/vcfrc - VMware Cloud Foundation Resource Center

http://vmware.com/go/cloudfoundation - VMware Cloud Foundation

http://vmware.com/go/cloudfoundation-community - VMware Cloud Foundation Discussions

http://vmware.com/go/cloudfoundation-docs - VMware Cloud Foundation Documentation

Tanzu Kubernetes Grid (TKG)

http://vmware.com/go/get-tkg - Download VMware Tanzu Kubernetes Grid

Hope this helps at least one person in the VMware community.

Sunday, February 14, 2021

Top Ten Things VMware TAM should have on his mind and use on a daily basis

The readers may or may not know, that I work for VMware as a TAM. For those who do not know, TAM stands for Technical Account Manager. VMware TAM is the billable consulting role available for VMware customers who want to have an on-site dedicated technical advisor/consultant/advocate for long term cooperation. VMware TAM organization historically belonged under VMware PSO (Professional Services Organization), however, recently has been moved under Customer Success Organization, which makes perfect sense if you ask me, because customer success is the key goal of a TAM role.

How TAM engagement works? It is pretty easy. VMware Technical Account Managers have 5 slots (days) per week which can be consumed by one or many VMware customers. There are Tier1, Tier2, and Tier3 offerings, where Tier 1 TAM service includes one day per week for the customer, Tier 2 has 2.5 days per week and Tier 3 TAM is fully dedicated.

The TAM job role is very flexible and customizable based on specific customer demand. I like the figure below, describing TAM Service standard Deliverables and On-Demand Activities.


VMware TAM is delivering standard deliverables like
  • Kickoff Meeting and TAM Business Reviews to continuously align with customer expectations
  • Standard Analytics and Reporting including the report of customer estate in terms of VMware products and technologies (we call it CI.Next), Best Practices Review report highlighting a few best practices violations against VMware Health Check’s recommended practices.
  • Technical Advisory Service about VMware Product Releases, VMware Security Advisories, Specific TAM Customer Technical Webinars, Events, etc.
However, what is the most interesting part of VMware TAM job role, at least for me, are On Demand Activities including
  • Technical Enablements, DeepDives, Roadmaps, etc.
  • Planning and Conceptual Designing of Technical Solutions and Transformation Project
  • Problem Management and Design Troubleshootings
  • Product Feature Request management
  • Etc.

And this is the reason why I love my job, because I like technical planning, designing, coordinating technical implementations, validating and testing implementations before it is handed over to production. And I also like to communicate with operation teams and after a while, reevaluate the implemented design and take the operational feedback back to the architecture and engineering for continuous solution improvement. 
That’s the reason why the TAM role is my dream job for one of the best and impactful IT companies in the world.

During the last One on One meeting with my manager, I have been asked to write down the top ten things VMware TAM should have on his mind and use on a daily basis in 2021. To be honest, the rules I will ist are not specific to the year 2021 but very general applying to any other year, and also easily reusable for any other human activity.

After 25 years in the IT industry, 15 years in Professional Consulting, and 5 years as a VMware TAM, I immodestly believe, the 10 things below are the most important things to be the valuable VMware TAM for my customers. These are just my best practices and it is good to know, there are no best practices written into stone, therefore your opinion may vary. Anyway, take it or leave it. Here we go.

#1 Top Bottom approach

I use the Top Bottom approach, to be able to split any project or solution into Conceptual, Logical, and Physical layers. I use Abstraction and Generalization. While abstraction reduces complexity by hiding irrelevant detail, generalization reduces complexity by replacing multiple entities that perform similar functions with a single construct. Do not forget, the modern IT system complexity can be insane. Check the video “Power of Ten” to understand details about other systems' complexity and how it can be visible at various levels.

#2 Correct Expectations

I always set correct expectations. Discover customer’s requirements, constraints, and specific use cases before going into any details or specific solutions is the key to customer success.

#3 Communication

Open and honest communication is the key to any long term successful relationship. As a TAM, I have to be the communicator who can break barriers between various customer silos and teams, like VMware, compute, storage, network, security application, developers, DevOps, you name it. They have to trust you, otherwise, you cannot make success.

#4 Assumptions

I do not assume. Sometimes we need some assumptions to not be stuck and move forward, however, we should validate those assumptions as soon as possible, because false assumptions lead to risks. And one of our primary goals as TAMs is to mitigate risks for our customers. 

#5 Digital Transformation

I leverage online and digital platforms. Nothing compares to personal meetings and whiteboarding, however, tools like Zoom, Miro.com, and Monday.com increase efficiency and help with communication especially in COVID-19 times. This is probably the only related point to the year 2021, as COVID-19 challenges are staying with us for some time.

#6 Agile Methodologies

I use an agile consulting approach leveraging tools like Miro.com, Monday.com, etc. gives me a toolbox to apply agile software methodologies into technical infrastructure architecture design. In the past, when I worked as a software developer, software engineer, and software architect I was a follower of Extreme Programming. I apply the same or similar concepts and methods to Infrastructure Architecture Design and Consulting. This approach helps me to keep up with the speed of IT and high business expectations.

#7 Documentation

I document everything. The documentation is essential. If it’s not written down, it doesn’t exist! I like "Eleven Rules of Design Documentation" by Greg Ferro.

#8 Resource Mobilization

I leverage resources. Internal and External. As TAMs, we have access to a lot of internal resources (GSS, Engineering, Product Management, Technical Marketing, etc.) which we can leverage for our TAM customers. We can also leverage external resources like partners, other vendors from the broader VMware ecosystem, etc. However, we should use resources efficiently. Do not forget, all human resources are shared, thus limited. And time is the most valuable resource, at least for humans, therefore Time Management is important. Anyway, resource mobilization is the true value of the VMware TAM program, therefore we must know how to leverage these resources. 

#9 Customer Advocacy

As a TAM, I work for VMware but also for TAM customers. Therefore, I do customer advocacy within VMware and VMware advocacy within the Customer organization. This is again about the art of communication.

#10 Technical Expertise

Last but not least, I must have technical expertise and competency. I’m a Technical Account Manager, therefore I try to have deep technical expertise in at least one VMware area and broader technical proficiency in few other areas. This approach is often called Full Stack Engineering. I’m very aware of the fact that expertise and competency are very tricky and subjective. It is worth understanding the Dunning Kruger-Effect which is the law about the correlation between competence and confidence. In other words, I’m willing to have real competence and not only false confidence about the competence. If I do not feel confident in some area, I honestly admit it and try to find another resource (see rule #8). The best approach to get and validate my competency and expertise is to continuously learn and validate it by VMware advanced certifications.

Hope this write-up will be useful for at least one person on the VMware TAM Organization.

Thursday, February 04, 2021

Back to basics - MTU & IP defragmentation

This is just a short blog post as it can be useful for other full-stack (compute/storage/network) infrastructure engineers.

I have just had a call from my customer with the following problem symptom. 

Symptom:

When ESXi (in ROBO)  is connected to vCenter (in Datacenter), TCP/IP communication overloads 60 Mbps network link. In such a scenario, huge packet retransmit is observed. IP packets are defragmented and packet retransmission is observed.

Design drawing:

Hypothesis:

MTU Defragmentation is happening in the physical network and MTU is lower than 1280 Bytes.

Planned test:

Find the smallest MTU in the end-2-end network path between ESXi and vCenter

vmkping -s 1472 -d VCENTER-IP

Decrease -s parameter value until the ping is successful. This is the way how to find the smallest MTU in the IP network path. 

Back to basics

IP fragmentation is an Internet Protocol (IP) process that breaks packets into smaller pieces (fragments), so that the resulting pieces can pass through a link with a smaller maximum transmission unit (MTU) than the original packet size. The fragments are reassembled by the receiving host. [source]

The vmkping command has some parameters you should know and use in this case:

-s to set the payload size

Syntax:vmkping -s size IP-address

With the parameter -s you can define the size of the ICMP payload. If you have defined an MTU size from eg. 1500 bytes and use this size in your vmkping command, you may get a “Message too long” error. This happens because ICMP needs 8 bytes for its ICMP header and 20 bytes for IP header:

The size you need to use in your command will be:

1500 (MTU size) – 8 (ICMP header) – 20 (IP header) = 1472 bytes for ICMP payload

-d to disable IP fragmentation

Syntax:vmkping -d IP-address

Use the command “vmkping -s 1472 IP-address” to test your end-2-end network path.

Decrease -s parameter until the ping is successful.

Monday, January 11, 2021

Server rack design and capacity planning

Our VMware local SE team has got a great Christmas present from regional Intel BU. Four rack servers with very nice technical specifications and the latest Intel Optane technology. 

Here is the server technical spec: 

Node Configuration

Description

Quantity

CPU

Intel Platinum 8280L (28 cores, max memory 4.5TB)                          

2

DDR4 Memory

768GB DDR4 DRAM RDIMM

12 x 64GB 

Intel Persistent Memory

3TB Intel Persistent Memory

12 x 256GB

Caching Tier

Intel Optane SSD DC P4800X Series

(750GB, 2.5in PCIe* x4, 3D XPoint™)

2

Capacity Tier

Intel SSD DC P4510 Series

(4.0TB, 2.5in PCIe* 3.1 x4, 3D2, TLC)

4

Networking

       +

transceivers, cables

Intel® Ethernet Network Adapter XXV710-DA2

(25G, 2 ports)

1


These servers are vSAN Ready and the local VMware team is planning to use them for demonstration purposes of VMware SDDC (vSphere, vSAN, NSX, vRealize), therefore VMware Cloud Foundation (VCF) is a very logical choice.

Anyway, even Software-Defined Data Center requires power and cooling, so I've been asked to help with server rack design with proper power capacity planning. To be honest, the server rack plan and design is not rocket science. It is just simple math & elementary physics, however, you have to know the power consumption of each computer component. I did some research and here is the math exercise with a power consumption of each component:

  • CPU - 2x CPU Intel Platinum 8280L (110 W Idle, 150 W Computational,  360 W Peak load)
    • Estimation: 2x150 W = 300 W
  • RAM - 12x 64 GB DDR4 DRAM RDIMM (768 GB)
    • Estimation: 12x 24 Watt = 288 W
  • Persistent RAM - 12x 256GB (3TB) Intel Persistent Memory
    • Estimation: 12x 15 Watt = 180 W
  • vSAN Caching Tier - 2x Intel Optane SSD DC P4800X 750GB
    • Estimation: 2x18W =>  36W
  • vSAN Capacity Tier - 4x Intel SSD DC P4510 Series 4TB
    • Estimation: 4x 16W => 64 W
  • NIC - 1x Intel® Ethernet Network Adapter XXV710-DA2 (25G, 2 ports)
    • Estimation: 15 W

If we sum the power consumption above, we will get 883 Watt per single server.  

To validate the estimation above, I used the DellEMC Enterprise Infrastructure Planning Tool available at http://dell-ui-eipt.azurewebsites.net/#/, where you can place infrastructure devices and get the Power and Heating calculations. You can see the IDLE and COMPUTATIONAL consumptions below.

Idle Power Consumption


Computational Power Consumption

POWER CONSUMPTION
Based on the above calculations, the server power consumption range between 300 and 900 Watts, so it is good to plan a 1 kW power budget per server which in our case would be 4 kW / 17.4 Amp per a single power brach, which would mean 1x32 Amp PDUs just for 4 servers. 

For a full 45U Rack with 21 servers, it would be 21 kW / 91.3 Amp, which would mean 3x32 Amp per a single branch in the rack.

HEATING AND COOLING
Heating and cooling are other considerations. Based on Dell Infrastructure Planning Tool, the temperature in the environment will rise by 9°C (idle load) or even 15 °C (computational load). This also requires appropriate cooling and electricity planning.

Conclusion

1 kW per server is a pretty decent consumption. When you design your cool SDDC, do not forget for basics - Power and Cooling.

Sunday, November 29, 2020

Virtual Machine Advanced Configuration Options

First and foremost, it is worth mentioning, that it is definitely not recommended to change any advanced settings unless you know what you are doing and you are fully aware of all potential impacts. VMware default settings are the best for general use covering the majority of use cases, however, when you have some specific requirements you might need to do the VM tuning and change some advanced virtual machine configuration options. In this blog post, I'm trying to document advanced configuration options I've found useful in some specific design decisions.

Time synchronization

  • time.synchronize.tools.startup
    • Description:
    • Type: Boolean
    • Values:
      • true / 1 (default)
      • false / 0
  • time.synchronize.restore
    • Description:
    • Type: Boolean
    • Values:
      • true / 1 (default)
      • false / 0
  • time.synchronize.shrink
    • Description:
    • Type: Boolean
    • Values:
      • true / 1 (default)
      • false / 0
  • time.synchronize.continue
    • Description:
    • Type: Boolean
    • Values:
      • true / 1 (default)
      • false / 0
  • time.synchronize.resume.disk
    • Description:
    • Type: Boolean
    • Values:
      • true / 1 (default)
      • false / 0

Relevant resources:

Ethernet

Isolation

With the isolation option, you can restrict file operations between the virtual machine and the host system, and between the virtual machine and other virtual machines.

VMware virtual machines can work both in a vSphere environment and on hosted virtualization platforms such as VMware Workstation and VMware Fusion. Certain virtual machine parameters do not need to be enabled when you run a virtual machine in a vSphere environment. Disable these parameters to reduce the potential for vulnerabilities.

Following advanced settings are booleans (true/false) with default value false. You can disable it by changing the value to true.

  • isolation.tools.unity.push.update.disable
  • isolation.tools.ghi.launchmenu.change
  • isolation.tools.memSchedFakeSampleStats.disable
  • isolation.tools.getCreds.disable
  • isolation.tools.ghi.autologon.disable
  • isolation.bios.bbs.disable
  • isolation.tools.hgfsServerSet.disable
  • isolation.tools.vmxDnDVersionGet.disable
  • isolation.tools.diskShrink.disable
  • isolation.tools.memSchedFakeSampleStats.disable
  • isolation.tools.guestDnDVersionSet.disable
  • isolation.tools.unityActive.disable
  • isolation.tools.diskWiper.disable

Snapshots

Remote Display

Tuesday, November 24, 2020

vSAN 7 Update 1 - What's new in Cloud Native Storage

 vSAN 7 U1 comes with new features also in Cloud Native Storage area, so let's look at what's new.

PersistentVolumeClaim expansion

Kubernetes v1.11 offered volume expansion by editing the PersistentVolumeClaim object. Please note, that volume shrink is not supported and extension must be done offline. Online expansion is not supported in U1 but planned on the roadmap.  

Static Provisioning in Supervisor Cluster

This feature allows exposing an existing storage volume within a K8s cluster integrated within vSphere Hypervisor Cluster (aka Supervisor Cluster, vSphere with K8s, Project Pacific).

vVols Support for vSphere K8s and TKG Service

Supporting external storage deployments on vK8s and TKG using vVols.

Data Protection for Modern Applications

vSphere 7.0 U1 comes with support Dell PowerProtect and Velero backup for Pacific Supervisor and TKG clusters. Velero only option to initiate snapshots from supervisor Velero plugin and store on S3.


vSAN Direct

vSAN Direct is the feature introducing Directly Attach Storage (typically physical HDD) for object storage solutions running on top of vSphere. 


There will not be a shared vSAN Datastore like typical vSAN has but vSAN Direct Datastores are allowing connect physical disks directly to virtual appliances or containers on top of vSphere/vSAN Cluster providing Object Storage services and bypassing traditional vSAN datapath.

Hope you find it useful.

Monday, November 23, 2020

Why HTTPS is faster than HTTP?

Recently, I was planning, preparing, and executing a network performance test plan, including TCP, UDP, HTTP, and HTTPS throughput benchmarks. The intention of the test plan was the network throughput comparison between two particular NICs

  • Intel X710
  • QLogic FastLinQ QL41xxx

There was a reason for such exercise (reproduction of specific NIC driver behavior) and I will probably write another blog post about it, but today I would like to raise another topic. During the analysis of testing results, I've observed very interesting HTTPS throughput results in comparison to HTTP throughput. These results were observed on both types of NICs, therefore, it should not be a benefit of specific NIC hardware or driver.

Here is the Test Lab Environment:

  • 2x ESXi hosts
    • Server Platform: HPE ProLiant DL560 Gen10
    • CPU: Intel Cascade Lake based Xeon
    • BIOS: U34 | Date (ISO-8601): 2020-04-08
    • NIC1: Intel X710, driver i40en version: 1.9.5, firmware 10.51.5
    • NIC2: QLogic QL41xxx, driver qedentv version: 3.11.16.0, firmware mfw 8.52.9.0 storm 
    • OS/Hypervisor: VMware ESXi 6.7.0 build-16075168 (6.7 U3)
  • 1x Physical Switch
    • 10Gb switch ports  <<  network bottleneck by purpose, because customer is using 10Gb switch ports as well

Below are the observed interesting HTTP and HTTPS results.

HTTP


HTTPS


OBSERVATION, EXPLANATION, AND CONCLUSION

We have observed

  • HTTP throughput between 5 and 6 Gbps
  • HTTPS throughput between 8 and 9 Gbps

which means 50% higher throughput of HTTPS over HTTP. Normally, we would be expecting HTTP transfer faster than HTTPS as HTTPS requires encryption, which should end-up with some CPU overhead. Encryption overhead is questionable, but nobody would expect HTTPS significantly faster than HTTP, right? That's the reason I was asking myself, 

why HTTPS overachieved HTTP results on HPE Lab with the latest Intel CPUs?

Here is my process of the "issue" troubleshooting or better to say, root cause analysis. 

Conclusion

  • In my home lab, I have old Intel CPUs models (Intel Xeon CPU E5-2620 0 @ 2.00GHz), that's the reason HTTP and HTTPS throughputs are identical.
  • In the HPE test lab, there are the latest Intel CPU models, therefore, HTTPS can be offloaded and client/server communication can leverage asynchronous advantages for web servers using Intel® QuickAssist Technology introduced in the Intel Xeon E5-2600 v3 product family. 
  • It is worth to mention, that it is not only about CPU hardware acceleration, but also about software code which must be written in the form, hardware acceleration can leverage for a positive impact on performance. This is the case of OpenSSL 1.1.0, and NGINX 1.10 to boost HTTPS server efficiency. 

Lesson learned

When you are virtualizing network functions, it is worth considering the latest CPUs, as it can have a significant impact on overall system performance and throughput. Does not matter, if such network function virtualization is done by VMware NSX or other virtualization or containerization platforms.

Investigation continues

To be honest, I do not know if I really fully understand the root cause of such behavior. I still wonder why HTTPS is 50% faster than HTTP, and if CPU offloading is the only factor for such performance gain.

I'll try to run the test plan on other hardware platforms, compare results, and do some further research to understand much deeper. Unfortunately, I do not have direct access to the latest x86 servers of other vendors, so it can take a while. If you have access to some modern x86 hardware and want to run my test plan by yourself, you can download the test plan document from here. If you will invest some time into the testing, please share your results in the comments below this article or simply send me an e-mail

Hope this blog post is informative, and as always, any comment or idea is very welcome. 

Saturday, November 21, 2020

Understanding vSAN Architecture Components for better troubleshooting

VMware vSAN becomes more and more popular, thus more often used as primary storage in data centers and server rooms. Sometimes, as with any IT technology, is necessary to do the troubleshooting. Understanding of architecture and components interactions is essential for effective troubleshooting of vSAN. Over years, I have collected some vSAN architectural information into a slide deck I made available at https://www.slideshare.net/davidpasek/vsan-architecture-components

In the slide deck are the slides with the following sections ...

vSAN Terminology

  • CMMDS - Cluster Monitoring, Membership, and Directory Service
  • CLOMD - Cluster Level Object Manager Daemon
  • OSFSD - Object Storage File System Daemon
  • CLOM - Cluster Level Object Manager
  • OSFS - Object Storage File System
  • RDT - Reliable Datagram Transport
  • VSANVP - Virtual SAN Vendor Provider
  • SPBM - Storage Policy-Based Management
  • UUID - Universally unique identifier
  • SSD - Solid-State Drive
  • MD - Magnetic disk
  • VSA - Virtual Storage Appliance
  • RVC - Ruby vSphere Console

Architecture components
  • CMMDS
    • Cluster Monitoring, Membership, and Directory Service
  • CLOM
    • Cluster Level Object Manager Daemon
  • DOM
    • Distributed Object Manager
    • Each object in a vSAN cluster has a DOM owner and a DOM client
  • LSOM
    • Local Log Structured Object Manager
    • LSOM works with local disks
  • RDT
    • Reliable Datagram Transport
Components interaction



Architecture & I/O Flow




Troubleshooting tools
  • RVC
    • vsan.observer
    • vsan.disks_info
    • vsan.disks_stats
    • vsan.disk_object_info
    • vsan.cmmds_find
  • ESXCLI
    • esxcli vsan debug disk list
  • Objects tools
    • /usr/lib/vmware/osfs/bin/objtool
How to use vSAN Observer

  • SSH somewhere where you have RVC. It can be for example VCSA or HCIbench
    • ssh root@[IP-ADDRESS-OF-VCSA]
  • Run RVC command-line interface and connect to your vCenter where you have vSphere cluster with vSAN service enabled. RVC requires the password of the administrator in your vSphere domain. 
    • rvc administrator@[IP-ADDRESS-OF-VCSA]
  • Start vSAN Observer on your vSphere cluster with vSAN service enabled
    • vsan.observer -r /localhost/[vDatacenter]/computers/[vSphere & vSAN Cluster]
  • Go to vSAN Observer web interface
    • vSAN Observer is available at https://[IP-ADDRESS-OF-VCSA]:8010
Slide deck includes little more info so download it from https://www.slideshare.net/davidpasek/vsan-architecture-components

If you have to troubleshoot vSAN, I highly recommend to follow the process documented at "Troubleshooting vSAN Performance".

Hope it helps the broader VMware community.

If you know some other detail or troubleshooting tool, please leave a comment below this post.

Thursday, November 05, 2020

NSX-T Edge Node performance profiles

It is good to know that NSX-T Edge Node has multiple performance profiles. Those profiles will change the # of vCPU for DPDK and so leave more or less vCPU for other services such as LB:

  • default (best for L2/L3 traffic)
  • LB TCP (best for L4 traffic)
  • LB HTTP (best for HTTP traffic)
  • LB HTTPS (best for HTTPS traffic)

Now you can ask how to choose Load Balancer Performance profile. SSH to the edge node and use CLI.

 nsx-edgebm3> set load-balancer perf-profile  
  http   Performance profile type argument  
  https   Performance profile type argument  
  l4    Performance profile type argument  
 Note: You may be prompted to restart the dataplane or reboot the Edge Node if there are changes in the profile in # of cores used by LB.  
 To go back to default profile:  
 nsx-edgebm3> clear load-balancer perf-profile  

Changing from L4 to HTTP help me to achieve ~3x higher HTTP throughput through L7 NSX-T load balancer. Hope this helps someone else as well.

Tuesday, September 22, 2020

vSAN - vLCM Capable ReadyNode

VMware vSphere Lifecycle Manager (aka vLCM) is one of the very interesting features in vSphere 7.  vLCM is a powerful new approach to simplified consistent lifecycle management for the hypervisor and the full stack of drivers and firmware for the servers powering your data center.

There are only a few server vendors who have implemented firmware management with vLCM.

At the moment of writing this article, these vendors are:

  • Dell and HPE for vSphere 7.0
  • Dell, HPE, Lenovo for vSphere 7.0 Update 1

Recently I have got the following question from one of my customers.

"Where I can find official information about certified vLCM server vendors?"

It is a very good question. I would expect such information in VMware Compatibility Guides (VCG), however, there is no such information on "Systems / Servers" VCG but you can find it in "vSAN" VCG.



vSAN VCG contains "vSAN ReadyNodes Additional Features" where one feature is "vLCM Capable ReadyNode". So, there you can find Server Vendors successfully implemented firmware management integration with vLCM, but it is available only for vSAN Ready Nodes. I can imagine, that in the future, vLCM capability may or may not be available even for standard servers and not only for vSAN Ready Nodes.

Friday, September 11, 2020

Datacenter Network Topology - Dell OS10 MultiDomain VLT

Yesterday, I have got the following e-mail from one of my blog readers ...

Hello David,

Let me introduce myself, I work in medium size company and we began to sell Dell Networking stuff to go along with VxRail. We do small deployments, not the big stuff with spine/leaf L3 BGP, you name it. For a Customer, I had to implement this solution. Sadly, we are having a bad time with STP as you can see on the design.

 

Customer design with STP challenge

Is there a way to be loop-free ? I thought about Multi Domain VLT LAG but it looks like it is not supported in OS10. 

I wonder how you would do this. Is SmartFabric the answer ?
Thank you

Well, first of all, thanks for the question. If you ask me, it all boils down to specific design factors - use cases, requirements, constraints, assumptions.

So let's write down design factors

Requirements:

  • Multi-site deployment
  • A small deployment with a single VLT domain per site.
  • Robust L2 networking for VxRail clusters

Constraints:

  • Dell Networking hardware with OS10
  • Networking for VMware vSphere/vSAN (VxRail)

Assumptions:
  • No more than a single VLT domain per site is required
  • No vSphere/vSAN (VxRail) Clusters are Stretched across sites
Any unfulfilled assumption is a potential risk. In the case of unfulfilled assumption, the design should be reviewed and potentially redesigned to fulfill the design factors. 

Now, let's think about network topology options we have. 

The reader has asked if DellEMC SmartFabric can help him. Well, SmartFabric can be the option as it is Leaf-Spine Fabric fully managed by External SmartFabric Orchestrator. Something like Cisco ACI / APIC. SmartFabric uses EVPN, BGP, VXLAN, etc. for multi-rack deployment. I do not know the latest details, but AFAIK, it was not multi-site ready a few months ago. The latest SmartFabric features should be validated with DellEMC. Anyway, SmartFabric can do L2 over L3 if you need stretching L2 over L3 across racks. Eventually, it should be possible to stretch L2 even across sites.

However, because our design is targeted to a small deployment, I think the Leaf-Spine is the overkill for small deployment and I always prefer the KISS (Keep It Simple, Stupid) approach. 

So, here are two final options of network topology I would consider and compare.

OPTION 1: Stretched L2 Loop-Free across sites 
OPTION 2: L3 across sites with L2/L3 boundary in TOR access switches 


 Option 1 Stretched L2 Loop-Free across sites


Option 2 - L3 across sites with L2/L3 boundary in TOR access switches 

So let's compare these two options. 

Option 1 - Stretched L2 Loop-Free across sites 

Benefits

  • Simplicity
  • Stretched L2 across sites allows workload (device, VM, container, etc.) migrations across sites without L2 over L3 network overlay (NSX, SmartFabric, etc.) and re-IP.

Drawbacks

  • Topology is not scalable for more TOR access switches (VLT domains), but this ok with the design factors
  • Topology optimally requires 8 links across sites. Optionally, can be reduced to 4 links.
  • Only two routers. One per site.
  • Stretched L2 topology across sites also extends L2 network fault-domain across sites, therefore broadcast storms, unknown unicast flooding, and potential STP challenges are the potential risks.
  • This topology has L3 trombone by design - https://blog.ipspace.net/2011/02/traffic-trombone-what-it-is-and-how-you.html. This drawback can be accepted or mitigated by NSX distributed routing.

OPTION 2 - L3 across sites with L2/L3 boundary in TOR access switches 

Benefits

  • Better scalability, because other VLT domains (TOR access switches) can be connected to core routers. However, this benefit is not required by the design factors above. 
  • Topology optimally requires 4 links across sites. Optionally, can be reduced to 2 links. This is less than Option 1 requires.
  • Each site is local fault-domain from L2 networking point of view, as L2 fault-domain is not stretched across sites. L2 faults (STP, broadcast storms, unknown unicast flooding, etc.) are isolated within the site. 

Drawbacks

  • More complex routing configuration with ECMP and dynamic routing protocol like iBGP or OSPF
  • Four routers. Two per site.
  • L3 topology across sites restricts workload (device, VM, container, etc.) migrations across sites without L2 over L3 network overlay (NSX, SmartFabric, etc.) or changing the IP address of migrated workload.

Conclusion and Design Decision

Both considered design options are L2 loop-free topologies and I hope it should fit all design factors defined above. If you do not agree, please write a comment because anybody can make an error in any design or not foresee all situations, until the architecture design is implemented and validated. 

If I should make a final design decision, it would depend on two other factors
  • Do I have VMware NSX in my toolbox or not?
  • What is the skillset level of network operators (Dynamic Routing, ECMP, VRRP) responsible for the operation?
If I would not have NSX and network operators would prefer Routing High Availability (VRRP) over Dynamic Routing with ECMP (high availability + scalability + performance), I would decide to implement Option 1.

In the case of NSX and willingness to use dynamic routing with ECMP, I would decide to implement Option 2.

The reader in his question mentioned, that his company do not use spine/leaf L3 BGP, therefore Option 1 is probably a better fit for him. 

Disclaimer: I had no chance to test and validate any of the design option considered above, therefore, if you have any real experience, please speak out loudly in the comments.