Last week one of my customers experienced high latency on vSphere datastore backed by NFS mount. Generally, the usual root cause of high latency is because of few disk spindles used for particular datastore but that was not the case here.
NFS datastore for vSphere
Although NFS was always understood as lower storage tier VMware and NFS vendors were working very hardly on NFS improvements in recent years. Another plus for NFS nowadays is that 10Gb ethernet is already commodity which helps NFS significantly because it doesn't support multi-pathing (aka MPIO) as FC or iSCSI does. On the other hand it is obvious that NFS is another abstract storage layer for vSphere and some other details like NFS client implementation, ethernet/IP queue management, QoS, and so on can impact whole solution. Therefore when someone tell me NFS for vSphere I'm always cautious. Don't get me wrong I really like abstractions, layering, unification and simplification but it must not have any influence on the stability and performance.
I don't want to discuss advantages and disadvantages of particular protocol as it depends on particular environment requirements and what someone wants to achieve. By the way I have recently prepared one particular design decision protocol comparison for another customer
here so you can check it out and comment it there.
Here in this case the customer had really good reason to use NFS but the latency issue is potential show stopper.
I have to say that I had also bad NFS experience back in 2010 when I was designing and implementing Vblock0 for one customer. Vblock0 used EMC Celerra therefore NFS or iSCSI were the only options. NFS was better choice because of Celerra iSCSI implementation (that's another topic). We were not able to decrease disk response times bellow 30ms so at the end NFS (EMC Cellera) was used as Tier 3 storage and customer bought another block storage (EMC Clariion) for Tier 1. It is history because I was implementing new vSphere 4.1 and SIOC was just introduced without broad knowledge about SIOC benefits especially for NFS.
Since there lot of things changed with NFS so that's just one history and field experience of one engagement so lets go back to the high latency problem today on NFS and troubleshooting steps what we did with this customer.
TROUBLESHOOTING
Environment overview
Customer has vSphere 5.0 (Enterprise Plus) Update 2 patched to the latest versions (ESXi build 1254542).
NFS storage is NetApp FAS with the latest ONTAP version (NetApp Release 8.2P2 7-Mode).
Compute is based on CISCO UCS and networking on top of UCS is based on Nexus 5500.
Step 1/ Check SIOC or MaxQueueDepth
I told customer about known NFS latency issue documented in
KB article 2016122 and broadly discussed on Cormag Hogan blog post
here. Based on community and my own experience I have hypothesis that the problem is not related only to NetApp storage but it is most probably ESXi NFS client issue. This is just my opinion without any proof.
Active SIOC or /NFS/MaxQueueDepth 64 is workaround documented on KB Article mentioned earlier. Therefore I asked them if SIOC is enabled as we discussed during Plan & Design phase. The answer was yes it is.
Hmm. Strange.
Step 2/ NetApp firmware
Yes. This customer has NetApp filer and in kb article is update comment that the latest NetApp firmware solve this issue. Customer has latest 8.2 firmware which should fix the issue. But it evidently doesn't help.
Hmm. Strange.
Step 3/ Open support case with NetApp and VMware
I suggested to open support case and in parallel continue with troubleshooting
I don't why but customers in Czech Republic are shame to use support line. I don't known why when they are paying significant amount of money for it. But it is like it is and even this customer didn't engaged VMware nor NetApp support and continued with troubleshooting. Ok, I understand we can solve everything by our self but why not ask for help? That's more social than technical question and I would like to known if this administrator behavior is global habit or some special habit here in central Europe. Don't be shame and speak out in comments even about this more social subject.
Step 4/ Go deeper in SIOC troubleshooting
Check if storageRM (Storage Resource Management) is running
/etc/init.d/storageRM status
Enable advanced logging in Software Advanced Settings -> Misc -> Misc.SIOControlLogLevel = 7
By default there is 0 and 7 is max value.
Customer found strange log message in "/var/log/
storagerm.log"
Open /vmfs/volumes/ /.iorm.sf/slotsfile (0x10000042, 0x0) failed: permission denied
There is not VMware KB for it but Frank Denneman bloged about it
here.
So customer is experiencing the same issue like Frank in his lab.
Solution is to changed *nix file privileges how Frank was instructed by VMware Engineering (that's the beauty when you have direct access to engineering) ...
chmod 755 /vmfs/volumes/DATASTORE/.iorm.sf/slotsfile
Changes take effect immediately and you can check it in "/var/log/
storagerm.log"
...
DATASTORE: read 2406 KB in 249 ops, wrote 865 KB in 244 ops avgReadLatency
1.85, avgWriteLatency 1.42, avgLatency 1.64 iops = 116.59, throughput =
773.65 KBps
...
Advanced logging can be disable in Software Advanced Settings -> Misc -> Misc.SIOControlLogLevel = 0
After this normalized latency is between 5-7 ms which is quite normal.
Incident solved ... waiting for other incidents :-)
Problem management continues ...
Lessons learned from this case
SIOC is excellent VMware technology helping with datastore wide performance fairness. In this example it help us significantly with dynamic queue management helping with NFS response times.
However even in any excellent technology can be bugs ...
SIOC can be leveraged just by customers having Enterprise Plus licenses.
Customers having lower licenses has to use static Queue value (/NFS/MaxQueueDepth) 64 or even less based on response times. BTW default Max NFS queue depth value is 4294967295. I understand NFS.MaxQueueDepth as a Disk.SchedNumReqOutstanding for block devices. Default value of parameter Disk.SchedNumReqOutstanding is 32 helping with sharing LUN queues which usually have queue depth 256. It is ok for usual situations but if you have more disk intensive VMs per LUN than this parameter can be tuned. This is where SIOC help us with dynamic queue management even across ESX hosts sharing same device (LUN, datastore).
For deep dive Disk.SchedNumReqOutstanding explanation i suggest to read Jason Boche blog post
here.
Static queue management brings significant operational overhead and maybe other issues we don't know about right now. So go with SIOC if you can, if you have enterprise environment consider upgrade to Enterprise Plus. If you still have response times issue troubleshoot SIOC if it does what he has to do.
Anyway, it would be nice if VMware can improve NFS behavior. SIOC is just one of two workarounds we can use to mitigate risk of high latency NFS datastores.
Customer unfortunately didn't engaged VMware Global Support Organization therefore nobody in VMware knows about this issue and cannot write new or update existing KB article. I'll try to do some social network noise to help highlight the problem.