It is well known, that the storage industry is in a big transformation. SSD's based on Flash is changing the old storage paradigma and supporting fast computing required nowadays in modern applications supporting digital transformation projects.
So the Flash is great but it is also about the bus and the protocol over which the Flash is connected.
We have traditional storage protocols SCSI, SATA, and SAS but these interface protocols were invented for magnetic disks, that's the reason why Flash over these legacy interface protocols cannot leverage the full potential of Flash technology. That's why NVMe (new storage interface protocol over PCI) or even 3D XPoint memory (Intel Optane).
It is all about latency and available bandwidth. Total throughput depends on I/O size and achievable transaction (IOPS). IOPS on storage systems below can be achieved on particular storage media by a single worker with random access, 100% read, 4 KB I/O size workload. Multiple workers can achieve higher performance but with higher latency.
Latencies order of magnitude:
SATA - magnetic disk 7.2k RPM ~= 80 I/O per second (IOPS) = 1,000ms / 80 = 12ms
SAS - magnetic disk 15k RPM ~= 200 I/O per second (IOPS) = 1,000ms / 200 = 5 ms
SAS - Solid State Disk (SSD) Mixed use SFF ~= 4,000 I/O per second (IOPS) = 1,000ms / 4,000 = 0.25 ms = 250 μs.
NVMe over RoCE - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.100 ms = 100 μs
NVMe - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.080 ms = 80 μs
DIMM - 3D XPoint memory (Intel Optane) ~= the latency less than 500 ns (0.5 μs)
Ethernet Fabric Latencies
Gigabit Ethernet - 125 MB/s ~= 25 ~ 65 μs
10G Ethernet - 1.25 GB/s ~= 4 μs (sockets application) / 1.3 μs (RDMA application)
40G Ethernet - 5 GB/s ~= 4 μs (sockets application) / 1.3 μs (RDMA application)
InfiniBand and Omni-Path Fabrics Latencies
10Gb/s SDR - 1 GB/s ~= 2.6 μs (Mellanox InfiniHost III)
20Gb/s DDR - 2 GB/s ~= 2.6 μs (Mellanox InfiniHost III)
40Gb/s QDR - 4 GB/s ~= 1.07 μs (Mellanox ConnectX-3)
40Gb/s FDR-10 - 5.16 GB/s ~= 1.07 μs (Mellanox ConnectX-3)
56Gb/s FDR-10 - 6.82 GB/s ~= 1.07 μs (Mellanox ConnectX-3)
100Gb/s EDR-10 - 12.08 GB/s ~= 1.01 μs (Mellanox ConnectX-4)
100Gb/s Omni-Path - 12.36 GB/s ~= 1.04 μs (Intel 100G Omni-Path)
RAM Latency
DIMM - DDR4 SDRAM ~= 75 ns (local NUMA access) - 120 ns (remote NUMA access)
Visualization
Conclusion
The latency order of magnitude is important for several reasons. Let's focus on one of them - latency monitoring. It was always a challenge to monitor traditional storage systems as 5 minutes or even 1-minute is simply too large interval for ms (milisecond) latency and the average does not tell you anything about microbursts. However, in lower latency (μs or even ns) systems is 5-minute interval like an eternity. Average, Min and Max of 5-minute interval might not help you to understand what is really happening there. Much deeper mathematical statistics would be needed to have real and valuable visibility into telemetry data. Percentiles are good but Histograms can help even more ...
Hope this is informative and educational.
Other sources:
Performance Characteristics of Common Network Fabrics: https://www.microway.com/knowledge-center-articles/performance-characteristics-of-common-network-fabrics/
Real-time Network Visibility: http://www.mellanox.com/related-docs/whitepapers/WP_Real-time_Network_Visibility.pdf
Johan van Amersfoort and Frank Denneman present a NUMA deep dive: https://youtu.be/VnfFk1W1MqE
Cormac Hogan : GETTING STARTED WITH VSCSISTATS: https://cormachogan.com/2013/07/10/getting-started-with-vscsistats/
Wiliam Lam : Retrieving vscsiStats Using the vSphere 5.1 API
So the Flash is great but it is also about the bus and the protocol over which the Flash is connected.
We have traditional storage protocols SCSI, SATA, and SAS but these interface protocols were invented for magnetic disks, that's the reason why Flash over these legacy interface protocols cannot leverage the full potential of Flash technology. That's why NVMe (new storage interface protocol over PCI) or even 3D XPoint memory (Intel Optane).
It is all about latency and available bandwidth. Total throughput depends on I/O size and achievable transaction (IOPS). IOPS on storage systems below can be achieved on particular storage media by a single worker with random access, 100% read, 4 KB I/O size workload. Multiple workers can achieve higher performance but with higher latency.
Latencies order of magnitude:
- ms - miliseconds - 0.001 of second = 10−3
- μs - microseconds - 0.000001 of second = 10−6
- ns - nanoseconds - 0.000000001 of second = 10−9
SATA - magnetic disk 7.2k RPM ~= 80 I/O per second (IOPS) = 1,000ms / 80 = 12ms
SAS - magnetic disk 15k RPM ~= 200 I/O per second (IOPS) = 1,000ms / 200 = 5 ms
SAS - Solid State Disk (SSD) Mixed use SFF ~= 4,000 I/O per second (IOPS) = 1,000ms / 4,000 = 0.25 ms = 250 μs.
NVMe over RoCE - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.100 ms = 100 μs
NVMe - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.080 ms = 80 μs
DIMM - 3D XPoint memory (Intel Optane) ~= the latency less than 500 ns (0.5 μs)
Ethernet Fabric Latencies
Gigabit Ethernet - 125 MB/s ~= 25 ~ 65 μs
10G Ethernet - 1.25 GB/s ~= 4 μs (sockets application) / 1.3 μs (RDMA application)
40G Ethernet - 5 GB/s ~= 4 μs (sockets application) / 1.3 μs (RDMA application)
InfiniBand and Omni-Path Fabrics Latencies
10Gb/s SDR - 1 GB/s ~= 2.6 μs (Mellanox InfiniHost III)
20Gb/s DDR - 2 GB/s ~= 2.6 μs (Mellanox InfiniHost III)
40Gb/s QDR - 4 GB/s ~= 1.07 μs (Mellanox ConnectX-3)
40Gb/s FDR-10 - 5.16 GB/s ~= 1.07 μs (Mellanox ConnectX-3)
56Gb/s FDR-10 - 6.82 GB/s ~= 1.07 μs (Mellanox ConnectX-3)
100Gb/s EDR-10 - 12.08 GB/s ~= 1.01 μs (Mellanox ConnectX-4)
100Gb/s Omni-Path - 12.36 GB/s ~= 1.04 μs (Intel 100G Omni-Path)
RAM Latency
DIMM - DDR4 SDRAM ~= 75 ns (local NUMA access) - 120 ns (remote NUMA access)
Visualization
It is good to realize what latencies we should expect on different infrastructure subsystems
- RAM ~= 100 ns
- 3D XPoint memory ~= 500 ns
- Modern Fabrics ~= 1-4 μs
- NVMe ~= 80 μs
- NVMe over RoCE ~= 100 μs
- SAS SSD ~= 250 μs
- SAS magnetic disks ~= 5-12 ms
- Wavefront Histograms - https://youtu.be/syIKQ2oZk9s
- How the Metric Histogram Type Works: True Visibility into High-Velocity Application Metrics - https://www.wavefront.com/metric-histogram-type-works-true-visibility-high-velocity-application-metrics-part-2-2-2-2/
Hope this is informative and educational.
Other sources:
Performance Characteristics of Common Network Fabrics: https://www.microway.com/knowledge-center-articles/performance-characteristics-of-common-network-fabrics/
Real-time Network Visibility: http://www.mellanox.com/related-docs/whitepapers/WP_Real-time_Network_Visibility.pdf
Johan van Amersfoort and Frank Denneman present a NUMA deep dive: https://youtu.be/VnfFk1W1MqE
Cormac Hogan : GETTING STARTED WITH VSCSISTATS: https://cormachogan.com/2013/07/10/getting-started-with-vscsistats/
Wiliam Lam : Retrieving vscsiStats Using the vSphere 5.1 API
No comments:
Post a Comment