ORIGINALLY POSTED 4th September 2018
5,688 views on developerworks
There is a lot of talk – dare I say Hype – around NVMe, and I’ve found that a lot of people don’t understand what it means, not only in terms of the technology, but also why is it important from a business, getting things done, perspective. We recently ran a Meet-up in Auckland for local storage professionals, partners and IBMers to explain more about NVMe, just what it is, how it is being used and why it matters.
‘SCSI – So Long and Thanks for All the
A not so subtle reference to the 4th book in Douglas Adams ‘trilogy’ – The Hitchhikers Guide to the Galaxy – it sums up really what NVMe is – its a protocol, just like SCSI for performing reads and writes, and all the associated discovery, error recovery etc that we need to use a storage device. Non-Volatile Memory express, as it suggests was designed to add storage to the memory higherarchy, but maintain our ‘block’ based addressing of devices – LUNs as we know then are now ‘namespaces’ – but I suspect the LUN or volume concept will live on as its one we all know and understand.
The term is being abused however to be used as an attachment type, i.e. NVMe interface drives, with ‘NVMe storage’ devices, which makes sense as we run NVMe primarily over a PCIe link – so inside a storage controller, or server as it started, great, we have direct PCIe from the CPU complex out to the Flash, or up-coming Storage Class Memory (SCM) technologies such as 3D-Xpoint and Z-SSD. Critical is that we can perform Remote Direct Memory Access to the devices (RDMA) with similarities to the Verbs protocol for Infiniband, thus giving ultra low latecny – as well as massive parallelism. Up to 64K queues per deice, and 64K entries per queue – spec’d anyway!
However one further adaptation you may have read about is NVMe over Fabrics – which comes in many flavors, including Fibre Channel, Ethernet and Infiniband. The whole Ethernet area is huge, with two competing hardware infrastructure base technolgies – Internat Wide-Area RDMA Protocol (iWARP) and RDMA over Converged Ethernet (RoCE) with v2 of RoCE providing routing and making it useful! I will save details of these for another day, but today lets look at NVMe-FC – NVMe over Fibre Channel.
Fibre Channel is perfectly suited to NVMe, as todays FC transport hardware (switches and HBAs) already perform a form of RDMA across the fabric, where the FCP encapsulates SCSI, we can simply change this to encapsulate NVMe (at a basic level) – So from a hardware perspective, the current generation switches, and HBAs can simply have a firmware/OS update and you are NVMe ready. The same goes for our storage controllers. All of our block storage devices are NVMe ready in terms of hardware, and very soon you will see software updates to make use of this. However, it does need software driver updates and OS updates in order to make full use of the NVMe benefits.
So what are the benefits. Often quoted will be reduced latency, and while technically true, the protocol allows for massively parallel queuing systems, (since SCSI was serial with one queue) there is scope to get parallel I/O running, which means less time queued behind other I/O. The reality is, at the storage controller side anyway, we’ve already solved some of this in other ways. That is, with NVMe, and RDMA, we can run a user space driver – thus no kernel interrupts, and no copying of data from hardware to kernel to user space – just direct hardware to user space. However, thats what we did with Spectrum Virtualize back in 2003 – user space drivers, polling for work to do, and no interrupts. Similarly with scale out – batching – lock removal – binding – all techniques to reduce context switching inside a system – multi cored ‘SCSI’ has been possible and so the parallel nature of NVMe-FC gets minimal benefit when run on-top a well designed and scalable storage controller architecture.
That said, the server side will see the biggest benefit. SCSI was first ratified as a standard in 1986, so the OS side of things has been relatively unchanged for over 30 years. This has resulted in a ‘aint boke don’t fix’ attitude, which is great, but means things like interrupt scaling / spreading takes great skill and careful configuration planning in order to spread the workload across multiple cores. (irq_balance script in linux – if you’ve not looked at it, try it!)
In order to explain more, and show the potential gains on the server side, I worked with our development teams in the UK and Israel during June and July to get a very early pre-release version of the NVMe-FC driver support into a FlashSystem9100 so we could run a proof of technoogy demo at the recent Systems Technical University in Sydney. The video below was created to showcase the technology, explain how we will implement this – with the use of NPIV based NVMe WWPNs, so you can run SCSI and NVMe over the same physical connections – and bring to light the potential savings in terms of server CPU loading, which in this simple demo gives back almost 50% of the processing needed to handle the I/O for you to use for real application work – or potential reduce costs with less core licenses, and start to run more SDS based functionality on existing hardware – or simply just run more VM’s on the same box.
There is a great redpaper covering a 101 on NVMe, which I urge you to read if you want more technical detail, it won’t be long before you have to start thinking about this for new installs.