Monday, April 10, 2023

NVMe & sysfs: a helper script

The Linux NVMe driver exports a really useful set of information to sysfs. While there's absolutely no comparison to nvme-cli's capabilities, there are a few nice things about using the information reported by sysfs to do basic inspection of the NVMe drives connected to a system. First, the information is available in user space, so super user privileges are not required. Second, there are no dependencies to download. Lastly, you can get some information that's not available in nvme-cli, or would require a set of nvme-cli commands to execute and parse.

I'm often adding drives to different systems and interested in the drives' NUMA affinity and whether they negotiated to the proper PCIe generation and link width. It's very easy to retrieve this information via sysfs and print it along with other useful information about the drives such as model numbers, firmware revisions, and namespace configurations.

I have a small Bash function that I use in my own custom .rc file. The output looks like this:

$ lsnvme

   Controller (PCI Address)    : nvme0 (0000:e6:00.0)
   Model                       : Dell DC NVMe PE8010 RI U.2 960GB
   Controller Type & Transport : reserved over pcie
   NUMA Node                   : 1
   Link Information            : 16.0 GT/s PCIe x4
   Serial Number               : SSA9N4572I1309E1N
   Firmware Revision           : 1.1.0
   Attached Namespace Count    : 1

      Namespace nvme0n1 (nsid = 1)
         Formatted / Physical Sector Size : 512 / 512
         Size in GB (512-Byte Sectors)    : 960 (1875385008)
         Globally Unique ID - NGUID       : NA
         Universally Unique ID - UUID     : NA
         Worldwide ID - WWID              : eui.ace42e00162bbe86

   Controller (PCI Address)    : nvme1 (0000:17:00.0)
   Model                       : CSD-3310
   Controller Type & Transport : io over pcie
   NUMA Node                   : 0
   Link Information            : 16.0 GT/s PCIe x4
   Serial Number               : UE2237C0787M
   Firmware Revision           : U3219141
   Attached Namespace Count    : 1

      Namespace nvme1n1 (nsid = 1)
         Formatted / Physical Sector Size : 512 / 4096
         Size in GB (512-Byte Sectors)    : 3840 (7501476528)
         Globally Unique ID - NGUID       : 55453232-3337-4330-4f55-49013738374d
         Universally Unique ID - UUID     : 55453232-3337-4330-4f55-49013738374d
         Worldwide ID - WWID              : eui.55453232333743304f5549013738374d


The dirt simple Bash function I use can be found here. I have an NVMe-centric way of looking at drives (controllers being very distinct entities from their attached namespaces) and often deal with multiple namespaces. The output format I prefer may not be optimal for someone prefers a namespace-centric presentation of the same information. The main point here is that if you find yourself needing to work with NVMe drives frequently, it might be worth adding a sysfs-based NVMe inspection function to your shell that's suited to your needs.

Tuesday, April 4, 2023

Why SSDs "Lie" About Flush

A Flush command requests that an SSD controller writes any data held in its volatile write cache to the NAND media. The intent is to ensure that all completed writes by the controller are actually persisted to the media. I sometimes hear people complain that SSDs "lie" about the Flush command, implying some nefarious, lazy, or outright bad firmware design. The truth is that Flush is quite a pernicious operation for NAND-based SSDs, and SSD designers are forced to balance between honoring Flush as it is intended and protecting the SSD from premature wear-out in the face of excessive Flush operations.

The root of the problem is the mismatch between host write units and the minimum write unit of the SSD, which is a NAND page (it could be larger depending on the controller design). Hosts typically write to SSDs in units of 4kB while a NAND page today is commonly 16kB. An SSD controller would like to accumulate enough data from the host to write out at least a single page of data to NAND. In other words, the controller would like to accumulate 4 x 4kB writes in order to fill a 16kB NAND page. If the host has only written a single 4kB sector, then issues a Flush, the SSD has only two unsavory options: honor the Flush by padding the data to the NAND page size and writing it out to the media, or ignore the Flush since there is insufficient data to fill a NAND page. 

Consider a host that would like to Flush on every 4kB write. If the SSD faithfully honors every Flush command, there is an immediate 4x write amplification penalty as only a quarter of each 16kB NAND page is used for host data. The SSD will also require garbage collection much earlier (at just 25% full), which compounds the write amplification penalty. In short, there is no way for an SSD to both honor Flush at all times and maintain its rated endurance in the presence of a Flush-happy host.

"Enterprise" or "data center" drives solve the Flush problem by using on-board back-up capacitors to provide enough power to flush the contents of the controller's caches under all circumstances. This makes the controller caches effectively non-volatile and Flush can be safely treated as a no-op. Consumer drives don't have the luxury of adding the amount of capacitance needed to achieve the same guarantees. It adds cost and there is very limited room (or no room at all) on an M.2 PCB for the required capacitor placements. 

Whether enterprise or consumer grade, all SSDs complete writes when they arrive at the controller (not the NAND media). Waiting for the write to be committed to NAND would be a costly way to solve the Flush problem. Write latency would become dependent on host behavior. Ironically, slowly trickling writes would result in the highest latency. Imagine 4kB writes arriving 500 milliseconds apart. It would take 1.5 seconds to accumulate enough writes to fill a typical NAND page. This starts creating issues with command timeouts. Also, all writes would incur the 500+ microsecond page programming time, which could otherwise be pipelined away (at least until the NAND media bandwidth is saturated).  

So, how do we get clarity about Flush behavior? The answer from SSD vendors might be to buy a class of drive that supports back-up capacitors. From that perspective, it is a solved problem. There is certainly room for better cooperation between a host and an SSD without back-up capacitors to ensure that a Flush command is handled predictably, but these would require software changes. Perhaps the simplest solution is for consumer SSD vendors to offer different ways to handle Flush; one that is best effort and another that is an iron-clad guarantee, with the latter sacrificing on endurance warranty claims.

Then again, maybe it's simply ok to lie sometimes?

Three Things I Wish Every Storage Software Vendor Provided

In my work on SSDs, I have the opportunity to test a wide variety of storage software from parallel filesystems to high performance database...