Tuesday, March 19, 2024

Three Things I Wish Every Storage Software Vendor Provided

In my work on SSDs, I have the opportunity to test a wide variety of storage software from parallel filesystems to high performance databases. SSD device vendors are always interested in demonstrating compatibility and (ideally) great performance at the application level. In my time working on "the other side of the wire," there are three things I wish every storage software vendor provided to benefit the relationship between SSDs, the application, and ultimately end users:

1. Create a storage-specific microbenchmark. This benchmark should use as much of the application's IO stack as possible and focus on the aspect(s) of performance most relevant to the application. It should measure a very limited number of metrics (or even a single key metric). The hardware requirements for the benchmark should be kept modest, using the minimum quantity of SSDs and servers. If your benchmark requires 6+ servers and multiple clients, few will want to run it more than once (if at all), let alone run it regularly. A well-written storage microbenchmark helps SSD vendors articulate the value of their products in metrics that are important to the application and helps end users validate the performance claims of an SSD device vendor without the complexity and cost of testing at full application scale. Clever storage software vendors will create scores and scoreboards to encourage vendors to promote their results. Microbenchmarks can be put in the SQA flow as extended performance monitors to guard against application-specific performance regressions. One of the best examples of a well-executed storage microbenchmark is the Aerospike Certification Tool (ACT) It uses tail latency as its figure of merit.

2. Create a storage certification or compatibility program. This certification can be done in-house, provided as a self-certification to vendors, or enabled through a 3rd party. Each have their advantages. The ultimate goal is to help end users make device selections that improve their chance of first-pass success when deploying the application. Create a logo program and/or certification/compatibility list so that SSD vendors can promote that they work with the application. The certification can be in written form (i.e., a test plan), but is ideally partially or fully automated. The more that it is automated, the easier it is to run frequently, or integrate into regular regression testing. VMware (like them or not these days) has a very complete certification program. Once passed, vendors get listed on a public certification site and gain access to a logo program (e.g., "VMware Ready"). Not all certifications must be nearly as complex to add significant value. A simple compatibility test goes a long way in ensuring first-pass customer success. 

3. Create an SSD specification. The SSD specification should outline all the requirements needed to operate successfully in the application. Beyond hard requirements, an SSD specification can also provide guidance on what is desired in the future. This gives SSD device vendors something to consider when designing new products or prioritizing new feature development. The largest example of such a specification is the OCP Datacenter SSD Specification developed by Meta, Microsoft, and other large organizations (the latest version can be found in the OCP contribution database). While this specification is extremely large, even a couple thoughtful pages provided by a storage software provider helps SSD device vendors understand what is needed to operate successfully with the application. Requirements listed in an SSD specification can and should be validated by a certification or compatibility test.

TL;DR: My dream conversation with a storage software vendor would start with something like "here is my SSD specification, here is a microbenchmark you can run to see how you compare to other SSD devices, and here is a certification test you can run to get on our compatibility list."



  


Monday, July 10, 2023

The Weight of Data

Many of you have heard that SSDs become ever so slightly "heavier" as they're written due to the fact that programming a cell adds electrons to a floating gate. Electrons do have mass, however small. Let's do some arithmetic to see what all that data weighs on a modern SSD.

We know an electron weighs 9.1093837 x 10^-28 grams. We just need to estimate how many electrons are contained in a full SSD. Let's do this per user TB. Straight away, we'll take away the decimal conversion and start with 1TiB. This capacity does not count the spare area of NAND, so let's bump that capacity by about 10% to 1.1TiB. Yes, these small adjustments are silly, but let's just have fun. Because data must be scrambled before being written to NAND, we should expect that a fully written SSD has about the same quantity of "0" and "1" data internally. Most NAND these days stores more than one bit per cell, so let's restate this as saying that all programming states are distributed equally across all cells. If the maximum program state requires 300 electrons (a guesstimate that I hope someone corrects), then we can estimate that cells contain 150 electrons on average. Since most SSDs use TLC NAND, that's 50 electrons per bit.

So, 1.1TiB is 8.8TiBit. Multiply by 50 electrons/bit and we have 440 "TiElectrons" per user TB, or 440 x 1024^4 electrons. So finally we get (440 x 1024^4) * (9.1093837 x 10^-28) = 4.40698 x 10^-13 grams in one TB of user data stored on a TLC SSD. Per Wikipedia, this is about the weight of a "prochlorococcus cyanobacteria, the smallest (and possibly most plentiful) photosynthetic organism on Earth." A Zettabyte of such data would have the equivalent weight of a mosquito. How cool is that?

Thursday, May 4, 2023

Mixups with rwmixread

I saw an interesting question about FIO on Twitter. I will paraphrase here:

I observed that tests ran against a Ceph block device with FIO using the randrw parameter produce the exact same IOPS on read and write channels, but in other IO tests with the same drive, I saw the same amount of write IOPS but much higher read IOPS. Why?

The glib answer is "because that's what you told FIO what to do." When using rw=randrw, the default mix of reads and writes is 50/50. FIO will dutifully enforce this ratio in terms of IOPS. Outside of FIO, there is nothing forcing a 50/50 mix of read and write IOPS and the SSD is free to service more reads if it has the bandwidth to do so.

The core issue is that read and write performance of SSDs is not symmetrical, but using the rwmixread parameter binds their performance together though IOPS. This has the tendency of producing lower read IOPS than what the SSD is otherwise capable of servicing. Consider a workload that is 70% read and 30% write, but the reads are 4KiB while writes are 256KiB. Running this against a Gen4 NVMe SSD, we get the following results:

read: IOPS=57.3k, BW=224MiB/s
write: IOPS=24.5k, BW=6122MiB/s  

The write throughput was saturated at 24.5k IOPS, and thus reads were limited to 57.3k IOPS to preserve the requested 70/30 IOPS ratio. If instead the write IOPS are fixed at 24.5k IOPS through the rate_iops option and the reads are allowed to be serviced as fast as possible, the results are much different:

read: IOPS=515k, BW=2011MiB/s
write: IOPS=24.5k, BW=6125MiB/s

Nearly 10x the number of read IOPS are achieved with the same write throughout. Of course, this is an extreme example using very different block sizes, but similar issues can arise even if reads and writes are the same block size, particularly in the presence of garbage collection. 

A more pernicious problem that can be hidden with rwmixread is how different SSD firmware prioritizes reads and writes. Some firmware implementations may favor reads over writes. When the rwmixread parameter is used, FIO must hold off submissions of IOs in one direction until the number of completions in the other direction meets the desired ratio. Thus a drive with read prioritization is forced to complete writes it may otherwise have deferred in the presence of further reads. Such a drive will behave very differently under different conditions.

My personal opinion is that the rwmixread parameter is over-used when evaluating SSDs. I prefer to separate reads and writes into their own threads and measure their interaction under different loads. Alas, this is a subject for another day.

  

Monday, April 10, 2023

NVMe & sysfs: a helper script

The Linux NVMe driver exports a really useful set of information to sysfs. While there's absolutely no comparison to nvme-cli's capabilities, there are a few nice things about using the information reported by sysfs to do basic inspection of the NVMe drives connected to a system. First, the information is available in user space, so super user privileges are not required. Second, there are no dependencies to download. Lastly, you can get some information that's not available in nvme-cli, or would require a set of nvme-cli commands to execute and parse.

I'm often adding drives to different systems and interested in the drives' NUMA affinity and whether they negotiated to the proper PCIe generation and link width. It's very easy to retrieve this information via sysfs and print it along with other useful information about the drives such as model numbers, firmware revisions, and namespace configurations.

I have a small Bash function that I use in my own custom .rc file. The output looks like this:

$ lsnvme

   Controller (PCI Address)    : nvme0 (0000:e6:00.0)
   Model                       : Dell DC NVMe PE8010 RI U.2 960GB
   Controller Type & Transport : reserved over pcie
   NUMA Node                   : 1
   Link Information            : 16.0 GT/s PCIe x4
   Serial Number               : SSA9N4572I1309E1N
   Firmware Revision           : 1.1.0
   Attached Namespace Count    : 1

      Namespace nvme0n1 (nsid = 1)
         Formatted / Physical Sector Size : 512 / 512
         Size in GB (512-Byte Sectors)    : 960 (1875385008)
         Globally Unique ID - NGUID       : NA
         Universally Unique ID - UUID     : NA
         Worldwide ID - WWID              : eui.ace42e00162bbe86

   Controller (PCI Address)    : nvme1 (0000:17:00.0)
   Model                       : CSD-3310
   Controller Type & Transport : io over pcie
   NUMA Node                   : 0
   Link Information            : 16.0 GT/s PCIe x4
   Serial Number               : UE2237C0787M
   Firmware Revision           : U3219141
   Attached Namespace Count    : 1

      Namespace nvme1n1 (nsid = 1)
         Formatted / Physical Sector Size : 512 / 4096
         Size in GB (512-Byte Sectors)    : 3840 (7501476528)
         Globally Unique ID - NGUID       : 55453232-3337-4330-4f55-49013738374d
         Universally Unique ID - UUID     : 55453232-3337-4330-4f55-49013738374d
         Worldwide ID - WWID              : eui.55453232333743304f5549013738374d


The dirt simple Bash function I use can be found here. I have an NVMe-centric way of looking at drives (controllers being very distinct entities from their attached namespaces) and often deal with multiple namespaces. The output format I prefer may not be optimal for someone prefers a namespace-centric presentation of the same information. The main point here is that if you find yourself needing to work with NVMe drives frequently, it might be worth adding a sysfs-based NVMe inspection function to your shell that's suited to your needs.

Tuesday, April 4, 2023

Why SSDs "Lie" About Flush

A Flush command requests that an SSD controller writes any data held in its volatile write cache to the NAND media. The intent is to ensure that all completed writes by the controller are actually persisted to the media. I sometimes hear people complain that SSDs "lie" about the Flush command, implying some nefarious, lazy, or outright bad firmware design. The truth is that Flush is quite a pernicious operation for NAND-based SSDs, and SSD designers are forced to balance between honoring Flush as it is intended and protecting the SSD from premature wear-out in the face of excessive Flush operations.

The root of the problem is the mismatch between host write units and the minimum write unit of the SSD, which is a NAND page (it could be larger depending on the controller design). Hosts typically write to SSDs in units of 4kB while a NAND page today is commonly 16kB. An SSD controller would like to accumulate enough data from the host to write out at least a single page of data to NAND. In other words, the controller would like to accumulate 4 x 4kB writes in order to fill a 16kB NAND page. If the host has only written a single 4kB sector, then issues a Flush, the SSD has only two unsavory options: honor the Flush by padding the data to the NAND page size and writing it out to the media, or ignore the Flush since there is insufficient data to fill a NAND page. 

Consider a host that would like to Flush on every 4kB write. If the SSD faithfully honors every Flush command, there is an immediate 4x write amplification penalty as only a quarter of each 16kB NAND page is used for host data. The SSD will also require garbage collection much earlier (at just 25% full), which compounds the write amplification penalty. In short, there is no way for an SSD to both honor Flush at all times and maintain its rated endurance in the presence of a Flush-happy host.

"Enterprise" or "data center" drives solve the Flush problem by using on-board back-up capacitors to provide enough power to flush the contents of the controller's caches under all circumstances. This makes the controller caches effectively non-volatile and Flush can be safely treated as a no-op. Consumer drives don't have the luxury of adding the amount of capacitance needed to achieve the same guarantees. It adds cost and there is very limited room (or no room at all) on an M.2 PCB for the required capacitor placements. 

Whether enterprise or consumer grade, all SSDs complete writes when they arrive at the controller (not the NAND media). Waiting for the write to be committed to NAND would be a costly way to solve the Flush problem. Write latency would become dependent on host behavior. Ironically, slowly trickling writes would result in the highest latency. Imagine 4kB writes arriving 500 milliseconds apart. It would take 1.5 seconds to accumulate enough writes to fill a typical NAND page. This starts creating issues with command timeouts. Also, all writes would incur the 500+ microsecond page programming time, which could otherwise be pipelined away (at least until the NAND media bandwidth is saturated).  

So, how do we get clarity about Flush behavior? The answer from SSD vendors might be to buy a class of drive that supports back-up capacitors. From that perspective, it is a solved problem. There is certainly room for better cooperation between a host and an SSD without back-up capacitors to ensure that a Flush command is handled predictably, but these would require software changes. Perhaps the simplest solution is for consumer SSD vendors to offer different ways to handle Flush; one that is best effort and another that is an iron-clad guarantee, with the latter sacrificing on endurance warranty claims.

Then again, maybe it's simply ok to lie sometimes?

Three Things I Wish Every Storage Software Vendor Provided

In my work on SSDs, I have the opportunity to test a wide variety of storage software from parallel filesystems to high performance database...