Random Writes: Why SSDs "Lie" About Flush

A Flush command requests that an SSD controller writes any data held in its volatile write cache to the NAND media. The intent is to ensure that all completed writes by the controller are actually persisted to the media. I sometimes hear people complain that SSDs "lie" about the Flush command, implying some nefarious, lazy, or outright bad firmware design. The truth is that Flush is quite a pernicious operation for NAND-based SSDs, and SSD designers are forced to balance between honoring Flush as it is intended and protecting the SSD from premature wear-out in the face of excessive Flush operations.

The root of the problem is the mismatch between host write units and the minimum write unit of the SSD, which is a NAND page (it could be larger depending on the controller design). Hosts typically write to SSDs in units of 4kB while a NAND page today is commonly 16kB. An SSD controller would like to accumulate enough data from the host to write out at least a single page of data to NAND. In other words, the controller would like to accumulate 4 x 4kB writes in order to fill a 16kB NAND page. If the host has only written a single 4kB sector, then issues a Flush, the SSD has only two unsavory options: honor the Flush by padding the data to the NAND page size and writing it out to the media, or ignore the Flush since there is insufficient data to fill a NAND page.

Consider a host that would like to Flush on every 4kB write. If the SSD faithfully honors every Flush command, there is an immediate 4x write amplification penalty as only a quarter of each 16kB NAND page is used for host data. The SSD will also require garbage collection much earlier (at just 25% full), which compounds the write amplification penalty. In short, there is no way for an SSD to both honor Flush at all times and maintain its rated endurance in the presence of a Flush-happy host.

"Enterprise" or "data center" drives solve the Flush problem by using on-board back-up capacitors to provide enough power to flush the contents of the controller's caches under all circumstances. This makes the controller caches effectively non-volatile and Flush can be safely treated as a no-op. Consumer drives don't have the luxury of adding the amount of capacitance needed to achieve the same guarantees. It adds cost and there is very limited room (or no room at all) on an M.2 PCB for the required capacitor placements.

Whether enterprise or consumer grade, all SSDs complete writes when they arrive at the controller (not the NAND media). Waiting for the write to be committed to NAND would be a costly way to solve the Flush problem. Write latency would become dependent on host behavior. Ironically, slowly trickling writes would result in the highest latency. Imagine 4kB writes arriving 500 milliseconds apart. It would take 1.5 seconds to accumulate enough writes to fill a typical NAND page. This starts creating issues with command timeouts. Also, all writes would incur the 500+ microsecond page programming time, which could otherwise be pipelined away (at least until the NAND media bandwidth is saturated).

So, how do we get clarity about Flush behavior? The answer from SSD vendors might be to buy a class of drive that supports back-up capacitors. From that perspective, it is a solved problem. There is certainly room for better cooperation between a host and an SSD without back-up capacitors to ensure that a Flush command is handled predictably, but these would require software changes. Perhaps the simplest solution is for consumer SSD vendors to offer different ways to handle Flush; one that is best effort and another that is an iron-clad guarantee, with the latter sacrificing on endurance warranty claims.

Then again, maybe it's simply ok to lie sometimes?

1 comment:

Albert AutinApril 6, 2023 at 6:03 PM
Well I think the negativity comes with inexperience, speaking from experience. Hah! Usually when it comes down to finding really low level details that aren't normally in my domain or apparent, I like to blame the system of naming, documentation, or in general user handling etc because I like to pretend I'm a good user.
For this "really really actually flush" feature I feel like you're hunting at, I would probably make a special new kmod flag called something like "storage_flush_when_told_performance_tradeoff_i_know_what_im_doing"
Im a big fan of overly verbose "I know what I'm doing" flags that I can't be mad for breaking my system when I flip them. I know people hate more flags and signals though...

Random Writes

Tuesday, April 4, 2023

Why SSDs "Lie" About Flush

1 comment:

The LZ4 Trade-Off (Illustrated and Simplified)