Friday, November 10, 2006

flash for main storage

I was in a discussion about flash on a closed mailing list, so I'll post my comments here.

I believe that flash will soon be suitable for main storage on most desktop and laptop machines (which means replacing the vast majority of the hard drive market). Flash survives mechanical wear much better than hard drives (flash storage in a camera will usually survive the destruction of the camera), it produces less heat and less noise, and it has better seek times. It is more expensive, although the price is coming down and the main problem now is the number of writes that can be made.

Flash is widely regarded as being slow for bulk IO (benchmark results I have seen approach 10MB/s - while 60MB/s is common for cheap desktop IDE disks). I am not sure how much of this is inherent to flash technology and how much is due to the interface used to access the flash. I often work with Gig-E networks, but for my home use I only have 100baseT, so I have little need for more than 10MB/s IO rates at home.

It is generally regarded that a sector of flash storage wears out at between 10,000 and 1,000,000 writes depending on how recent the hardware is and who you talk to (some vendors are more optimistic than others regarding the usable life of their devices).

Let's assume that you have a 32G flash module running JFFS2 with an average of 2G free (30G of long-term data that doesn't change and 2G of space that is used for new files). Let's assume that the most pessimistic prediction for flash reliability of 10,000 writes happens to be correct. So if 10,000 writes are to be made to that 2G of space that means 20T of data written! If we assume that the machine will be obsolete in 5 years then that allows us an average of just over 10G of data written per day (20,000/365/5=10.9). On my laptop iostat reports the following after 5 days of uptime:

Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
hda 1.94 9.94 20.33 4614118 9439808
I believe that this means an average of 20 blocks were written per second over the last 5 days with a block size of 4K (page size), this means 6.6G per day. Clearly something is wrong with my laptop as there should not be so many writes, but even so I wouldn't expect it to wear out within 5 years if I used only flash storage. Incidentally I do a lot of travelling and generally find that I'm lucky if a laptop hard drive lasts three years. So I could expect flash to last longer than a hard drive for my laptop use.

When flash fails I believe that only a small part of the data will be lost, which is better than the hard drive failure condition which is often to lose everything!

Also there is nothing preventing you from creating a RAID-1 of flash devices. Last time I checked the JFFS2 kernel code didn't support such things but that could be fixed if there was suitable hardware.

Note that JFFS2 is vastly preferable to using Ext3 or similar filesystems on a flash device. Flash needs wear-levelling (spreading the write load over all parts of the disk) for sane operation. JFFS2 has this built in to the filesystem, while Ext3 etc are designed to repeatedly write the same parts of the disk. This means that to use Ext3 you need a mapping layer that does wear-levelling which causes inefficiency. Also JFFS2 has compression built in (same method as gzip). This is good for smaller flash devices (EG the 32M storage that was common in iPaQs), and also reduces the wear on larger storage.

The biggest problem for me in using flash at the moment is the lack of support for XATTRs (needed for SE Linux) in JFFS2. KaiGai Kohei has been working on this, it's been a while since I checked on the progress so I'm not sure if it's got into the repository yet.

Another problem with flash is that it is totally unsuitable for use as a swap device. This means that you need to have so much RAM that swap is not needed. Fortunately desktop machines with 2G of RAM are becoming common.


Anonymous said...

i did similar calculations recently for the 128Mb CF card i use in my digital camera. given that i have now taken over 8000 photos with it i was interested to work out (using the same 10,000 writes figure you mention) how long it had to go. the bottom line was very little cause for concern, as even taking the high figure of ~2Mb per 3.3Mpixel image (which is around the largest it seems to produce), i still have manymanymany writes left - i expect the camera will die long before the card does. i realise this is not comparible directly with typical OS throughput, but the basic argument is the same. Owen.

Luis Villa said...

This is already happening- Vista will come with support for what they call a 'ReadyDrive', which is a hard drive with several gigs of built-in flash to store time-sensitive bits on. Ars has some details here.

Martin-√Čric said...

The project I'm working on involves 512Mb NAND flash storage devices with a Geode LX. I find the access speed to be excruciatingly slow. Part of the problem is the serial access method used, but the kernel's MTD subsystem also appears to lack the ability to mount and access large devices quickly enough to be usable.

Anonymous said...

Please provide iostat -k and/or iostat -x, because, according to man iostat, block size is 512 bytes. Which makes your deduction even more interesting.

Russ said...

Just a quick not to others using comments, CF is not a good comparison or testbed. CF has its only flash translation layer and it does not benefit from jffs2.

Aside from JFFS2, YAFFS2 also looks interesting for large NAND flash devices.

Anonymous said...

I have a linksys WRT54G running openwrt which use JFFS2.

One thing I noticed is that after I have done some amount of editing(i.e., update that hits the file systems like those in /etc/ or doing some amount of apt-get), it would take a very noticeable long time for it to "rebalance".

I am not sure if this is related to the speed of the flash used but if it is not, I won't think it is usable on a laptop environment.

Erich Schubert said...

Typical flash drives have a much more limited IO bandwidth that what you would probably get from a larger drive.
If you'd make a flash harddisk in the 40 GB range, you'll maybe use 10 drives with 4 GB each, in a RAID-0 like setting (though you might even prefer a higher raid level, which btw. is another way to counter wear. however the raid needs to be carefully combined with a translation layer, I guess).
But in such a setup, it should be easy to obtain higher transfer rates.

What I'd certainly like to see as a first step is using a flash for e.g. journal storage, or for other metadata. There are other hybrid solutions possible. If you can collect the data that needs frequent rewriting, you could maybe put it on a small magnetic disc, and use the flash mostly for data that is rarely rewritten. Just to be able to use a smaller magnetic disc only. Maybe just use copy-on-write semantics to reduce the number of write cycles for the flash.
Aren't there flash types that are much better WRT to write cycles? maybe even NVRAM? Even having a 10 MB COW layer which will store e.g. key filesystem structures top reduce the number of write cycles might be an affordable combination.
This could maybe even be done in a controller on the disc, transparent to the operating system. Some MB copy on write NVRAM, some GB of flash, and some RAM to store write counters to optimize NVRAM useage.

Well, I'm not an expert.

Aaron Zitzer said...

FYI, Merrill Lynch recently issued a report on their view of the future of the HDD market including information on the potential threat of flash. You can get to the full report from here: