DeconstructSeattle, WA - Thu & Fri, Apr 23-24 2020

← Back to 2019 talks


(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Hi. Let's talk about files. Most programmers seem to think files are generally pretty easy to deal with and straightforward. Just for example, let's look at the top comments from Reddit r/ programming, are from when Dropbox decided to remove support for all but the most common file system on Linux. The top comment reads, I'm a bit confused. Why do these applications have to support these file systems directly? The only differences I could possibly see between different file systems are file size limitations and permissions. But aren't most modern file systems about on par with each other?

The number two comment also says the same thing. Someone does respond and they say, oh, hey, this is actually more complicated than you think. File systems are leaking abstraction. But since this is Reddit, someone basically calls them a moron, and then explains it is very easy, as everybody knows. Most comments are in this vein. The goal of this talk is to convince you that some of the internet is wrong. These comments are actually not correct, and this is harder than people think.

To do this, look at the file stack. We'll start at the top and look at the file API and we'll see that it's basically impossible to use correctly. Then we'll look at the file system and we'll see it drops many critical errors will lead to data loss and to corruption. And then we'll look at disks and see that disks can easily corrupt data at a rate five billion times higher than as stated by standard definitions.

So to begin, let's look at writing one file. What do we have to do to write a file safely? And by safely, I mean we don't want to have data corruption. So we either want the write to complete or we want to have nothing happen. In this example, let's say we have a file that contains the text a space foo, and we'd like to overwrite foo with bar. So then it will contain a space bar. For this example, there will be a number of simplifications, one of which is, if you think a lot about disks, you can think of each one of these characters as a sector on disk. For the most part, I will not be calling out the simplifications, because the full version of this talk would take about three hours and we have 25 minutes.

Similarly, you will notice a reference to a paper in the bottom right. I will also not be calling this out because we don't have enough time. So how do we write to a file? We can use this function called pwrite. This is a system call, a thing provided by AOI to let us interact with the file system. If you're used to something like Python, maybe you're not used to an interface like this. But ultimately, with something like Python or Ruby, when you call something to write a file, it will ultimately end up having to call as a call like this to write to the file.

So anyway, the interface is going to take a file that I do want to write, in this case, the text part, how many bytes we want to write. This is 3 bytes at offset 2. So if you do this, we can see what will happen. Maybe it will work. We'll actually get a bar. We could crash while writing, though. If this happens, maybe we crash before we write and we have a foo, or maybe we crash in the middle and we have data corruption.

So the pwrite operation is not guaranteed to be [INAUDIBLE] crash, so anything could happen. If we want to enforce atomicity, we don't have data corruption, one standard technique is to make a copy of the data into what we call undo log. We'll do the write of the main file and then we can delete the log file. If we crash in step one, this is OK. But we wouldn't have started writing the file yet, so it's OK that the undo log is not complete. If we crash during step two, this is also OK, because the undo log should be complete and we should be able to use it to restore data to a known good state.

So if we do this, we want to use more to syscols, we'll use create, which is spelled creat. We'll make this log file. Sorry. That's just how it is. And then we can write into this log file what we need to do to restore. In this case, we're writing 2,3,foo. What this says is, to restore, please at offset 2 write 3 bytes and those bytes should be foo. And then we'll do the pwrite. And so this should overwrite the data. If we crash during this, we should be fine. And then we'll do the log file.

And this may or may not work, depending on our file system and file system configuration. If we're using ext3 ext4, very common systems on Linux, and we're in a mode data=journal, we'll talk about data modes later, then it should work OK. Something we might see is we might see the log files incomplete. It contains 2,3,f. And the original file has not modified. When you try to restore from a crash, the bullets tell them log files incomplete, because we see that it says to write three bytes but there is only one byte there. So it cannot possibly be complete and we should not restore.

We might also see the log files complete and the original file has been modified. This is OK. This is what the log file is for. It'll help us restore to a new and good state. However, if we're using the mode data=ordered, then this will not work. Something that can happen is this write and this pwrite can be reordered and we can start modifying the file before the log file has actually been written to. We can fix this by using other syscol, fsync. Fsync does a couple things. First it's a barrier. This means it prevents reordering and it also flushes caches, and we'll talk about what that means later.

But anyway, if we add these two fsyncs it'll prevent the write and the pwrite from getting reordered, so this will work with an =order mode. However, if it equals write back, it appears if we wrote random garbage into the log file, we didn't write random garbage in the log file but this write operation is not atomic in multiple ways. One way in which is not atomic is that the length of the file can change before we finish writing. And so if we just read the file naively, whatever bits were on disk, it will seem as if those were in the log file.

To fix this, we can add a check sum. A check sum is just some way to check to make sure the data is good. If our check sum is held, then we'll know the log file is actually good. We could have one more problem. It can appear as if we have no log file. So even though we created this log file and wrote to it and fysynced it, the document metadata may not contain the log file. So when we restore from a crash and we ask where's the log files that were here, we might not see a log file and we think we're actually done with that operation.

To fix this, we can fsync the current directory. We should also fsync afterwards to make sure if we crash afterwards we don't actually undo our work, although technically that's not a correct in this book. We also haven't checked for errors. Every single one of these syscols can return an error. We'll talk about this in syscol file systems, because while this is not exactly an API difficulty issue, there is a bug that makes this impossible to handle correctly.

OK. So that's writing a file. This is kind of OK. I mean, except for this file systems bug that makes this completely unsafe. But other than that, it's fine. And so, yeah, sorry. So if someone asked you in an interview, can you write a file safely, and you know the 84 rules you need to know, you can do this, right? But one question you might have is, let's say we have a really code basis of a million or ten millions lines of code, can we do this right every single time?

This is what the authors of this paper looked at. A number of programs are write to files. These are databases and version control systems, things you would hope write to file safely. And then there were the static analysis tools to try to find bugs. A static analysis tool, rather than executing the code, it will just look at what's there and try to decide, in this case of people using the file API correctly, you could find a few simple classes of bugs to suspect incorrect API usage.

When they did this, they found that every single program tested except for Sqlite in one mode had at least one bug. People who write the software, they're actually pretty good at this. They're certainly better than me at writing files. Probably they're better, no offense, better than most of you in the audience. And they still couldn't do this correctly. But testing is also quite rigorous. You feel that testing for Postgres and Sqlite is much more rigorous than testing for most software. Yet they still ended up with bugs. So even experts cannot do this correctly. One natural follow up question is why is it so hard to use that even experts who've spent a lot of effort on this cannot do it correctly?

So one issue is concurrency is hard. And writing to file API effectively is concurrent programming. Right. If you look at what bugs people have when they write concurrent programs, one problem that comes up is they incorrectly assume things are atomic which are not atomic. We've already seen two examples of this. People also incorrectly assume things execute parameter when they actually can be reordered. And we've also seen at least one example of this.

We also have infrequent non-deterministic failures. This is actually worse for files than it is for concurrent programming. Sometimes people say this means it's OK because when you turn your computer on, even the most stuff is unsafe, most of your files are still mostly there. This is, I guess, OK for some use cases. I personally like this. But also, if you're writing a database or writing something that tracks people's email or wedding photos, whatever, it's simply not OK. And you want to avoid having data corruption. But because failures are so infrequent, it makes it much harder to tell if your program is corrupt.

Another issue is that API is inconsistent. We've already seen a couple examples of this. If you look up any syscol and try to decide, how should I use this safely, the answer depends on your file system, file system mode. If you want to know if your pends are atomic. The answer is they're usually atomic but not with ext3 or ext4 that equals writeback or ext2 in any mode. Or if you want to know if directory ops can be reordered, mostly they cannot be, but there could be a better [INAUDIBLE] so ext2 not shown on this slide.

And this is true for any particular property you might care about, even properties that many people assume are fundamental, like writing a single sector is atomic. this is actually not enforced by most file systems that happens to be enforced with most current generations of disk. People have experimented with MBM where this would not be true. And if these enter wide use, you will see many, many bugs with respect to this. And this is, by the way, these aren't bugs. This API behavior is different by design. Theoretically, when you're writing code that accesses files, you should read the files at spec understand the whole thing completely and then code to the spec. In practice, this is not what people do. People typically code to what disks they have and what file systems they have. And then when you insert another file system, you end up with bugs.

So another issue is documentation's unclear. Earlier we talked about these data modes. If we want to know what these mean, if you ask, people say, oh, you should read the man page. So let's look at the man page. Journal says all data is committed into the journal prior to being written into the main file system. Ordered says all data is forced directly out to the main system prior to its metadata being committed to the journal.

Writeback is quite long. I'm not going to read this whole thing, but I will note that says this is rumored to be the highest-throughput option. The problem with these docs is like, what does this mean You can only actually understand what this means if you not only know what a journaling file system is, you actually know about implementation decisions the [INAUDIBLE] ext decided to make. This cannot tell you how to use your file system. It can only remind you if you already knew.

If you want to go find English language docs, LKML is probably the place to go. This is Linux kernel analyst. Here's an exchange from 2009 where someone finds out a very surprising thing to them of health systems. They ask, where is this documented? A developer hopefully responds, this mailing list. A file system developer, a core file system developer, responds, oh, probably some six to eight years ago in email postings that I made. So if you want to understand how to use file systems safely, you're basically expected to read the last five to 10 years of LKML with respect to file systems.

One more problem is there's an inherent conflict between performance and correctness with file systems. So we said earlier, fsync is a barrier that prevents reordering and it also flushes caches. If you've ever worked on the design of a very high performance cache, like for example, a microprocessor cache, this is a very strange decision.

The reason this is strange is because you will often want to enforce ordering without paying the costs of flushing caches. Flushing caches tends to be expensive, right? And there's basically no good way to do this. What if you actually want to do this? You can write your own file system. This is what the authors of this paper did. They took ext4 and they modified it to add an operation that lets you enforce ordering without flushing caches. And when they did this, they found a tremendous form of improvement. But this is impractical for most people writing user mode software.

All right. So that's the file API, we've seen as basically impossible to use correctly, even if you're an expert. Now let's look at file systems. So one question you might have is can file systems handle errors correctly? And so the most basic test you might think of, the stuff maybe Kyle talked about yesterday, but really [INAUDIBLE] or file systems is, let's say you write to a file and the disk immediately says, I didn't write anything. Will the file system handle this correctly?

So in 2005, the reference paper looked at this. And the answer is no. Basically no mainstream file system handled this correctly. The one exception was ReiserFS. It actually did OK. But ReiserFS is basically not used anymore for reasons beyond the scope of this talk. We've looked at this in 2017.

This is myself working with [INAUDIBLE]. Things that improved for extremely basic error testing. Most file systems today will not handle this correctly. JFS does not. You probably won't work with JFS unless you work on IBM, big RN systems. But for the most part, file systems you're running today, they seem to work OK with respect to this.

Some of the testing we might do is we might go look for more complicated error cases. One way to do this is we can look at critical internal file system operations and see if they're handled correctly. An example of a function like this is sync_blockdev. Sync_blockdev, the man page says, write out and wait upon all dirty data associated with the block mapping of its device.

So this is saying if this operation fails maybe there's data that needs modified and it doesn't get written out. So you have data loss or data corruption. So they looked at a few functions like this that do something critical that return one error to make it easy to tell if the standard errors' handled or not And when they looked at this, they found errors were handled in 2008, so they were ignored about 2/3 to 3/4 of the time.

We also did this again in 2017. Things had improved somewhat. Errors were now only ignored 1/3 to 2/3 of the time. So now we've seen simple errors tend to be handled correctly, but there's definitely code passed by which we can drop errors somehow. So where are errors dropped?

Before we get to that, here's a couple comments next to these errors. One comment is, just ignore errors at this point. There's nothing we can do except to try to keep going. This is from XFS, a widely used Linux file system. This is default, by the way, on RHEL and CentOS.

On ext3 and I believe also ext4, there's a comment that says, error, skip block and hope for the best. Ext is the default system for most Linux disk rows. So back to the first example, we can see an actual concrete case where we have problems. So what if you get an error on fsync? For a long time with Linux, there's a very good chance the error would not be propagated at all. If it was propagated, it can be propagated into the wrong process.

Today this is somewhat better. You're likely to get the error on the right process. However, this is unrecoverable. It basically corrupts internal file systems state it in a way that you cannot fix. If you're using XFS or Btrfs, the data is basically lost and there's nothing you can do about it. If you're in ext4, the data is marked as clean so the file system thinks it has not been modified although there are modifications. And so it will just get evicted whenever there's memory pressure and it wants to evict part of the page cache.

If you're very adventurous, you can try to save this data in some cases. You can say, I know I'm on ext4 and then you can either write it to another device, in which case, even though the data is clean, the file system will be forced to write it out, or you could attempt to go into the kernel and mark the pages as dirty. But this is not really recommended. Postgres actually crashes itself in this situation because there's no good way to handle this and you're expected to restore from a checkpoint. Most software does nothing and you just have data corruption.

So that's a file system. We see that there are many critical bugs that cause data loss and data corruption. So what are disks? Do disks work? So we talked about flushing. We've seen that there are a number of cases where we want to flush data to make sure data gets written to disk.

Does this actually happen when we write to disk? So people looked at this in 2011. The authors of this paper were looking at this and they talked to people who work with file systems at Microsoft and someone honestly told them, some disks do not allow the file system to force writes to disk properly.

At Seagate, this is a disk manufacturer, someone told them the same thing, although they did claim that their own disks will do this correctly. NetApp also has seen the same thing but NetApp makes a disk appliance and they have a lot of data on different disks and they found that some disks will sometimes just not flush. And if you're relying on this to actually save your data, you'll just lose data.

Another question you might have is in effect to error rate. We've seen that file systems will often do bad things when they get errors from disks. So the question is, is this theoretical or do you just actually have a lot of errors?

If you look at a vendor data sheet for consumer HDD, often called a spinning metal or spinning rust disk, stated error rate will typically be 10 to the minus 14. This is saying that for every 10 of the 14 bits you right, in expectation you'll get one error. This is just an average. Error is highly correlated, so you won't see this every 10 to 14 bits. But on average, this is supposed to be true. SSD typically state better error rates, depending on class of drive, consumer drive ten to the minus 15, enterprise drive to the minus 16.

If you wonder what this means, a terabyte hard drive, this is not that big today, has 10 to the 12 bytes. This is roughly, although not quite 10 to the 13 bits. So in expectation, if you bought one terabyte of HDD and you read it 10 times, you should click it at one unrecoverable error. You buy a 10 terabyte drive, not really that big today, that expectation, every time you read the entire drive, you should expect to see one error within HDD.

So this is data from data sheets. What actually happens in practice, this looked at 1.5 million disks and they found a bad model drive. This is a consumer drive. You'd have roughly 20% chance of read error after two years. And this is a read error where there's no corresponding writer. So you write the data. This says everything is fine. And then I read it. Says there's an error. I can't read this data anymore.

And one person of the time, you'll also get silent data corruption. And this is for a bad model drive. If you buy a very good model of enterprise drive, the stats are like 4% for the first line and then 0.05 percent for the slide. People sometimes say that this is a problem of the past because SSDs have error correcting codes. The error correcting codes resource more information and they can fix up errors. This is, for example, the title of an article in The Register. It says, Flash banishes the specter of the unrecoverable data error.

Let's look at actual measured error rates. Here's a study from Microsoft. They found actual error rates are 10 to the minus 11 to 6 to the minus 14. Depending on the class of drive, this is four to five orders of magnitude worse than claimed.

On Facebook, both of the same thing. They saw something substantially worse. They saw 2 to the minus 9 to 6 to the minus 11. This is a factor of 500,000 to 5 million times worse than claimed. 2 to the minus 9, this is two gigabits or 250 megabytes. This is saying for roughly every 250 megabytes you write on a bad drive, you expect, not a bad particular instance of a drive but a bad model of drive, you expect to get one error.

And so the thing, when people say SSDs don't have these problems because they need ECC, or sorry, they have ECC, the thing that forgets is that SSDs need ECCs to work at all. If you look at raw error rates for flash devices, these guys did this in 2012, you can find raw error rates for flash devices are 10 to the minus 1 to 10 to the minus 8. 10 to minus one is every 10 bits, to the minus 8 is every, what is that, , sorry not a large number. 100 megabits, 100 megabits, there we go. Or 128 megabytes. So this is basically unusable. So ECC doesn't make the drive super reliable. It makes the drive just workable.

Another thing people often claim with SSDs is that there are now pretty safe against power loss protection, or against power loss, because they have power loss protection. And so the claim is the SSD has a capacitor or battery, and so if you lose power or you crash or whatever, it's OK, because SSD will write it with whatever is in its cache safely and everything's going to be fine.

So someone went and tested this. They bought six drives that claim to have power loss protection. And they made a simple rig that would just let them toggle power writing to the disk. And they found that four of the six models of drive, every drive out from Intel had encrypted data.

One more thing to talk about is disk retention. I think people often are surprised that SSD are supposed to be forgetful. This is by design. If you look at a relatively young drive, one that has 90% of its write life remaining, it's expect to hold data for 10 years after. For drive close to end of life, it's expected to hold data from between one year and three months after write. I think people often are really surprised to find that it is correct behavior for a drive to lose data after three months.

And this is, by the way, from data sheets. As we've seen data sheets mean something optimistic. I feel that what actually happens, if you buy a drive and write it to its end of life, many drives will just [INAUDIBLE] themselves and they will hold it out for 3 months before basically zero time. OK. So we've seen the file API is basically impossible to use correctly. File systems have critical bugs that prevent us from basically doing anything correctly. And just also corrupt data at a much higher rate than is claimed.

Let's look at a couple of things that follow from this. One thing is what can we do about this? As a proven topic, I'm not going to talk about a lot of stuff here. Briefly, one suggestion, maybe use a database. If you use something like Sqlite you can use it most places where you use a single file. It's probably much safer than writing a file yourself.

I'm not saying you should always use a database. There is a tradeoff here. Likely, you're program language you like to use has file support in the language or in the standard library. And you can grab files. You can't really grab Sqlite databases very easily. And yes, there's some overhead here. But if you have data you want to not have corrupted all the time, maybe use a database.

Another thing is file systems is pretty hard. At the beginning of this talk, we saw this example where people said, hey, why would Dropbox not support every single file system in the universe? It's very easy because they're all the same. Turns out this is not the case. They're not the same.

Before we end, let's look at some frequently asked questions. While I was prepping this talk, I looked at a bunch of different internet discussions on this exact topic. How do you write to file systems safely? And if you read a non-specialist form, so something outside of LKML or the Postgres mailing list, that is the Postgres developers mailing list, basically, inevitably, multiple people will chime in to say, oh, why is everyone making this so complicated? You should just use this one weird trick and then everything will be fine. Stop making everything so messy.

And so we'll look at the most common one weird tricks from 2000 internet comments. The first trick is you can just rename. The idea is if you don't ever overwrite anything and just rename everything, things are going to be OK. Your [INAUDIBLE] for example, we made a copy that we're going to overwrite somewhere else and then we ended up overwriting it.

This is saying do the opposite. Make a copy of the entire file. Modify that and then rename it back in place. And this is supposed to be safe. This, of course, is not safe. I think the reason people say this is safe is because rename is specified as atomic in the public standard. However, that's only speaking to what's happening during internal operation. It's not saying what happens during a crash.

If you look at what happens with actually just your file systems, most mainstream file systems today, they have at least one mode. Reading [INAUDIBLE] topic on crash. I think the one exception to this is Btrfs. But even then it's a little bit subtle.

With Btrfs, rename is only atomic and crash with respect to rename to replace a file is not atomic on crash with respect to rename to create a new file. So you have to be careful. Also, there've been many recent bugs found with rename atomicity on Btrfs. So even if you're writing code to [INAUDIBLE] XFS, you still have to be careful about this. Actually go basically test Btrfs yourself.

The second most common one weird trick is similar in some sense. People say, oh, don't overwrite things. Just append. This will so be safe because you're not overriding. This is also of course untrue. Appends do not guarantee ordering or atomicity. As you recall, the study looked in 2014, many of the bugs are actually because people made incorrect assumptions about append.

So in conclusion, computers don't work. So I'd like to know one thing before I finish here, is that I think the underlying problem is actually not technical, right? We looked at a bunch of technical problems today and there are solutions to technical problems, right?

If you work for a really big tech company, like Facebook, Microsoft, Amazon, Google, your write files are probably safe if you do it using center techniques used in the company. And people will do things like they make sure the disks are actually safe when you ask them to flush. And then they will modify the OS or add extra [INAUDIBLE] so that errors are propagated correctly. And there will be a huge attribute storage group that makes sure everything is replicated properly, et cetera.

And if you go ask someone, hey, why do you do all this? Why do you spend a mind boggling sum of money to make sure everything is safe? You'll get an answer like, oh, we have millions of computers, and so if we didn't do this, if you calculate the rate at which computers corrupt data, we have data corruption all the time, that would cost us so much more money than we spend doing this stuff.

But the funny thing is a really big tech company might have what, on the order of 10 million machines. Consumers have many, many more machines. So if you do the same math, it's actually much worse for consumers than it is for tech companies. And yet, if you look at companies that write consumer software, even some of which actually also do this big cloud stuff or whatever, they mostly write files very, very unsafely.

And so they can do the math and they can say, hey, consumers are losing data all the time. Why don't we fix this? And the problem is that in the first case I talked about, basically the cost of unit corruption falls in the company. If a company loses database on whose likes which ad, this is very expensive. They don't want to do that. And if they even lose user data, let's say Gmail just loses all its data, people know what happened and they're like, oh, man, Gmail sucks. I'm going to do something else.

But when people lose data on their computer, they're mostly not sophisticated enough to go write a [INAUDIBLE] analysis tool, figure out if the file was written to unsafely, and figure out who did this. They're just like, computers kind of suck. That's unfortunate. But they don't really know who to blame. And so the cost basically falls on users. And companies can avoid paying the cost for this and so they do.

Yesterday, Ramsey gave this talk where he described, I think, very compellingly, a serious problem. And then at the end he said, oh, by the way, probably this problem won't be. Fixed I think fundamentally it's sort of the same thing. There is a serious problem. People pay a high cost for it. But there's no good way to make money solving this problem. So of course, it's not going to be fixed.

We see in this GDPR that regulation can sometimes force the companies to do the right thing in some aspect. Regulation is a very big hammer. When you look at the history of regulations like this, often the costs end up outweighing the benefits. With GDPR, I think it's too early to tell. It will often take years or decades before we can actually see what the impact is.

I think designing regulation for stuff like this, it's much harder than the problems we've talked about today. So I don't really have an answer for this. Thanks to all these people for help on this talk. I got a lot of help. That was really great. If you want to see a transcript, I apologize for talking too fast, I know it was a lot of material. If you want to see it at your own pace or if there was something you missed, you can go Thanks for your time.