DeconstructSeattle, WA - Thu & Fri, Apr 23-24 2020

← Back to 2019 talks


(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Hello, everyone. Welcome back. This is going to be Jepsen 11, Once More Unto the Breach.

My name is Kyle Kingsbury, but you might know me as aphyr. And I break databases in my day job. It's terrific. But before I broke databases, I used them. I was an API engineer at a number of public companies. The weird thing about--


Yes. The weird thing about APIs is that they're not physical objects. They're sort of illusions. They're rainbows in the sky. You can behold them, but you can't touch them.

And like all rainbows, they are supported by a physical structure, these steel girders, the code that we write to make the API happen. And those girders are in a firm wooden piling layer, like the VM and the libraries we choose, and those pilings are in turn sunk into a foundation of databases that we found lying around on SourceForge in 2003. Those are here represented by the pile of tires.

And as anybody who's ever run a database in production knows the pile of tires is usually on fire. And it is our job, as operation engineers, to engage in this practice called "information hiding" to make sure that nobody notices the fire. We're going to build the tower of abstraction higher and higher away from the flames.

So out of unreliable components, we want to make reliable systems. And I don't mean just databases. I mean discovery services, queues. These things strike fear and terror into the hearts of engineers, because they do weird things.

That get into split brain. You have broken foreign key constraints where some object you reference isn't present. You have temporal anomalies where you try to write something, and you can't read it, and then you can read it again. How do we know if these things happen?

I mean, you could ask the vendors. And of course, they'll tell you everything's fine. But I'm interested in really finding out. I want to measure our systems and figure out if they're really safe.

So to measure, I've been building this systems testing system. And it's called Jepsen. And it's a toolkit, a framework, for building a distributed systems test.

Your distributed system is going to be comprised of some Linux computers-- or BSD, or whatever else-- connected by some IP network. And they're going to run some processes, just like you would deploy in production. We're going to run an actual database on the system. And the environment is things like clients, things that make network requests to and from the system.

And at the boundary between clients and servers, there should be some invariants that hold. For example, if I something into a queue, it should eventually come out again somewhere. We hope at least once. That's like a nice invariant you'd want to check, right?

So we're going to run a bunch of clients outside the system, and they'll all be making requests into the database, which will itself be distributed. And then we should be able to verify that these invariant hold. But we don't actually know what the database is doing.

It could be written in some source code that we don't have access to. It could be in a language that we don't understand. Or maybe there's no formal model that we can apply. So a lot of the techniques that we would use to verify safety, maybe they're not fully applicable here.

So what we're going to try and do in addition to those formal verification techniques and source analysis, we're going to take this empirical view of the system. We're going to look at the client's perspective. So we're going to have clients generate randomized operations, like write the number five. We'll apply it to the system using a network client, and then we'll get back some response.

And depending on what happens, we'll log one of three possible outcomes. It could succeed, and we know that it took place. It could fail, and we know it did not take place. Or something else could happen.

We could get a timeout or maybe a crash. And in that case, we'll leave the invocation open for all time. Because at some future point, weeks from now perhaps, the operation could take place, and that client is basically stuck.

So we do this concurrently on many clients. We talk to many servers. We build up this concurrent history of invocations and completions. And that history is just a data structure. It's like a log, and we could write a function that looks at the log and tells us if it looks correct.

Maybe we could try to find a path through all the different transactions and system along which it looks like the system were single-threaded. Or maybe we try and verify that everything that goes into a queue comes out. So our general plan of attack is to generate randomized operations, record a history of those operations as they were applied to the real system, and then verify that that history is consistent with an abstract model of the system.

And while this is going on, we're going to introduce faults. We're going to get out our network shears and cut cables. We're going to make poison sandwiches and serve them to our men in our diner.

We're going to sneak into their homes and change their clocks around so they wake up at the wrong time. Because these faults happen in real computers-- not the sandwiches bit, but the network petitions.


So I'd like to talk to you tonight about three tonight. I'd like to talk to you today about three services that basically comprise the last year's worth of my investigations, starting off with FaunaDB at version 2.5.4. FaunaDB was a 100% ACID, according to its home page. It's based on Calvin, which is a replicated protocol for doing serialized transactions across multiple data centers.

And it was actually a snapshot isolated up to strict serializable, which is kind of the highest level of transactional safety depending on exactly what settings, and what types of indices you're using, and whether you're reading or writing. You can always opt up to strict serializable. We expect at a minimum, thing's should be snapshot isolated.

To warm us up, I want to show you a cute little bug. Imagine you're putting numbers into a collection-- -5, -3, 0, 3 5. You read them all back, and you get just 0, 3, 5. Where is -3? I don't know.

Well, this happens, because inside of FaunaDB, when it returns a set of collection, or a set of objects, it starts it somewhere. It says, give me everything from this position upwards. And the starting value that it choose is the integer value 0.

So you get 0, 1, 2, 3, all the way up to max long, but you never go back to the negative numbers. So you wouldn't return any negative values unless you explicitly asked to start at a negative position. This doesn't effect doubles or strings, because all doubles sort above all integers, and all strings sort above those. So this problem didn't really arise unless you happen to be doing requests for negative integers-- fixed in 2.5.5.

Let's get a little more complicated. You want to get a bunch of objects. But it would be a big query, maybe like more than 10 elements. So you're going to get pages of results, because you do want to tie up the database with a long query.

So you get the first, I don't know, 64 or 1,024 results, and you also get a pointer to the next page. And then you can request the next page, and go to the next page, and next page. And you get successively more and more of the data set.

This is how the original FaunaDB test that Fauna themselves wrote worked, and it's also how the JavaScript client works by default. This approach is simple, and beautiful, and also wrong, because each one of these requests happens in a different transaction. So if you're making a request for an object that has changed while you're traversing, maybe the object could jump from one page to another, and you could double or triple count it or never see it at all. Or maybe you observe part of a transaction, but not both parts, because they affect different keys in your result set.

So what you have to do instead is get a timestamp and make every query at that same timestamp. Because FaunaDB is temporal, you can query any point in time. And then you get a consistent view.

So we build the system. We put a bunch of numbers into a collection. We try it out.

We put in 1, 2, 3, 4 in one transaction. We issue a get request for everything, and what comes back is 1, 3, 7. Shouldn't there have been a 2? If we saw 1, 2 should be there as well, right?

Well, it gets worse. We start doing this bank test, where we've got a collection of bank accounts sort of simulated, and each one contains a certain balance. And we transfer money between these simulated accounts by reading both accounts, incrementing a decrement, and then by a fixed number, and then writing them back.

Under a snapshot isolation, this should mean that the total of all accounts is constant over time, but it's not. In FaunaDB, indexed reads would by default observe fluctuating values that were distributed around the correct total, and that's a little strange. That suggests that something like read skew is going on.

In fact, 60% of our reads by default, even without failures, observe partially applied transactions. The reason for this is because the index structure inside a FaunaDB stores temporal information. So it'll have like key x at timestamp 2 had the value 3.

Under certain conditions, when you updated x, it would actually replace the index structure instead of appending a new element. And then if you go to do a read, you wouldn't remember that there was a previous entry for that lower timestamp, and you could skip over some values. And that could allow you to get inconsistent results. This is fixed in 2.6.0-rc9.

Another strange phenomenon, when you put in the numbers 1, 2, 3 4, and you do a read query for everything, and what comes back is 1, 2, 3-- no 4. Every page has a few elements missing from the very end. That's odd.

The reason for this is, because when FaunaDB is constructing a result set for a page, it needs to skip over operations which haven't yet been applied. So it sees a transaction that's not written yet. It's like, oh, we won't put you in the collection. But I'm still going to increment the counter that tells you when I'm done.

So then my counter that tells me when I'm done fills up before I've traversed everything in the result set, and I get back a partial result set. So this bug was also fixed in 2.6.0-rc9. Basically, 2.6.0 fixed all the safety issues we found. It looks like a snapshot isolated to a strict serializable system. This is great news, and I'm really happy.

Now let's try something else, YugabyteDB, version 1.1.9, research from this winter and spring. Yugabyte is as a multi-model database. But the tests that we ran were for its snapshot isolated CQL interface. It makes it look like Cassandra. There's also serializable support and an SQL interface that are kind of works in progress that should be released shortly.

Now Yugabyte is linearizable on a per-key basis and snapshot isolated between keys. It also has linearizable indices, so you should be able to observe immediately consistent effects of your transactions in index structures. But this requires that you have accurate clocks, because multiple parts in the YugabyteDB algorithm requires well-synchronized or "proceeding at the same time" monotone clocks in order to ensure safety.

The first bug that we found, well, the first major one, was a memory leak-- all of these bytes lost, like tears in the rain. We'd have issues where at some random point during a test the server would start to consume like a gigabyte per second of RAM and then explode. We eventually linked this to a bug in libbacktrace, a concurrency error, which we think only manifested when Yugabyte was being run in debug mode. So it shouldn't have affected too many production users. Replacing that with glog fixed that issue.

Another problem, if you start initiating network partitions, latencies will jump and operations will timeout. This is common for databases. What's not common is that when the petition resolves, those latencies would stay elevated, and we continued to see timeouts forever. After a number of partitions, the database would just give up working, and it would never come back.

This occurred because of a race condition. When a node became a leader, it needed to stop running all the tasks that it was running as a follower. And so it aborts every task that's currently running and then waits for all the tasks that are currently running to complete.

But if someone sneaks in and inserts a new task between those two steps, then that task will continue to run and won't know that it supposed to abort. And you'll see it stuck forever, but not in a way that the system can detect and back out of. It still considers itself a leader.

It still responds to like heartbeat messages from other nodes. It just won't do any actual work, so you get stuck. This is fixed in 1.1.15.

Read skew, again, read skew is that thing where you read part of a transaction, but not another part. If you imagine laying down transactions like bricks, it's like one course of bricks is one transaction, read skew looks like a view through the structure which sees part of one course and part of another. It's like seeing x3, but not y3.

So we'd observed read skew in the bank test as well. The total of accounts would fluctuate over time, but in a worse way than Fauna. In Fauna, the fluctuations were distributed around the correct value.

But in Yugabyte, they drifted. They'd skew up or down, maybe doubling your money in 30 seconds or so. That's probably not what you want from a transactional system. And this behavior happened basically by default, just all the time it corrupted data.

This is due to a race condition inside of Yugabyte's homegrown replication mechanism. Where if transaction T1 wants to write to key k, it creates a transaction status record to track the transaction. It goes and writes a provisional record for k. It says, ah, I would like you to eventually take on this value. Not yet, but once I've committed.

Then we come back and we tell the transaction status record, all of my provisional rights are ready. Go ahead and commit. And then you can tell the client that T1 is committed. Asynchronously, you'll go and clean up your provisional record and clean up your status record.

If a read comes in and looks at key k, it could see that provisional record and say, ah, there's a transaction in progress. I need to check and see what's going on with that transaction. So it checks on transaction 1. But by the time the transaction 2 gets there to look, transaction 1's record is gone. It's been cleaned up.

And so transaction 1 says, ah, well, of course. It's been aborted. So the original record clearly did not take place.

Let's just ignore it. So that allows you to unobserve committed transaction state on a per-key basis, which leads to read skew. This is fixed in 1.1.10.

Another case, this one rare. We have lost inserts where you'd put like 3, 4, 5, 6, 7, 8, 9. And if you read everything, you'd see all of those numbers for a while.

And then after a few seconds, like 7 would just fall out. It was just there. Where did it go?

This is associated with a pair of partitions causing a membership change and a leader election. In response to network partitions, Yugabyte would automatically reconfigure the cluster topology. And if a partition happened at the right time during that process, it could lead to an invalid election.

The reason the election was invalid was because Yugabyte extended Raft with this extra notion of non-voting members. They receive updates about state, and they will acknowledge them, but they shouldn't be counted as voting members. They don't really like-- they're not real nodes in the sense of being able to vote on the outcome of the consensus algorithm, except that they could.

The non-voting member would receive an update and say, ah, got it. You're good to go. And then the leader would say, oh, I've got acknowledgment from these non-voting members. I must have majority acknowledgment.

I'm good to commit. But of course, the non-voting members don't actually voting. So a new leader could come to power with regular voting members and override that, causing you to back up and unexecute some of the log entries. This was fixable in 1.1.13.

Again, read skew, the cavalcade of errors continues. This is a rare case that we observed with only clock skew, and it required a fair bit of skew, fairly aggressive. But we would be able to induce jumps in the value over time. So this is another case of data corruption as opposed to just a pure read-only phenomenon.

This occurred because of something like a copy constructor for this internal operative called Lock Batch. Basically, when you copy to Lock Batch object, it would reset the status field to acquired no matter what the actual status should have been. So you would acquire a lock, and it would come back acquired even if you didn't actually acquire it. You could only trigger this during a timeout that was caused by this clock skew phenomenon. So that was fixed in 1.2.

As far as I know, all the safety issues that we found were addressed in 1.2.0, and YugabyteDB now looks like a snapshot isolated implementation of CQL. So that's pretty darn cool. The last database I want to talk about is TiDB at version 2.1.7.

It's an SQL database based on Google's percolator, also snapshot isolated. And like Yugabyte, it relies on wall-clocks for safety. It has a leader lease algorithm, which needs clocks to advance at the same time. Even if they're not perfectly synchronized, they have to count the same number of seconds per second.

Unfortunately, TiDB exhibited a bunch of weird behavior by default, even without failures, including read skew, other cases of G2 anti-dependency cycles, lost writes, you can rewrite history. It has anomalies I didn't even have names for until now. So again, you'd get cases in the bank test where like the value would fluctuate wildly, so it kind of corrupts state over time.

It would do this weird thing where we'd get like transactions that had strange cycles in them. Imagine that our values aren't just single numbers or single strings. Imagine they're in lists. So like value x is the list 1, 2.

I'm going to write r for read and a for append. So we're going to read x and see the value 1, 2, append number 3 to x, and then we'll read x again and see 1, 2, 3. Now, take this three set of transactions.

We've got a read of 2, 1, append to y5, append x4, transaction 2 appends x5, transaction 3 reads x and sees 2, 1, 5, 4. A quick show of hands, who thinks this is a legal execution under snapshot isolation? Who thinks this is not a legal execution? Who has no clue?

That's me, because like I don't know what these transactions mean. Like how am I going to figure it out? So let's look at this carefully.

In order for us to read the number 5, as long as all of our inserts are unique, we know that that read has to happen after the right of 5. So append x5, that had to precede read x 2, 1, 5, 4. And the same thing holds for the top transaction. That one we needed to execute before the bottom transaction. So far so good.

Now consider that reading x 2, 1, 5, 4 implies that the append of 5 happened before the append of 4. So that means that we've got this right-right precedents relation between the middle transaction and the top one. You have to append x5 first, and then you can append x4.

But because we got 2, 1, 5, the append of x5 had to happen immediately after the read of x 2, 1. It's kind of like that append x5 snuck into the middle of the top transaction. Like it both depends on and is also depended on by each other.

It's a cycle. This is illegal. And it gets worse.

We get reads like this-- 1, 2, 3, 4. And then we'll append 7 and get 1, 2, 3, 7. Hang on a second. How do you append and get two separate values both with four elements in them?

And then we do another read, and we get 1, 2, 3, 4, 7. What is this nonsense? In order to get both 1, 2, 3, 4 and 1, 2, 3, 7, we had to append 4 and 7 to separate copies. It's like the universe split in two.

So we've lost this idea of having a single thread of history. And moreover, when we get 1, 2, 3, 4, 7, that means the append of 7 had to either be merged back together or like somehow we applied the append of 7 twice, once in one fork of the history and once in another. Something is deeply wrong here.

This occurs, because in snapshot isolation what's supposed to happen is that the append of 4 sees 1, 2, 3, goes to commit, writes 1, 2, 3, 4. The append of 7 tries to do the same. It generates the list 1, 2, 3, 7. It goes to commit, but it fails.

It should abort, and the reason is because of a principle called "first committer wins." In a snapshot isolated system, you can't write back your changes if the things that you're writing have been altered since you read them. You don't want to step on someone else's toes.

So the append of 7 should abort, and it does. But in order to hide aborts from users, TiDB includes a retry mechanism by default, which just says let's try that write again and see what happens. So it reads the current value, 1, 2, 3 4. It tacks a 7 on the end and writes it back, and this time it commits.

Problem solved-- users never notice, except you've got these weird inconsistent transactions and corruption of state. But you can disable this. In fact, there's a documentation flag for it-- tidb_disable_txn_auto_retry=1. You set that value, and it does exactly the same thing, because--


I jest. It actually does a slightly different thing. There are two separate retry mechanisms which do exactly the same thing.

One of them applies all the time. And one of them only kicks in if a node is unreachable, or slow, or maybe just hiccups by a few milliseconds. Like it happens in healthy clusters too, but just less often.

So it's not enough to disable it with this flag. You actually have a second flag you need to alter. It's bad.

The fix here is to make sure the disable flag applies to both systems and to disable them by default, because it violates snapshot isolation. And this is true. In 3.0.0-rc2, both of these mechanisms are disabled by default. And TiDB, through all of our testing, appears to be snapshot isolated. So this is great.

To recap, what have we learned? When you choose a database, whether open source or commercial, you are essentially signing a contract with the vendor, here represented by Ursula, the sea witch. And the contract includes things like you need to provide a network that never partitions. That's in the RabbitMQ documentation.

Or maybe the system provides a snapshot isolation. But, by the way, did you remember there's a retry mechanism in there, which may or may not break all your guarantees? So you got to read these docs really carefully to find these cases. And then throw that all away and do it for yourself, because sometimes the documentation is wrong.

Look out for words like "strict" consistency, "strong" consistency, "ACID." These are words that marketing people love to put on databases that don't have well-defined meanings. So if you see these absent the context of some more formalized guarantee, like strict serialize ability or snapshot isolation, you can take a hunch that this might not be as safe as you'd like.

I want to encourage both vendors and users to be formal and specific in the guarantees they expect and provide. And also, to recognize that not everything needs 100% ACID, 100% serializable. Sometimes snapshot isolation or read uncommitted is perfectly sufficient. So think about when that can be the case.

When you test these systems, you can't just test on the happy case. You've got to test with failure, and that means things like process crashes, which you can simulate with kill-9. That means node failure. You can try AWS terminate. You can turn off the power. You can do clock skew with date or by injecting libfaketime, which is an LD_PRELOAD shim that lies about the clock to your process. It works some of the time. I know, I really wish it were better.

There's garbage collection, IO pauses, which you might think like everything garbage collects, right? Like we all use Go. We're cool kids.

But it turns out that garbage collection actually looks a lot like a network partition, because it creates this pause in execution. And if you spend some time digging through like the Elasticsearch mailing list, you can find all sorts of instances of data corruption caused by garbage collection. So use SIGSTOP and SIGCONT to simulate process pauses.

Network petitions I think are the gold standard for introducing failure to distributed systems, because they create large windows of concurrency and message passing. IP tables, NF tables-- any kind of firewall will allow you to introduce these. It's not enough for us to test our systems component by component. You can't just look at the Raft implementation alone.

You have to look at Raft plus the coupling mechanism that's does cross-shard transactions. You have to look at the I/O subsystem. You have to look at the cache. You have to look at the API that you glued onto the top of it. So try to think about making high-level and invariants that surround your entire system, because not all these systems have well-defined compositional laws.

Ultimately, I want us to do more property-based testing. So don't just say x equals 2. x should be 2. Try x equals random number, and then see that you can read the same thing back. Because then you'll find out that negative numbers can be read back.

Use high-level invariants on your system. And when you do this testing, use distributed systems failure modes. Because these things happen in production all the time.

Finally, if you look for perfection, you will never be content. We could spend years and years boiling the ocean trying to design better tests and formally guaranteed systems. But oftentimes, there's a lot of low-hanging fruit.

Several of these databases lost data without any failures at all. Just by running normally, they could corrupt state or return illegal results. So if we do a little bit of testing, we can clear out a lot of that low-hanging fruit and get better safer systems for everyone.

I want to make a special thank you to Fauna, to Yugabyte, and to PingCAP, who sponsored the analyses of FaunaDB, YugabyteDB, and TiDB, respectively, to Jessie Frazelle, Kit Patella, Tim Kordas, and Keyur Govande, who all provided technical support, and to Peter Bailis, Peter Alvaro, and Diego Ongara, who were instrumental in discussing a lot of these invariants and theoretical models. Thank you all very much for coming. If you'd to learn more about this research, you could do so on