The Path Towards Simplifying Consistency in Distributed Systems by Caitie McCaffrey

← Back to 2017 talks

Transcript

(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

So I'm Caitie McCaffrey. I am a distributed systems engineer and I work at Twitter.

It's actually super cool to be back. Cause I used to live in Seattle for seven years and work on Xbox games in a past life.

But really, I am a feral concurrency control programmer. Because this more accurately describes what I do on a day to day basis.

So what is feral concurrency control? It's a great term that Peter Bailis coined in his paper that came out in 2015 by the same name. And basically-- you can go read the paper, it's great. But the gist of it is feral concurrency and control is this application level mechanism for maintaining database integrity.

And when this paper first came out some people on the internet got mad. Cause that never happens. And they found the term kind of insulting. But I thought it was fucking hilarious. And I like, emailed Peter. And I'm like, oh my God. This is the best thing ever. Because it totally describes what I do on a day to day basis.

I go and I take a business task from like a PM or a customer, and then I translate it into a service or a system. Which reads and writes data to at least one or more NoSQL databases with pretty weak consistency guarantees. And I've built a lot of these in my career, right?

So I started out building the Halo 4 statistics service. Which has a fairly sort of simple requirement. Which is, recorded aggregates statistics from every multiplayer Halo game a user plays. You know, you only want to process them once because we don't want to double count things. And you don't want to not process a game, because that's like always the best game the person has ever had. They get very mad.

But basically we made this like Rube Goldberg machine to make sure that this happened. This involved the actor model. Actually productionizing an [? MSR ?] research project. It involves a saga pattern, idempotency, carefully crafted crafted document keys, and batch transactions within a partition. This is kind of insane.

Let's look at another system that I've built. So I'm the tech lead-- or was the tech lead-- at Twitter observability stack. I'm working on some of the abuse stuff now. But basically our stack-- When Twitter originally made this, we were only one DC. And now we're globally distributed. So we wanted to make our dashboards and alerts always available, even if a single data center went down. Sounds easy. Just replicate it, right? No. Because this is what the Twitter observability stack looks like.

We do use Manhattan, which is our database. But we also have many hand bespoke, artisanally crafted indexing services. We also do HDFS roll up.

So in order to actually replicate this correctly across multiple DCs-- and just show you a time series graft-- it involved a lot of idempotency causal chain of events. And the application logic. And me pulling my hair out.

Or this service, basically. Where-- Which we just launched, because I'm helping with some of the abuse efforts at Twitter now. And we want to notify you when you receive an abuse report. Because we take that really seriously. And we want to give you more visibility into the process.

And then if someone you've reported gets actioned against, we also want to tell you about this. Right? So that we can sort of show you what's going on. And this sounds super simple. Like, just send a push notification to your phone.

But no, because it requires all of this garbage. Which involves a combination of compare and swap operations. Eventually consists in immutable lists, reverse order [INAUDIBLE] causal chaining of events in the application. Getting the data model on this actually took-- Getting this right took several iterations. It's one of the reasons I came over to help out. Because there is time, and idempotency, and all this crazy stuff.

So feral concurrency controls 100% the exact word for what I do on a day to day basis.

So let's take a step back. How did we end up in this world of feral concurrency control? Why are creating distributed systems so hard?

Once upon a time we constructed a monolithic applications. That ran at a scale that could safely fit inside a single relational database. The database was magical. It wrapped up all our consistency guarantees, and sort of encased them in a nice little present bow. And I didn't have to think about this. I just made a horizontally scalable service that we would just add more instances of. And you hope that you could buy a big enough machine for this to fit on.

In 2010 when I started working on the Halo services, we were in this world. Halo 1, 2, 3, and Reach all had relied on a single, massively, vertically scaled SQL database to provide the consistency guarantees. And do all the things like show you your scores, and what weapons you had, and all that nice stuff.

And this was soon becoming untenable for the scale that we want it to grow to. And all of the different things we wanted to provide. And this is sort of what happened at the same time in the rest of the industry.

So we saw the rise of NoSQL databases, because basically industry decided that availability was queen and that we had to have scale requirements. This pushed us to using NoSQL and give up a lot of our consistency guarantees. But at the same time, we could now generally scale beyond a single machine. And we also had higher availability because we were replicating our data more.

And then NoSQL basically told us, take control. Take control of your transactions. Behold the instrument of your liberation. So this is him describing feral concurrency control back in 2014.

So that was the promise.

We also at the same time saw a re-emergence of software-oriented architecture embraces micro services. And this sort of came about for a lot of reasons. I think the advent of cloud and things like that was super critical. Because it just made it really easy to deploy a single machine that was doing its own thing. And this also came about because it allows teams to iterate and ship new code faster. It reduces coordination points between people. And allowed technical organizations to scale. So you could have totally different teams working entirely different things, and shipping them independently of one another. Which may be problematic. But whatever.

So now we are living in this world. Of a bunch of micro services all running their own bespoke NoSQL database for the consistency guarantee that they want.

But these two things I think rose and became very popular at the same time and it wasn't a coincidence, right? They both solved this problem of high availability. And when I say availability, I don't mean availability in the cap sense. I mean availability in the industry sense, which means it has to return value and it has to return it within a certain amount of time. Both require us to trade off strong consistency. Right? You've denormalized all of your data across micro services in different databases. But you get higher availability and greater flexibility.

But this comes at a cost. Now, because data normalization is the norm between these services in different data stores, we have a lot of consistency challenges. Because you now have multiple copies in all these different places. But it's also nice because now you can store data in the format it needs to be queried in, and then that can be really fast.

So we've basically created one set of problems for another one. So now we have to talk about consistency models.

So when I started learning about consistency models, I came across this sort of hierarchy. This is a tree with linearize ability. Which is a total order. This is kind of the gold standard at the top. This is also from Peter Bailis. And this seems manageable. Right? Like there's a handful of them. You could like reasonably go and read about them and understand them. Except for pipe-lined random access memory. Because no one fucking knows what PRAM is, and I still haven't heard a coherent definition of it.

But the rest of them are fine. But then you start diving down this rabbit hole. And this is really the current landscape as of 2016. So this is a diagram. If you can't read all of it, don't worry, because it's insane. Linearize ability is still at the top. Weaker eventual consistency is the bottom. And these are all the different consistency models you could pick in between.

So A, this is terrifying because I don't want to learn all of these. And B, it's basically been a whole year since this has been published. There is at least five to 10 new ones because people need to write PhD thesis's.

So-- So this was depressing, so I decided to take a break and play this internet meme that was going around on the internet when I was writing this talk. So they're like, post a picture of you from multiple years ago, and see how distributing systems and dealing with all these consistency models has basically ravaged me over time. So let's play.

This is me in 2010. I'm at a Halo Reach launch event. And I'm super bright-eyed and bushy-tailed, and I don't know it's about to hit me. And I was really excited. And then this is me in 2014 after successfully shipping HALO 4. And currently working on the new catalog metadata system for HBO. Which is not one of the Rube Goldbergs I showed you, but trust me, there's one there.

And then, you know, this is me today. But I guess my hypothesis was wrong. Because just like the inimitable Jenn Schiffer, I too just keep getting hotter and smarter. But I digress. Let's go back to distributed systems.

So basically today I fundamentally believe that programming distributed systems today is too expensive and time consuming. And we fundamentally aren't spending enough time as an industry or as a group trying to make us an easier thing. Understanding the myriad of consistency models is complicated and cumbersome. It distracts developers from focusing on what matters-- building a great experience for users. Or enabling-- basically it needs to be a tool. Right? Like I want to see people create art. Or solve cancer. You know, prevent cancer. Or all these different things. And the fact that we have to have an army of developers just to show you a notification on your phone is ridiculous.

So how do we do this? And how do we make this less error prone? How can we do better? I want to take a step back and look at another industry that has sort of been revolutionized by technology. Finance and accounting used to be incredibly, incredibly expensive. And you didn't get that much out of it, other than you had a bunch of books and ledgers where you recorded every transaction. And you could sort of-- It was basically bookkeeping. You were just doing all this manual math. And you would know how much data-- or money you had collected and gained, and all this stuff.

One of the big protocol inventions made in 1340, because just this process of keeping ledgers was incredibly error prone, was double entry bookkeeping. So basically for every transaction, a debit and credit must be applied for the same amount. So you have a left column and a right column. And basically, the nice thing about this is that any point in time those columns should add up to be exactly the same number. If they don't, then you know that you've made an error somewhere in your accounting.

So this is cool because it made it less error prone. Or at least helped you detect the error. But it's also doubled the amount of work you had to do. Because now you to add two columns. And so book keeping and accounting was basically like this manual, by hand work. It was very hard to do anything sort of future thinking, just because of the sheer amount of math and human effort it was required to calculate all these numbers.

So in 1979, VisiCalc was invented. This was the first digital spreadsheet program for personal computers. It came with functions like sum. And you could create simple formulas using arithmetic and cell names. One of the creators Dan Bricklin actually created this while he was doing an MBA. And he sort of basically noticed that-- he's like, this is sort of ridiculous that people are on the whiteboard-- or the chalkboards, sorry-- drawing all this stuff. And if you made an error, you'd have to go back and erase all of your work. And fix it and redo it. And so you were less focused on the task at hand, like the actual problem you're trying to solve. And more focused on just doing this math correctly. Because that was all your brain can handle.

So him and his partner went and created VisiCalc which is this digital spreadsheet. In a TED Talk he gave, he tells the story about how he was doing this while he was doing his MBA. And he didn't tell his professor he was making this digital spreadsheet. But he came in with the homework completed, and had done five year forecasting models, and all of these projections. And the professor was super shocked. And he got an A. But because-- He was shocked, because the sheer amount of work it would have taken to do that all by hand was just untenable. People didn't do that. People weren't super future-thinking. Because it just took too much work. So VisiCalc sort of really reduced the human power necessary to just do the manual task of-- the actual task of bookkeeping. And allowed you focus on more what finance was.

In 1983, Lotus 1, 2, 3, came out onto the scene and added charting, plotting, database capabilities, and introduced named cells and ranges. It also added macros. This is where it sort of really changed the game. Because now we had some compute that you could pass around. Also copy, paste was a thing, which was a big game changer. Which sounds dumb. But there was a time where you used to literally just have to go and data entry everything in VisiCalc.

I also want to point out that this is like a floppy disk with jet engines strapped to it. I think we need more advertising like that. Oh, also-- And then in case you thought taking out full page ads to write really long notes to your competitors was something this generation of tech invented, think again. This is Microsoft Excel entering the scene in 1987. Writing a note to Lotus 1,2,3 users to say, hey, please use Excel.

Excel came onto the scene in 1987. It supported macros and user-defined functions by default. I like this stat just because i think hilarious. It also supported 16 spreadsheets. You could have 16 different spreadsheets open at the same time. With 384 rows and 256 columns. So that was twice as many cells than anything else. And Microsoft offered this rich user interface. And so basically, they became the dominant player in the game and they really changed finance. So let's segue a little.

This is my mom. She actually started her career in finance. And it was interesting, because I was talking to her before this talk about this whole change in how digital spreadsheets changed everything. Because she started back in the world where they did a lot of stuff by hand. You would literally employ divisions of people just to maintain these books and double check them. Each division had their own way of keeping and creating financial reports. And so then you'd have to figure out how to make these-- I guess you could think of it as an API layer-- to roll them up in between.

And then she was telling me how much just copy and paste changed the game. And macros changed the game. And now you could share these templates and create custom standards. Excel wasn't prescriptive. But it did enable them to free themselves from the day to day things. Just declaratively say, here's the computation I want to do. And then worry about forecasting, and worry about how they could grow their business, and sort of change numbers there.

This was a huge win for the industry. Because basically, now you didn't have to hire as many people to just do the basics. And you actually got more value from every single person in your finance organization. And so this allows businesses to grow faster, and integrate faster. And so this was super huge.

So when I look at this, and I'm like OK. How did Excel do this? How did digital spreadsheets do this? What can we-- lessons can we take and apply to distributed consistency? Because I think we're still in that same world. Where it takes way too many people just to build a simple system, to answer a simple question.

So one of the big things that I think they did is they standardized on one tool. And I'm not saying one tool or one language has to win. But we have to start agreeing on some things. There's just way too many things. And this simplified the process. At least within a single organization. Of creating and roll up reports across the whole organization, and there was a single way to talk to each other.

The other big thing is sort of this idea of computerated computation. And I almost put declarative programming here, but I don't think it's exactly what I mean. Sure, you told it the numbers. But then macros were also part of this. One of the first big wins of digital spreadsheets was providing and eliminating the need to perform error prone, time consuming calculations. And then also to try things out by just changing numbers and letting it rerun. And then also by defining common functions and macros. So this really changed the game.

So can we apply these same lessons to distributed programming? How do we standardize distributive systems right? And consistency models in particular. Maybe there's one right consistency model. So distributed programming actually has some research done in this. This is one of my favorite papers that no one knows about. This is Argus. It's from Barbara Liskov. It was published in 1988 in The Communications of the ACM. And this is actually a summary paper of all of her work. There is actually seven or eight papers leading up to this, if you want to go read all of them. But basically the idea here is she actually tried to fundamentally solve and add consistency models to a programming language and a runtime. Because she recognized that, hey, distributed systems are going to be really complicated. We don't want every single developer having to think about this. We just want to give them some basic guarantees.

So the language was built around the concept of guardians. So you can think of these as individual services that run. And handlers which are APIs on top of those services. So you basically had typed APIs, and then you could deploy multiple guardians throughout your system. And they would all sort of know how to talk to each other.

This system is incredibly opinionated. Liskov believed that the form of communication that is needed is remote procedure calls without [INAUDIBLE] semantics. And then she also very much believed that you had to provide atomicity and serialize ability for developers to even understand what was going on. And make it conceivable and easy to work with. At the end of the day, basically the code you wrote if you wanted to do a transfer between two bank accounts, was this. And the system would actually handle all the distribution for you. And that's pretty cool. This is like my attempt to write Argus. It might be a little wrong, but it's based on what I could glean from the papers.

So basically the guarantee that Argus gave you is that either branch A removed $50 from account 123, and added it to account 789. Or that none of those things happened, and you never saw a world in which they had partially completed. So that's a pretty strong guarantee. Argus didn't take off. It was a little before its time. And it also is prone to a lot of-- It has some problems, right? It is especially prone to deadlocks. And so there's still that complexity people have to deal with.

So let's go to modern times. If maybe we keep going along this path of strong consistency, is the way to go.

What about Spanner? Spanner is Google's globally distributed relational database. It provides highly available distributed transactions. And it relies on a true time API and massive amounts of hardware. We're talking fiber between data centers, GPS, and atomic clocks installed on individual machines. Spanner is actually open source and available to the public. Which is cool. You can use it as part of-- Sorry, it's not open source. You can pay to use it via Google Cloud. And it offers five nines of availability.

So you'll hear people talk about, it's beat the cap theorem. It hasn't. It just threw a lot of money, and harder, at it. So that you have enough redundancy that you can actually provide very, very high availability guarantees. And Spanner is actually excellent. I'm a huge fan of this. This is a true feat of engineering.

I would also like to point out it works because the earth has a limited circumference. So when we go into space, all the bets about this are off. But it is truly a great feat. And one of the things that Google says is that it has made their developer's lives easier. Because now they get to program in the way that they're used to, with ACID and SQL transactions, and things like that. So maybe Spanner's the right way to go.

But then you have this competing-- Or you have this paper that was actually published in 2015 from Facebook. Where they talk about discussing why they haven't adopted strong consistency-- including linearize ability and causal consistency-- in their services. And it's a seven page paper that's kind of interesting. But it basically boils down to-- the biggest barrier is that consistency mechanisms must integrate across many stateful services. And so basically this is, we denormalize data. So that we can do effective querying. Because no single data lay out is efficient for all workloads, especially at scale. Strong consistency across all their systems would result in higher latencies, because there's just no way to achieve strong consistency without talking to all of those different systems. And now instead of having-- You're basically bounded by the slowest machine in the cluster. And then they're also arguing that users will tolerate inconsistencies over higher latency.

And so basically what this comes back to is micro services are once again fundamentally at odds with strong consistency. And I sort of agree with this, right? Strong consistency in systems would require a protocol to do distributed transactions. We have, once again, already done this. It's called two faced commit. And basically it got abandoned because it doesn't scale and has f problems. So no one adopted this.

And so, even high-- And then even fundamentally more than that, high availability-- Highly available strong consistency systems can still have still have huge problems like contention and deadlocks. These are nontrivial to solve. And you actually have to think about them and then tune your system. So it's basically, where you spend any effort to tune it. Are you setting up a new service? Or are you going to go and try and throw everything into a transaction? Which then once more and more of the world in this transaction, it gets more expensive to compute. So I still think we're going to need some form-- We might-- We'll need strong consistency but it's not going to solve all of our problems.

If we go in the total opposite direction and remove coordination completely, we have CRDTs. CRDTs are conflict replicated data types. And the idea here is no coordination is required. Rights can happen at any replica. And updates propagate in any order.

Each replica will always deterministically merge to the same thing. So this is nice. It's highly available because as long as you can talk to one replica they can get propagated in any order, and all the replicas will end up being the same thing. So this is cool.

There's also research languages like LASS which are trying to provide a programming model around this. To make it a little easier for developers to grok and not have to sort of implement all of these on their own. And the research here is promising. And CRDTs have actually been pretty successful for something that came out of academia. Because they've been actually implemented in production.

CRDTs still have problems. Like they grow monotonically unbounded. Like, that's proven mathematically. So typically in industry you just do your round of [INAUDIBLE] at some time and shrink the state. And then keep growing. But they also have a problem too, because it's very-- Once you start working with them it's very clear that then you start wanting to do things like debit from a bank account and not go to a zero balance. Or like, assigning unique user ID. And you can't do that with CRDTs. It's not a thing.

So basically what we're getting here is there's no one size fits all consistency model. So this attempt to maybe standardize on even-- And you'll hear causality. I'm not going talk about it in this talk. I still think there's problems there.

Distributed systems are fundamentally about trade-off. And we need to allow people to make those trade-offs. Because really in industry, availability is the most important thing. And so we need to be able to make trade-offs around that. Along with our business requirements and things like that.

So one of the things that I think is interesting, and where I think the effort to standardize on, is these mixed consistency data stores. I super like this because it's basically this idea that data stores will actually implement multiple consistency models, and then you get to choose from them when you're implementing it. And so while you still have to understand different distributed consistency models, it's super nice because you only have to learn one set of APIs. And you don't have to learn how to operate multiple different things.

So Twitter's Manhattan, key-value store, has this. It started as just something similar to Dynamo or Cassandra, where you have redirect consistency and it's eventually consistent. But recently they've added strong consistency guarantees on a per key basis. So you can do compare and swap operations. So that's kind of nice. Because we didn't have to teach our developers how to go and hook up to ZooKeeper to get the strong consistency. And so they can just use the database they already know how to do. And it makes them more productive.

Microsoft's Azure DocumentDB is also very cool. And they do this as well. They actually provide four different consistency models for you to choose from. Including strong, bounded, staleness, and session and eventual. So if you're interested in that. Once again, this is nice because you learn one API set. You don't have all this ramp up time. And it's-- it allows you to make the trade-offs yourself.

I also think it's nice because it hopefully prevents you from building systems like this. Where you basically-- I mean the reason like-- Yeah. This is a joke but it's also kind of true. Because you basically are forced into this world right now. If you pick whatever database supplies your artisanal consistency guarantee that you need for your thing. And so if you need multiple, you have to stand on multiple databases. And that's a huge pain in the ass. So basically what these mix consistency systems think, and where I can think we can start standardizing, is providing lower cognitive overhead and less operational cost. Because now you just stand up one of these instead of eight of them.

OK. So what about computer data computation? Where can computers help us build distributed systems? What I talked about still places a ton of the load on the developer, even if we all standardized to a couple of different tool sets. Because you still need to understand the trade-offs you're making, and that's actually kind of error prone. Because there's a lot of little nuances and minutia to picking the right one. And so, how do we help people make this less error prone? How do we help people pick the right thing? What if the computer can help us do this?

And this isn't actually super far off. There was this paper published at POPL in 2016 called "Because I'm Strong Enough." I like this paper because I'm pretty sure it's named after a Cher song. But I also like it-- I also like it because it's trying to help developers pick the right amount of consistency. So what you do is you basically specify your applications. So say we wanted to write a bank application. You say, my bank account must be greater than 0. I'm going to define an invariant that has to hold true always. And then I say, I'm want to be allow people to deposit money and withdraw money. So this is a replicated system. We want it to be highly available. So you probably want to be able to do these operations at different points in the-- different replica sets in the system.

So what [INAUDIBLE] provides you is a static analysis tool. And you input your invariants via Java annotations. And then you would describe the bank account here. Where are the invariant must be greater than zero. I'm going to do debit and deposit. And then you just sort of define some other things about your parameters that are input. Like, amount has to be greater than or equal to zero. So we're not going to do anything funny with adding negative numbers.

And so you would run this. And [INAUDIBLE] would actually return you in error and say hey, your system is broken. If you implemented it this way. Because you need-- coordination is required to do debit. It's actually impossible to keep the invariant that you specify that the bank account must always be greater than zero and do debit successfully. Because we have to have all of the replica sets agree on what the current bank balance is. And then all of them have to agree to do this-- do this debit. Otherwise you have a race condition. And you can get the bank account to go non-zero.

So this is nice because it's allowing you to focus more on the system you're building and less on the consistency models. And the tool will help you pick the right one. So you still have to go implement this. But I think this is a huge step in the right direction.

The world I would like us to live in is basically, we want to have people do more declarative programming. They're going to define their invariants and their actions. You're going to have a tool that helps determine the consistency required. And then maybe-- this is sort of my question mark. Can we get to code generation? I don't know. I actually don't think it's that hard. I think you could probably start doing research and building tools that, once we picked a consistency model, and for very basic things, can we generate the code that will actually just store that data for you? If you provide me a data transformation?

So that's the world I would like to live in. So this is the path O believe we need to follow to start simplifying distributed computing. The goal here is to turn distributed systems into just another tool that is accessible to the maximum number of people possible. I think the right direction is probably, in the long term, something like Argus. A programming language and a runtime that provides consistency as a core primitive.

But unlike Liskov, I think that the language needs to provide multiple consistency levels. I don't think there is just one.

We want to make distributed computation easy. Because honestly it's too time consuming, and the world just does not scale if we keep having software eat the world. And everyone has to be mired in the minutia of picking the right consistency model, then they're not thinking about doing other interesting things like creating art, or solving cancer, or whatever your passion is. You shouldn't have to all know this garbage. So just like Excel simplified finance, and we need to keep researching, experimenting, to create that for distributed programming.

So thank you.

[AUDIENCE APPLAUSE]