Setting Up To Fail by Vaidehi Joshi

← Back to 2019 talks

Transcript

(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] So hello. Yes, I'm here. Very happy to be here. Really excited to be here. As Gary said, my name is Vaidehi. This is my first time at Deconstruct. I've heard so many great things about this conference, so I'm really thrilled to be here. Before I get into things, I want to tell you a little bit about myself.

So I work at a company in Portland, Oregon, and it's called Tilde. We work on a product called Skylight. Maybe some of you use it. Does anybody here use Skylight? Hey! One hand, that's all I need. [LAUGHS] You should come find me after, I'll get you a Skylight sticker.

So I'm actually not going to talk about Tilde or Skylight today, but if you are interested in performance or performance monitoring, you should come talk to me afterwards. It's actually a really interesting problem, and I'd love to chat with you about it. So the other thing that you should know about me is that I really love learning new things. If you follow me on Twitter, you might already know this, but I'm kind of obsessed with knowing stuff.

And a couple of years ago, I got this idea in my head that I wanted to know about computer science stuff, so I decided to teach myself. And I decided that I was going to write about it and blog about it along the way. So I created a little project, you might have heard of it. It's called Base.cs. Some of you have, cool.

And another thing you should know about me, not only do I love learning things, I also really love sharing what I've learned with other people. I actually-- I've been reflecting on this recently, and I think I like this better than the learning part. I don't know what that means-- maybe I just like to talk and tell people stuff. [LAUGHS]

But I think something that I really try to do whenever I learn something is think about how to make it accessible for other people. Mostly because a lot of the times when I learn things, I have a hard time finding resources that are accessible and friendly to begin with, and I don't want anybody else to go through that whole experience, so I try to make it easier for others.

So over the past few years, some of the ways that I've done that is through a couple of different side projects. I created a video series, the Base.cs video series, which is really just an excuse for me to nerd out about linked lists on camera. I also host a podcast with my friend Saron, the Base.cs podcast. You can listen to it on Spotify and Apple Podcasts and a couple other places. And I also created a project recently called Byte Sized, which, again, is just an excuse for me to nerd out about how computer science things were created. There's like a whole history to it, it's really interesting, and you should definitely learn about it if you're interested and that kind of thing.

But this year, I decided that I wanted to learn some new stuff. And I'm the type of person who really likes to challenge themselves and put them in really uncomfortable positions. So I decided I was going to learn something completely new that I knew nothing about. Specifically, I decided I wanted to learn distributed systems. So I created a project called Base.ds, which is learning-- which is uncovering the fundamentals of distributed systems.

So full disclosure-- if you asked me a year ago like really anything about distributed systems, I don't know what I would have said, because I really didn't know anything about them. I probably would have responded with something like, I don't know, something, something, Kubernetes? That's what a distributed system is, right? [LAUGHS] And at one point, I actually Google Imaged Kubernetes, and I found this and I got really overwhelmed, it's like the world's scariest flowchart. And I closed that tab and I was like, oh, this is too hard.

And honestly, I don't really know that much about Kubernetes, but that's OK. Because this talk is not about Kubernetes, it's about-- [LAUGHS]

[APPLAUSE]

Some Kubernetes fans here. All right. [LAUGHS] No, it's not about Kubernetes, it's about distributed systems. So that's what we're going to talk about. We're going to talk about specifically what they are and what happens when distributed systems go wrong. It turns out, a lot of stuff. So before we can really get into how distributed systems go wrong, we need to understand what they are. And conveniently, there have been so many great talks that happened yesterday that talked about this, so I'll do some callbacks to that.

We can define a distributed system by understanding first what it is not. So imagine that we have a single process. A single process on its own is just a single system. If it's not communicating with anyone else, it's not a distributed system. However, if you have multiple processes and they're all talking to each other by sending messages over a network, well, now you have a different story. Now you're dealing with a distributed system.

Really, a distributed system is just many processes that are talking to one another. And you can think about the entities in a distributed system in different ways. So they're all autonomous, they're all capable of doing their own work and performing their own tasks, but they still talk to one another through messages. So they perform their own tasks, but they also send and receive messages to one another.

Another term for the entities in a distributed system is nodes. And if you love graph data structures like I do, then this might be conveniently a nice little callback to graphs, because that's another way of thinking about distributed systems, too, as nodes in a graph. These nodes send each other messages, and they also do their own tasks, too.

And a node can really just be anything. I mean, like anything. Like it could be the kind of hardware, it could be software, it could be a server, it could be a computer, it could be a process. And a distributed system can have many different types of nodes. So it's not just really a graph that we're talking about, we're talking about a graph with different types of nodes that look differently, and yet, they all talk to one another and it works. Most of the time.

Another cool thing about the nodes in a distributed system is that they can live anywhere. And if you think about this, it sort of makes sense, right? Because the word distributed makes you think of something that's scattered about, dispersed. And when we're talking about distributed systems, it's the nodes, the components of that system that can be scattered about, sometimes quite literally geographically dispersed across the world.

So we have these multiple nodes and they're all different from one another, but what's the benefit of this? Well, one benefit is that you can add more nodes to a system. Since each of these nodes is just a component in the system, adding a node should be fairly straightforward. You can add new nodes or you can replicate the nodes that you already have. So let's see what that looks like.

When we replicate these nodes, one benefit we have is that now we have nodes that are capable of performing the same tasks, but there's many of them. And this effectively allows us to scale our system as our application grows. Scalability could be a whole talk in and of itself, but the idea is that your system will continue to be performance and reliant for your end users. You can add new types of nodes or you can just keep adding the types of nodes that you already have. So if you design your system correctly, theoretically, even if something terrible happens, the whole system won't come crumbling down.

So for example, if a node goes up in flames, you're only dealing with a single subsystem failure rather than an entire system that has come crashing down. And it's kind of nice to isolate that failure and be able to handle that failure on its own without completely bringing down the system for the rest of your end users.

However, on the flip side, we have all of these different components in our system, but to our end user, they think that they're dealing with one single computer. So a distributed system is really just an illusion. It creates this illusion that there are multiple parts, but to our end user, it appears that it's one single system. So if you're a distributed systems engineer, you can basically tell everybody that you just trick your users and feel OK about it.

Distributed systems present themselves as autonomous self-sufficient nodes, but masquerade as a single system to their end users. And this illusion, this masquerade is pretty important. Because this brings us to the concept of transparency, and it's pretty important to understand transparency when you talk about distributed systems. Transparency is what allows your end user to have no idea how many nodes are in the system, if something's getting added, if something's getting removed, and it also allows you to scale your application without your end user knowing about it.

I think if we start thinking about distributed systems in this context, in a simpler lens, you start to see that a lot of things are distributed. Web development is distributed computing. Even if you have just a single server and a single client, you have two different nodes and they're talking to each other by sending messages over a network. It's distributed.

Frontend development is distributed. This is my favorite-- possibly my favorite slide in this whole presentation, because sometimes people talk about frontend development being super easy. It's not easy. Handling asynchronous requests and shared mutable state across multiple different clients is an incredibly hard problem, and it's a distributed computing problem. And as I learned yesterday, thanks to Ayla, game development is also distributed. Who knew? Not me. Now I do, which is kind of cool.

So basically, everything that you interact with is probably a distributed system. And distributed computing is all around us. And as it becomes more apparent that everything is distributed, it also becomes apparent that it's extremely hard. There's this great quote from a computer scientist named Leslie Lamport, and he describes what a distributed system is. And I think it really does a good job of highlighting what makes these problems so hard.

He says-- there he is. He says, "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable." Which is kind of great, because you're just happily going along and then something fails somewhere in like Norway and you're like, what?

I think this quote is really great because distributed computing is hard in a lot of ways, but the idea that something can fail and affect something in a different part of the system sometimes really dramatically is really interesting. And this idea of failure in one node impacting the rest of the system is really the crux of what makes thinking about distributed problems so tricky.

And I think what he's getting at here is that understanding that one part of the system can affect everything else should really shape the way that you think about distributed computing. In other words, failure can completely break the transparency of your system. And from what I've learned about distributed computing, understanding failure seems to be like the biggest step you have to take to understand how to really work and think about distributed systems.

So that's kind of what I want to talk to you about today. Failure in distributed systems. When something fails in a distributed system, it's because something unexpected happened. Another name for this is a fault. If you've heard the term fault tolerance-- I think Kyle actually mentioned it in his talk yesterday, if you've heard this term in the context of distributed systems, the idea is trying to build systems that can avoid faults and handle them as best you can.

So let's talk a little bit about faults. Faults are just unexpected things that happened. And unexpected things can happen in any node of your system. They could be related to the node's hardware or the software or even just its connection to the network. Unexpected things happen all the time.

The fault could be latent, which means that it exists in the system but hasn't manifested itself yet, or it could be active, which unfortunately means-- sorry, it's already blown up in your face. Oops. Whenever a fault happens, it causes some sort of unexpected behavior, which results in an incorrect result, and that incorrect result is something we also call errors. So a fault causes an error. And if an error is not hidden or masked from a system-- from the rest of the system, rather, well, now it's manifested as a failure within that node of the system.

And of all these things, the failure part is the one that's the most interesting to me. Because when we talk about distributed systems and failure, something I learned is that thinking about failure in the rest of the system is super important. When we talk about failure, there's a whole language to it, there's a whole semantics. And one term that you might hear is failure modes. Failure modes are how we identify the types of failures that occur in our system, and in order to identify a failure, you need to think about how the rest of the system is going to perceive it.

So let's talk about how systems fail. What do those failure modes look like? We'll start with an example. So here we have the world's cutest distributed system, and we'll remember that nodes communicate with each other with messages through a network. But that network can sometimes be slow. And you have unexpected slowdowns and you have to try to guard against them, or at least think about them. So one strategy that you can use is you can create something called an expected time interval. And this is basically just an agreed-upon amount of time that you expect a node to respond within.

So for example, maybe we have an expected time interval of 5 to 5,000 milliseconds. All the nodes in the system expect that every other node will respond within this expected time interval. And that's fine. This red node, by the way, is like pretty inactive, he doesn't have much to say in this whole story, so sometimes he says like funny things. [LAUGHS] Or like not useful things at all. Sometimes you have nodes like that in your system, it's OK.

So a node that communicates or responds to other nodes within that expected time interval is considered a healthy node. On the other hand, if a node takes too long or a amount of time to respond that's not within that expected time interval, well, now we have a node that did something that it was not expected to do. We expected it to respond between 5 and 5,000 seconds, and if its response was not within that range, that's unexpected behavior. So in other words, we have a node that's behaving oddly and we have a fault in our system.

And this is known as a timing or performance failure. Notably, this comes in two different forms. So the more intuitive one is the one I sort of just mentioned, which is that a node exceeds the upper bound of the time interval. It takes more than 5,000 milliseconds to respond. It's too slow.

And if you start to think about it, it becomes obvious that a slow-to-respond node seems like a bad thing. Obviously we want it to be like in a reasonable amount of time. But there's a second part to this, too, which is that a node can respond such that it exceeds the lower bound of the time interval, which means that somehow, it responds in less than 5 milliseconds. And if you were here for Alison's talk yesterday, clock skew falls into this category.

If you have a node that responds too quickly, now you have like scheduling issues and timing problems, and it can cause some issues with the sequencing and ordering of events in a system that you might not ever have expected to happen. So performance failures come in two different shapes, and it's important to think about both of them.

Eventually, let's say that this node does deliver the right value. It kind of doesn't matter, because this is still considered a failure. Even if it delivers the correct content, it took either too little or too long to respond, and that was unexpected, and so it still failed in that behavior. So we have timing failures, but we also have a subset of failures within that. So let's look at that.

Take, for example, a node that takes such a long time to respond, that it just never responds at all. Another way of thinking about this is that it's taking infinitely late amount of time to respond. This is also known as an omission failure, and the idea here is that in an omission failure, the rest of the system sees that node as one that never sent a response ever. That node effectively failed to reply, it just omitted its response, and that's how it gets its name, omission failure.

And we can further categorize omission failures as well. If a node fails to send a response, that's a send omission failure, versus if a node fails to receive a response, that's a receive omission failure. And omission failures can have pretty big impacts on a system, because depending on how long a node takes to respond, well, some pretty bad things can happen. And there are a few ways to plan for these failures. You could try to be proactive and you could try to resend the message, or you could ask another node for whatever information that failing node did not provide us with.

But what if one node just keeps ignoring the rest of the nodes? And what if it just becomes completely unresponsive and just doesn't reply to anyone? Well, that's when we can say that a crash failure occurred. Just as omission failures are subsets of performance failures, a crash failure is a subset of an omission failure. To the rest of the system, this node at some point responded, then one day it omitted a response, and then it just stopped responding after that. And at that point, it's safe to say that that node crashed. Now I know the term crash sounds a little scary, but sometimes the way to deal with it is just to turn the node on and off again. [LAUGHS] So it's not that bad.

So for all of these three failures that we've looked at, everything has either had an undelivered response or a delayed response. But it's worth noting that eventually for some of these failures, there was a response, and the content was correct, it was just not at the right time. The next failures we're going to look at do not do this.

So let's say that a node delivers a response, but the value of that response is actually incorrect. If you've ever gotten a response from a server and it looked weird in the client and you were like, that seems odd, chances are you were probably witnessing a response failure. The response came within the amount of time, but the value of it was incorrect. This is a value response failure. The content is actually wrong. And it usually tells you that there's something wrong in the internals of the node, maybe something with the control flow or the logic, and that's where you should probably start investigating.

However, we can also get a response from a node where the state of that node is actually incorrect, too. For example, if you have a node that asks for a state and the response from that node is incorrect and unexpected, well, now you have a state transition response failure. The unexpected behavior here is that the node that asked for information sees the response and realizes that the node that is giving the information somehow is in the wrong state. And this is also odd behavior. It's unexpected and it's a fault. And it usually means that, again, something's probably wrong with the internals of that node.

But for both of these two different response failures, for state transition responses and value response failures, the unexpected thing is the content of that response. But thankfully, some of you might actually have ways to deal with this. We can use error handling to deal with unexpected response values that come from nodes. So that's convenient. This may also be the only solution I provide to you in this whole talk. [LAUGHS] It's very hard to talk about problems and solutions in 35 minutes or 30 minutes, so maybe there's another talk in there somewhere.

So so far, we've looked at one flavor of failure. We've got timing failures, omission failures, crash and response failures, but they actually all share one characteristic, one quality-- they're all consistent. And what do I mean by this? I mean that if you have two nodes that are talking to a node that fails with any of these failures, both of those nodes will see the same thing. In other words, the rest of the system will perceive these failures in exactly the same way.

Now unfortunately, we have consistent failures, which would lead us to believe that maybe there is something that is not consistent. Unfortunately, those are inconsistent failures. They are not consistent, they are terrible, and they make you feel like this. So let's say that one node asks another node for some sort of value, and that node responds. But what happens if some other component in that system asks the same exact node for that same value, and I don't know, maybe that node responds in a completely different way?

Obviously this is probably pretty confusing for the two nodes that have been asking for information. We could have them talk to one another and the red node would be like, oh, everything's fine. And the yellow node would be like, banana? WTF? What am I supposed to do with this? They're probably confused and they wouldn't know what to believe, and it's hard from a programmatic perspective to think about how to resolve this inconsistency.

And thus, we have come to the greatest failure of all, my personal favorite-- the arbitrary or Byzantine failure. Oh, I heard somebody sigh in the audience. That's how I feel, too. [LAUGHS] There's like a lot of research about Byzantine failures and faults, and you could probably do a whole talk on that. If you're interested in it, you should check out some of the papers. There are papers from the early '80s, and you can keep reading them and then feel sad about computers.

[LAUGHTER]

So in this kind of failure, a node can respond with arbitrary messages at arbitrary times. And to make matters worse, it can also forge messages and responses about other nodes. So for example, let's say that the red node asks for some information about the yellow node to this problematic fellow here, or the yellow node asks some information about the red node to the purple node. Well, you could-- in this failure, you couldn't really guarantee that it will say something that's true and valid. In other words, a node with the Byzantine failure can just lie, which is pretty untrustworthy, if you ask me. Another WTF moment. So many in distributed systems.

So as we can see, there are so many different ways that a distributed system can fail, and some of them seem a little worse than others, and you could probably spend 30 minutes talking about each one. But a lot of the times, I think when we talk about faults, some people will just think about the hardware perspective of it, and the truth is, faults in hardware are actually-- they've reduced in terms of like what they were like back in the '80s. They've definitely gotten better as time has gone on.

However, you can't say the same thing about software. Software faults and the failures that come from them still account for 25% to 35% of system downtime. So I personally think they're really interesting to understand and learn about, and also, as a software conference, it probably applies to most of the work that we all do every day.

Software failures have another name, too. Bugs. Even the most well-tested and well-QA'd systems have some bugs that you just can't quite predict and maybe are sometimes unavoidable. And those are the ones that often result in significant system downtime. And all of them are going to fall into one of the failure modes that we've learned about today, or at least a subset of them. And the truth is, failure in distributed systems is just inevitable. Our systems are set up to fail. And they're going to fail eventually, and they're going to fail in different ways.

And if we know anything about human error, sometimes they will fail because of us. Sometimes they will be kind of silly and maybe bring down the entire internet with them. That last one is from earlier this month, and I think the thing that I find the most amusing is that 4.4 thousand people liked it? I don't know why. [LAUGHS] I think a couple of these were just caused by typos or bad redexes.

If nothing else, I think in my time learning about distributed systems-- and I'm still learning so much about them, one thing is definitely for sure. How a system works is just not as important as how it fails. We tend to think about the success modes a lot and not about the failure modes in a system. Does our system failed gracefully? How do we think about what it does and how it behaves when things go wrong?

And I didn't give you too many solutions, but the truth is that there are already a lot of solutions you're probably using to handle these kinds of failures, and there are a lot of tools and libraries and algorithms out there to handle these things. So I guess what I would hope that you take away from this is not so much the solutions to these things, but a paradigm shift of how you think about your own systems and how they fail.

I think we could all do a better job of understanding the problem of failure in our systems. We'd create better products for our end users, and we'd be better at our jobs for it, because at the end of the day, all we do is build systems to avoid failure and then deal with the failure and the fallout when it happens.

We could all spend more time thinking about the modes of failure and how to build systems to resist those. So I encourage all of you to think about what failure looks like in your system, whether that means building something from the ground up, or maybe you're entering a new database or-- a new-- not a new database-- a new codebase, and maybe you're taking on inheriting a new codebase that someone's worked on, and maybe there are still some failures that are embedded in that as well.

Ultimately, everything that we build is going to fail, so we shouldn't be too scared of it. We should accept it as an inevitability, and we should get out there and build things that are going to fail. Because as we all know now, eventually, they definitely will. But that's OK, I think, because we're going to be more equipped to understand them, to think about how to avoid them, and at least when you start a new project, now you'll know all the ways that it can go up in flames. Thank you so much.

[APPLAUSE]