A Fresh Look on Failure by Pablo Meier

← Back to 2018 talks

Bibliography

Transcript

(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Hi, everyone. My name is Pablo Meier. You can read things I read on the internet at morepablo.com every now and then. You can follow me on Twitter @SrPablo. The title of my talk is "A Fresh Look on Failure." So let's talk about freshness.

Go back to the year 2008. A brand new site called Stack Overflow is just started. And back then, you can ask questions with lots of answers, and subjective answers. And someone asked the question, what significant new inventions in computing have we had since 1980?

Now, we all know that Stack Overflow would now moderate this with modern community standards. But oddly, this question is still up. It was asked by someone who worked at a printer company.

[LAUGHTER]

But the reason he asked this question was because there was another thread before where someone said, what is a significant invention in computing on the order of the hash table, on the order of a O of n runtime analysis? And people had trouble coming up with anything that hadn't really come after 1980. So he asked specifically, after 1980, what has come up?

And I asked myself-- when I was learning things in undergraduate, I really had my world rocked every semester with new things. And in my decade as a software engineer, I hadn't thought of too much that I found in industry. But as Alan Kay is credited to saying, the best way to predict the future is to invent it. I thought I would present to you a problem domain that I find pretty vexing and a way that I've been thinking about how we could solve it together, as computing community.

So what is my chosen problem domain? That chosen problem domain is distributed systems. This is a slide from a presentation by the Philips Light Bulb Company on how they power their smart light bulbs. How many programmers does it take to screw in a light bulb? It turns out it's a lot.

[LAUGHTER]

Boxes and arrows everywhere we go-- and furthermore, about eight months ago, I did the interview rounds. And about a decade ago, what would happen was that I'd shake a marker very tentatively on the white board and reverse strings in place. And then I'd do it-- have to draw arrows with if statements for things that I forgot. But these days, I have to design a URL shortener. I did four on-sites. I designed four URLs shorteners.

[LAUGHTER]

So I'm going to walk through what that interview looks like at a high level. But I'm going to need your help for this part. Whenever I give you the cue, when I throw my arms out like this, I need you all to yell, "But Pablo, it's not web scale." Can you do that?

[LAUGHTER]

Excellent. Pablo, can you design for me you a URL shortener? I was like, OK. We'll have a client speak to an EC2 instance with their application, maybe, and it will speak to a back-end database. Is it an object store or relational store? Let's talk trade-offs. Great.

[LAUGHTER]

But Pablo, it's not web scale!

It's not? Oh my gosh. You're right. Well, luckily, we have a solution for this. We can put it behind an elastic load balancer. And we can horizontally scale the boxes so we can handle all this extra traffic.

But Pablo, it's not web scale!

Oh, you're right. What was I thinking? Thank you for bringing that up. We can add a Redis cache so we don't have to make that round trip to the database. And suddenly, we're closer to web scale.

But Pablo, it's not web scale!

But we're not web scale enough. Now we have to integrate with a third-party provider, who has a worse SLA than we do. And each request takes two full seconds. How can we do this without slowing down our service? And I say, well, we can do the zeitgeist. Let's an an asynchronous message queue, like Kafka.

Well, then how do we get the responses for this, because we do care if they worked or not? Well, we can use another message queue. And they say, well, what happens when the message queues fill up? And in the interview, the honest engineer says, we provision comically large queues.

[LAUGHTER]

But what is the problem with all of this? We've taken one of the most reliable attractions in all of software, the humble function call. It is the mitochondria of software. It is the powerhouse of the cell.

[LAUGHTER]

And we replaced it with something that sucks.

[LAUGHTER]

You give a value. You get a value. But now you might not get a value. And it's going to cost you a lot of resources. And looking back at this diagram, every box represents one or several computers that can and will fail. And every arrow represents a network call that can and will fail.

So it's not like we have nothing to deal with this. But when you take what was a function call and you make it distributed, you suddenly have to do a lot more things. We have to add monitoring, for example. We have to suddenly throw more code, more time, more attention at whether or not our instances are healthy and they're OK. We have to add alertings because we need to wake people up at 4:00 in the morning to say, you're not making any revenue. Get back up to work. Normally, this involves turning it on and turning it off again.

[LAUGHTER]

Idempotent operations-- I really like this one because it is not so much about throwing new code at a problem or for attention. It is about changing what correctness means. For those who don't know, idempotent operations means to do something once is to do the same thing, to do something, several times.

So if I want to create a user and then delete that user, I should be allowed to delete it again. This is because if we don't get a response from the network, we want to be able to retry. But maybe it just took a long time. So maybe we get the same thing twice. We're turning correctness on its head in order to support these distributed systems.

Error handling code-- this one will echo Sandi Metz's talk. Suddenly, everything might just fail. And you need to have a contingency plan in place. This is all code that you maintain. Deploying CI gets a lot harder. When I run integration test suites for the companies that I work at, suddenly, I'm downloading gigabytes of Docker containers, usually with their own databases. And you have to worry about sample data for those.

And lastly, latency requirements-- let's go back to 1980. Back in the day, people were thinking, does this have too many function calls? Am I doing too much pointer indirection? That's not a thing anymore. But suddenly, we have to worry so much about latency because if you have 20 calls behind every request you end up serving, one of those calls can go bad. And it can mess you up.

So let's walk through a scenario. Suppose you-- this is a contrived example to show you what the impacts of this are. Suppose you are talking to Facebook API around 2009. And you want to get information that you have access to and permission for from a user. And you'd like to take your database model of that user, combine them into some richer object, and upsert it back up. You'd like to be able to write code like this. I'm going to call the Facebook API. I'm going to get the user data for my database. I'm going to combine them into this new object and upsert it.

In reality, this is not what it looks like at all. We can call the Facebook API. But then we have to check the status code. It might have 503'ed. They might be down. It might have 400'ed because it was Facebook API in 2009, when they just changed things without telling you.

We can't move on, though. You might not have gotten a response. You need to be able to track this, too, and handle this. And when this happens, your instruction pointer now goes back. How do you move it back if you want to retry? That's for you to figure out.

[LAUGHTER]

We can move forward. We're going to get the value from our database. But wait, that might go wrong, too. So again, instruction pointer back up-- but how do we deal with this? We'll figure that out later.

Finally, we're safe. We can combine our data. We have this new user object. And we try and upsert it. But did you notice that the upsert is in the same try block as the previous database operation?

So if you accept it, and the same depends-- this depends more on database frameworks. But you might have to move to a different try except scope in order to handle this. And I kind of lied when I said you can combine these safely. The Facebook API might return different data that your combined function does not expect. So you need to handle application error logic on top of this.

I put it this way to make it fit into a slide. I idly looked at what it would actually look like just logging, but with more cases and dealing with scopes. But by the time you write integration tests for these things, it's just a bad time.

[LAUGHTER]

It is a lot of code. And we can't all be Margaret Hamilton. So how do I want to solve this? Languages. I'm a language enthusiast. And I love languages. So I'm going to give you a reductive, incomplete view of a history of programming language design.

Let's go back to 1958, which is a time before Gary's talk. A very long time ago, we have ALGOL, which is something of an Adam and Eve type language. I say that because if you read the Bible, there's pages of who begat who begat who begat who. ALGOL begat many languages. And then the last few lines of that chain or BCPL into C in 1972. So C changed everything, was a major ground breaker of the industry.

And if you look at the top 10 languages, according to the TIOBE index of programming popularity, many of them, I would say trace, back to C. So if you-- say, if you're a language friend, too, if you're someone who loves languages, you can say, Pablo, Objective-C, C++, JavaScript all mean something slightly different when you say "object." You can't really be comparing these languages.

I would say, maybe not directly. However, they do have a lot in common. They have these scoped lexical blocks, which is either curly braces or DO/END or colons. They tend to always have this identifier that can be assigned and mutated over time with an equal sign based on the expression on the right. The function names are always prefixed. We invoke those with parentheses. They're all statement-based languages. So many of them also have these semicolons. But either way, if they don't, they still go one after the other. We define our functions or our methods, for loops, if statements, using these blocks. Many of them even have Go, too.

So as a result, if you can read code in one, you can kind of read code in the other. And while they do change things around the fringes, I think, fundamentally, they still make a lot of the same assumptions C made. And this is why I want to talk about it.

C was built in a world where we had one processor. And it was getting about twice as fast every 18 months. And the processor time was the thing that was stopping us. So as a result, between the two cases of the monolithic call and the distributed call, the top one is a lot easier than the bottom one.

So I'm going to propose a bunch of features and ideas around language design. And the proposal is we might make the top row a little bit more challenging than before. But it will help us equalize on the bottom row and the big gulp there.

So here comes the freshness in several parts. The first is massive end-to-end process count. What do I mean by this? Let's look at the real world. Is the real world dominated by elephants and whales? It's not. It's mostly dominated by insects. We're better than the elephants and whales. But the insects outnumber us, too. And similarly, we're in this bind because Facebook and Google cannot power their services through a giant supercomputer they can pay millions of dollars for. They need to connect hundreds of thousands of commodity hardware machines.

With this in mind, let's take, instead of one major process and maybe spawn threads from it. Let's put hundreds of thousands of concurrent, independent processes together. I want it to be as easy to make a new process as it is to make a new object. I've never gotten a code review where someone said to me, Pablo, I see a new keyword in a for loop. I think you might have, like, 100 objects here. It's just never happened. But I do think that if I see a code review with hundreds of processes in many conventional languages, I would be like, are you sure this is going to be OK? So it goes beyond just a distributed system.

[LAUGHTER]

Even if you have one machine with many cores, there are a lot of benefits to just modeling the problem in this way. What syntax can we use for this? Suppose you've got some function, just like we put the word "new" behind things, let's put the word "launch" behind it, or some other word to create a new concurrent process.

Value-based message passing-- great. So we have these processes. But they need to talk to each other, somehow. And I would argue that right now, just like the message queue was very useful in speaking between different actors, we can create our own abstraction that operates in a similar way between the various processes we're creating.

So I have this simple one. It's going to compute prime numbers. And every time it calls "next prime," it's going to generate a prime number and publish it. And we can have a bunch of other processes that can subscribe to the other end of this, just like the message queue example. And they will do whatever they want to it.

We don't currently use many common abstractions that operate just this way at the program level. So let's say that I call it receiver. And I can create this syntax and say, just push it onto the receiver. And anyone who has it suddenly receives it. They can block and wait for it in the meantime.

Great. Pauseless GC is something that can come out of this, too, if we really want to so. Let's think if you have this elephant language that has all these processes all going, but it has a centralized allocator-- so you've got the stack at the top and the heap at the bottom-- and every time we run this program, these little things are using-- going through that centralized allocator to allocate their memory. There will come a point when it runs out of memory, and we need to do a collection. And when that happens, we have to stop everything, which is not really good. And I don't think it would work if we took this to the scale that we're looking at, hundreds of thousands of machines all running one beautiful distributed program. Can you imagine having a centralized allocator for that? That would be bad news.

But to use the metaphor again, insects don't share one giant digestive system. They've all got their own. So if we give each of them their own stacks and their own heaps and they can collect themselves, we will still have stops in the allocations. We will still have occasional pauses, but not at a system level. These will all be very short. And the system, as a whole, will continue to move on.

I have wait for the animation. There we go. So I have a sense of who comes to Deconstruct. It is people who agree with me that Hacker News is a terrible place to read comments. And you all still read it, anyway.

[LAUGHTER]

[APPLAUSE]

It is people who follow other programmers on Twitter and read blog posts. And none of you, I'm sure, have really heard of me. And it's unlikely, probabilistically, that I'm a singularly brilliant mind in the world of the most vexing problem to some of the most well-funded corporations on the planet.

These are not new ideas. And I'm probably not the person to bring them about. I'm not going to invent the future. You may think, very cleverly, that I am not describing a language that I would like to build, but one that exists. And you'd be absolutely right. I'm describing Go. Go has the things that I've described. And it's certainly a really fun language.

It was first released in 2008. So this syntax I gave you is done in Go this way. I change it up a little bit. This one's straight-up Go syntax. I put a semicolon at the end, though. So some of you got fooled.

I promised you guys freshness. And I promised you something that really inspired me in a way that I was not used to. And we're going to have to go deeper than Go for that. I'm going to go somewhere else. So I'm not actually describing Go to you. We have to go to Erlang, which was first released in 1986. Also released in 1986-- me. The freshest thing I knew came out when I escaped to the womb. This in Go looks like this in Erlang. And this in Go looks like this in Erlang.

So if you're wondering why I had a fish in here, it's a red herring. I don't know if this is what herrings look like. I want to take this point of the talk to tell you that this is absolutely not an anti-Go talk. I love Go. Go is a lovely language. And everything I've described is rocking the distributed systems world as promised.

I don't do a whole lot of DevOps. But my friend who does-- I don't know this company very well. But I know when they make a press release, he comes to me and hugs me and says, my life is easier. And they use a lot of go for this. Docker is written in Go. Kubernetes is written in Go. They're using the things that I've described to you already to really change how distributed systems and how reliable they are and how well they work. But I want to show you a few program implications to some of the design decisions that Erlang made differently.

Erlang does a few things, small things, pretty differently that fundamentally alter how I think about programming. So I'm going to spend the rest of the talk going over that.

The first one's immutability. Erlang isn't a beautiful language. And there's nothing stopping you from writing immutable code in Go or any other language. But it is more enforced in Erlang. And let's talk about how this changes concurrent processes speaking to each other.

So we've got a process on the left and a process on the right. And both these processes have cylinders that they can use to allocate, or these represent data variables that they can use to do the computations they need to. But they share these three green cylinders. And they need these at the start of the computation.

We hit Go. We hit Start. And these computations-- they both mutate the shared data. Now, failure in distributed systems is not an if it happens. It's a when it happens. Everything that can fail will. So at one point, one of these processes is going to die. So we learn that turning things off and turning them back on again is usually a good thing to do. But we needed three green cylinders for that. And now we've mutated them away.

Furthermore, if the one on the left was waiting for something on the right, it is now stopped. So if we can't start that other process again, we're in trouble. So if we separate these two out and not let them touch each other's-- touch the same data together, we still start with three green cylinders. But we'll give them each their own copy. And then they're able to work independently and communicate exclusively through message passing. So this is one major design difference that has bigger implications as we go forward.

The second is linked processes. I said I want to be able to make hundreds of thousands of processes at a time. Erlang will let you link two processes together because these are virtualized processes. These are not something else underneath. So they are able to see when something goes wrong. And in a linked process, when something happens to one, it will-- the other will be notified and be allowed to take action to it.

So again, if it happens, not when it happens, we blow one up. But now the other is notified. And they can choose to do something about it. In this case, what it typically does is restarts it. Now, we needed three green cylinders for this. But we never mutated them away. So we have them right in place. And we can almost deterministically go back to where we were before because we linked them.

So if we have immutable processes and we have linked-- immutable data and linked processes, we can have supervisors. What is a supervisor? A supervisor is a process whose only job is to start new processes and watch them. And if anything happens, it restarts them.

My first job-- I was a leaf of the organizational tree to a very large company. And another coworker of mine says, Pablo, we may not seem like we're important. But we're the only ones in this company who build the thing that we sell. Everyone else-- they just talk to each other-- talk, talk, talk-- for their counterparts in other languages-- talk, talk, talk. And this is a-- this was a very reductive view on the value of management.

[LAUGHTER]

I have since managed. I never agreed with this assessment or the implied tone behind it. But this is actually how many Erlang programs are structured, where virtually every process is not working on your application logic except the leaves of the tree. And you build trees.

So we have this root process here. It's going to start three others. This top one is going to start an army of workers. And they're all going to-- in this example, going to concurrently process a large document.

The one in the middle is going to put these two large suns just on the right. I made them large to make it more visually interesting. But they don't really mean anything other than they are parts of your application domain. And you build this tree, deliberately thinking how it will fail.

The one at the bottom of the original three I am going to get back to in a moment. But in this case, the orange ones are the ones doing all your application logic. And the white ones are doing nothing else but watching, waiting, ready to restart.

So who started Chaos Monkey? A lot of hands. Deconstruct is great. Chaos Monkey was for systems like this by Netflix. And in this case, they figure, since it's an when it happens and not an if it happens, why not we be the harbingers of death? So bam, you don't have a cache. How does your application respond? Bam, you don't have a queue. Bam, you don't have a database.

A lot of programmers then have to spend a lot of time thinking to themselves, how does our system respond as a whole? How do we handle these crashes? If we look at supervision tree, this is almost certainly what they were built for. Bam, you lost an entire subsystem of your application. This is noted by the supervisor. And it brings it right back up.

This strategy is called a simple one-for-one. It took eight lines of code to declare. This is built on libraries that have already served billions of other people. I didn't have to do any thinking. Bam, we lost one of our processing nodes. Now, this job-- I've had to think ahead of time and say, does it make any sense, when this fails for one of them, for the rest to return? I'm going to say no. It doesn't make sense to get most of something processed and then some of it not done.

So we gave it a different strategy. And it's just going to straight up murder the other nine. And then it's going to restart. And we're back in business. A powerful short sentence about it is that restarting from failure is the same thing as starting from scratch at every point in the system. And if you build these systems the way they were designed to, you're almost free from Chaos Monkey by default.

So let's talk about bug philosophy. Let's put the whole world of bugs into two categories, the bugs you can anticipate and the ones you can't. And let's split into two other categories, which is a little incorrectly labeled, reproducible bugs, bugs you can reasonably be expected to write code to catch and integration test for, and those not so much. Think of seven layers deep of things just not being on the right time in leap year.

So on the top left quadrant, this is how we mostly think about software errors. We think of the bugs that we can think of ahead of time. We write try-catches. We write status codes. We're in pretty good shape for this. But the bugs we can't anticipate, by definition, will bite us.

On the whole, we're OK, though. This is what most of our jobs are-- is that we see a bug. We go, oh gosh. We write the test case. It's in the top row. We can kind of expect that we should handle this and maintain that handling code. And we push it back up.

Bugs you can anticipate, but ones that are just really too far into the weeds-- they still happen. But that's kind of where that other slide of monitoring latency requirements, availability of requirements-- just, this is where a lot of the work of distributed systems comes in. The bottom right, the bugs you cannot anticipate, and really, you're not able to conceivably, within a reasonable amount of time, fix, is just a-- bad news. That's sad users.

I'm not going to-- emoji doesn't have the emotional granularity for me to make this not look like I'm just making everything happy. I wanted the top right to be a little nervous. I wanted the bottom right to be a little relieved and then the bottom right to be tepidly enjoying the beer. I say this to say that when you have supervisors in place, you have an answer to some of the cases of all three of those quadrants. Erlang still has try-catch statements. It still has checking error codes. You still have the top left quadrant. But it gives you an answer for some of the cases for the other three.

Many of these ideas are well explained by an author named Fred Hebert. He worked at Heroku for a while. He's written a lot of the materials for Erlang, if you were so curious to look at them. And he used the metaphor that I really like that I'm going to borrow. It is that many software errors are thought of as hygiene problems. We write everything with great skin suits. And we use try-catch statements to really just make sure that everything is handled. And we send it off into production. And the minute it finds a pathogen it was not ready for, it kills you.

So then we worry. We worry. And we had better test cases. And we write better skin. And we launch it. And our entire careers are spent just catching these things and throwing them back out. Erlang is more like an immune system. It can survive things you didn't expect by structure and by definition.

So if we look at this example, V2-- and I write Erlang code more to the idealized version, where I say, I'm going to get the Facebook data. And if it doesn't work, I'm going to restart this process. And it's going to try again. And I'm going to get the user data. And if the database doesn't work, I'm just going to restart that again. And it's going to keep trying.

And then I combine them. And I run into an application-level exception I was not expecting. We will restart and try again. And I will upsert it. This is not responsible production-level code, either. But I guarantee you this is a whole lot more resilient. And I don't have to think about how to retry these things, mostly with declarative code.

Last one is observability and patchability. When we look at a system like this, oftentimes, things do not just die. They just get sick. They start acting slow. They start doing weird things. And this hurts our systems. But because these are computers underneath it, we can SSH into them and start running profiling tools, like strace or lsof or any of the rest. And we can apply patches to them. We can even restart the software on them. And we can fix these systems in place.

When I said that the bottom of the original root three was something I'd get back to, these are the libraries that ship with Erlang. One of the processes in every Erlang application you run is something that will listen on the network so that you can log into your program and, like a development REPL, you can start asking it questions about the system. You can start running Erlang profiling tools on the system as it is running. So I can say, what is happening in these three green circles, and why?

Erlang also lets you update the code in a running application. So we can patch the fix. To use the medical metaphor again of Fred Hebert, this is more like doing surgery. Similarly, it is like fixing one subway stop whenever there-- it is in trouble rather than stopping the whole subway system the whole time.

When you have these tools-- restarts, supervisors, and being able to update code in place that you can log into a running application and profile-- it reminds me a bit of a tweet that I saw, where someone had to decommission a server that had been up for 6,500 days. This is 17 years. This was an always-alive computer, to use the language of a previous talk, that had been up for 17 years. And someone had to finally turn it off.

So when you have these tools in place, you don't have to ever turn off an Erlang program. And I'm going to ask you, when was the last time you started a program you wrote, thinking it will never go down, thinking it will never turn off? It's a pretty fun thought. And I encourage us all to try it at some point.

So I threw a lot at you. So let's recap it. To build these big systems, let's have languages that enable us to make lots of processes on the order of 100,000s and use value-based message passing so that they can work safely together. This gives us nice things like pauseless GCs. This is available in Go. And I highly encourage everyone to give it a try.

I've written some Go. It's super fun. And you'll get to play with these tools. If you want to go deeper in the rabbit hole, Erlang, through immutability and the linked processes, lets you have supervisors and observability and patchability to a level I have not seen in any other language or runtime that has gotten this popular.

So I want to thank you for having me speak and for listening to me. And I want to congratulate you because you now all have the knowledge and the power to do the responsible thing, which is go into your organizations tomorrow, find the decision-maker, and say, I think it's time we took engineering seriously. I will leave if we don't rewrite everything in Erlang.

[LAUGHTER]

Congratulations.

[APPLAUSE]

That's not the end of the talk.

[LAUGHTER]

But we're close. But yeah, take a video of it. Put it in the Deconstruct Slack. And we'll all find new jobs. I'm not asking you to rewrite everything in Erlang. I could fill the talk with disclaimers. But I do think it's worth taking a look at if you want to change how you think about software and failure.

The second big takeaway-- if you're looking for inspiration, I highly recommend looking back into computings past. We have a lot of half-eaten desserts, a lot of amazing pastries that we took a bite out of and then put back. And then the industrial PC revolution happened and Web 1.0 and Web 2.0. And we left a lot of stuff behind. So I invite you to look back.

This time, thank you.

[APPLAUSE]