DeconstructSeattle, WA - Thu & Fri, Apr 23-24 2020

← Back to 2018 talks


Why Programs Fail (Andreas Zeller)


(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Thanks, Gary. My name's Stuart Halloway, and I work at a company called Cognitect. And we like to talk about stuff and share knowledge. If you've read any of these books, you've read something written by somebody who works with me at Cognitect. And we are also the company that is behind the open-source programming language, Clojure, which is a language which is known for making developers ridiculously effective.

In my day job, I am the lead developer on a database called Datomic. And it has lots of characteristics. But given the other talks that have occurred over the last couple of days, I would highlight here that schema changes in Datomic are lock-free.



Now, having said that, that doesn't mean life is always perfect. So I'm going to tell you a story. Once upon a time, there was a team using Datomic. And Datomic has an abstraction over block storage. So when you design a Datomic system, you can choose your underlying block store. We originally targeted DynamoDB, but you can also use Couchbase, [INAUDIBLE], JDBC to Oracle, SQL Server, Cassandra.

And there's also an embedded H2 database, which is convenient for developers because it runs inside the process that is acting as Datomic, so it's less to get going with operationally. So as happens, this team decided that, for convenience, they would develop against H2 and they would go to production with Cassandra. Also as happens, guess when they decided the test going to production? They tested going to production maybe a day before they actually had to go to production. And things did not work well.

All of a sudden the basic transactions that were working just fine when Datomic was backed by H2 stopped working. And they were failing hard, the system was falling over, transactions were exceptioning out, and people started to get a little bit nervous. And they looked at the problem, and they got scared. And pagers were paged, and chats were chatted, and slack were tightened. And I was pulled in, and I was on vacation.


So this is not a happy beginning to a story, and I don't want you to live through this kind of story. And so I want to talk to you about writing correct programs and about correcting programs. So there are a ton of tools that you can use to help write correct programs. We've got enough time here. Let me explain them all. No. There's too much. Let me sum up. The tools do not work.


The cost of software defects is now $1.1 trillion. That number is, in the United States alone and in a single year-- I should mention, by the way, that when I was doing research, I found that number to be staggering. And in fact, I don't even trust it. So there are notes on all the slides, and you can go in fact check this and decide what number you think this should be. But even if you mark this off 90%, this is terrible.

I think we can do better. And the trick is to aim small, miss small. Let's start with aiming. In the words of the famous mid-twentieth century epistemologist, Yogi Berra, if you don't know where you're going, you might end up somewhere else. This happens with programs all the time. And it sounds like sort of trivial advice. Well, of course you should decide where you're going and go in that direction. But we find ourselves coding in the wrong direction all the time. And there are mechanical things you can do to not get into this situation.

So before you write code, you need a problem statement. What problem are you trying to solve? Doesn't need to be long, but it should be specific, it should be precise, and you should write it down. Get it out of your head, get it into the world. This, and not your code, is the most important artifact you make when you're building software.

As you form your plan to defeat the problem, you should write down a rationale. The rationale tells what you're doing. It tells why you're doing it. It also-- and this is important-- tells what you're not doing and why you're not doing it. Code is no good for this. Code is terrible at this job. It's particularly terrible at telling you what's not in the code, right? The choices that you didn't make code it's not going to tell you anything about. And all of this stuff should be indelible. You don't erase it.

So you're going to change your mind, right? I'm not advocating for some sort of waterfall design process here. 20 minutes into thinking about the problem, you're going to realize that your problem statement needs to be refined or that your rationale needs to be refined. That's great. Refine it, but do not throw away what you started with. You never know when you're going want to come back to it.

And Nathaniel already said this yesterday. He ended his talk with this slide. When you make an important decision, write it down. Decision-making is what we do. If you think what you do is writing code, you are mistaken. This is actually the what we do of what we do. And the aiming portion of this, the part that you do at the very beginning when you're deciding what you're going to do, is the highest leverage moment. This is where you have the chance to be going in the right direction.

So throw out your developer tools and concentrate on tools that let you manipulate text and let you manipulate pictures. Now, the tools that I'm showing here I'm not endorsing other than by way of example. These are the ones that I use. I use Org-mode and Emacs for text and a bunch of wikis, whatever is required by projects. And I use a Mac tool called OmniGraffle for drawing. But you should find tools that work for you or your team, and this is where the core of your work goes. Your code is not the core of your work. This is the core of your work.

So that's aiming. Now, you want to make things small. The smaller an area of code you have, the less bugs you can have. But what does it mean to be small? One letter variable names? Really short methods? My favorite, font size? I am actually happy to make a principled argument for every one of these, but these are not the most important part. There are much more important ways to reduce code size, and the most important one is to make things that are simple.

The word simple has a lot of definitions in modern usage, but the original definition-- and the one that's the most useful to a software developer-- comes from simplex, which means one folder braid. And the opposite of that, then, is complected, which means braided together. And here the picture's a little small, but there are four strands labeled A, B, C and D, which are going to be braided together. And this is what happens in our programs.

And the way it plays out is, later, you want to use strand A, and you discover that strand A of whatever your code is doing is complected with something else. I wanted this piece of the code to use again, but it's intertwined with this thing, B. It's intertwined with this other thing, C and D. And this gets worse at scale, not better.

And the things your code is intertwined with, they are a dependency bloat. I've spent a lot of time in the last two decades letting Maven download the internet onto my computer. It's bloat, but it's way worse than bloat because it's things that are actively undermining what you're trying to do. You're trying to use a piece of code, A, that does what you want that is intertwined with B, C and D, which are actually pulling in the wrong direction, usually, by the time you go to reuse code. So you want to write code that is simple and not complected.

Another thing you want to do is to write code that's general. And there's this great quote from Alan Perlis that people in the Clojure community in particular always pull out. And the quote says, we want to have 100 functions on one data structure rather than 10 functions on 10 data structures. And if you're trying to optimize for size, it's not exactly clear why this matters. If anyone wants to check the math, 100 times 1 is 100, and 10 times 10 is also 100. So it's not clear from the quote where this turns to matter about the size of the eventual program that you're going to write.

But let's take a particular example that I grabbed off of Stack Exchange. Lewis is writing a program to manage a very trivial little domain, which is lists of books. And Lewis discovers that there are already 22 classes. And so on Stack Exchange, it says, hey, is this too many? And it is. 22 is too many. But the good question is, what's the right number?

Well, I'm excited to say that today, in this room for the first time, I am going to release an open-source program that can take Agile story cards or GitHub issues and tell you how many new data structures you need to make.


So this is going to be analyzing the text and analyzing the-- and this is specifically for information-managing applications, things that are managing information. And here it is. Right. Your programs should not make new data types to represent information. We already have perfectly good data types for representing information. You learn them in your data structures class in college, and we don't need anything besides that.

And this is a catastrophe at scale. If you compare a typical Java program-- and I'm talking here about your entire application. So not just your own code, but your dependencies-- you're going to have hundreds to thousands of data structures, where every new class is a data structure. And you're going to have thousands to I don't even want to talk about it number of functions or methods.

Contrast this with an idiomatic Clojure program, which follows Perlis' advice. A typical Clojure program will not introduce a single new data structure. So those tens of data structures are the ones that are built into Clojure. And of course, the program does something, so it will introduce something. And that's where those a few hundred of functions come from.

So this kind of size transformation is a way bigger deal than switching to a smaller font size. These are things that can cause programs to be an order of magnitude smaller. They can cause the size of the team that's needed to build the program to be an order of magnitude smaller. So that's pretty great.

Now, let's imagine a world where you aim and you make things very small. I've been trying to live in this world in the Clojure world for the last 10 years. And I can tell you that this has eliminated a lot of problems, that I can get more done with less code, but my rate of defect introduction is still high. I still make mistakes. My code is still filled with all the same kinds of bugs that everybody else's code is filled with.

And so we're back to the tools. I guess we have to go back to these tools and have something help us out. It's kind of exhausting. I don't want to talk about the tools. Let's skip it again.


Instead, let's consider things a little more broadly. Each of these tools is a source of evidence. And I love, Ellie, that you used this word late in your talk. This is a great word. This is an important word. This is how we reason about systems. Our systems produce evidence, which is appearance from which inferences may be drawn. And the word appearance sort of implies that this is visual, but it doesn't have to be.

In fact, I can troubleshoot a number of different problems on this Mac laptop by listening to the fan. I can tell you the difference in sound between the fan when I'm in GC hell and the fan when one thread has gone crazy in Chrome, right? Those make a distinct sound. I don't even have to look up. I just lean over and kill Chrome or I lean over and kill the JVM. I know what thing is going wrong.


Having said that, some kinds of evidence are better than others. Logs are fantastic evidence. They really are about as good as it gets. And it's not because they have UI affordances, although that's going to be helpful. It's because they have characteristics as data structures. Logs are persistent in the computer science data structure sense, which is, they contain their own history.

So given a log, you can also discover the state of the log at some point in the past. Logs are indelible. Nothing is ever erased. Logs are temporal. They provide some time ordering about what happened in the system. And they're ordered. They provide a causality order of what happened in the system. This happened, then this happened, then this happened, then this happened.

Now, I've put an asterisk next to temporal and ordered because logs don't provide you with perfect temporality or perfect ordering. But that is a physics problem and it's out of scope for this talk. So it is a problem, but this is about as good as we can get. And in fact, when, you start to look around there are a few things that we have in our systems that have this characteristic. Logs are one.

Another one is get. You know what's great about get? It has exactly the same characteristics logs do. It's persistent. It has temporal characteristics. It has ordering characteristics. Again, little caveats on temporality and ordering, but it provides something there. So we know how to make stuff like this.

I propose that we should model domain data entirely using get and logs, since they have these terrific characteristics. Actually, I'm teasing a little bit because there are data structures that have these characteristics. They're called persistent data structures. They've become radically more popular in the last 10 years as functional programming has taken off. And I can tell you that using persistent data structures is the sleeper feature that makes Clojure awesome.

I've been working for 10 years now on a stack that uses these data structures, and it is a ridiculous unfair advantage over my peers when I have to figure out problems because the evidence everywhere in the system is of high quality. Everywhere in the system, you know where you are and you can backup and know how you got there.

So find the persistent data structures in your language and start using them. There is one easy takeaway. It's late on the second day, so keep a very small specific takeaway. Go and find the persistent data structures in your favorite language and use them, and use them all the time. If you do not have a documented requirement that forces you not to use them, you should be using these everywhere in your programs.

So that's cool. We're going to know when we miss. We're going to have good evidence. The second use of the word small here is about the feedback loop. You have a problem, you have some evidence. How do you tighten the feedback loop to get to a fix? And I am going to take a side trip for just a second and talk about one tool here, which is the REPL-- the read eval print loop-- in a LISP-like Clojure.

And I don't have time to give this talk. I'd like to give this talk as well. There's links at the bottom here. But if you like test-driven development, you should come over and try REPL-driven development because it attacks a subset of the same problems and it attacks them better and more effectively. Unfortunately, that's not this talk, but I have something even better for you. And this is probably the single biggest idea in the talk.

For centuries, a secret society has moved behind the scenes of human history to make life better. And people who have been carriers of this secret flame have been persecuted. They have been hounded, but they have persevered. And they have worked to deliver the most powerful debugging tool in the universe. And I'm going to share it with you right now.

It's called the scientific method. Now, the good news is the scientific method is well documented in a lot of places, so I'm not actually going to teach you the scientific method right now. What I'm going to do instead is I'm going to point out the characteristic of the scientific method that matters for you as a programmer, and that is the scientific method allows you to divide and conquer problems.

If you've ever played 20 questions with a small child you know how not to do this, right? You're playing the game with a child and they're like, is it Elvis? And you're like, no, you have to divide the world up and sort of work your way to that. The scientific method is the single most important thing you can do when you're trying to understand a program. And this happens at every level.

Developing is actually debugging. You're constantly debugging, even when you're writing the code for the first time. And the scientific method is supported by a terrific tool set. So here's one particularly good tool. I think you're going to like these. If you haven't seen them before, the battery life is fantastic. And basically, the way these work is you write down the steps of the scientific method one at a time as you go-- what's the hypothesis, what's my narrowing experiment, whatever.

And if that tooling is too primitive, you can also check out these.


I don't think it matters so much. But this really works and it really works because of that algorithmic characteristic of the Scientific Method, which is you're dividing and conquering. If I am dividing and conquering a problem and you are wandering around linearly, you are not going to find shit no matter how good your tools are. And if you're not writing down where you've been, you're not even going to remember where you are at the end of the day. So follow the scientific method. Write things down.

Interestingly, the scientific method has been much criticized in science because it doesn't actually necessarily model the process of scientific discovery. It's actually better-suited for debugging than it is for science. So we might want to just take it over and say, we're just going to call this the debugging method. There's a great book called Why Programs Fail. I recommend this to everybody. Some of the chapters in this book are things that have now become standard Agile practice and some have not yet, but should do.

So back to the story with the scientific method in hand. Everybody is running around terrified. We can't go to production. We don't know what's wrong when we're talking to Cassandra. And so the team proposes to baseline experiments. This is the very front door of the scientific method. Don't even have a hypothesis yet. We just want to agree on what our problem statement is and have some baseline experiments. We're going to have a baseline experiment that shows the customer code, calls Datomic, calls H2 happy, and then another experiment that differs only in replacing H2 with Cassandra and sad.

So we set up this experiment. And lo and behold, what do we find out? Turns out that as soon as we sat down to run that experiment that it actually blows up with H2 and Cassandra. How could that be possible? H as anyone watched the show House MD? Everybody lies, right? And it's mostly not lies of malice, right? There was a mistake. There was a miscommunication. It turned out that the particular branch of code being tested had never been tested anywhere else.

And so in addition to using Cassandra for the first time, they were running the code for the first time. How much actual genius-level IQ did I have to apply to solve this problem? None. I am way too lazy at this point in my career to attempt to muster a genius-level IQ. Can't do it. But I can say, write down your problem and break it into pieces.

So what went wrong on this project? Well, the code was complected. I didn't really sort of talk about that. So it was harder to set up the experiment than it should have been because the code was complected. The problem statement, I mean, there were statements, right? They were writing things down, but they didn't have a good problem statement. The problem statement that they did have turned out to be incorrect-- factually untrue. And from that factually untrue statement, they had then started guessing without a method.

So let's pull this all together. Aim. This is about your thoughts. Your thoughts are the most valuable thing you bring to the table. You should have some thoughts. You should write them down. You should draw them. And you should preserve their history more carefully than you preserve the history of code.

Small comes from making things simple and making things general. And there's a lot more that I can say about this that I won't here for reasons of time, but there is literature on both of these things. When you miss, you need evidence. You need to understand what went wrong. And the best source of evidence going is things that are persistent data structures, so use the persistent data structures in your language.

And finally, use the scientific method. Pick one time in the next week where you're trying to problem-solve, and actually write down what the problem is. Write it out. If you don't do this already, write down what the problem is, write down a hypothesis, write down an experiment that could falsify or prove that hypothesis.

Now, there are a number of talks that go more deeply into some of these topics. These will be in the slides for you to look at afterwards. Hammock driven development is really a talk a lot about aiming. Simple made easy is a talk about making things simple. Debugging with the scientific method is talking about making things small.

And I will post a link to these slides in the conference slack in a few minutes. I would love to meet you all. I've met some people. This has been a great de-construct for me. I hope it has for you. I'd love to hear your stories. I will also drop my dinner plans in the conference slack. And if anybody wants to come along and talk about these issues or others, that would be terrific. My name is Stuart Halloway. And thank you very much.