The Tortoise and the Hare Write Software by Erica Gomez

← Back to 2019 talks

Transcript

(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Hi, oh, my gosh. It is so nice to meet literally every single one of you. My name is Erica Gomez. I have worked in a lot of different domains in this industry. But most recently I'm an engineering manager. So I've been working on a lot of email. I write email for a living. It's great.

But what I actually do though is, I help people who receive government food benefits, or SNAP. Yeah, yeah, it's an awesome program. Use those benefits on Amazon to get their groceries delivered to them at home. But I'm not going to talk about that today. Instead, what I'm going to talk about is an update on Aesop's classic fable, what if the tortoise and the hare wrote software.

So I think it's fair to say that in software we're kind of obsessed with speed. You know, we sprint. We're agile. We move fast. We break stuff. You know, it seems to work, like kind of.

We can legitimately say that software changes the world in some really fundamental way like almost every year. And so our industries become synonymous with speed. You know, we call a 60-hour work week a pretty good example of work/life balance. And sometimes this is what a career feels like.

But you know, like yeah, we talk about some aspects of this. You know, we debate it like wellness retreats. And I guess eating salads alone which is healthy. It's not like I do this every week. But I don't know how much we question this particular obsession with speed.

And it's not like this came out of nowhere, right? Because if we don't move fast, how are we going to build the next world-changing thing. Because that's the way software is. And that's the way it's always been. And that's the way it should be, or should it?

Well, I want to sort of pull apart this myth on software and speed and why they have to coexist. And I want to talk about airplanes to do that. So I know that on its face, like there's not a product more different than software than airplanes. Except the fact that airplanes have a lot of software on them. I mean, they're giant physical objects.

They can take a decade to build start to finish. They're safety critical. And yeah, sometimes software is safety critical too. But airplanes have an entire federal governing body dedicated just to their development. Most importantly, the development process includes intentional friction. And it's in every step of development process, because if something goes wrong, if you deploy a bug, you can't hot fix a catastrophe.

So I'm going to ask a brief question. You don't need to like yell this out or anything. But how much software do you think is on the typical commercial airplane? Well, I will give you an idea of how much. There is some data that say that the average pilot spends about six minutes of a flight actually flying the plane.

This is a diagram of what is called the flight profile. And that shows all the major phases of a flight leg. And while the pilot's not spending a ton of time, you know, kind of like hands on yoke, they spend that time doing some really, really important stuff-- the takeoff, the climb, and the landing basically.

But for the most part, while you're like sleeping or chewing your packets of kibble for humans, avionics takes care of the rest. There's good airplane food, right? And avionics as a result is super, super regulated. And it's regulated by this Federal Aviation Regulation set, called the FAR. And when I say regulated, I mean regulated.

I spent the first five years of my career working on performance and predictive analytics systems for commercial aircraft. And you know, to help you see kind of where this data regulation and caution all come together, I'll just do a quick overview of how this works. I've got to get a sip of my Marco Rubio. Come on, 2016 debates, anybody.

So OK, let's talk about how this system works. Well, so first, there's sensors. There are upwards of 100,000 sensors on a typical airplane. And these sensors are gathering all kinds of data in real time. Everything from your wind speeds, your tire pressures, water levels. And this sensor data is relayed by a central bus. And it goes to various onboard computers.

These computers basically gather the data, create reports. Some of these reports are things like the airplane condition monitoring system, the engine indicator system, pilot reports which are things that are important enough that they need to go to the pilot deck in flight. And these are all relayed by what's called the Aircraft Communication and Addressing Reporting System, or ACARS. And then this is sent either via satellite or via air band radio to ground stations. At this point, we gather all this data and we figure out what issues the plane might be currently having in flight.

So for example, if we see that an auxiliary power unit has a high probability of failure in the next 15 flight legs, well, we need to gather every single type of performance manual and repair manual, engineering spec, to then gather, aggregate and send over to the right people.

And so you can see that for having to do this while the plane is still in the air, you cannot be anything but exact and cautious and with good reason, right?

So this is a graph of aircraft safety statistics. It only goes up to 2012. But if you extrapolate out to 2017, you'll see that 4.1 billion passengers are carried on commercial aircraft every year.

Now, it's hard to get a firm number on this. But the current consensus is, if you are involved in a serious aircraft related incident, you only have about a 1-in-2 chance of survival. Now of course, as you can see here from the primary y-axis, overall flying is incredibly safe.

But it's because of the level of rigor involved in building these systems, in building and maintaining them. And that's why, for example it took eight years for the 787 to go from conception to reality. And you know, here you can see it in its delivery, livery for ANA. And so this is why the flight regulatory agencies build intentional friction into the design and development because very simple assumptions can have very serious consequences.

So on these giant heavily regulated machines, there's kind of become a philosophical DMZ on the amount of reliance on software that they have. And the global aircraft duopoly kind of is on either side of this debate. This is a diagram of the two major types of flight systems for controlling the aircraft.

Over on one side you have Airbus. Since 2007, Airbus has deprecated all mechanical backups of the elevators, the flaps, and the rudders. So what they have is essentially a pure fly by wire system. It's only in the event that all backup systems fail that pilots can exceed the flight envelope of a plane and take control of the mechanical backups.

On the other side you have Boeing. And Boeing has said that they will always leave power to the pilots ultimately, and maintain backup mechanical controls. Now to be clear, pretty much every commercial aircraft nowadays is a fly by wire system. It's just a question of what kind of backup is available.

Now, Boeing seems to have had a point. Because in the first ever public demonstration of a fly by wire aircraft which was at the Habsheim Air Show in 1988, well, that aircraft crashed. Airbus concluded that it was pilot error. But the pilot said that the computer took control of the plane when the plane was at low speed and caused the crash.

But I think it is way too simple to say that this is just about the volume of software on airplanes. It's really about how the software gets built. So if we go back-- and we're going to do a lot of history here-- so just fair warning. If we go back to the 1930s, we get to what would eventually become the Toyota production system.

Now the Toyota production system is founded on two major principles. The first one is continuous improvement. And this is where we get things like Kaizen and Kanban and what we generally know as Lean manufacturing. The second one is automation with a human touch. And what that means is you bring human expertise in until a process is so elegant and so simple that you know you can roll that process out onto the assembly line.

Now Toyota was so successful with this that by the 1980s they basically became the dominant global manufacturer. And other manufacturers took notice, especially in the aerospace industry. You might say they leaned in to Lean. I'm so sorry.

They did this so much so that they pulled Lean into their organizational models. One aerospace exec at the time said this, we're finding that you really need to look at the enterprise as an integrated system, not the thing that you're building. But the people that are building it.

And so it's through this sea change that aerospace companies become system integrators first, an engineering companies second. Agile concepts, they pop up in the late '90s. And in 2001, you have all these cool cats writing the Agile Manifesto in Utah.

And suddenly what happens is the aerospace industry is kind of molding itself around this Megatron of practices. Companies cherry pick from Lean and from Agile. And they only pull out the things that sounds like, software goes fast, cut cost. And they ignore the greater contexts and principles that undergirded these really important valuable and effective systems.

And so this turns Lean and Agile into something that's not really understood but is instead performed. It's kind of like an organizational cargo cult. Then it's not like, hey, we build cars efficiently and with minimal waste. Instead it's, hey, machines need to be designed just in time. And software needs to be implemented just in time and built like 5 as your file share, if anyone is familiar with Six Sigma.

And this idea that everything can be built super fast and with very few resources, whether it's your bitcoin power toaster, or your A380. This basically becomes common. But the most important part about the adoption of facets of these two practices is they ignore the most important parts.

The first part of the Agile Manifesto says, individuals and interactions over processes and tools, and in the Toyoto production system, respect for humanity. Well, you combine this with the rise of feature-rich IDEs and cheap hardware, easy remote work, and follow the sun developments. And by the early aughts you have this incredible confluence of efficiencies. And we are so efficient that friction just kind of goes away.

Well, what are the consequences of all this efficiency? We know it's had some really serious consequences. The last year saw two fatal crashes of the Boeing 737 Max. I'll read a couple of these headlines. From the New York Times, Boeing built deadly assumptions into the 737 Max, blind to a late design change.

From NPR, pilots criticized Boeing saying that 737 should never have been approved. Well, at this point news organizations have reported that they think the root cause was a late change to what is called the Maneuvering Characteristics Augmentation System, or MCAS.

So what is MCAS? MCAS is the system that is responsible for preventing a stall, ie, when the wing loses lift, too much drag. And when the 737 Max was first being developed, MCAS was implemented as a way to compensate for some structural imbalance changes that were made to the original 737 platform so that this could just be a derivation of that platform.

But in the first version of MCAS, it relied on a huge variety of sensors to determine when to engage and take over for the pilot. One of these is the angle of attack sensor right here, which is just what is the degree angle of the nose. But in flight tests, there were reports that say pilots noticed that the plane wasn't handling low speed stalls well. So there are reports that talk about the options that were considered to handle this case.

One of the options considered was aerodynamic, a hardware fix, right? But this would've been incredibly expensive. It requires structural changes to the plane. And because you're going through basically the design and develop the process of major parts of the aircraft again, you essentially may have to re-certify the aircraft. And reports say that in order to not have to do that which could have added years to the project development. Instead, they pursued a software workaround in MCAS. And so what happened was, all of the sensors needed to trigger MCAS were removed, except for that one angle of attack sensor.

And this wasn't documented in pilot manuals. And it wasn't part of training. So in essence, what happens? Well, this is a design gap, a hardware gap that's patched with software.

Now look, the investigations into these tragedies, it's still going on. We don't know all the details and we are all still learning them. I think in general though when you see a major catastrophe that has a software cause at its root that bug is more often than not a manifestation of much bigger systemic issues on whatever program it was a part of.

So now I want to talk about where teams fit into this. Because we all know that building good large scale software is like really, really hard, right? And we rarely have the space to kind of invest in the tools that we need to make it easier. Because that's all expense, right?

And if you want to move quickly, you've got to decompose things and separate out concerns. And you have to move in parallel motion, right? You need a lot of this parallel motion. And so what happens?

Well, this has consequences not just for the software that we're building and how we design it, but it in turn feeds back and changes the teams that build the software, right? It's this kind of feedback loop that changes the structure of how we organize human systems.

So let's take a step back again and talk about the first time the word software was used in published form. And it comes from the American Mathematical Monthly. And the statistician John Stuckey said, the software comprising the carefully planned interpretive routines compilers and other aspects of automated programming are at least as important as the hardware.

Well, what's the most important part of this? In my opinion, it's the carefully planned part. And Tuck had a really good reason to say it this way. Because at the time that he was writing software, he was using these, punch cards. Looks may be deceiving, but I have never written software with punch cards. I post date that.

But I can Google as well as anybody. And so here's my summary of it. Essentially what would happen is when you were going to write software with one of these systems, you'd have to hand write all your routines. And after you hand wrote them, you would work with either a lab assistant or yourself and you would translate these all into punch cards.

You would get into a queue to have them executed. And then you would get a printout of the output of your project. If you were lucky, you were at a place that had a lab with a punch card system. And if you were not lucky, you took those punch cards and you packed them in a box. And you shipped them to the nearest university that had a lab.

And then they would go into the queue there. And then they would get run and print it out. And then your cards and your printout all got packed back into a box. And you had better hope they're in the same order. And that's why people would make these marker notes across the side. And they get sent back to you.

So debugging could take weeks for something simple. And so you have to be careful because mistakes were really expensive. But I feel like this caution, this careful planning has been kind of sidelined. We don't have to ship punch cards to allow. We can roll out a prototype in a couple of hours, right?

And so in the 37 max when the investigation concludes, I think that we may end up seeing an example of how something that's kind of banal, like pretty common, this notion of speed to market impacts a multitude of things. It has potentially tragic consequences. And also when we use software as a way to compensate for what are essentially design hardware organizational defects, well then, that software becomes the single point of failure.

But software as a compensatory mechanism isn't always bad. And one example of this is Margaret Hamilton, who was the lead software engineer for the lunar landing module. Yeah, she is awesome. One of the things that she did was she specifically wrote the lunar landing software as an intentional backup to the module hardware itself. And it worked as intended.

But usually we see this anti-pattern of flaws and requirements and design and hardware being patched with software. Because software is the last thing built in that chain. So I think it's pretty easy for all of us to agree that software should be built very slowly and carefully for airplanes. I think that's a very easy case to make.

And we all want our airplanes to work flawlessly. But why should we slow down in other domains? I mean, most of us don't work in safety critical software.

Of course we want to build better things. And we want to be the best engineers we can be. But why should we invest in our systems to the same degree? Well, there's a ton of examples. I think we've all used half-baked systems, right? And part of it is because of the nature of how software launches work, right? Like, we can be Cavalier about it to some degree, you know.

Hey, just roll back. Just disable the feature, whatever. And so because we can deliver things so quickly, we never really ask if we should. And I think we should ask this question because recklessness and speed has a cost.

And these are just a few of them, right? Developer wellness, when you're up at 3:00 AM working an operational issue and you're exhausted, what's going to happen to your happiness there. Customer trust, your PR, attrition, people leave teams because they are so tired of trying to fix broken things. And these are real costs. But the problem is, they're just not as easy to quantify as clicks.

So I think we should ask this question because at the end of the day software is written by humans and for humans. Earlier I mentioned that whole first published use of the word software. I lied. There's actually one that was two years earlier. And it comes from the Institute of Radio Engineers.

Richard Carhart in his paper on quality control and reliability said the following, we need a total systems approach to reliability. And he specifically calls it all the facets of a system. He mentioned the system operators. But he saves the best part for last. He says, in addition, the interactions between these various elements, hardware and software, people must be recognized and included as the glue that holds the system together.

So look at this. He says, hey, operators are part of the system. Software and people are used synonymously. It's right there all in the history of our industry, being careful, being deliberate, and humans, the importance of humans.

So when we build software and we ship a-- let's just say, less than perfect feature-- we kind of do this with the unspoken understanding that we would love to fix it later. But we all kind of know that may or may not happen. Because you know, things change.

On April 9th 2014, a 911 call routing center in Colorado went down. And it stopped routing calls to 81 centers in seven states. For a period of six hours, 11 million people, including the entire state of Washington could not reach emergency services. So why did this happen?

Well, the technical explanation is that there was a parameter that stored the maximum simultaneous call volume. And that parameter was set to an arbitrarily low threshold. And no exception for it was ever handled. It was never stress tested. It wasn't alarmed. And it was exceeded.

Now of course, I can't speak to whatever was behind that set of decisions that was made there. But I think we can do a quick thought exercise on this. So you're on a team. You have a feature backlog that's huge and perhaps longer than your operational backlog.

And you very much understand the criticality of what you're building. And you want to write the best stuff you can. But you know, you've got stakeholders. And those stakeholders want more stuff faster. And maybe there is a looming contract. And maybe with that contract there are fines. Or hey, you've got a VC. And they're breathing down your neck for an MVP, right?

And so the time comes, you're going to start building all this instrumentation, alarming. And you listed it all out, and you create a cut line. And you implement everything that's above the cut line. But everything below the cut line becomes hey, it's all right. We'll fix it post launch.

Now, keep in mind. I'm not judging. I'm not blaming them. I do this. You do this. We all do this because we live in the midst of a system that incentivizes this, and we are nothing more than human.

So what do we do. We slow down. In 1964 Fred Brooks wrote a very classic essay called "The Mythical Man Month." And it was based on his experiences with the IBM 360 mainframe program. And it's from this project that he came up with Brooks's Law, which is adding human resources to a late software project makes it later. We've probably all been in some version of this before.

If you haven't read this book on all the terrible and awful mistakes that we make trying to build software faster, I really, really recommend that you do it. It's amazing. And then go read it and internalize these incredible messages. And when you're finished, just kind of accept that everyone's going to just ignore all these great learnings.

No one will ever follow them. And that means you too. And that means me too. You know, it's not unique to any one of us. Because here we are. It's almost half a century since he's written this book. And we still add more engineers to late projects. We still use Crunch as a tool of first resort. And we still use software to compensate for upstream defects.

And we're all guilty of this because it is such an easy trap to fall into. Builders are going to build and we are not incentivized in our speed obsessed culture to slow down, really for any reason. I mean like, I'm standing up here. I'm giving this talk, and I still think. Oh, crap, like we're running three weeks late on this program. I wonder if I can ask so and so if they can like jump on this and maybe help finish it out.

And I know better, I should know better, right? But it doesn't help to be this way. And I think that we already see different types of intentional friction that are used widely in our industry. Pair programming for example, nobody pairs programs for like 16 hours a day. Yeah, I mean, and if you do, please talk to somebody about that. I'm so sorry.

Prototyping, like prototyping, we say, oh, this is something that makes things go fast because we're going to get customer feedback quicker. But really we're building something we intend, we should throw away, right? It's a form of slowing us down to think about what we're building. And code reviews, what are code reviews but a way of creating intentional friction to drive up quality.

But the thing is, intentional friction doesn't have to be slow. Automated deployment tests, yeah, they can slow things down. But like ideally they don't. What they do take though is time. And most types of intentional friction require this sort of upfront investment.

On the human side of things, companies are doing things like mandating. You know, mandating you take your vacation time. And then you unplug while you take your vacation time. Because there's a lot of data out there that shows that defect injection rates spike during crunch. And they go back down once you've had some time to rest. So what does rest require? Time, this all requires time.

Now, I can't tell you all the techniques to slow down because I too am on this journey. But I can make a few recommendations. Induce friction where you can, whenever you can. And where you were already doing it, identify it and double down on it.

Revisit your designs in a big way. The world is constantly changing, right? And your design may need to change too. Sometimes we think that, oh, we've got a big high level design. And now we're just going to look at components. We're going to iterate on those components.

And we think, oh, OK, if I make a change right over here, well, it's fine because my system is perfectly encapsulated. So it will never have any other impact. But we know that that's not true. We know that these things can have ripple effects all the way to the end user, right?

And so don't just look at the component level. But look at the system level and do that regularly at multiple points in the development process. Close your laptop at the end of the day, turn off notifications, genuinely get away from what you do professionally and rest.

And lastly, you are going to get a ton of external pressure, always to speed up. And so what you need to do is apply a cost to risk so that you can make good apples-to-apples comparisons and decisions. There's a lot of different techniques out there for risk analysis that I'm not going to go into here.

But then the other thing you need to think about is what risk means in terms of technical debt. And Jessica Kerr on Twitter has a great explanation of this about technical debt versus escalating risk and that you shouldn't talk in terms of technical debt and should instead talk about escalating risk. I think part of a reason for this is that we talk about technical debt. And the problem with that metaphor is that it indicates fungibility. Like OK, we're going to accrue technical debt now. But it's cool because later we can just like get a different credit card and pay it off, right?

But we know it doesn't work that way because technical debt accrues interest, that is lowers credit card. So do what you can to apply a cost to this risk and make sure you communicate it that way and think about it that way when you are deciding hey, is this an acceptable level of risk?

So probably some of us here have seen that video of Keanu Reeves where he's just like, I love movies, right? That's how I feel about airplanes and safety and systems. So please, reach out to me. If you want to talk about this stuff more, I love talking about this, whether on Twitter or on the two meter band for any of you hammies out there.

So I'm just going to leave you all with one last thought and it comes from the person who basically gave me all of my best childhood memories, Shigeru Miyamoto from Nintendo. He said, a delayed game is eventually good. But a rushed game is forever bad. So I would just say, hey, let's all slow down, take a breath, and build some better software. Thank you.

[APPLAUSE]