Modularity by Kamal Marhubi

← Back to 2019 talks

Transcript

(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Hello. How is everyone?

[INTERPOSING VOICES].

I'm just-- I think I might be the first speaker to do this, but I'm just going to quickly do this because it's really wild from up here. OK. Cool. So with that out of the way, let's get started.

So, hi. I'm Kamal, and I'm speaking to you. You are all here at Deconstruct. And so about a year ago, I found myself on a plane. I mean, I got on the plane voluntarily, et cetera. But I exhausted the supply of cheesy action films in the live in-flight entertainment system. And so I had to dip into my collection of tech talks that I keep on my laptop for emergencies like this.

And the one I pulled out was this talk by Barbara Liskov. And for anyone who is-- eh, there's a bit of cheering. That's great. So for anyone who is unfamiliar with Barbara Liskov, she is an incredibly impressive person. She kind of invented encapsulation and data hiding.

She also independently came up with a consensus algorithm called Viewstamped Replication slightly before Lamport came up with Paxos, and did all manner of other things in abstract data types and in distributed systems. And she actually won the Turing Award for all this work. And the citation there is "for contributions to practical and theoretical foundations of programming language and system design, especially related to data abstraction."

And the talk I'm going to be giving today is focused around this data abstraction and modularity part of her work. My name is Kamal Marhubi. If you want to find me on the internet, you can just put @s and .coms in various places in here-- so kamal@marhubi.com. Kamalmarhubi.com is a website. @kamalmarhubi is like GitHub, and very sporadically Twitter, and occasionally Instagram.

And there's going to be kind of a few references in this talk. This link should work-- maybe. If it doesn't work during the talk, it will work sometime after the talk after I am off the stage.

And so yeah, getting back to kind of this modularity idea and where this talk is going to be coming from. I care a lot about how I architect code and design it. But I struggle a lot with how I architect code and design it. This stuff is actually really hard.

You typically, if you're like me, you'll make some decision. It seems like a brilliant decision. And then much as like in Dan Abramov's talk yesterday, you come back days, hours, weeks, some period later, and it's like-- ugh. It's not what I wanted to be doing.

And the way that I go about this is my decisions are really based on intuition. And intuition requires experience. And that basically means you have to make a lot of mistakes. And it's just like, not a-- it doesn't feel great to do things that seem good, and then find out in a very short order that they are not good, and that you will now hate yourself for at least six more months because you don't have time to go back and fix the problems you introduced.

So part of what I think goes wrong here, and this is like sort of again, tying back to Dan's talk yesterday, is that the usual advice is easily misapplied and even easily misunderstood I think. So don't repeat yourself-- very easily misapplied.

Inject dependencies is very good as a thing to do in a lot of cases. But occasionally, it will just lead you astray. And now you have this horrible object graph that you have to construct, and everything's just like spaghetti. And you have seven levels of drilling through to get these, whatever, database provider thingies to where they need to be. And it's all really a pain.

And then there's this reduced coupling advice that comes up a lot. And I think, to be honest, I find a lot of these words are kind of like-- I don't know. They go off my brain like slime sometimes. I can't really remember what coupling means most of the time. I definitely don't remember what cohesion means, and they come up a lot.

And it's not just me, right? This is stack overflow. What's the difference between cohesion and coupling? And these are completely different concepts, except that they both begin with a C and an O I think.

And I think this is probably someone's homework question or something, because it's very, you know-- please write a 150-word essay about blah. But these words just kind of slide right off my head.

And the main thing I think of when I think of coupling is there's the BBC, kind of like slightly edgier, much more British version of Friends from the early 2000s this comes to my mind much more than whatever the separating concern stuff.

So now let's go back to the plane, right? So I'm on the plane and I'm watching this talk. And Barbara Liskov is-- this is a version of her Turing Award talk, her Turing Award lecture.

I'm not actually sure if this is the actual lecture, or if it's slightly later, but this particular version I have a screenshot of, there's question time at the end. And you have Guy Steele and Phillip Butler and all these like fancy people asking questions. And it's like, whoa. That's kind of cool.

But in this talk, she goes over the history and the context, the historical context, of the work that went into effectively coming up with what led to classes or data hiding and all the stuff we have now. And I was on the edge of my seat. Well, I mean, it was economy class. I was on the edge of my seat anyway. But I was on the edge of my seat, and I kept pausing the talk. I must paused it a dozen times to take notes of all of the paper titles, so that I could go back and read them later.

And then you fast forward some period, and I got invited to speak here. I'm like, oh, I have all these paper titles. I should read them now. And so fast forward a bit further and now we're here.

And so part of the real historical context for all of this work is something that people in this room, maybe they're older than me and may remember this, but the software crisis. And this is a-- "software crisis is a term used in the early days of computing science for the difficulty of writing useful and efficient computer programs in the required time."

I guess we're still in the early days. I don't really know. But this is all referring to '67, '68 is what's being meant here. And at that time, they just didn't really stand a chance of writing software in any reasonable way, because we just didn't know how to break up programs.

So if you go back to '68, there's this Dijkstra-- "Go To Statement Considered Harmful," which is a very, I think, a much more well-known title than the actual text. But I'll just go through this really quickly in a little bit.

But a even lesser known, I think, fact is that he did not title this-- Dijkstra himself did not title this Go To Statement Considered Harmful. He titled it, "A Case against the GO TO Statement." You can tell this is the legit Dijkstra thing because of the EWD thing in the corner. He numbered all of his writings.

And so he called it, "A Case against the GO TO Statement." And the editor gave it this more catchy title of "Go To Statement Considered Harmful." And that's how we get all of this stuff, right? Six URLs, short URLs, whatever. GetFlow, harmful-- I kind of agree. Dot Dot, harmful-- whatever. ACM, harmful?

This kind of funny because this letter was published in the ACM. And then immediately under that-- Considered Harmful, Essays Considered Harmful. It's just like, if that editor did not just rewrite the title of this thing, we would have missed out on 735 results-- 735 hopefully unique considered harmful things.

But going back to the paper-- or it's not really a paper. It's a letter to the editor. This is the entirety of it. This is-- it is-- I had never read it. I had read-- knew the title. But I had never read it. And it's very, very short. In fact, the second page is actually just acknowledgment. So it's only that. That's the entire thing.

And the gist of why goto is considered harmful by Dijkstra is that over here he is saying that our brains aren't very good at reasoning about dynamic processes that are evolving in time. And up here he's saying that having a function and stack discipline-- so having the ability to have a stack trace is useful for understanding a complicated process in time, because it kind of makes it a bit more linear, and our brains are better at that.

And then down here he says that goto statements kind of interfere with the ability to have a call stack. Because if you could literally get to anywhere from anywhere, then what can you even unwind? Like, there's no sensible way to take a program state and figure out how to make it into a kind of linear call stack kind of thing. So--

Oh, and the beginning of the letter he says, "I became convinced that the go to statement should be abolished from all 'higher level' programming languages." And we're all very lucky today, because goto has been abolished from all higher level programming languages. Let's all just cheer for ourselves for a second. I didn't do any of the work, but I'll take the applause.

So we're quite lucky. But the motivation that I found really striking in this is that the motivation for goto being considered harmful is to make the code more understandable, but in a very reasoned and concrete way, instead of if you have goto things that are like spaghetti. It's like he makes the argument that it interferes with having a nice stack kind of abstraction to reason about what's going on.

So that was kind of the start, or approximately the start, of this reading spree that I went on. A roughly contemporaneous paper is this one called the "Dataless Programming," which is a very bizarre name by someone called Balzer, with an L not with a W. I just mispronounced that. Yeah, so "Dataless Programming" by Balzer.

And I'm not a Marvel person at all, but it's The Rand Corporation that's kind of maybe funny. I don't know. I googled it. It's like, something. It's hard to know what it actually is more than the Marvel thing.

So I'm not going to show any syntax from his Dataless Programming language, because as with many languages from that period, our modern sensibilities are very harmed by reading code in that syntax. But what he does in this language, which is very interesting, is effectively takes the abstraction of a sequence of things, like say, an array or a linked list, or any other representation of things like [INAUDIBLE] [? gone ?] [? through ?] in order, and puts it into the syntax of the language.

So the programmer can supply procedures that the language will execute when it hits a for loop. It's not that it's calling a function, like in Java or in JavaScript you have this iterator thing, and there's a bit of syntactic sugar. This is like, the language's implementation will directly call some functions in machine code. Or it has a few built in arrays and lists. It also has the ability for any other order data representation can be used by just providing the necessary data handling algorithms in whatever way you did that in those days.

And the goal for him in this was to make it possible to change the representation of something without needing to change the user code, the code that was calling into it. So here he says, now you can change a collection from an array to a list, and don't have to change the source statements. Just merely change the data declaration.

We're all like very, very, very, very, very familiar with this these days. But this was incredibly novel at the time, or at least in like-- I think was just incredibly novel at the time. Because yeah, there's this cute, quaint statement on the next page. It's like, some representations, for example, colored pictures are not ordered sequences of things, and we have absolutely no idea on how to abstract that at all. We just like, meh. We can do it for lists. And it took a lot of time, and we can't figure this out for other stuff.

So I thought this was really interesting just to think of, you know, this was 1967 I believe. So we're 52 years later. So there was a lot of time that passed in between. But at the same time, I know people who were alive in 1967. I was only alive in '80-something.

But there are programmers I've spoken to who they come from-- they were programming in this time. And the interfaces and everything we take for granted completely now did not exist at that time. So again, kind of like lucky us, basically.

And so the interesting thing about both of the-- I think the goto statement considered harmful and this dataless programming language is that they're both approaches to solve a problem via a language change, so removing goto, or adding interface or proto interfaces into the language, in the one case, to make the program more understandable, and in the other, to make it more adaptable, more easily modified.

And then if at the time, like in the software crisis era, a lot of the issues were seen to be as organizational and to do with how people go about designing programs. And so this paper by Parnas has a fantastic name-- "Information Distribution Aspects of Design Methodology."

And what the paper is actually about is hiding information is good. This is-- we come-- we're all familiar with having private data members and so on. And this is from an era before that. And so it's all about organizational and methodologic-- method-- [MUMBLING] that word-- ways of hiding information.

And a key, I think, really sharp observation in here is he clarifies that when we think of connections between different modules of a system, we frequently think in terms of APIs and data that's crossing over. But the actual connections are more extensive than that. They're the set of assumptions that these modules make about each other.

For example, you may ask for a list of things. And you may-- so the API that you call is to get a list. But you may be inexplicitly assuming that it comes out in a certain order, and that's like an implicit assumption. And so it then becomes a dependency, or some information that has leaked from the thing that is giving the list into the thing that is asking for the list.

And so he says that we may only make those changes which do not violate the assumptions made by other modules about the module being changed. So it's not about, can we change it so that the APIs still match up and it still compiles? There are some latent only in the human ether realm connections between things that we may be breaking without knowing. This is, I guess, why we have integration tests and all of this stuff.

But he has some really good-- this paper is just full of really good quotes. I don't know. Haste makes poor internal structure likely. And it's like a fortune cookie for programmers. It's like, yes. Yes, it does. Yes. Hmm.

Some properties of good programmers-- a good programmer makes use of all of the usable information given to them. And this is, I mean, arguably true, arguably not. I'm not sure.

But a kind of corollary is that we should not expect a programmer to decide not to use information that they have. And so we should instead make it that they can't get that information in the first place. And this is the crux of what Parnas is talking about, is like, can we hide through-- in this paper, it's through documentation systems and communication between subteams and all of this kind of thing.

But in our modern day, if you have some kind of a stack or something, then-- or you have an array, and certain invariants need to hold. You shouldn't just make all of those things public members of your stack thingy, because otherwise someone will just go and screw up all of your invariants, and then things will happen that are not ideal.

So now we get to some of Liskov's own work. And actually, I'm going to-- it's kind of funny, because she did such foundational work here. But I'm going to kind of go very quickly. Because it's just so-- I don't even know how to give it good justice.

So she has one paper which is kind of building on the ideas in this Parnas paper, of doing this in the methodology space. So she kind of refines the idea of what modules are and uses-- has some rubrics on how to decompose the system well, and so on.

But then her-- and then just like a year or two later, her and someone called Zilles published this paper, "Programming with Abstract Data Types." And I'm just going to show you immediately the syntax in this language. This one is actually kind of-- if you squint, this is a class.

And so this is just two or three years after we don't know anything, we have a stack. Cluster was the term used for a grouping of operations. So a stack is a cluster that has elements of type t. And its methods are push, pop, top, erasetop, and empty. The internal representation, which is only visible to stuff inside this stack part-- this is like private members-- is like, it has a top. And it has an array of the type t, and then the definitions of the methods follow.

Create is able to inspect the internal secret representation, but any other part of the program from outside of this block effectively has no access. And this is like, you go from we can't figure out how to do pictures or anything, to this is basically Java's type system minus interfaces in five years or something?

And then you wait 20-something more years before Java. And then you wait another 10 years, I think, until generics happen. And this is like [INAUDIBLE] generics back in 1970 whatever. So it's really amazingly foundational stuff that just kind of sort of happened overnight, but then didn't percolate into the rest of the world for quite a while.

So let's see. Where are we going next? Right. So there are a couple of motivations. So earlier when we were talking about goto considered harmful, a motivation was to make code easier to understand. And in this work with these clusters or classes, the goal is to make code easier to change. Because all of these invariants can't be damaged by outside people, because they can't access the internal representation. So you can change the internals, and providing you continue to uphold the same API external interface, then you are able to change the code without issue.

And so between these two things, what we are aiming for through all of this is to have code that is both understandable and modifiable. Of course, code that is understandable is automatically going to be at least a little bit easier to modify than code that isn't understandable, because, you know, infinite monkeys or something. If you don't understand what you're doing, you're not going to get any good outcomes. But just because it's understandable does not make it easy to modify. You need both of these properties.

So how can we achieve these without reference to DRY and coupling and stuff. So let's look at the historical and kind of like the motivations in these papers themselves.

So reduce connections by hiding implementation details. This is probably-- you know, there's one of these words in Solid or something that is this. But this is easier to understand for me, and also is directly tied to Liskov's work here in introducing a data abstraction.

Another way is we can allow multiple implementations. This is the weird switch an array out. It's in the dataless programming thing. But today we have interfaces or type classes or traits or protocols, whatever you want to call them, but they're great. They allow you to define kind of a high-level understanding of what a component in your system is, and then switch out implementations. We use this for testing a lot. It's all great.

And another way-- and then I think this is like the thing that I kind of get from the [? clue ?] and Liskov's work is a lot of effort went into putting these features into languages to help us with this stuff. So if your language has such features, use them. Obviously, don't go overboard because that is like-- then you get nonunderstandable code by kind of crossing over to the other side. But use features like interfaces, like data hiding that are in your language if you can.

So I want to talk about a few ways you can apply this. It's kind of driven by my experience. So in my own work, or in your own work, you can apply this in a few ways. For example, if you're making a change and it feels harder than you think it really should, then take note of that somehow.

And a couple of years ago I introduced-- I replaced all of the TODO with a user name with TODO with a category. Because having a user name like TODO Kamal doesn't mean anything. And Kamal will leave someday.

And then time will pass, and you end up with a situation where it's like, if you look at all the TODOs. You have current engineer one, current engineer two, the person who's now VP of engineering, someone who's not there, the CEO has like, 20 TODOs because they wrote a bunch of code early on, CTO.

And then you have 20 dudes from like, who is Jeff? And you have the person who switched roles from being an engineer to being someone on a more business side of the operation. And then you have another engineer. And you have another person who left.

And these-- I find that these names are super useless. And the categories-- for example, instead of TODO username, we're going to have TODO arch, for architecture and design problems, beta before we can really beta, docs for fixing docs, and so on.

And this is much easier to cut across your code base and see, oh, there is-- especially-- I'm mostly speaking here about the architecture issue, where you've noticed that something was harder to change than you thought it had the right to be. You take a note of it.

And then maybe later you're going back, and you come across, for example, TODO arch. If we look at the way we employ this interface, it's trying to do too much-- long comment explaining what amount of too much it's trying to do and how.

And that is just there. We don't make a change right now. But when we come back and we see this, we're like, ah, yes. Now I can see with a bit more experience how to actually split this up.

And this is, again, reminiscent of what Dan was talking about yesterday, of you have-- you're noticing some kind of repetition or something, and you're not sure exactly what to do. Maybe the best thing to do is do nothing and just take note. And then look at it later, because you will be smarter hopefully. So that's ways to apply it in your own work.

During code review is another great time to apply this, because the goal here is to have code that is both understandable and modifiable. Well, during code review, you've got someone modifying code. And you've got you who's trying to understand what's changing. It's like these two things are happening at the same time. It's a really good time to take advantage.

And so go beyond style, you know, single quote, double quote-- just throw all of that away. That will not increase the understandability. It may increase the readability. It may be easier for someone to scan the code. But it is not going to increase the ability to understand the code.

So instead, ask questions if you don't understand why is this being passed there? Get clarifications. If you get a good clear answer back in the code review, get that comment put into the code. Because nobody really goes back and looks at code review comments to find out what's happening. Putting it in the code is a bit closer to where you'll be when you're actually trying to do something.

But more than this, look for hidden assumptions. This is going back to what Parnas said. It's not just about APIs and data structures, it's about assumptions that we're making. So here's a pattern that you will probably have seen fairly often getThings, and then you [? check ?] the zeroth element.

And there's some assumption happening here. We're getting the first element. Is it because the order is special? Like, where getThings returns them in ascending order, and we want the smallest one? Is it because we know that the collection is a singleton? Is it something else?

So do something about that. You know you could add an assert, add a comment, or you could-- this is something I came across in the Gradle API, which is like a build system for Java and Android and stuff. And they have this file collection that has a method called getSingleFile-- returns the content of this collection asserting it contains exactly one file.

So it makes the intent super clear. There's none of this getThings-- getFiles and then get the zeroth element. It's clear that you're expecting there to be a single file. You will now get a failure if somehow two files get created, or zero. And this is really great. So these are ways you can apply this during code review.

You can also apply this during design. This is kind of coming back full circle, right? It's about design and architecture and how I find that hard. And applying these principles a bit can help with breaking down some of the decisions you might come across, or some of the trade-offs.

So let's just go through a very quick example of a Redis-based cache. So Redis is a key-value store for anyone who's not familiar with it. It has fancy values. So you know, the-- it's not just string keys and string values. It's the values can be lists or sets or hashes or ordered sets. And I think there's some other yet fancier things, like HyperLogLog maybe.

But it has very non-fancy keys. The keys are just byte strings. There's no namespacing. There's no nothing. And so if you're using Redis, an issue you might start to think about is well, how do we deal with this detail of the keys, which we have to now have in maybe multiple different parts, like the part writing to the cache, and the part reading from the cache.

And so let's-- to make this slightly more concrete, we're going to have a cache for GitHub user account info. And we're going to assume that user account info is just two things-- the avatar and the starred repos. That's it.

So the way people frequently solve these namespace problems in Redis is to do something like prefix it with some string and a colon or some other delimiter. So like, [? avatar:KamalMarhubi ?] would be a link to a URL that is probably a square picture of me. And then starredRepos could be, in this case, [? starredRepos:KamalMarhubi ?] could be a list of starredRepos. I don't know. I have at least one or two. I don't star things in general.

And the consumer, you could imagine-- I hope this is-- oh, this screen's really big. You could imagine like-- so in some part of code that needs to access this cache, you await redis.get, and then you compute the correct cache key by concatenating strings. Or you do the same thing with lrange 0, -1 is like the way in Redis to get an entire list. So you're like, I want to get all the starredRepos.

And now you've got this starredRepos colon and avatar colon kind of baked into this code. And they would also have to be baked into the code that writes to the cache. And this is like, you know, you might start to think, hmm-- DRY? This is leaking these key names. And maybe we should dry it up. And so you're like, OK. Well, let's put the key names into some functions, into some Redis common file or something.

This is actually like a real problem I have at work, where our use of Redis is kind of this-- it's hard to know-- it would be very hard to break down the Redis instance that we use into different parts that-- into different Redis instances, because everything is sort of glommed together in this way.

So you've gotten rid of the implicit, shared-- this super knowledge of these keys by putting them into one place. But you still have the fact that the cache is Redis-based is still very baked into all the uses of that. And that's not so great.

So it's like, use a feature of-- well, not the language, because that was JavaScript, but let's say, the language Typescript. And you can define an interface for a cache that has nothing to do with Redis and nothing to do with cache keys and nothing to do with any of that. It's just getAvatar, getStarredRepos.

And it's very easy to imagine how you would implement the Redis version of this. But you could also implement a version that hits local memory and then Redis, or hits them both in [? races, ?] or hits the real database in Redis, or all kinds of other things without needing to keep the details of the implementation visible in the use.

So you could swap out that implementation, and code using it would not need to change. So this is like a very, very whirlwind way of how you might apply some of these trade-offs-- or apply some of these rubrics during a design.

And one last way I think that's very interesting to think more forward-looking is how you evaluate a new technology. I'm just going to be like-- you know, I'm probably going to be old now, because GraphQL is a new thing to me. I have not used it. But it has some interesting properties that I can kind of relate to these-- the wanting to hide internal details, and wanting to have multiple implementations.

So this is like a GraphQL query, or so I'm told, for GitHub. So you specify directly in the query what fields you want. So you're saying, I want the avatar URL and the names of the starredRepositories of the viewer, who is me. And then magic happens. And you get exactly those things in exactly that structure.

And so you've kind of got this explicit schema. So this is like the-- the field names are defined in a schema, and that forms a very clear interface between the code that is calling into the GraphQL thing, and the code that is doing the magic. I have no idea how the magic happens.

But I did play with the GitHub. They have a schema explorer thing, and it's kind of cool. And the explicitness of that makes it sort of hard for details to leak in, because the schema is very clear about what details even exist. So it's hard to have details cross over that boundary. Yeah, this is what I was just saying. It forms a pretty clear interface.

And on the magic side, there could be a database. There could be a database and caches. There could be lots of other REST APIs that could be talking to a mainframe. Any of these things could be happening.

But this interface of the GraphQL query kind of hides all of that, and allows the people in the magic side to change things without you on the-- sorry. I keep saying magic. I think there's stuff over there that's actually code. Resolvers is a word that is important, or so I've been-- like, that's far as I've gotten.

So the implementation on the resolver magic side can change without having to change the clients, because their queries-- like, the schema hasn't changed, and so their queries are still valid. And so maybe GraphQL is promising in this way. Maybe it does hide some implementation details. And maybe it does give us a bit of ability to change out the implementation without breaking consumer code.

So I don't know. Maybe GraphQL is cool. It's five years later since it came out or whatever, so like, you know, I'm getting there.

So just to kind of sum up and recap everything that we went over here. So the-- what we want as a fundamental desire is code that we can both understand and modify. And the ways we can do that are to heighten implementation details so that we are able to modify small amount of code without needing to have universal knowledge of the entire system; allow multiple implementations-- some slide misfire.

Anyway, allow multiple implementations and use features of your language where they're applicable and where they will help you. So interfaces are great. Classes are great. Traits in REST are great, and type classes and interfaces in Go, which are different from interfaces in Java. But they're all great, because they allow you to take advantage of effectively work that has been done by lots of theory people, and then lots of implementation people, to kind of make our lives a bit easier and enforce these boundaries.

And then in your day-to-day work, this is going to be an ongoing process. We got rid of goto, but getting rid of all these implicit dependencies is not a thing you can just zap away. So it's an ongoing process, and you have to be vigilant.

It's like a slog, because people, yourself included, are always trying to mess it up. And so just do what you can, which may just be writing down that it happened, and getting on with your life. But keep track.

And as an industry or as a community, we've abolished goto, which is great. We've invented encapsulation and interfaces. These are also really great. I've said they're great so many times, I'm sure you're all convinced.

But there's probably still a lot left to do. And I have no idea what those things are. But I have this sense of when you go to, say, take your stuff and put it in the cloud, you suddenly jump out of normal programming into IAM roles and stuff that is really not the same. There's like a pretty big gap there, and there's probably all kinds of other areas in programming where there's more things we can do to help.

And yeah. I promised-- I think I promised that I would get you the URL again at the end. And there's my contact information. And thank you very much.

[APPLAUSE]