DeconstructSeattle, WA - Thu & Fri, Apr 23-24 2020

← Back to 2019 talks

Transcript

(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Hello. I'm Allison Kaptur. Thanks very much to Deconstruct and to Gary and Brent and the rest of the team for having me. This Is where you can find me on the internet. This is my personal site and also my handle most places. I'm currently working at a company in San Francisco called pilot.com which does automated bookkeeping, which I'm not going talk about at all today, but if you are interested in that, please come find me later.

We're here today to talk about clock skew. And at its most basic level, clock skew is when two machines-- two clocks on two different machines disagree about what time it is. And if you'll allow me a moment of speculation about the people in this room, My guess is that most of you have heard of this and at least some capacity. You're aware that there are clock skew related bugs that exist. And you probably are aware that there's also a set of techniques for dealing with clock skew in the context of distributed systems that are largely the province of distributed systems wizards.

So today, I'm here to convince you of three things. The first one is that you probably work on a distributed system or work in a distributed system, even if you don't think you do. The second thing is that clock skew related bugs are widespread enough that you should be aware of this in your day-to-day programming and software design. And the third is that the set of mitigation techniques for clock skew is not just limited to distribute systems wizards. It is a set of tactics that you can apply that will substantially remove the impact of these kinds of bugs. I separately believe that you have the ability to become a distributed systems wizard if you are not one already, but that's a different talk.

All right, so let's be specific about our terms. A distributed system is more than one computer. So we've heard about some examples this morning that are pretty clearly distributed systems-- multiplayer game frameworks, databases with lots of clients and complex consensus management protocols. Some other things that are just our distributed systems are a web app that has a server and also a database. That's two computers that makes it a distributed system. The game Space Team-- you ever played that?

[APPLAUSE]

Space Team fans in the house. You and your friends connect to each other's phones over Wi-Fi or Bluetooth. There's no server involved, and then you yell space nonsense. Again, that's a distribued system. And basically, anytime you're doing programming with a database, you're probably working a distribued system.

So let me start by motivating the problem clock skew with a couple of bugs. I admit that I'm in the following extremely convenient situation right now, which is that my Mac clock, this machine, is running five minutes slow. And I'm trying to reconstruct that the timeline here. I'm pretty sure I proposed this talk before this started happening. But this has been super useful for me because it gives me a lot of insight into how different systems behave in the context of a clock skewed system.

I want to be a little more specific about terminology. Technically speaking, if two clocks are just different and have a gap in what time they think it is, that's a clock offset. Skew is the first derivative, so the amount by which a clock is running fast or slow. And then the second derivative is clock drift.

So I'll start with the thing that maybe noticed this problem in the first place. So I was working one day, and I pushed a branch to GitHub. And then I go to the UI to open a PR, and I found this bit of UI element that says akaptur-- that's me-- push this branch from this hash to this hash two minutes from now. So hello, I'm from the future. Please hurry up and land my PR. I'm sure it's very good because it can time travel also.

So what's going on here? My Mac has some notion of what time it is. GitHub has a notion of what time some event occurred. And then, when I hit this PR open page, presumably there is some JavaScript library that's saying, OK, compare the user system time to the time of an event and translate the resulting time delta into a human-readable English string. And what you get from that is two minutes from now.

So this is not wrong from an engineering standpoint. This is what time my clock was saying. At the time, it was running about two minutes slow. But from a product or design angle, it's not a super useful thing to do. And if you were thinking about this use case, you might make a different choice here than just giving the user the timestamp that applies their actions in the future.

So the first thing that I wondered when I saw this is, well, is this just me or is it GitHub? I definitely thought it was probably me, because if it was GitHub, we have a much bigger problem. So I went and asked on chat, as one does, to see if any my colleagues are having the same problem.

So here's a little screenshot of me talking my colleague Glyph on-- this is Zulip chat-- functionally equivalent to Slack, an open source product. And I showed you this not to indicate the-- not for the content of the chat, but to show the timestamps along the right-hand column. And I realize this is very faint, so let me read them. Glyph response to me at 12:07, 12:07, 12:08. I reply to him at 12:07, 12:07. And then he replies at 12:09. And as this conversation goes on, the timestamps continue to interweave.

So Zulip is making a series of decisions here, which I think are fairly defensible. The first one is they are showing me timestamps of my messages based on my system time. I can see the argument for it. You can likely see an argument against it also. They're showing me my colleague's messages timestamps based on probably their system time, possibly the server time. It's not clear to me, since those are likely aligned.

The important thing is that they are not using anybody's timestamps to determine the ordering of these messages. And if they were doing that then the entire chat application would be totally unusable, not just for me, but also for all my colleagues, because all of our messages would be out of order.

The third clock skew bug I want to tell you about is one that I wrote myself, so close to my heart. This was when I was working on the desktop client at Dropbox. And we had a validation that we wanted to run once every six months for every user. And this was a useful validation around integrity of the syncing protocol. But it was a little bit resource intensive. It requires some network calls and a little bit of CPU, so we don't want to hammer our users with it. So we decided, OK, we want this running once every six months per desktop client. And we don't want it running statistically once every six months. We want it running actually once per six months for each client.

So we set up a protocol where the client would call the server and say, this is the timestamp of the last time I ran this check. And should I run again? And then the server had some admissions criteria that it would use to determine whether or not to run. So you've heard the title of this talk, so you can see where this story is going.

I actually considered the possibility of clock skew because we know in the Dropbox desktop client that many clients are old machines. That's the main one is old machines, and there's lots of them. And so there's clock skew in the pool.

So the choices you can you can make here either to say, well, if the client report that it last ran in the future, don't run because it probably ran recently. If the client reports that the validation ran in the future, consider that no information, and run with some statistically low probability. Or you can say, well, this client has clock skew, so obviously something is wrong. So let's go ahead and run this validation about sync integrity. And I went with option three.

And then, a couple weeks later, I'm checking on the monitoring and making sure I understand what's going on. And I see that one client is running like hundreds of times. And I was like, oh no, that's not supposed to happen. And so I look at the logs, and the system clock on this user's machine is jumping back by 10 minutes 6 to 10 times an hour and that this decision I had made was causing it rerun the validation every time. So we fell back to the statistical options. Said, OK, treat that as no information. Run with low probability so you don't like swamp this particular user.

At this point, you may be wondering why does your system clock go back 10 minutes 6 to 10 times an hour? And the leading theory that I have heard on this one is that the user was setting the clock intentionally in order to cheat at video games.

We heard a preview of a more serious set of clock skew issues in Carla's talk this morning. I'm going to talk about a related one. But you will notice that time on the system is an input to a number of 2FA algorithms. And so if the client and the server disagree about what time it is, that you're going to have a bad time. We were speaking with some folks about what you do if you are a physical RSA ID dongle has clock skew. Do you just have to go and replace it with a new one? Not clear.

Another area where this is really important is in the context of certificate validation. And in particular, a certificate is valid over a particular date range. Well, how do you know what day it is? Obviously, if you're in the context of doing like an SSL validation process, the client is trying to determine if the server is trustworthy, so we can't take the server's view for what day it is. So the client has to know. And if the client clock is off by far enough, then it will incorrectly fail to validate a certificate because it thinks it's out of range.

And there's this really interesting Google paper that is primarily from Google researchers that came out in 2017, where they are basically monitoring the SSL cert validations warnings from Chrome, errors and warnings, and trying to classify the root cause of them. And they found that 33% of certificate validation errors on Windows are attributable to incorrect client clocks. This is a pretty widespread problem.

And they had two mitigations that did. The first one is that they recommend that you as a designer or a programmer of systems should not use an immediately-- a freshly-issued cert. If you get a cert issued today, any clock that is at all behind will think that it's not valid, to give it a couple of days before you actually deploy that.

The other thing that they did is they tried to set up an actionable suggestion in Chrome itself. In particular, if you can verify that the problem with the certificate is that the client clock is wrong, then you can tell the user, hey, fix your clock, and then the problem will be solved.

So the next question is, well, how do you know what time it is if the client clock-- if you think the client clock is wrong. And they stood up a secure time server in Google infrastructure. This is also a little bit of interesting security problem, because by definition, you are failing to do a cert validation. And so you can't do a secure-- you can't have a standard HTTPS connection to this time server. And the solution that is Chrome itself is shipping with a key that they can use in that context.

And they put up this morning that looks like this. So instead of the user getting the standard, scary HTTPS error, they have this red clock-shaped thing. And the call to action is go fix your date and time. And they found that this is unsurprisingly much higher-- has a much higher rate of users actually being able to fix the underlying problem.

So let's take a step back for a minute to talk about how this ever works and how is ever doesn't work. When things are not totally broken, what's the situation? And how do clocks get so out of sync with each other?

So the first answer for the question of how clocks get out of sync is they're just not that good. There's some math in 2001 Sun Microsystems paper that points out that if a clock is off by 10 parts per million, then that adds up to 1 second per day. And we heard Kyle talk earlier today in saying that, oh, the clock skew is severe, then the various properties distribute systems don't hold. And I asked him, well, what's severe? He said, oh, well, 30 seconds. So 10 parts per million means you get to that in a month.

And there's all sorts of things that can affect this also, so things like the CPU load, the interference, electrical or magnetic interference, temperature where the clock is running. So your expectations for a hardware clock should just not be that high.

The other thing that you need to be aware of is leap seconds. And I don't have time to do justice to this topic. It's quite interesting, like I said in mitigation, that you can apply. The basic idea here is a day is not exactly 24 hours long. Sometimes it's a little bit more or a little bit less. And you need to throw in an extra second or take out a second from the day to account for that gap. Unlike leap years, where we're quite confident of when you should schedule a leap day and can schedule it out hundreds of years in advance, leap seconds are about the rotation of the Earth being a little bit wobbly. And the wobble is impacted by things like earthquakes, which we're quite bad at predicting.

So the relevant standards body, which I think is the NIST-- I'll have check on that-- schedules out leap seconds a couple of months in advance. Like, hey, by the way, leap second coming up at the end of such and such day. And your clock has to know about that and react to it.

OK, So how can that possibly work? Oh, this is what the clock looks like on days when you have a leap second, when you have an extra second. So the answer to clock skew in general is called network time protocol. And the basic principle here is how do you find out what time it is? Well, you ask someone who knows. And network time protocol defines the protocol that the client and the server speak and the ways of establishing yourself as an authority on what time it is.

So the first thing to know about NTP is that it is not about getting a set of clocks to converge to the same time. It's about getting a set of clocks to converge to the right time. So there is an implicit higher-- or there's an explicit hierarchy in NTP, where a clock that is-- a computer that is connected to a high-accuracy clock directly can establish itself as a server that is very reliable. The client doesn't necessarily have to trust it. We'll come back then a second.

So the so-called stratum 1 machines in this context are hooked up to a GPS clock, radio clock, a cesium atom, those sorts of things, which are high reliability. And then the server can advertise that. And the next layer down is clients who are talking to the more reliable servers, and then also broadcasting down to lower strata. In practice, you don't usually see more than four layers in NTP. The protocol supports up to 15, but the network topology is such that you don't usually have go down that far.

The next interesting thing to think about here is, well, the network part. So if you walk up to someone on the street and you say, hey, what time it is-- what time is it? You are probably pretty confident that the time between when they check their watch and start speaking and the time when you actually hear their response is not high enough for you to worry about. But over a network, that's obviously not true. You want a high-accuracy timestamp, and you need to know how much time it spent in transit. So NTP defines the way to answer this question. And I want to get into the details of this because I think it's surprisingly straightforward once you get into it.

So suppose that we have a client sending a message to a server, and then the server sending a response. And in this case, there's just the one client, so I'm using the x-axis to represent time, primarily. The client sends this message at time T1. The server receives that message at time T2. Then the server does its processing and responds to time T3, and the client gets the response at time T4.

So the question we were just trying to ask was how much time did that packet spend in transit? And with these four numbers, that is very easy to calculate. The round-trip delay is that time from when the client received the response minus the time that the server sent it plus the time when the server received its request in the first place minus the time that the client sent it. Alternatively, you can think of this as the time in between when the client first sent the message and received the answer, subtracting out the time that the server was spent processing. So that's round-trip delay.

The more interesting question is to say, well, what is the actual offset between these two clocks? And the way that the spec states this is, OK, it's the time that the server receives the request minus the time that the client sent it plus the time that the server sent the response minus the time that the client received it. And so the second term is backwards in time.

I find this not particularly intuitive. And so for me, it's more intuitive to think about this with the order reversed. So say, OK, the first one, the first term is the time spent in transit. And then we netted against the time spent in transit on the other side and divide that number by 2.

So let me work an example here. Let's say that the client sends its request at time 0. The server receives it a time 3, sends its response at time 4, and the client gets the response at time 7. So the math here is 3 minus 3 divided by 2 equals 0. Not a particularly interesting example.

Now, let's try this for my MacBook over there. It sends a message at time 0. The server receives that message at the time that it believes to be 303. It spends, say, 1 second calculating its response. Of course, in practice, it's much faster than that-- time 304. Sends a response to my machine at the time that my client believes to be 8. And so the offset then is 303 minus this large negative number, 296, and then divided by 2. And so you get 299.5 seconds, which is, in fact, our intuition for how far off my client is.

And what's interesting here is you can't make any statements at all about the clock offset until the client has actually received the response. So when the server gets the message from the client, the only thing it knows is what time the client thinks it sent the message. So this server might say, OK, between the network time in transit and the clock offset, the total it's 303 seconds. But it can't make any statement about the breakdown between those two. Not until it receives the response, until the client receives the response, can we actually say, OK, we have what the server thinks and we have what the client thinks, and we can work out the offset from those numbers.

The other thing that's interesting to look at here, which I sort of buried in this example, is that this algorithm doesn't make it easy to distinguish between offset and asymmetric transit time. So if you have low latency in one direction in the NTP protocol and high latency in the other direction, you will end up calculating an offset that is not necessarily there.

So I mentioned earlier that anyone can speak this protocol. Anyone can represent themselves as a highly-accurate clock. And so a big chunk of the spec in NTP says, how do we know? How do we get to the right time? How do we know if a particular server which may be representing itself as a high stratum object is actually telling the truth?

And this is the really fun part of the spec to read because they adopt some terminology where a clock that is accurate and run correctly is called a true timer, and a clock that is not accurate, running faster or slower, or lying is called a false ticker, so high drama. I'm not going to get too far into the details.

The basics of this is you do some more additional math. In particular, you decide your trust based on the stratum of the server, the total time spent in the network, where more time in the network gives you a worse answer probabilistically. And you also look at a number of different sources. So the NTP website recommends that if you're standing up a server for your data center and that sort of thing, you either have one NTP server or at least four so you can actually do some math.

So the thing that's most exciting to me about this is it is not that involves in terms of the math. This is a piece of infrastructure that underlies basically all of the internet and all of the work that most of us do. It's sort of held together with bubble gum and toothpicks, like the rest of the internet. And when you dig into the details, it is quite understandable.

And so my exhortation to you [INAUDIBLE] I'm going to give you some concrete techniques. But overall, the mindset here is that these pieces of fundamental infrastructure are understandable. And if you take the time to throw yourself at a problem or to dig into the meat of it, you can come to some really interesting mental models.

OK, so the practical part of the talk-- let me see how I'm doing on time. In fact, most of my speaker notes are showing me the time, but that's my system time. So I'm definitely slow by five minutes here. I have 3:59. So here's some things that you can do.

The first one is to think about having a reasonable fallback. So if your client tells you the wrong time, what do you want to do that information if it gives you an answer that is obviously nonsensical relative to your frame of reference? So here, you can imagine GitHub choosing to say an absolute timestamp when the client's timestamp is not reasonable, or maybe to exclude the relative section, or something like that. Having a reasonable fallback takes you up a big chunk of the way.

The other thing that you can do is try not to care what your client thinks at all. Pick a time keeper that's not the system time of any machine. A database is a great choice for this because in a relatively straightforward architecture, probably you have one database, and it's easy to get consensus across one of something. And if you have multiple setups for your database, then you start to get into the things that Kyle was talking about, where the database itself has thought quite hard about consensus mechanisms and can probably present to you an interface that is good enough for many use cases. So if you can avoid caring what your client thinks, that's very useful.

I have this example of some software doing exactly the right thing. Like many of us, I have Google Calendar configured to give me a 10 minute warning on all my meetings. And so here it is at 316, according to my Mac, telling me about my coffee at 3:30. Google Calendar does not care at all about what time my system clock thinks it is. And so it's able to do the right thing.

The other alternative that you can take us to get a second opinion. And this is what you saw the Google Chrome team do. If you do need to care what time your client thinks it is and you have the ability to detect if that answer is somewhat nonsensical, you can think about standing up another service or getting another way to figure out what the correct time is, and then build that into your protocol. So high-level strategies-- having reasonable fallbacks, avoiding trusting the client, picking a consistent timekeeper. And then if you have to, you can fall back to second opinions.

So before I break, I want to ask the question that's probably on everyone's mind, which is what's wrong with the Mac? I don't have a terrifically satisfying answer to this at the moment. If you have more details here, I would love to hear about them. I think the upshot of this has to be that whatever system is supposed to be tracking the time and slewing the clock to adjust it is just not running. Because otherwise, you can't be five minutes off. And it's getting more off at a rate of about 20 or 30 seconds a week, which is a lot.

If I run SNTP on the machine, it says, yeah, you're definitely 223 seconds behind plus or minus 190 seconds. So Apple is not actually using NTP, I think, because of the battery life concerns. This is all from forums. So there's this thing called Pacemaker, and I think maybe is also not running. But if you have ideas on how to diagnose this, not to fix it, but to understand it, I would to hear them.

All right, so my thanks to the folks who helped with this talk-- Jess Shapiro, Amy Hanlon, Glyph Lefkowitz, Julian Cooper, and Nelson Minar. And we're not doing questions live, but if you have bugs and war stories about the time that you thought the system clock would be totally reliable and then it knocked down your service, I would love to hear them. Thank you.

[APPLAUSE]