DeconstructSeattle, WA - Thu & Fri, Apr 23-24 2020

← Back to 2018 talks


(Editor's note: transcripts don't do talks justice. This transcript is useful for searching and reference, but we recommend watching the video rather than reading the transcript alone! For a reader of typical speed, reading this will take 15% less time than watching the video, but you'll miss out on body language and the speaker's slides!)

[APPLAUSE] Thank you. Aw. OK.

Hello, my name is Jumana Bahrainwala and I'm going to be talking to you about how to scale relational database migrations. That's my Twitter handle. Feel free to tweet at me. I don't actually Tweet a lot, but you're welcome to tweet me.

OK, so what are we going to talk about? So you have the dawn of time and you have a Users table, and your Users table has, like, an ID, Active, First, Last, and Email. And it works for your simple application, and you're like, yeah, make sense.

And then you grow and you add more people to your team. And then your people on your team are like, hmm, what is First? First is not good naming convention. And in the spirit of constantly iterating, you're like, OK, yeah, we can rename First. So they decide to add a migration-- and they do it safely-- where they're changing the table Users and renaming the word First to First Name. Because that's what you're-- like that's what it means, right? First Name, Last Name, et cetera. And you're like, cool.

But at the same time, your application is actually serving to your end users and people are trying to constantly read from the users table. So you have like your select statement, which was like just selecting from Users ID, [? too. ?] And for some reason, the statement never returns. You just keep waiting and waiting and waiting. And you're like, hmm, what's happening?

So you open up htop and you see there's this like, select waiting. And you're like, why is the select waiting? Select should never wait. This is a problem. I should always be able to read from my database. And then your first transaction commits, and then the select finally returns. So you know those two are kind of related. You're like, hmm, what is happening?

So you never going to reach the state when you change a column type and you lock your database for 10 minutes. Because that's never happened to anybody. And, yeah.

So how did I come to this particular topic? I work for a company where we went from being a really small company and we went to becoming a really large company really, really quickly. We onboarded a bunch of new clients.

We needed to start caring about downtimes. We couldn't actually put up maintenance pages and say that we're going to ride on our migrations for like this block of time because people are like, no, you can't have your app down. We're paying you for this.

So we had SLAs, et cetera, et cetera. So one of the biggest things, one of our biggest issues-- when you don't have an architecture which is resilient and you don't have systems which will have backups all over the place-- like, you are the small app, you grew really quickly, you probably have a relational database as your back end-- and you're like, OK, how can this never go down? Let's figure this out.

And one of the best ways to make sure that-- like one of the reasons that your database does go down is when you run migrations. So this is my motivation for this talk. How do you run migrations when you're trying to scale up without-- like in a safe manner?

Most of these examples are Postgres-specific, but the concepts likely apply to most SQL-like DBs. I know there are some differences. And I recognize that, but most of the examples in the task are Postgres-specific.

I'm going to talk about three things. I'm going to talk about scaling with tools, scaling with people, and then the remaining-- there are some remaining issues that still exist trying to get to zero-downtime migrations.

So scaling with tools. Don't lock yourself up. I've used the word locks multiple times, and I haven't actually introduced them. Let me introduce them. So locks, you have this thing on your table that essentially locks it. What does a lock mean? You're like, OK, this is locked, so I can't actually go into it. Or, I can't see what's happening behind that door, right?

So you have some particular locks on your table that don't let you read or write from your table. These are the worst kinds. These are the ones that you don't want. You're like, what is happening? Why can't I see what's happening underneath the hood? It's like a stop, a do-not-enter sign. You can't actually see what's happening inside your table.

Then you have some locks that are open, but they're like a one-way street. They're like-- they let you do some things. They don't let you do other things. They're like, we let you read in this specific situation. Or will let you update in this specific situation. So it's like-- let's you do some things, but not all the things.

And then lastly, you have the locks which are like beautiful highways. They let you go towards the city and have no problem. They're like, OK, you can have a lock on this table. This will not cause any issues. You will always be able to read. You'll always be able to write. It doesn't actually cause any downtime.

So these are like on a broad level, the three types of locks that you have on most-- most of your transactions will hold. The first type, the one which is like the do-not-enter sign, that's your access-exclusive lock. That's like your Alter Table, Drop Table, Truncate, Reindex, or Cluster. When you run any of those commands, for a couple of milliseconds, that table is blocked to reads and writes.

Then you have the lock, which is like a row share, row exclusive. It lets you do somethings, doesn't let you do other things. It's Like your select for updates, select for share. Or like anything that inserts or modifies data, it holds this lock on this particular row for a very short amount of time. But then you'll still be able to read from it and still be able to write to it.

And lastly you have your access share lock, which is like your select statement. This is the lock that you never want blocked. Because if you can read from your table, that is when your application goes dark.

So reads are generally OK with reads. Most of the time, when you are trying to rely on multiple reads and different transactions, you don't have a problem. Sometimes, reads are OK with writes. It depends. Sometimes, writes will cause an issue.

But reads and writes are generally not OK with DDL statements. I know I haven't introduced what DDL changes are-- and I will-- but DDL changes are the ones that you have to actually look out for. So you want to look for locks that are conflicting because conflicting locks are what cause downtime.

OK, so we've talked about the theory behind locks. Now let's talk about what you should do to make sure that you don't actually hit these locks. The way that you do that is you break up your migrations. Break up your migrations. The first thing you have to think about is schema changes.

What is a schema change? So let's say you have the blueprint of the Death Star, and you know that there is this gaping hole that will cause the rebels to essentially get the Death Star to fall down, so you need to repair this hole. And how do you do that? You put up like-- you put up scaffolding and you fix this hole in the Death Star, right? It changes the underlying blueprint or the underlying structure of what the Death Star looks like. This is the exact same thing, what a schema change is.

Your database has a bunch of tables and has a bunch of relations that define what the underlying data, like underlying structure of the DB is. It's your DDL. Like, it's the data definition of the-- Data Definition Language of your database. It tells you what the data looks like underneath the hood.

Any time you want to change a relationship or you want to change what a table looks like in this database, it's a DDL change. A DDL change is a schema change. Schema changes are the ones that are the most-- they're the most affected-- they're the ones that affect your database the most.

They are the ones that cause like an access exclusive lock. They're the ones that you want to make sure-- so like if you have DDLs, the types of transactions you're looking at are like your create statement, your Alter Column Value or Alter Column Type. And all of them are the ones that will cause a complete lock to the table. It might be for a couple of milliseconds, but it still happens. And that is the lock which is an access exclusive one.

So you want to make sure that your schema change is really, really quick and it's in a small transaction. You don't want to tie your schema change to something that is long running. You want the Alter statement to run and just be done. How do you make sure that you do that? You make sure it's its own transaction. Don't tie your schema changes with any other change. Break up your transactions so that your schema change is its own transaction and it's really, really quick.

The next thing you want to do is index changes. An index is essentially a way to access data really, really quickly. What you can think of is like-- you know how we have pointers, like a pointer to like creating low-level disk access to a particular amount of data. You're like, OK, I want to access Store ID, and it just goes to Store IDs because it knows exactly where it is because it creates a pointer to it-- and on the disk level.

Indexes can be really painful to add, depending on the type of database system that you're using. Postgres lets you add them concurrently. Concurrent indexes don't cause downtime, but some databases actually add indexes non-concurrently which lock up the table and actually cause downtime. So it depends on the use case, but you want to make sure your index change is by itself-- again, in its own transaction. So you don't tie your index changes to your schema changes.

Lastly, you have data changes. Data changes are things that are-- things like when you copy data over or, like, you're inserting statements, update statements, et cetera. So like, let's think of when you want to copy over data from one column to another column, what your application has to do, it's got to read all of the data. And then it needs to write it to this other column. That can take really, really long.

Let's say you have a table of like a million rows, right? When you're doing that, that transaction can take a really long time, so it should be its own transaction. You have your schema change, which you want to be really, really quick, so that has to be separate. You have your index change that might lock. That needs to be its own transaction. And then you have your data change that will probably be long, and should be its own transaction.

So going through an example, so let's say you have your Users table. You are just creating your Users table, and that's the definition. And then you decide to add an Age column and a Status column. So those two are their own transaction when you add them in, and they're your schema migration.

Then, you decide that you're constantly searching on Email and on Status, so you decide to add Index on Status and Email so that you can search on those [? three ?] things really, really quickly. You want that to be its own transaction.

And then you decide that, oh, actually, I need to know when the user was last active, and I'm going to add an Active Date column. But when you add your Active Date column, you want to backfill it because you want to know-- you're like, OK, as a before value, I'm going to add the current date/time so I know that they've been active, at least now.

So you want that data migration, which is going to go through every single row in the table, to be run as its own separate transaction so you don't tie it to things which are actually going to lock the table to reads and writes. So if you tie your third schema migration to your data migration, that will make both the schema migration and the data migration really long. But if you separate them out into their own transactions, the schema migration will be really short and fast. And the data migration can take long, but that will be OK because it won't be blocked to reads and writes.

So if you go back to locking, in theory, both of those things will have an access exclusive lock, which is the highest level of locks that-- and causes the table to actually be shut down. It doesn't let you see what's happening underneath the hood. And when you do select [? star ?] from Users, at that point, it won't return for those couple of milliseconds.

A couple of examples for things that you want to do, like when you're changing the type of a column, you don't want to actually change the type of an existing column, even though that's really easy. You just write your Alter statement on the existing column and then change the type.

This is a workaround to make sure you don't lock your table and cause-- and for a long period of time-- what you do is you add a new column with the new type. Then you write to both columns, and then you backfill the new column. At some point, the new version of the code would stop writing to the old column.

This is painful and long, but what this will end up doing is not locking your Users table, or whatever table you are looking for. So if you have to change the type of column, do a three-step process. Think of how you can add a schema migration, data migration, and then process around it, versus just changing the existing column type, which will cause a lock for a long period-- for a longer period of time.

Another example is like adding a column that has to default. What you're going to do is you're going to add a column. So that's a new column. Then you'll add the default as a separate command. So that's your data migration. And then you'll backfill the column with the default value.

Again, there's an easy way to do this that is quick, but it's going to cause downtime. And then there's a harder way of doing this, which is like a three-step process, but will not lock your table. Choose the harder way. It's worth it.

OK, so we've talked about two things. Don't lock yourself out. Break up your migrations. And then [? data-police ?] this co-compatibility.

A lot of times, your migrations may fail. When you are running a migration in production and, for some reason, a table gets locked up and you're like, we have to abort. The best way to deal with that is to make sure that your code is backwardly and forwardly-compatible. Make sure you can read from both of those columns for a certain period of time. Both of these columns are being written to.

And don't drop your columns until you're like way, way, way into the future. You're like, oh, we're actually not using this column for a while now. Now we can drop it, and we're 100% sure the migration has gone through. So make sure your database and code is always backwardly and forwardly-compatible. That will save you issues.

So we've talked about scaling with tools. Now, we're going to talk about getting with your team. The first thing you want to do when your team grows is add a migration user. Should probably do it with-- before your team grows, too. But like anyways, add a migration user.

A migration user is a person or or a roll on your database that can have a locked timeout of five seconds. So what this means is the user demand-- like any command that this user runs has a locked timeout of five seconds. So any transaction needs to finish in five seconds. If it doesn't, it will abort.

So again, you have your Users table. Because, you know, I'm smart and I can only think of one table that you can have, so the Users one. And you have a bunch of select statements. So I know it's a list, but it does finish concurrently. But anyways, so you have a bunch of select statements and they're all reading from the Users table.

And then you have your Alter statement. And you know the Alter statement will actually lock the Users table for some portion of time, once it actually manages to get the lock, but it can't do that because there are a bunch of selects first. But every single select that comes after it will actually just wait. It's going to wait till that alter finishes because the Alter, they don't-- but they don't know what the Alter is going to change. So until that Alter finishes, all those other selects are waiting.

So that Alter statement there waits for all those green selects to finish. And let's say, instead of like a few green selects, you have like thousands and thousands of green selects. That Alter statement is going to block everything that comes after it.

So if you have a migration user, what that migration user will do-- it's like, oh, I've not managed to finish running this alter statement in five seconds? Abort. Let me-- let's just, like we'll figure out how to do it again. And all those selects in the purple will actually finish after that because the Alter statement will just abort. So yeah, and then you won't have that sad-faced.

You want to have migration tooling. So when you grow with your team, you want to make tools as an abstract things out so it's easier for your team to actually grow and write migrations without having to think about all these random things.

First step, if you're opening up your terminal and you're going into Postgres, typing out localhost, and creating a table, Users, just stop. There has to be a better way of doing it. Like, you believe it. I believe it. There has to be tools. Someone has done it. And yeah, someone absolutely has. There are a bunch of automated scripts.

Rails has this thing called Active Record. We use this thing called Alembic. There is like File DDB. Find the tool that works for your use case and for your language, and use that. Don't-- I mean, don't-- use what's out there. Don't try and write your own. And when you're looking for a tool, something that I have found really useful is having a hash of knowing exactly what the state of your database is at any given point of time.

I want every developer to be able to go into production and be like, OK, what are all the migrations that have run? So there's like a table which tells you, these are all the hash values of all the migrations that have run so far. I can take that hash table and easily replicate it on my local machine. And this is important because you want to be able to have a way to debug what your DB looks like on your local host.

And the way to do that is to be able to replicate the schema from what your production database looks like onto your local machine, right? So find a tool that does that for you. There are a bunch of them out there. I'm sure you can do it.

And lastly, you should load test your migrations. You should always be load testing your application. For like-- the minute you're growing-- you're like OK, how many requests can my application actually take? Do I need to scale up my database? Do I need to scale up my web nodes? Et cetera, et cetera. But this is specifically saying, when you're running an Alter statement or you're running the Create statement or you're dropping something, load test that specifically.

So when you're load testing your migrations, I'm going to assume you're running your migrations during low traffic because that's the best time to run migrations in case something bad happens. But what if low traffic still means you have like thousands of users on your website?

Load test your migrations. Find out exactly what part of your application is being affected. It might not be that every part of your 's being affected. It might be this one table that you're running a migration on that has two requests, so it won't matter. But you won't actually know unless you decide to load test your migrations.

So I'm going to run you through an example of what we found out when we decided to load test our migrations. And we recognized this was an issue. And this is the TLDR. Like, we had a cyclical graph and everything with all the transactions were waiting on each other.

So you have time, on one side, and you have your HTTP request cycle on one column and then you have migrations on your other column. So your request cycle starts, and your request cycle starts and opens up a transaction. All it's doing is it's updating the Users table, right, set status to false where id is equal to 2. It's nothing fancy. This should be easy. You can still read and write from the User's stable at any given point of time. No issues.

But then, you decide to spawn an async worker because you know how when you have your HTTP call, you send off stuff to an async cluster to like do some random work and wait for that-- the request cycle, though, will wait till those async workers come back. So you spawn an async worker, and the async worker makes an HTTP call to get information from a CRM service, which is external. So it goes out and it's doing some work.

At the same time, you run a migration which is changing the Users table. So it's adding a column to the Users table. This, we know, locks the Users table for a certain period, like for a couple of milliseconds, depending on how you do it. So that's another transaction. This is separate from the request cycle. It's its own transaction. It's being run separately.

Then you come back to the request cycle, and the worker, the async worker comes back and it runs a database call. So all it's doing is-- it's like selecting from users. So you have three transactions. Two of the transactions are within the request cycle. And you have your migration transaction which is outside of the request cycle.

For the request cycle to complete, it's going to wait for the async worker to come back because that's the only way that the request cycle actually finishes-- because it's supposed to wait for the async worker's information. And for the transaction one to complete, it needs to do a session.commit. So you're still wait-- so you know that the request cycle doesn't finish until async worker comes back.

So what exactly is happening? Your transaction one, you have an HTTP call. It's a request and it's the request cycle one, and all you're doing is you're updating the Users table.

But transaction two, which is adding a migration, is actually blocking transaction one. Because we know all transactions go into-- like it's not exactly a queue, but it's some sort of queue. So transaction one needs to finish before transaction two can complete.

But at the same time, we know our transaction three that comes after transaction two which is-- all it is doing, it's doing a Select statement-- is actually going to wait for transaction two to finish. So transaction three is waiting on transaction two, but transaction one is the request cycle which doesn't finish until you get the information from the async worker to come back. So transaction one is waiting on transaction three.

So you have this orthograph where everyone's just waiting on each other. There is no deadlock scenario. It's not actually locked up the table. All we're doing is like waiting for each other to actually complete, and no one actually ever completes because they're all waiting on each other. So if you go back to this, what has happened, transaction one was blocked by three. Three was actually blocked by two. And then two was blocked by one.

I understand this architecture may not have been the best way to design the system, but when you first start out, you make decisions which don't necessarily make sense.

So maybe the async worker should not have opened up its own transaction. And, like, what happened is, we decided to load test our migration. And when we load tested our migration, that's when we found out that migration was being run between when that async worker left and was still waiting for its request to come back. And we essentially locked up our application in production for a bit because, you know, we didn't actually load test our migration. So you should all do that.

So yeah, load test your migration specifically, not just your application. Load test migrations. Make sure there's a lot of users on your-- using your website so you can see what migrations actually get affected.

Last thing I'm going to talk about, the ORM. A lot of things can be moved off the ORM, if you decide to use your ORM as your source of truth. If people in your company are actually going to the database and doing Select statements, this doesn't work. But if the way to actually access data or like it's always going to be the ORM, you can offload things.

Like, you can add sensible defaults, like null values, et cetera, to your ORM. And that saves you from doing a bunch of data migrations and doing that whole two column, three column additions, and doing data migrations over time. So you can offload things to your ORM, but that's only if you decide that your ORM is going to be your source of truth.

And if you're using your ORM and you have multiple languages, then you want to make sure that every model and every language is actually always constantly up-to-date, et cetera, et cetera.

There are a few issues left with actually getting to zero downtime migrations. Foreign keys, vacuuming, and then primary and replica.

Foreign keys. When you add a foreign key to an Orders table and like you're making a User ID-- foreign key to the Orders table-- when this happens, both of these tables acquire an access exclusive lock. Which means that both the Orders table and the Users table are locked to both the reads and writes while this is happening. This is a problem.

I mean you can't just lock your Users table and your Orders table for a while. I don't know. I don't have a good solution to this. You have to decide whether you want referential integrity or not. Like that's a decision your company has to make.

There is a way to add foreign keys without validation. So you add the foreign keys and don't validate them. But then when you go to validate them, and if there is data which is actually invalid, that will cause problems with the state on your DB. So ups and downs. You have to decide what to do. I don't have a solution.

Vacuuming, Postgres actually lets you run vacuum without running a Vacuum Full. It's important that you run vacuuming for performance of your database, so that you clean all the doubles. But depending on what your DB looks like, running a Vacuum might actually cause your tables to lock up.

So if you are doing that-- and you should, to increase performance to your DB-- you have to decide how to do it. Like, maybe you want to maintenance window or something.

And lastly, when you're growing, you want to move-- like have primary and replica of your DBs, and you want to try and move all your Select statements to your replicas. The reason for that is so that your primary database actually has more processing power to run migrations really quickly. So it's not running a bunch of Select statements. The Select statements are moved off to your replica database, and your primary just has more computing power to run things really, really quickly.

So I've talked about a bunch of things. I've talked about scaling with tools. Break up your migrations. Break up your migrations. Think about schema changes and exchanges and data changes. Think about how, every single time, you're adding a migration that is-- locking can be broken up so that it doesn't lock up your table quickly.

I've talked about scaling with your team, like adding a migration user, doing migration tooling, load testing, how that helps you. And lastly, I've talked about a few issues that happen when you are trying to still get to zero downtime migrations.

I realize I haven't introduced myself. My name's Jumana Bahrainwala. These are the places you can find me on the interwebs. I cannot sing, but I can rap the Alexander Hamilton song with you. And I grew up in this really tiny country called Bahrain, which I share the same last name with. I don't know how many of you share the same last name of the place that you grew up at? Not many, I'm going to assume. And I'm proudly Canadian.

Ooh, Canada!



And lastly, I have a bunch of acknowledgments. I-- like this talk did not happen in a silo. I definitely stood on the shoulders of giants. Gary and Brent went above and beyond to help me out. Nico, Filipe, Alex, Zach, Di, and every other wonderful person of the place that I worked at made sure that I actually said smart things.

Hopefully, I said smart things. And I read a bunch of things, and these are all the things that I read to try and figure out how to actually do zero downtime migrations. And if you-- oh. If you come and find me after, and if I don't know, I will copy their things for you and tell you the answer.


Thank you.