Leveraging Holistic Synergies

Using a copy of or a partial copy of a deployed database (e.g. production or staging, one that is somehow exposed to other people) is a great way to jump start local development. It can also be a valuable tool for debugging issues where the specific state causing a known error is both critical and difficult to reproduce - provided your database is small enough.

I stumbled on Matt Segal's post on the same topic and he has solid advice for getting started with automating this process. Go ahead and read though it! While good advice, it does come with some hard assumptions which require changes for my preferred workflow and customer requirements.

Note that this post likewise assumes one is using PostgreSQL, but while the specific commands and arguments will differ for MySQL or any other database, the overall strategies remain valid for other databases, and even other application languages/frameworks.

My starting point

For much of the past umpteen years I've used a two-part system to restore databases. Acquiring the database dump was part one, and this varied project by project, so there was ever any authoritative way of fetching the dump itself. Rather, my system relied on getting this file into my ~/Downloads directory and then running a single script for creating a new database from this file.

#!/usr/bin/env bash

LATEST_DB="$HOME/Downloads/$(ls -t ~/Downloads | head -1)" 
BACKUP_PATH=${2-$LATEST_DB}
DB=$1

echo "Creating database name $DB from $BACKUP_PATH"

psql -c "CREATE DATABASE $DB;"

echo "Now restoring database $DB from $BACKUP_PATH"

pg_restore -O -x -d $DB $BACKUP_PATH #"${@:2}"

It did work, and it did save me quite a bit of time, but it was still somewhat clunky. It required that I manually name the database each time, it assumed and required that the dump was the most recent file in the ~/Downloads directory, and of course it did nothing to solve for acquiring the backup.

In previous projects years ago I usually bundled this all into a couple of Fabric tasks.

Constraint: no overwriting the database

The first change I'd have to make is ensuring that the restoration process results in a new database, rather than overwriting an existing one.

I'm pretty confident my preference here puts me squarely in the minority, but in most cases I've found that the simplicity of overwriting a single database is outweighed by the benefits of (a) being able to track the provenance of the database (both deployment and time) and (b) separating in-progress development data from production data required for investigation.

For a project called or nicknamed "acme", for example, if I download a production backup from June 30, 2020, I expect that backup to be restored to a database called acme_prod_20200630. Thus I can see, fairly at a glance, what project a DB is associated with, what system it came from (or what it's being used for), and when it was acquired.

There are a couple of objections:

It can result in accumulating databases and eating up storage
It requires changing the database name for your project

As for accumulating databases, yes, you could end up with this problem. However in practice it's not terribly difficult to drop old databases periodically. Most of my customers are B2B SaaS and their databases just aren't that big (we're talking gigabytes, sometimes even MEGA-bytes here) such that a single extra copy is unlikely to eat up my free disk space. If I were dealing with databases of that size I'd probably consider using remote databases even for [some] local development or debugging.

And as for changing the database name, this is not set directly in the project itself, it's pulled from an environment variable following the 12-factor pattern. Using direnv it's a matter of adding a line to the project's non-source controlled .envrc file, annotating it as necessary, and restarting the dev server.

export DJANGO_SETTINGS_MODULE=config.settings
export DJANGO_CONFIGURATION=Local
export DJANGO_DEBUG=True


# Feature development
export DATABASE_URL=postgres://localhost/acme_development


# ERROR TESTING
export DATABASE_URL=postgres://localhost/acme_prod_20200701

Now after debugging the issue at hand and resolving, it's trivial to comment out or remove the testing database and get back to whatever test environment you had previously.

Constraint: Heroku backups

Since my customers overwhelmingly use Heroku (and if you're running a bootstrapped SaaS business I feel very strongly that you should probably be doing so too) we can skip the steps for identifying and downloading the latest backup from S3. Using the Heroku CLI tool and the pg subcommand we can interrogate the available backups and directly download as well.

The pg:backups:download command will download a specific backup, by ID, and by default will simply download the latest backups. We can use this command, however it cannot write to standard output to pipe the results. By default it will download the backup file to latest.dump in your working directory (or some numbered variant like latest.dump.1 if another copy already exists), though you can provide an alternative path. This path cannot be - and trials of using /dev/stdout were met with less than successful results.

So our options are to download the file and leave it, download and clean up, or pipe the file contents to pg_restore. The first option should be a non-starter, so let's look at the second option.

#!/usr/bin/env sh

PREFIX="acme"
ENV="prod"
APP="acme-production"

DB="$PREFIX_$ENV_$(date +%Y%m%d)"
BACKUP_PATH="/tmp/$DB.dump"

echo "Creating database $DB"
psql -c "CREATE DATABASE $DB;"

echo "Downloading latest backup"
heroku pg:backups:download --app="$APP" -o $BACKUP_PATH

echo "Restoring database $DB from $APP via $BACKUP_PATH"
pg_restore --no-owner --no-privileges --dbname $DB $BACKUP_PATH

rm $BACKUP_PATH
echo "Restored database $DB from $APP, enjoy!"

This script would be project specific. It first creates a database name using the hard coded project name and environment name (enterprising minds will see how this could be better generalized), then creates the empty database, downloads the latest backup to the temporary directory, and restores to the database prior to deleting the temporary backup file.

The restoration step adds a couple of flags. First, --no-owner this will ensure that our local database user is the owner for database objects. Secondly, --no-privileges skips any access privileges in the database. There are historical reasons why both of these have proven to be useful, which quite frankly, I forget! Probably because I haven't run into the same problems...

However, we can skip the whole file step by piping if we directly download via the backup URL. So let's do that.

#!/usr/bin/env sh

PREFIX="acme"
ENV="prod"
APP="acme-production"

DB="$PREFIX_$ENV_$(date +%Y%m%d)"
BACKUP_PATH="/tmp/$DB.dump"

echo "Creating database $DB"
psql -c "CREATE DATABASE $DB;"

echo "Now restoring database $DB from $APP"

curl --silent "$(heroku pg:backups:url --app=propertymetrics-production)" | \
    pg_restore --no-owner --no-privileges --dbname "$DB"

echo "Restored database $DB from $APP, enjoy!"

This isn't significantly different, but it skips writing anything to disk, at least prior to restoring into the database.

Using Invoke & Django for optionality

I mentioned at the beginning that in prior projects I bundled or recommended bundling this kind of functionality into Fabric deployment scripts. An alternative to these shell scripts is to follow that previous path and use Invoke, a Python task runner and partial successor to Fabric 1.x.

There a couple of benefits to using Invoke over shell scripts. First, the Python tasks tend to be easier to work with and comprehend as you add features to the scripts. For another, since it's Python you can import other Python modules, including your own configured Django project. This makes it possible to use your Django database settings to guide backup restoration.

Here's the task complete with docstring. This task, run from the command line like inv restore-db will let you configure the database connection settings or pull them from your Django project. It will optionally create a new database named after the Heroku app combined with a date stamp, e.g. acme_prod_20200703, or allow you to overwrite your current or specified database after prompting for confirmation.

from datetime import date

from invoke import task
from invoke import Exit
from invoke import UnexpectedExit


@task
def restore_db(
    context,
    host="",
    dbname="",
    username="",
    port="",
    app="acme-prod",
    email="",
    password="",
    new=False,
    overwrite=False,
):
    """
    Downloads and restores locally from the last production backup

    Note that absent setting your DJANGO_SETTINGS_MODULE this will
    not work correctly IF using a base settings module other than
    myproject.settings

    Usage examples
    ---------------

    Default, will prompt to overwrite current configured DB

        inv restore-db

    Create a new, timestamp named DB (bail if it already exists):

        inv restore-db --new

    Create a new DB named 'foo', prompt to overwrite if it exsts

        inv restore-db --dbname=foo

    Create a new DB named 'foo' and bail if it exists

        inv restore-db --dbname=foo  --new

    Create a new timestamp named DB, e.g. `acme_stage_20200701` pulled
    from the last staging database backup

        inv restore-db --app=acme-stage --new

    Args:
        context: the invoke task context
        host: the DB host (will use Django defined host by default)
        dbname: can be provided but otherwise will be named after the
                app, environment, and date of restoration.
        username: the DB username (will use Django defined user by default)
        port: the DB port (will use Django defined port by default)
        app: specifies the Heroku application to target. Prefer app name
                over remote as it should not be common practice to have the
                production remote configured and available.
        new: boolean value for whether to make a new DB only. If no dbname
                is specified, this will create a new database named with
                the datestamp for today. This is helpful if you want/need
                to separate a development database from a debugging database
                without overwriting the state in your development database
                which may be helpful in feature dev.
        overwrite: boolean flag for skipping the overwrite confirmation

    """
    import django
    from django.conf import settings

    django.setup()

    app_name = app.replace("-", "_")

    if new:
        dbname = dbname or f"{app_name}_{date.today().strftime('%Y%m%d')}"
    else:
        dbname = dbname or settings.DATABASES["default"]["NAME"]

    username = username or settings.DATABASES["default"]["USER"]
    host = host or settings.DATABASES["default"]["HOST"]
    port = port or settings.DATABASES["default"]["PORT"] or 5432

    try:
        context.run(
            f'psql -t --host={host} --username={username} --port={port} -c "CREATE DATABASE {dbname};"'
        )
    except UnexpectedExit:
        if new and not overwrite:
            raise Exit(
                f"\nEither {dbname} already exists or there was an error connecting to the database.\n"
                "Because you specified `--new` we're going to bail out. Drop this flag to be prompted to overwrite.\n"
            )
        print(
            f"Either {dbname} already exists or there was an error connecting to the database."
        )
		overwrite_confirm = "n"
        if not overwrite:
	        overwrite_confirm = input(f"Overwrite {dbname} if it exists? (y/N) ")
        if not overwrite and overwrite_confirm.lower() != "y":
            raise Exit("Okay, bailing out")

    context.run(
        f'curl --silent "$(heroku pg:backups:url --app={app})" | '
        f"pg_restore --no-owner --no-privileges --clean --verbose --host={host} "
        f'--username={username} --port={port} --dbname="{dbname}"'
    )

The non-specific warning after trying to create the database is due to the fact that we're relying only on an error after trying to create a database that already exists. You could query the cluster to see if the database is there, but given our expectation of how this is mostly likely to result in an error, this method is simple and effective.

Now if you want to always overwrite your working database from production all you need to do is run inv restore-db --overwrite; if you want to always create a new date stamped database without overwriting, inv restore-db --new; and you can change the environment from which to source your backup or specify a different target database name.

AWS and security

And if you're not using Heroku to deploy your Django project you can substitute a connection to your S3 bucket or where ever you store your database backups. One recommendation I would make for pulling from AWS is to use aws-vault to safely manage credentials, especially if you use more than one AWS account. This will store credentials securely (e.g. on macOS in the Keychain) and then inject them into the process environment in addition to an AWS connection session.

Provided your script is looking for these named environment variables (e.g. AWS_ACCESS_KEY_ID, AWS_SESSION_TOKEN, etc), the mechanics of running the script is simple:

$ aws-vault exec some-named-profile -- ./my-restore-script.sh
$ aws-vault exec some-named-profile -- inv restore-db --new

Data cleaning

The final step in handling database restorations is cleaning production data. This is critical if the database contains any personal identifiable information (PII), confidential business information, or legally (or ethically!) protected data. It can also add peace of mind knowing that if someone accidentally has an API or email service wired up in development that they don't accidentally send production data or worse, notifications to actual customers. Cleaning will involve obfuscating data and likely truncating as well. Truncating can aid with database size as well, and in a larger team setting the cleaning process should be an automated step between the usable backup and what developers actually download.

You can clean your data using the ORM or a SQL query; I've found using a SQL query and executing the file works just fine, is easy to understand and maintain, and is better at enforcing table wide queries (e.g. update() in the Django ORM). Whichever path you take, you'll likely want to replace passwords (this is usually pretty easy in local dev, just give everyone the same easy to guess password), replace names and emails (emails can be <user_id>@example.com), and any other personal information such as addresses or payment information.

Here's a partial anonymization script pulled from a real project.

-- Anonymizes an application database dump.
--
-- Usage: after importing a production database dump, run this script against
-- the database, e.g.:
--
--      psql -d new_db_name < scripts/anonymize.sql
--
-- It will remove sensitive data *in-place*, including names, email addresses,
-- phone numbers, etc.
--
-- WARNING: these queries are DESTRUCTIVE and should only be run
-- against local copies of the database.


UPDATE newsletter_subscriber SET
    name = translate(name, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', 'xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ')
  , email = 'hello' || id || '@example.com'
;

UPDATE store_purchase SET
    first_name = translate(first_name, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', 'xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ')
  , last_name = translate(last_name, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', 'xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ')
  , email = 'order' || id || '@example.com'
  , phone = translate(phone, '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', '44128492274xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ')
  , address1 = translate(address1, '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', '44128492274xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ')
  , address2 = translate(address2, '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', '44128492274xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ')
;


UPDATE store_transaction SET
    card_number = '***********4242'
    , exp_month = 1
    , exp_year = 2022
;


UPDATE auth_user SET
    first_name = translate(first_name, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', 'xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ')
  , last_name = translate(last_name, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', 'xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ')
  , email = 'user' || id || '@example.com'
  , phone = translate(phone, '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZC', '44128492274xaaaaawiwiqihhasdgkkkkkjhhLLKKJJJJUUUUOOOOPPPPZZZZZZZ')
  , password = 'pbkdf2_sha256$20000$FQSl5AooLtKR$k+YWRUOvymb0SDjlj5PMmNr18TWtvixZsdopYxWMdUE='
    WHERE email not ILIKE '%customer%'
;

Truncating data serves several purposes. It makes your local DB smaller when you don't need everything in the production database, it can remove useless information (like sessions that may still be present), and it can even aid in data obfuscation as the number of records can in some cases convey information in and of itself. If what you want to do is truncate an entire table, the most efficient way to do this is to use TRUNCATE , e.g.

TRUNCATE auth_user CASCADE;

This one line query will remove every row from the auth_user table and every row in any other table that references auth_user, and every row in any other table that references those rows and so on. For the auth_user table this is likely not what you want to do, of course, but it's effective for wiping django_sessions if that table has any data.

The Django ORM's foreign key constraints do not include ON DELETE directives notwithstanding the on_delete instructions defined on your ForeignKey fields, which means that selectively deleting rows from tables with cascading effects requires stacked DELETE queries on JOIN for each dependent table or a single delete() queryset call from the ORM. The latter is terribly convenient but can also be fairly slow for large numbers of objects and dependents.

Putting it altogether

As alluded to earlier, it'd be far superior to have the entire cleaning process run automatically so that developers don't need to worry about every having any kind of identifying customer information under their control - and responsibility. My own napkin sketch for doing this would involve populating an object storage (e.g. S3) with cleaned backups after a scheduled job pulls a complete backup into a secondary database, cleans it, and then dumps the newly cleaned database into the target storage for developer use. It would be a nice little project to play with Lambda and RDS together.

Originally published 2020-07-05