Leveraging Holistic Synergies

If you've written enough Django - or any web application code - you've probably come across a gnarly model query or two. It might just involve a lot of joins across tables and perhaps include various aggregations, too. But whatever it was, the code was probably a little complicated, a little "big", and maybe even ran a little slowly.

At some point some clever person came along and said, "This is gnarly code, we should encapsulate it somehow so it can be reused without being rewritten and with some documentation!" and maybe this was included in a queryset method or a standalone function. This solved the problem of code reuse and encapsulation - in the application. It didn't solve the problem of reuse in the database for other queries, and it certainly didn't solve the performance issue.

Now, maybe you rightly don't care about reuse outside of your application. You don't have any DBAs poking around and building reports. Fair enough! But you probably care about the performance of this query. And for a certain class of queries, your database probably offers an elegant and often overlooked solution: views, more specifically, materialized database views.

A classy view, not a CBV

First, a view in database parlance is just a named query which exists in the table namespace of your database. Instead of writing a CREATE TABLE statement with a table schema, you use a CREATE VIEW statement with the query itself. The result allows for querying against the view and all its fields just like you would from a table, e.g. SELECT * FROM the_new_view_i_just_created;. And now everyone using the database can reuse this query with ease.

Except we've already established that this isn't a big concern of ours, and further this still leaves us with a performance challenge. Every selection from this view executes the underlying query. We've done nothing to mitigate the performance issues of actually running the query.

Fortunately, the materialized view allows us to build a named query and then cache the values for "instant" reuse.

"We are living in a [materialized] world"

A materialized view takes the view concept one step further, and actually stores the queried values. The only time the gnarly backing query is executed is when the materialized view is populated. Bonus: like a regular table (and unlike a regular view) a materialized view supports indexes on columns, which may be valuable for subsequent lookups!

You wouldn't want this data stuck there forever. Presumably the whole point of using a query is to provide some dynamism. Fortunately there is a built-in solution, the REFRESH MATERIALIZED VIEW command which reruns the backing query and repopulates the materialized view. Whenever you run this query you can know you have fresh data in your queries. However this presumes you know when to run this command, and FURTHER, that this is actually a good solution.

So let's look at some of the alternatives and consider one of the most important things for anyone building software.

Tradeoffs Tradeoffs Tradeoffs

Consider the alternatives

If you're dealing with an agglomeration of data that is expensive to produce, here are a few ways of dealing with this in a Django project:

Cache a web response
Cache a template fragment
Cache the calculated data result
Cache individual data components of the entire result
Store aggregated or rollup values on model instances (database column)
Use a full rollup model that represents only denormalized results
Push aggregated, denormalized data to another data store (e.g. search engine, columnar database)

I won't call this an authoritative list.

Each solution comes with tradeoffs. These include cold cache wait time, stale cache, managing cache invalidation, managing data rebuild, and data query-ability (I made that one up).

For example, caching a web response works well for public content. The cache rebuild strategy is simple. The tradeoffs are that you're likely to face some [minor] cold cache wait times, and it really only fits a specific subset of solutions. It works great for those! But that's it. And you can't do much else with the data.

Rollup values, or incremented values, in lieu of individual counts, will result in fast results and probably involve little to no risk of a cold cache. However there's an added complexity to ensuring the data values are consistently updated.

Two facets of tradeoff I've found are very helpful for guiding this strategy are data churn and universality. An example will probably help throw some light on these.

A motivating example solution with materialized views

A while back I investigated a bug in some code that updated an ElasticSearch index. This wasn't properly removing old data from the index when an elected official's term of office expired. Rather than dig in I wondered, "Why exactly are we even using ElasticSearch for this?" The obvious reason was for search, but it was also because the underlying data was pretty normalized. If this had instead been a single table, then PostgreSQL's text search functionality would suffice and we could simplify yet another part of the architecture.

The answer for using a materialized view in this case was based on several keys points: the source has relatively little churn; the results can be further queried; invalidation is both event and time based.

Data churn is important because if this were data about, say, user activity, and needed to be refreshed every single time a user did something it would probably just eat up database time all day updating an increasingly larger data set. And sure, there may be times when the churn spikes because of lots of data additions or modifications. The bonus sub-point here is that in those cases the data doesn't need to be fresh to the very minute.

Subsequent querying is valuable for the simplicity and performance. If this were just cached object, for example a list of dictionaries, it could still be filtered and searched. However you'd need to load it all into memory and wouldn't gain the benefits from indexing. The materialized view, however, can be indexed and then queried like any other table, which means it can be queried like any other Django model.

The case of time based invalidation is important insofar as it means you cannot rely solely on some kind of application event to update the results. The queryable elected officials should be updated when a new one is added, but also when preset elected term dates start or end. This means there is already a benefit and a need for a regular scheduled refresh.

The last thing I'll leave you with in this now much-longer-than-expected email is the important of having a recognizable and authoritative source of truth for your cached data. This is the difference between something like a materialized view and an incremented cached value (not that they're necessarily solving or capable of solving the same problems). For an incremented cached value, you presumably have a way of rebuilding that count from the underlying data. But when you do it's probably distinct logic and it's subject to failure such as events not being processed. The results from a materialized view may be stale, but you can be confident that there is an obvious path to rebuilding the results from authoritative source in a highly replicable way.

Are there times when this is not the right choice, and that the tradeoffs work the other way? Absolutely! But when you can, taking advantage of a simpler architecture with fewer points of failure is a nice win.

In next week's edition we'll look at how to actually implement database views and materialized views in a Django project, including a review of some Python packages. It's almost certainly easier than you think. And after that we'll probably compare deploying Python apps to several platform as a service offerings.

Materially yours,

Ben

Originally published 2022-04-06