Modeling data well

I talk often about how modeling data should be done later in the process. If you start a new project with a models.py or com.mycompany.models.* or models/*.rb then you're putting the cart before the horse, and you're going to model data badly. But okay, you've accepted that or at least that it's my opinion. In fact, my steps are:

  • Model commands.
  • Model queries.
  • Model pubs, subs, and schedules.
  • Model the environment / outside connections.
  • ... Then model data.

But when it finally does become time to model data, how do you model it?

Think about expectations

You've gotten a start modeling everything else, so you've probably already spent some time thinking about this. That in fact is the point of putting data modeling this far back in the process. You need to decide on databases, whether you're adopting a streaming or other async strategy for processing mutations, whether the same tables (or databases) can be used for mutation and query, what indexes to use, and so forth. You can't do that without the context in which your data lives. When sitting down to model data, you want to know things like:

  • How many users, agents, and other autonomous processes are there going to be hitting your data through the commands and queries you've laid out?
  • How long can a query take before a user notices the lag?
  • How important is transactional consistency?
  • Are there special requirements for queries like text search that suggest alternative databases to support them?
  • Are there special reporting or auditing requirements for various models?
  • Should changes on certain data be reversible?
  • Do any of your commands and queries revolve around collaborative editing?
  • Are there row-level (or document-level or whatever) permissions involved?

You have finite time to spend modeling data. Spend it where there's going to be traffic and tough-to-meet requirements. When you have to steal time, steal it from infrequently used cases in your app, service, or microservice. Which queries are the ones that are most visible to the user? Which mutations are the ones that touch the most and are the most sensitive to consistency or timing? Focus on those and branch out.

And yes, you need to model schema, attributes, and document fragments, but all of that is at your fingertips if you've already modeled commands, queries, subs, pubs, and schedules.

💡
Behavior, both machine and human, determines schema.

There's a temptation to over-optimize for scale at the data modeling stage. Sure, you want 25M daily-active-users and 100 million queries an hour, but is that where you're really starting? (I'll grant that in some situations it is, but I implore you to ask yourself whether you're the exception). And yes, you want all your queries to return as quickly as possible, but there's a point of diminishing returns and there are queries that are only executed 100 times a day.

Instead of optimizing for scale you don't have, optimize for evolving to scale.

Put in a data migration system first thing, always. You can hand-control it or automate it, but you have to have one. Except in extreme cases, it needs to effect migrations while the system is online. And while not every migration needs to be reversible, the migration system ought to have the concept. I like alembic because of its flexibility but having one is more important than having a specific one.

This is also a great place to use Architecture Decision Records (or ADRs) to talk about what kinds of scale you hope for / expect, what strategies you'd employ when you're approaching that scale, and what the process ought to migrate up a level. You don't have to get too detailed yet; "concepts of a plan" suffice at this stage. The point is that you've thought about the possibility that at some point for example you might have to go from a Postgres-based queue to RabbitMQ to Kafka, and you don't do anything now that prevents you from making that transition smoothly when the time has come.

A short list of some of my personal data modeling practices

I'll note that these data modeling strategies are valid for SaaS companies and service architectures. There are other modes of data modeling out there with different concerns (in particular there are times where my position about foreign keys is not valid). But in more than a decade of SaaS work, these have served me well.

I don't use sequential integer keys. I use uuid4 or uuid7 keys for the most part. The latter is sortable, with the sort key working out to a timestamp equivalent. There's a hard-won security lesson that predicated this: if your keys are obviously sequential, then technically-minded users interested in automating something involving your API will assume they can index through the list. The amount of CPU cycles or storage you save by fitting your keys in a 32-bit word is basically never worth the tradeoff. Go ahead and fight that holy war with me, and I'll point out the dollar and change per day you're saving on a really absurdly high-traffic table vs. the cost of mitigating just one easily avoidable data breach.

I rarely use foreign key constraints. Github famously does not use foreign keys ever, anywhere. I don't entirely avoid them, but they've made me bring down a service and introduce downtime we didn't have to run a migration. They're fine most of the time, but there are relatively common cases where you have to alter the schema or update records that cause the constraint to lock things that you don't want locked. Instead I typically opt to put resilience in my business logic around cascading data and handling broken relationships.

I use ARRAY and JSONB columns a lot, as well as GIN and GIST indexes. Your code to access and maintain data will never be less complex than the underlying data model. Because of this I avoid over-normalizing database models, period. If something's not going to be accessed as a top level object, and it doesn't push the constraints of either of those types, I will pick column storage knowing that if scale changes the calculus I can run a database migration to get to where I need to be.

I never use a relational database table as a queue. This is one of those "never do anything that prevents you from scaling later," things. Having your queue be part of your database lets you do things you can't do in any other queuing system. You can access potentially all old queue records. You can join to other tables. And in the rush of crunch time and the brain fog of friday peer reviews, letting a junior coder get away with abusing the queue because it's "just a table" is easy to do and hard to fix. I just use redis or SQS to start and expand to something more robust if I get to the point that it's necessary.

I add checkpointing and auditability in relatively early. They aren't typically the first things I add, but checkpointing and event-level (e.g. mutation or API-call level) auditing let me replay mutations on top of a checkpoint for forensic purposes and let me fix inconsistencies.

I'm not concerned with the redundancy of writing the same data to two or more places. That is to say that if I'm accessing OpenSearch to provide a query I don't have a problem writing the whole record as a document in OpenSearch and returning that without first checking Postgres to make sure that it's still there. Obviously there are exceptions where consistency matters, but it's worth thinking about whether you have one of those situations. It's often worth creating tables or data stores that serve up data that's more compatible with the query at hand than the "base" data structure the truth is stored in. The benefits that you get with maintaining redundant data are:

  • Avoiding stovepipes made up of constraints and single-use indexes that slow down your transactions in the ground truth tables.
  • The ability to populate read-only tables in an async or lazy fashion.
  • The ability to use strategy-specific data stores like OpenSearch for accelerating queries while using battle tested transactional consistency from a Postgres or similar.

My personal data modeling toolkit

I do my best to abide by the principle "Choose Boring Technology".

For plain old data, I use Postgres with SQLAlchemy. They both give me tons of scaling options. Once you move to RDS or Aurora you can scale up pretty much as far as you'd ever need to. And while SQLAlchemy contains an ORM it is not an ORM. If I want composability or absolutely crazy queries, I get that without having to drop to SQL strings or templates.

For document-structured data, I stay in SQL, but I use a JSONB column and various indexing strategies instead of over-normalized relational hierarchies. I typically validate document structured data with a JSONSchema or something like Pydantic that produces one.

For providing full-text indexed search, I use OpenSearch. I should say that I have most of a PhD in text search and I still choose OpenSearch 99.9% of the time for this. It simply follows so many best practices and you can do so much to change how indexes are built that the number of cases where OpenSearch or ElasticSearch aren't the right tool are vanishingly small. I can safely say that unless you already have a ton of tooling around Postgres full-text search that you'll create a better user-experience faster and with fewer resources with OS/ES.

For online migrations, I use Alembic. It will autogenerate changes off your SQLAlchemy ORM models, but you can also just tell it to create you a bare migration and you can do anything in that you can do in Python, like moving the contents of a Redis pubsub to a new RabbitMQ instance or loading all your profiles into AWS Cognito. And then you can keep using the autogenerated migrations after that.

For regular gridded data-science, ML, or AI training data, I use Parquet. Yes there are other formats, but parquet is the most compatible by far. I can use DuckDB, Polars, Pandas, or R. I can move it and all the resultant processes into Snowflake or Databricks when I get to that point of scale. I can hand it to people who work in other languages or systems and they can pretty universally do something with it. It partitions well. It's ridiculously fast.

For queueing, caching, cross-process data storage, and other set membership or hash-lookup cases, I use Redis. Again, it's the most flexible and broadly compatible thing out there. I almost suggested RabbitMQ for queuing but Redis does an appreciable job until you get to the scale where you want a streaming system like Kafka.

For permissions I use AuthZed's open source SpiceDB. This is the most "non-boring technology" choice on this list, but after using it in a couple of projects I can't imagine going back. It's an open implementation of Google Zanzibar, which controls access to files and folders in Google Drive among other things.

  • It creates clear and complete permission models;
  • It can handle ridiculous query frequencies before you need to scale up
  • Migrating away from it or up to a cloud-based AuthZed solution in the event that you need to is pretty straightforward.
  • Zanzibar is as battle-tested as it gets in the commercial space (classified govt systems are a separate class).

Wrapping up

While I recognize that I don't provide much in the way of concrete examples in this post, I still think there's a lot of general advice that's worth giving around data modeling, and I'd love to answer specific questions from people about some of my advice or what technology to use for a certain use case (use the contact form on my website).

TL;DR, though I leave you with three things:

  1. Your code to access and maintain data will never be less complex than the underlying data model.
  2. Behavior (both machine and human) begets schema.
  3. Optimize for the ability to evolve to scale, rather than to scale you don't have.