How I design backend services
CQRS is often associated with microservices and event-sourcing, but it stands alone as an excellent alternative to MVC design.
First off, this is gonna be a long article. Strap in...
I'm old, I get it. For the majority of my career, MVC was the way to design web applications. But I read a book about Evolutionary System Design and went to an O'Reilly conference on system architecture in 2018 and saw Martin Fowler talk on the "Microservice Design Canvas," and the two things together completely changed my thinking on how to design systems.
I've really fused these ideas to form my own theory about backend systems design. My goal is to create systems that are:
- Cost effective to develop.
- Easy for new developers to drop into.
- Performant enough to be cost effective to scale.
- Easy to break up into microservices or other components to support horizontal scaling as required.
- Easily "composable," i.e. big things don't require new code so much as aggregates of small things.
- Easy to monitor, react to problems, and to debug.
- Easy to add nontrivial things to on a time budget without creating new technical debt, or at least keeping it local to the thing added.
The microservice design canvas has you design commands, queries, publications, and subscriptions, then data. I go a little further, because I try to take deployment, scaling aspects, and security into account at design time.
Also I don't take a "microservices first" approach. In my experience the amount of work and money it takes to instrument and monitor microservices really just doesn't pay off for young companies with few or zero customers (where practically every system starts). I make it so it's easy to refactor pieces into microservices, but there's no requirement for the system to start there. There's also no requirement that you start with streaming or event sourcing solutions. This model works for traditional web APIs and for all the more advanced stuff that you end up needing when you finally do scale to hundreds of thousands to tens of millions of active daily users.
All code examples in this post will be in Python, but truthfully this could work in anything from Java to Go to Typescript to LISP. My personal preferred core stack is:
- Python
- Postgres
- Redis
- FastAPI
- SQLAlchemy
- CDK or Terraform/Helm
- Datadog
- AuthZed for permissions
Design Elements
This is the order you want to do the design in. MVC runs deep in development culture, and most developers will try to start any backend project in the ORM. This is a mistake, as I've pointed out before. If you do that, then you will end up adding elements that serve no purpose in the final version of the service, and you'll end up creating data structures that aren't conducive to the efficient working of your interface.
Start with the interface layer. Then hit the data layer. The data layer gives you your options for auth implementation, so then do that. All those tell you enough about your environment and infrastructure requirements that you can design those. And it's not that you won't or can't skip around a bit, but in general these things will resolve themselves fully in this order.
Typical examples of service design start with a TODO list, but I'm going to start with a simple group calendar, because there are more moving parts that illustrate more than the trivial TODO list, and you will be able to see how much this model simplifies a service that's naturally a little more complex.
Our example requirements
This will be short, but I want to set the stage for the decisions I'm making and how I'm breaking it down. I'm treating this like a small service. User management is expected to happen elsewhere. This service will be called (indirectly) by a UI and by directly by other services that serve up the UI and broker scheduling information with other parts of the company's SaaS offering.
It is an MVP, so it's not expected to handle calendar slots or anything "Calendly" like. It's just a barebones store of appointment data. Since user management is separate, the appointments can be owned by users, groups, or rooms or anything, but we don't care because that's all handled (presumably) by the user, group, and resource service.
What do my customers care about for V1?
- Creating, editing, and deleting appointments.
- Recurrence.
- Being able to see whether a new appointment will cause a conflict in their own or other attendees calendars, and what those conflicts are.
- When scheduling multiple people, they want to see the conflict-free openings in the calendar before having to click through weeks of people's calendars to find a time.
- Being able to view their calendar in various ways, up to a year at a time.
- Having all their calendar items together. All apps in the company should dump appointment data into this if they are creating it, and I should be able to load outlook and google calendars as well.
- Average calendar load time for a single person's calendar should be 2sec or less on good internet. Average calendar load time for a month of appointments up to 50 people should be 30 seconds or less.
- To be able to set 0 or more configurable reminders by email, SMS, or push.
What does my customer success team care about for V1?
- If an appointment disappears or moves in the calendar, they want to know how that happened - what app moved it and who, so that if a customer thinks it's a bug they can prove it one way or another and take the appropriate action, and we should be able to restore the old appointment.
What do my product people care about for V1?
- Engagement: how many calendar views across which people per day, and any appointment metadata that would be appropriate as classifiers distinguishing between the type of appointments people are adding or how many recurring vs. individual, etc. And how many appointments added / removed per day by people (as opposed to external services like Google and Outlook)
What are the (non-universal) engineering requirements to make this system flow well?
- There are other apps in my company with calendars. I need to make sure that appointments managed by those apps stay managed by those apps if there's metadata that my service cannot account for directly.
So basically my requirements are:
- timezone-aware timespans, including recurrence rules
- links and attachments
- appointment metadata such as originator, attendees, title and description
- some way to refer to an external owner of the appointment, for appts coming from services like Outlook and Google
- and for internal owners understanding which app made it and whether it has to be managed by that app (such as if the app is storing other metadata on the record that would be corrupted if a different app edited it.
- An audit trail of some sort to serve both the CS and Product use-cases. Restoring appointments can be manual by CS or engineering, but only if the audit trail exists with sufficient data on it.
There are also a few affordances that make my job easier:
- On average, a person will not have more than 4-5 appointments a day, 5 days a week. That totals out to 1,300 appts per person per year that we have to save, load, or destroy. It's likely I can do this much bulk writing in a single transaction and stay within the time limits we've imposed on ourselves.
- Across all our services, there are really no more than 150,000 daily active users, and of those, we can estimate that they're only changing their calendars a couple of times a day at most. That means that the write traffic, at least at first, will be fairly low. I can likely get to our MVP without pushing the writes to an async worker, although it's likely something we're going to want as our app grows.
- Traffic will be bursty, but within limits. When new users are added or when they sign up to sync their calendar from Outlook or Google, we're going to see a spike in traffic from baseline. This can likely be handled by combining FastAPI BackgroundTasks (less clunky than a queueing system and async workers for infrequent bursty traffic) with the kubernetes autoscaler, at least at first.
Commands (Mutations)
First a link to the post where I go into detail about Commands:

Just to sum it up, your command and queries should be governed by base classes that provide the "guts" of their mechanics, and serve as suitable points to enhance those gut mechanics. The main difference between Commands and Queries is that the Query classes should only have access to a read-only connection into the data.
But if for example, you want an audit trail of every command or query issued against the system, or if you want a total ordering or the ability to switch to event sourcing, the base classes are where you do it.
I typically write bulk mutations first, because there are just so many cases where people want to import data and bulk mutations typically have more efficient implementation options than fanning out atomic ones.
For our calendar, here is a list of commands:
- SetOpenHours(rrule) - Allows someone to set the time of day where someone else is allowed to schedule meetings including them. Like work hours.
- SaveAppointments - Bulk creation and update of appointment data.
- DeleteAppointments - Bulk deletion of appointment data. Cannot delete from external services or app-management-required appointments.
Queries
We want a small set of queries we can optimize our data structures for, but that will suffice to construct any kind of calendar for the UI, and to aid users in creating conflict-free appointments.
- CheckForConflicts(timespan, attendees, limit, cursor) - Return a list of people that have conflicts for a given timespan.
- FindOpenings(timespan, attendees, limit, cursor) - Return a list of timespans within the input timespan where the attendees have no conflicts.
- GetAppointments(timespan, attendees, limit, cursor) - Return a list of appointments, sorted by attendee, and then by timespan.
- SearchAppointments(search_criteria, sort_criteria, timespan, limit, cursor) - Return a list of appointments that meet a certain set of search criteria, which can be one or more of:
- Title
- Description
- Attendees
- Is External
- Is Owned By (app)
Publications, Subscriptions & Schedules
Our publications:
- AppointmentCreated(id, owner, attendees, timespan) - lets subscribed services know when someone's calendar appointment was created.
- AppointmentModified(id, owner, attendees, timespan) - lets subscribed services know when someone's calendar appointment was moved.
- AppointmentDeleted(id, owner, attendees, timespan) - lets subscribed services know when someone's calendar appointment was deleted.
- AppointmentReminderSentReceived(id, owner, attendees, scheduled_time, actual_time) - lets subscribed services know when a reminder was sent out and received by the given attendees.
- AppointmentReminderSentFailed(id, owner, attendees, scheduled_time, code, reason) - lets subscribed services know when a reminder was sent but failed to be received.
Our subscriptions:
- ExternalCalendarsBroker - Trust me, use a 3rd party service for this. I'm going to suggest either Cronofy or Nylas's APIs for this. But be aware of and put in extra time to design for external system failures and hangups. Your users will judge discrepancies between their Google and Outlook calendars vs. yours harshly and you want to be able to explain those differences when they happen and have something to push support on at the service you do choose when there are issues.
Our schedules:
- SendReminders - Sends reminders out to attendees. Will run once a minute to check if there are reminders to send and then send them. This may actually be broken up into a couple of scheduled tasks, depending on how many reminders you're sending out per minute, and how long they take to process. There is a fair amount of subtlety in sending reminders when people are allowed to change appointments up to the last minute. You're going to want to define a "fitness function" for how often reminders can fail, and how often reminders for deleted appointments can be sent out, and use that to determine how fiddly you want to be here.
Data & Infrastructure
Now this gets interesting. Because we've already designed our commands there's a lot more to our data model than we first thought, as well as a lot fewer data attributes in the core model. We want as few attributes as possible. Online database migrations are pretty rock-solid these days, so accidentally getting into production without something that you want for a future release won't be a problem.
Also, it's clear from above that we don't just want Postgres for our data model. We're searching appointments, so we want OpenSearch, and for simplicity's sake I'm assuming we're using Redis for pub-sub, and FastStream, which is a FastAPI like framework for streaming applications. You could use Kafka, SQS/SNS, or RabbitMQ, depending on your scale or your dev and SRE teams' proficiency.
Now that we know how people want from our service, we can define our data and infrastructure, and points-of-scale in our code around it.
We're going to use Postgres to store our appointment information. I'm not going to go into deep detail here about the fields, since those can be derived from the Command classes and the UI requirements but as someone who has designed more than one calendar in my lifetime, I have some notes:
- Indexing Timespans - Postgres has an interval type, and that interval can be indexed with a GIST index. This gives us an optimal way to search for appointments with overlapping intervals and covering single points in time.
- Indexing Attendees - Likewise an array type with a GIN index will give us the ability to search for all appointments that include a given set of attendees. We may need a CTE to deal with set math, but it will still be highly optimized and relatively straightforward.
- Timezones - Store the timezone as an ISO code (not an offset) on a separate field and store the actual interval in UTC. If you don't store all your time info in the same timezone then you can't effectively index the timespan of an appointment. Okay, you can but your indexing trigger starts to look complicated and you're saving nothing in terms of complexity in the timespan itself. Why use a code and not an offset? Because if someone moves the appointment and it crosses a DST boundary when they do so, you won't know to change the offset without the UI's help, making the interaction between your service and others need more documentation and making it more error prone.
- Recurrence Info - God save you and your users, read the RRULE standard closely, and in particular read the implementation notes for the most popular Javascript library and whatever your backend implementation language is. Better yet, this is one of those rare places where I'd advise you roll your own if you have time, rather than use the widely accepted standard, because the standard is SO loose, and because the different implementations out there often skirt different sets of details in it. But if you use RRULE, one big non-obvious detail you need to trust me on: store the RRULE itself in the local time and timezone that the user used to create the recurring appointment. If you don't, day-of-week calculations will be thrown off depending on how far away from the UTC timezone someone is, and how close to midnight their appointment starts. It's not that you can't correct for it, but one way lies 2400 lines of madness and bugs and the other way lies a different but far simpler type of madness.
Now, let's talk data models. Recurrences will be stored separately from appointments, and each appointment within the recurrence will be an actual appointment record within the database. We'll add a new scheduled task, UpdateOccurrences, which will run once a month and calculate occurrences out for 2 years (an implementation note to tell Product and CS about). The same code should be used when saving an appointment so that occurrences are saved that far out on initial appointment creation. We'll want to set a field on our Recurrence model to say how far occurrences have been calculated. That way if someone modifies or deletes an occurrence from the set, we won't accidentally re-create it when we call UpdateOccurrences.
Along with the Postgres record, we're going to want to index title, description, attendees, is external, and owning-application within OpenSearch. I won't bore you with the details of indexing these correctly because the requirements change a lot depending on the locales you operate in and tokenizers you choose. Also, you'll end up needing to query the service that expands attendee IDs into names most likely or code the search function to call it to reify full text searches to possible attendee IDs. This latter idea may be a little better for MVP, since it won't require you to setup an async task and broker to save appointments.
How about that audit trail? Well you'll notice conveniently if you're using FastAPI and a pattern like my Command class that all commands are Events, and publications are also events. We can dump those events into Redshift or BigQuery and boom, our audit trail is now real. We can use the event history to cover the CS case of recreating changes in the event a bug or a person someone screws up someone's calendar. We can use the same audit trail to figure out how many appointments were created by whom for engagement metrics. And we can use the audit trail along with any logging we do in DataDog to measure our service against our performance and reliability metrics. The other great thing about everything being an Event is that we can adapt our commands to a Streaming modality easily once we get to the point where we have to scale out. Dump the command event into Kafka and consume it with FastStream.
Authentication, Authorization, Permissions & Security
Environment
We're using an external service for all our user info, and we're proxying this service through our app-facing and user APIs, so presumably authentication is handled for us. Same with the security environment. All that is probably handled upstream. The only extra requirements we really have here is authorization.
We want to allow our customers to make private appointments. That's easy, we can add that as a simple flag. But we probably also need to keep track of who can see who's calendar. I personally love AuthZed's (I swear they don't sponsor me, I just think it's a great tool) open source SpiceDB permission service. It's an implementation of Google's Zanzibar standard, which handles all the permissions storage and checks throughout Google, so you know it can handle your use case.
So I'm going to suggest these new permissions in authzed without going into further implementation details:
- CanViewCalendarDetails - Whether or not the caller can view a given attendee's calendar (group, person, whatever ā the perm services can handle this)
- CanModifyCalendar - Whether or not the caller can modify a given attendee's calendar.
- CanViewCalendarFreeBusy - Whether or not the caller can view free/busy info for a given attendee, even if they can't view the full calendar.
In each command I will make sure that the calendar can be modified, and in each query I will make sure that the response will be viewable to the requesting resource or person, and whether it should just be empty appointments with busy labels.
Last step: Measuring our performance and reliability
Maybe this belongs above in Data and Infrastructure, but I want to treat this section separately. I prefer to use Datadog personally, but it can be quite expensive. However you create your dashboard and whatever you log data into, you want to measure, at the very least:
- p50 and P95 time to execute a save or delete appointment command. Alert if above 75-90% of the threshold determined by the users and schedule time to get it down if alerts are becoming more consistent.
- P50 and P95 times for each query. There are lots of ways to increase the performance. Sharding OpenSearch indexes, Postgres partitioning, tuning queries, storing the appointment data in multiple redundant tables with different indexing schemes tuned to each specific query, and caching the results of calls to other services or whole calendar responses.
- Failure rates - you want to alert on spikes in these. 404s, 401s, and 403s are generally problems downstream from you. They could indicate a broken UI or a misconfigured service. 400s could be a failure of coordination between you and your consumers, where you've released a breaking change without telling someone. 500 is definitely on you. 502 and 503 mean your services aren't scaling to meet depend. Track spikes and hourly failure rates over time. If the failure rates are increasing, then you should schedule maintenance time on it before the rates spill over your threshold. The key to good stewardship of engineering is proactivity. If you catch a failure before your customers do, you're doing well.