Modern applications run typically in the cloud. As good residents, distributed applications must fulfill many requirements in order to enable reliable operations and maintenance. This article summarizes the most important points in order to go live and to keep applications healthy over their lifetime.

Twelve-factor app

All modern cloud applications should comply with the 12-factor app principles:

Codebase is tracked in a version control system, the same code will be deployed in different environments
Dependencies are shipped with the deployable artifact
Backing Services can be detached and re-attached without code changes (e.g. database, message broker)
Config remains in environment variables
Separate stages for Build, Release, Run
An application runs as a single stateless process. Data must be stored in a database.
Port Binding: applications export their service via a port binding
Concurrency, services are horizontally scalable
Disposability, applications are disposable, startup fast and support graceful shutdown (handles OS signals)
Dev/Prod parity, environments are as similar as possible
Logs are treated as a continuous stream of events (no log-file rotation)
Admin processes, e.g. database migrations are done as one-off admin process within the same environment as the long-running application

Development

More often than not, developers are busy with adding new features but for reliable and robust applications, they must focus on cross-functional features too. With a DevOps culture, developers should collaborate closely with the ops-team and should know the infrastructure as well.

build with one command, complex logic should be abstracted away inside a script
integrate auto-formatting into the CI build (spotless, black, rustfmt, gofmt)
integrate static-code analyzers and linters into the CI build (errorprone, infer, sonarqube, eslint, ruff, clippy, go-staticcheck)
integrate unit tests into the CI build
add integration tests, load tests, only add minimalistic end-2-end tests because they are costly to maintain
don’t forget the fallacies of distributed systems, add retries with backoffs for idempotent requests, circuit breakers, load-shedding
know the application’s resource limits - set them accordingly!
add health and readiness checks to the applications
use feature flags instead of spinning up whole environments (enable/disable features for a selected userbase)
consider multi-tenancy from the start (regions, users)

Deployment

use a modern CI/CD pipeline with a simple branching strategy like github flow, avoid gitflow
keep the CI pipeline simple, move complex logic into scripts

Operations

backup strategy for databases
regular firedrills (restore backup)
keep the number of environments low (preferably Dev and Prod)
for an healthy DevOps culture, developers should do operations too. They should feel the pain during an outage. Afterwards they will write more robust code 😄

Infrastructure

the whole infrastructure is available as code IaC (terraform, AWS CDK, Azure Bicep)
infrastructure is treated as normal code with the same process (kept in version control, pull requests, CI/CD pipelines)
optionally, use GitOps tools (flux, ArgoCD)

Observability

Keeping a comprehensible overview over distributed systems is much more complex than with monolithic architectures. Therefore good observability is critical.

Logging

log to STDOUT/STDERR
structured logging with JSON
add request-ids to log-events
logs are ephemeral - don’t use logs as persistent data store!

Metrics

track at least the four golden signals
RED pattern (Request Rate, Error Rate, Duration)
USE pattern (Utilization, Saturation, Errors)
define SLIs, SLOs, SLAs and monitor them
alert on reasonable signals (don’t over-alert - otherwise people will start to ignore the alerts)
up-time availability checks
dashboards for the most relevant metrics

Tracing

track where distributed requests spend their time (Datadog, OpenTelemetry, Zipkin, Jaeger)

Documentation

Technical documentation is the gateway for a better understanding and helps to build a mental model of the complex, intertwined parts of big distributed systems. Good documentation is crucial for new team members and for your future self.

architecture diagrams visualize the IT landscape and give an overview of all participating applications and their relations
README.md in root directory
- project overview and purpose
- development instructions (build commands, how to setup the project locally)
- references to other helpful documentation and related git-repositories
playbook/runbook with helpful instructions (how to handle incidents, useful logging/metrics queries, dashboard links)
architecture decisions records ADRs

Costs

The cloud is expensive. Hence it is important to have cost transparency in order to provide exact billings per organizational grouping. Further current costs should be monitored and alarms should be triggered if they are too high.

tag infrastructure (project, department, team, contact persons, cost center)
quick dashboard for monthly costs

References

Github Production Readiness Checklist

Twelve-factor app#

Development#

Deployment#

Operations#

Infrastructure#

Observability#

Logging#

Metrics#

Tracing#

Documentation#

Costs#

References#