Modern applications run typically in the cloud. As good residents, distributed applications must fulfill many requirements in order to enable reliable operations and maintenance. This article summarizes the most important points in order to go live and to keep applications healthy over their lifetime.

Twelve-factor app

All modern cloud applications should comply with the 12-factor app principles:

  1. Codebase is tracked in a version control system, the same code will be deployed in different environments
  2. Dependencies are shipped with the deployable artifact
  3. Backing Services can be detached and re-attached without code changes (e.g. database, message broker)
  4. Config remains in environment variables
  5. Separate stages for Build, Release, Run
  6. An application runs as a single stateless process. Data must be stored in a database.
  7. Port Binding: applications export their service via a port binding
  8. Concurrency, services are horizontally scalable
  9. Disposability, applications are disposable, startup fast and support graceful shutdown (handles OS signals)
  10. Dev/Prod parity, environments are as similar as possible
  11. Logs are treated as a continuous stream of events (no log-file rotation)
  12. Admin processes, e.g. database migrations are done as one-off admin process within the same environment as the long-running application

Development

More often than not, developers are busy with adding new features but for reliable and robust applications, they must focus on cross-functional features too. With a DevOps culture, developers should collaborate closely with the ops-team and should know the infrastructure as well.

  • build with one command, complex logic should be abstracted away inside a script

  • integrate auto-formatting into the CI build (spotless, black, rustfmt, gofmt)

  • integrate static-code analyzers and linters into the CI build (errorprone, infer, sonarqube, eslint, ruff, clippy, go-staticcheck)

  • integrate unit tests into the CI build

  • add integration tests, load tests, only add minimalistic end-2-end tests because they are costly to maintain

  • don’t forget the fallacies of distributed systems, add retries with backoffs for idempotent requests, circuit breakers, load-shedding

  • know the application’s resource limits - set them accordingly!

  • add health and readiness checks to the applications

  • use feature flags instead of spinning up whole environments (enable/disable features for a selected userbase)

  • consider multi-tenancy from the start (regions, users)

Deployment

  • use a modern CI/CD pipeline with a simple branching strategy like github flow, avoid gitflow
  • keep the CI pipeline simple, move complex logic into scripts

Operations

  • backup strategy for databases
  • regular firedrills (restore backup)
  • keep the number of environments low (preferably Dev and Prod)
  • for an healthy DevOps culture, developers should do operations too. They should feel the pain during an outage. Afterwards they will write more robust code 😄

Infrastructure

  • the whole infrastructure is available as code IaC (terraform, AWS CDK, Azure Bicep)
  • infrastructure is treated as normal code with the same process (kept in version control, pull requests, CI/CD pipelines)
  • optionally, use GitOps tools (flux, ArgoCD)

Observability

Keeping a comprehensible overview over distributed systems is much more complex than with monolithic architectures. Therefore good observability is critical.

Logging

  • log to STDOUT/STDERR
  • structured logging with JSON
  • add request-ids to log-events
  • logs are ephemeral - don’t use logs as persistent data store!

Metrics

  • track at least the four golden signals
  • RED pattern (Request Rate, Error Rate, Duration)
  • USE pattern (Utilization, Saturation, Errors)
  • define SLIs, SLOs, SLAs and monitor them
  • alert on reasonable signals (don’t over-alert - otherwise people will start to ignore the alerts)
  • up-time availability checks
  • dashboards for the most relevant metrics

Tracing

Documentation

Technical documentation is the gateway for a better understanding and helps to build a mental model of the complex, intertwined parts of big distributed systems. Good documentation is crucial for new team members and for your future self.

  • architecture diagrams visualize the IT landscape and give an overview of all participating applications and their relations
  • README.md in root directory
    • project overview and purpose
    • development instructions (build commands, how to setup the project locally)
    • references to other helpful documentation and related git-repositories
  • playbook/runbook with helpful instructions (how to handle incidents, useful logging/metrics queries, dashboard links)
  • architecture decisions records ADRs

Costs

The cloud is expensive. Hence it is important to have cost transparency in order to provide exact billings per organizational grouping. Further current costs should be monitored and alarms should be triggered if they are too high.

  • tag infrastructure (project, department, team, contact persons, cost center)
  • quick dashboard for monthly costs

References