Modern applications run typically in the cloud. As good residents, distributed applications must fulfill many requirements in order to enable reliable operations and maintenance. This article summarizes the most important points in order to go live and to keep applications healthy over their lifetime.
All modern cloud applications should comply with the 12-factor app principles:
- Codebase is tracked in a version control system, the same code will be deployed in different environments
- Dependencies are shipped with the deployable artifact
- Backing Services can be detached and re-attached without code changes (e.g. database, message broker)
- Config remains in environment variables
- Separate stages for Build, Release, Run
- An application runs as a single stateless process. Data must be stored in a database.
- Port Binding: applications export their service via a port binding
- Concurrency, services are horizontally scalable
- Disposability, applications are disposable, startup fast and support graceful shutdown (handles OS signals)
- Dev/Prod parity, environments are as similar as possible
- Logs are treated as a continuous stream of events (no log-file rotation)
- Admin processes, e.g. database migrations are done as one-off admin process within the same environment as the long-running application
More often than not, developers are busy with adding new features but for reliable and robust applications, they must focus on cross-functional features too. With a DevOps culture, developers should collaborate closely with the ops-team and should know the infrastructure as well.
build with one command, complex logic should be abstracted away inside a script
integrate unit tests into the CI build
add integration tests, load tests, only add minimalistic end-2-end tests because they are costly to maintain
don’t forget the fallacies of distributed systems, add retries with backoffs for idempotent requests, circuit breakers, load-shedding
know the application’s resource limits - set them accordingly!
add health and readiness checks to the applications
use feature flags instead of spinning up whole environments (enable/disable features for a selected userbase)
consider multi-tenancy from the start (regions, users)
- use a modern CI/CD pipeline with a simple branching strategy like github flow, avoid gitflow
- keep the CI pipeline simple, move complex logic into scripts
- backup strategy for databases
- regular firedrills (restore backup)
- keep the number of environments low (preferably Dev and Prod)
- for an healthy DevOps culture, developers should do operations too. They should feel the pain during an outage. Afterwards they will write more robust code 😄
- the whole infrastructure is available as code IaC (terraform, AWS CDK, Azure Bicep)
- infrastructure is treated as normal code with the same process (kept in version control, pull requests, CI/CD pipelines)
- optionally, use GitOps tools (flux, ArgoCD)
Keeping a comprehensible overview over distributed systems is much more complex than with monolithic architectures. Therefore good observability is critical.
- log to STDOUT/STDERR
- structured logging with JSON
- add request-ids to log-events
- logs are ephemeral - don’t use logs as persistent data store!
- track at least the four golden signals
- RED pattern (Request Rate, Error Rate, Duration)
- USE pattern (Utilization, Saturation, Errors)
- define SLIs, SLOs, SLAs and monitor them
- alert on reasonable signals (don’t over-alert - otherwise people will start to ignore the alerts)
- up-time availability checks
- dashboards for the most relevant metrics
Technical documentation is the gateway for a better understanding and helps to build a mental model of the complex, intertwined parts of big distributed systems. Good documentation is crucial for new team members and for your future self.
- architecture diagrams visualize the IT landscape and give an overview of all participating applications and their relations
README.mdin root directory
- project overview and purpose
- development instructions (build commands, how to setup the project locally)
- references to other helpful documentation and related git-repositories
- playbook/runbook with helpful instructions (how to handle incidents, useful logging/metrics queries, dashboard links)
- architecture decisions records ADRs
The cloud is expensive. Hence it is important to have cost transparency in order to provide exact billings per organizational grouping. Further current costs should be monitored and alarms should be triggered if they are too high.
- tag infrastructure (project, department, team, contact persons, cost center)
- quick dashboard for monthly costs