Roadmap
Site Reliability Engineer (SRE)
The engineer who applies software engineering principles to operations problems. Builds the automation, observability, and reliability systems that keep production services running while letting development teams ship fast.
OPTIMISTIC 18–24 months · REALISTIC 2–3 years
Stage 00
Computer & IT Fundamentals
SREs debug production at 2 AM. Understanding what hardware, operating systems, and software are actually doing under the hood enables faster diagnosis.
Computer Hardware
- CPU — cores, threads, scheduling, context switching; CPU saturation diagnosis
- RAM — virtual memory, paging, OOM killer; memory pressure investigation
- Storage — IOPS, throughput, latency; disk I/O bottleneck analysis
- NIC — bandwidth, packet loss, interrupt handling; network performance baseline
- Physical vs virtual — cloud VM resource limits, noisy neighbor effects
Number Systems
- Binary, hex — memory addresses, network masks, log analysis
How Operating Systems Work
- Kernel vs user space — system call overhead, kernel panics
- Processes and threads — scheduling, priorities, process states (R, S, D, Z)
- Memory management — virtual memory, paging, swap, memory-mapped files
- File descriptors — ulimits, too many open files errors common in production
- System calls — strace for debugging unknown application behavior
- Boot process — systemd, init, startup dependencies
Software Execution
- How programs compile and execute — JVM, Python GIL, Go goroutines; language-specific performance characteristics
- Dynamic libraries — shared library versions, LD_LIBRARY_PATH issues
- Environment variables — application configuration via environment
Resources
- CS50 (Harvard, free)
- "Systems Performance" by Brendan Gregg (book reference)
- Professor Messer CompTIA A+ (free YouTube)
Stage 01
Linux — Deep Systems Administration
SREs are expected to debug production Linux systems under pressure. Surface knowledge is insufficient.
Linux Fundamentals
- Full filesystem hierarchy with operational significance: - /proc — kernel and process information, pseudo-filesystem - /sys — kernel parameters via sysfs, hardware configuration - /var/log — system and application logs - /etc — system configuration, service configs - /tmp and /dev/shm — temporary storage, memory-based
- Terminal proficiency — full command fluency including advanced usage
Performance Investigation Commands
- CPU: - top / htop — real-time process CPU and memory usage - vmstat — virtual memory, CPU activity, I/O stats - mpstat — per-CPU statistics - uptime — load average (1, 5, 15 minute); what load average means - pidstat — per-process CPU and I/O stats
- Memory: - free -h — total, used, free, cached, available - vmstat -s — memory statistics - /proc/meminfo — detailed memory breakdown - smem — process memory reporting with PSS/USS - oom-kill events in dmesg — understanding OOM killer
- Disk: - iostat -x — device utilization, await, service time - iotop — per-process I/O monitoring - df -h — filesystem space usage - du -sh — directory size - lsof — open files per process
- Network: - ss -tulpn — listening ports and processes - netstat -s — network statistics summary - ip -s link — interface statistics (bytes, errors, drops) - sar -n DEV — network activity over time - tcpdump — packet capture for live debugging - iftop / nethogs — per-connection bandwidth usage
- System-wide: - dmesg — kernel messages, hardware errors, OOM kills - journalctl — systemd journal querying - sar — System Activity Reporter; historical performance data - strace — tracing system calls of a running process - lsof — listing open files and network connections
Linux Networking
- ip addr, ip route, ip link — interface management
- iptables / nftables — firewall rules, NAT, port forwarding
- ethtool — network interface settings
- ss — socket statistics, faster replacement for netstat
- /proc/net/ — raw networking statistics
Systemd
- systemctl — managing services (start, stop, enable, disable, status)
- journalctl — reading logs (--since, --until, -u unit, -f follow, -p priority)
- Unit files — Service, Timer, Socket unit types
- Dependencies — After=, Requires=, Wants=; service startup ordering
- Resource controls — CPUQuota=, MemoryLimit= in unit files
Shell Scripting for SRE
- Bash scripts for operational tasks — health checks, log rotation, alert responses
- Error handling — set -e, set -u, trap for cleanup
- Parallel execution — & and wait, xargs -P
- Cron and systemd timers — scheduling regular tasks
- Python for SRE automation — more complex scripts, API calls, data processing
Resources
- TryHackMe Linux Fundamentals (free)
- "The Linux Command Line" (free online)
- Brendan Gregg's Linux performance tools poster (free)
- OverTheWire Bandit (free)
Stage 02
Networking Fundamentals
Network issues cause a significant percentage of production incidents. SREs must diagnose connectivity, DNS, and load balancer issues under pressure.
Core Protocols
- OSI model — all 7 layers, failure modes at each
- TCP/IP — handshake, teardown, states, performance tuning (TCP_NODELAY, keepalive)
- UDP — DNS, NTP, QUIC (HTTP/3)
- ICMP — ping, traceroute, unreachable messages
- IP addressing — subnetting, CIDR, IPv6 dual-stack
Protocols Critical for SRE
- DNS — TTL, propagation, NXDOMAIN, SERVFAIL; DNS failure is a top incident cause - dig, nslookup, host — DNS debugging tools - DNS caching — /etc/resolv.conf, ndots, search domains - DNS in Kubernetes — CoreDNS, service discovery via DNS
- HTTP/HTTPS/TLS — request lifecycle, keep-alive, connection pooling, certificate expiry - curl with verbose flags — -v, --resolve, --connect-to; bypassing DNS for testing - HTTP/2 vs HTTP/1.1 — multiplexing, header compression - TLS handshake — certificate chain validation, SNI - Let's Encrypt — automated certificate renewal, cert-manager in Kubernetes
- Load balancing — L4 (TCP) vs L7 (HTTP), session persistence, health checks - Round robin, least connections, IP hash — algorithms and when each is appropriate - HAProxy, nginx, envoy — common load balancers
- gRPC — protocol buffers, HTTP/2 transport, streaming, service mesh context
- WebSocket — long-lived connections, load balancer timeout configuration
Cloud Networking for SRE
- VPC, subnets, security groups — access troubleshooting
- Load balancers — ALB/NLB (AWS), Azure Load Balancer, GCP Load Balancing
- CDN — CloudFront, Fastly, Cloudflare; caching behavior, cache invalidation
- DNS in cloud — Route 53 health checks, Azure Traffic Manager, GCP Cloud DNS
Kubernetes Networking
- Pod networking model — every pod gets an IP
- Services — ClusterIP, NodePort, LoadBalancer, ExternalName; service discovery
- Ingress and Ingress controllers — nginx, Traefik, AWS ALB Ingress Controller
- Network Policies — CNI-enforced pod communication controls
- CoreDNS — Kubernetes internal DNS, configmap tuning
- Service mesh — Istio, Linkerd; mTLS, traffic management, observability
Network Debugging Toolkit
- ping, traceroute/tracepath — basic reachability
- curl — HTTP testing with headers, timing, TLS inspection
- tcpdump — packet capture on production (with care)
- Wireshark — offline PCAP analysis
- nmap — port connectivity testing
- netcat (nc) — raw TCP/UDP testing
- telnet — port connectivity check (legacy but still useful)
- openssl s_client — TLS certificate and connection debugging
Resources
- Professor Messer Network+ (free YouTube)
- Kubernetes networking documentation (free)
- "High Performance Browser Networking" (hpbn.co, free online)
Stage 03
Security Fundamentals
SREs work with security teams on reliability-impacting vulnerabilities and incidents. Understanding security principles enables better architecture and faster incident response.
Security Fundamentals
- CIA Triad — availability is the SRE primary concern; all three matter
- Authentication and authorization — IAM, service accounts, mutual TLS
- Cryptography — TLS, certificate management, secrets rotation
- Shared responsibility model — cloud provider vs customer reliability boundaries
- Common attack patterns — DDoS (availability impact), ransomware (availability impact), credential theft (integrity impact)
- Compliance context — SLO targets often embedded in compliance frameworks (HIPAA uptime, PCI availability)
Resources
- Professor Messer Security+ SY0-701 (free YouTube)
Stage 04
Cloud Platforms
SREs manage production infrastructure on cloud platforms. Deep knowledge of at least one provider is required.
AWS — Core Services
- Compute — EC2 (instance types, placement groups, dedicated vs shared), Lambda, ECS, EKS
- Auto Scaling — launch templates, scaling policies (target tracking, step, scheduled)
- Storage — S3, EBS, EFS, Instance Store; performance characteristics of each
- Networking — VPC, Route 53, CloudFront, ALB/NLB/GLB, API Gateway
- Database — RDS (Multi-AZ, read replicas, failover), DynamoDB (DAX, streams), ElastiCache
- Observability — CloudWatch (metrics, logs, alarms), X-Ray (distributed tracing), CloudTrail
- Reliability features: - Multi-AZ deployments — high availability within a region - Multi-region architectures — disaster recovery - AWS Backup — automated backup policies - Route 53 health checks and failover routing
Azure — Core Services
- Compute — Virtual Machines, Azure Functions, AKS, Container Apps, App Service
- Storage — Blob, Files, Queues, Azure Disk
- Networking — VNet, Azure Load Balancer, Application Gateway, Azure Front Door
- Database — Azure SQL (Always On AG, Business Critical), Cosmos DB, Azure Cache for Redis
- Observability — Azure Monitor, Application Insights, Log Analytics workspace
GCP — Awareness Level
- Compute Engine, GKE, Cloud Run, Cloud Functions
- Cloud Storage, Cloud SQL, Spanner, Bigtable
- Cloud Monitoring, Cloud Logging, Cloud Trace
Infrastructure as Code
- Terraform — defining, versioning, and managing cloud infrastructure: - HCL syntax, providers, resources, modules, state - terraform plan — reviewing changes before apply - Remote state — S3 + DynamoDB, Azure Storage - Workspace per environment
- Ansible — configuration management, agentless SSH-based: - Playbooks, inventory, roles, handlers - Idempotency — running multiple times produces same result - Common SRE uses — OS patching at scale, application configuration
Resources
- AWS Solutions Architect Associate course (A Cloud Guru, Udemy)
- HashiCorp Learn Terraform (free)
- Ansible documentation (free)
Stage 05
Containers & Kubernetes
Kubernetes is essentially a given requirement in SRE postings in 2026. SREs manage cluster reliability and help engineering teams deploy safely.
Docker
- Container lifecycle — build, run, stop, remove
- Dockerfile — FROM, RUN, COPY, CMD, ENTRYPOINT, ENV, EXPOSE
- Image layers and caching — optimizing build performance
- Docker Compose — multi-container local environments
- Resource limits — --memory, --cpus; preventing container resource exhaustion
- Logging drivers — json-file, syslog, fluentd; sending container logs to aggregator
- Health checks — HEALTHCHECK instruction, container health status
Kubernetes Architecture
- Control plane — API server, etcd, scheduler, controller manager, cloud controller
- Worker nodes — kubelet, kube-proxy, container runtime
- etcd — distributed key-value store, the source of truth for all cluster state
- API server — the single interface for all cluster operations
Core Kubernetes Objects
- Pods — smallest deployable unit, co-located containers
- Deployments — desired state for stateless applications, rolling updates, rollbacks
- StatefulSets — ordered, stable identity pods for stateful applications
- DaemonSets — running one pod per node (log collectors, monitoring agents)
- Jobs and CronJobs — batch and scheduled workloads
- Services — stable network endpoints for pods
- ConfigMaps and Secrets — configuration and credential injection
- Ingress — HTTP(S) routing from external traffic
- Namespaces — logical cluster partitioning
- ResourceQuotas and LimitRanges — namespace resource limits
Kubernetes Operations
- kubectl — the primary CLI tool: - kubectl get, describe, logs, exec, port-forward, apply, delete - -o yaml — exporting resource definitions - --dry-run=client — validating changes without applying - kubectl top — resource usage (requires metrics-server) - kubectl rollout status/history/undo — deployment management
- Helm — Kubernetes package manager: - Charts — pre-packaged applications - Values — customizing deployments - helm install, upgrade, rollback, uninstall - helm repo add, update, search
- Horizontal Pod Autoscaler (HPA) — scaling based on CPU/memory/custom metrics
- Vertical Pod Autoscaler (VPA) — right-sizing resource requests/limits
- Cluster Autoscaler — adding/removing nodes based on pending pods
Kubernetes Reliability Patterns
- Pod Disruption Budgets (PDB) — maintaining minimum availability during maintenance
- Liveness, Readiness, Startup probes — health checking and traffic control
- Resource requests and limits — preventing noisy neighbor issues
- Affinity and anti-affinity rules — spreading pods across nodes/zones
- Topology spread constraints — even distribution across failure domains
- Priority classes — ensuring critical workloads get resources during pressure
- Drain and cordon — safely removing nodes from service
Managed Kubernetes
- EKS (AWS) — managed control plane, node groups, Fargate, add-ons
- AKS (Azure) — node pools, Azure CNI, AAD integration
- GKE (GCP) — Autopilot, standard, Anthos
Resources
Stage 06
Observability — The Core SRE Skill
Observability is what separates SRE from DevOps in technical interviews. Defining SLOs, building actionable dashboards, and reducing alert noise are the skills hiring managers test directly.
Observability Fundamentals
- Monitoring vs observability — monitoring tracks known failure modes; observability enables understanding unknown states
- Three pillars of observability: - Metrics — numerical measurements aggregated over time (requests/sec, latency, error rate) - Logs — structured or unstructured event records - Traces — distributed request tracking across services
- Signal quality — the difference between "the alert fired" and "the alert tells you exactly what to fix"
- The goal — understanding system behavior from external outputs without needing to ask the system directly
SLI, SLO, SLA, Error Budgets
- SLI (Service Level Indicator) — the metric you measure (availability, latency, error rate) - Availability SLI — (successful requests / total requests) × 100 - Latency SLI — percentage of requests completing under threshold (e.g., 99th percentile < 200ms) - Error rate SLI — percentage of requests returning 5xx errors
- SLO (Service Level Objective) — the target for the SLI (e.g., 99.9% availability over 28 days) - Setting SLOs — too tight creates toil; too loose creates poor user experience - Multi-window SLOs — different windows for different user impact timescales - SLO-based alerting — burning budget too fast is a better signal than crossing a threshold
- Error Budget — acceptable unreliability per SLO (99.9% SLO = 0.1% budget = 43.8 min/month) - Error budget policy — what happens when budget is exhausted (halt feature work, increase review) - Error budget burn rate — how fast budget is being consumed
- SLA (Service Level Agreement) — contractual commitment to customers; typically more lenient than internal SLO
Prometheus
- Architecture — pull-based scraping model, time-series database
- Metrics types: - Counter — monotonically increasing (requests_total, errors_total) - Gauge — value that goes up and down (memory_usage_bytes, queue_depth) - Histogram — bucket-based distribution (request_duration_seconds_bucket) - Summary — pre-calculated percentiles (deprecated in favor of histograms for federation)
- PromQL (Prometheus Query Language): - Instant vectors — current metric values - Range vectors — metric values over time range - Functions — rate(), irate(), increase(), avg_over_time(), histogram_quantile() - Aggregation — sum(), avg(), max(), min(), count() with by/without labels - Common queries: - Error rate — rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) - 99th percentile latency — histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) - Memory usage by pod — container_memory_working_set_bytes{namespace="production"}
- Alerting rules — recording rules for pre-computation, alerting rules for notification
- Alertmanager — routing, grouping, deduplication, silences, inhibition
- Service discovery — Kubernetes SD, EC2 SD, file-based SD
- Exporters — node_exporter (host metrics), kube-state-metrics (Kubernetes object state), blackbox_exporter (endpoint probing), postgres_exporter, redis_exporter
Grafana
- Data sources — Prometheus, Loki, Tempo, CloudWatch, Datadog, InfluxDB, Elasticsearch
- Dashboard design: - Panel types — time series, gauge, stat, bar chart, heatmap, table, logs, traces - Variables — dynamic dashboards with dropdown selectors (environment, namespace, service) - Alerting — Grafana-managed alerts vs Alertmanager integration - Annotations — marking deployments, incidents on dashboard timeline - Dashboard as code — JSON export, Grafonnet, Jsonnet
- Alert rules — Grafana Alerting with notification channels (Slack, PagerDuty, email)
- Loki integration — log query visualization alongside metrics
Loki — Log Aggregation
- Architecture — horizontally scalable, like Prometheus but for logs
- LogQL — query language similar to PromQL: - Stream selectors — {app="nginx", namespace="production"} - Log pipeline — |= "error" (filter), | json (parser), | logfmt - Metric queries — rate({app="nginx"}[5m])
- Log labels — cardinality concerns (labels with high cardinality kill Loki performance)
- Promtail — log shipping agent, tail log files and send to Loki
- Grafana Agent — all-in-one telemetry agent replacing Promtail, Prometheus agent
Distributed Tracing
- Why tracing — understanding request flow across microservices, identifying bottlenecks
- Trace anatomy — trace ID, spans, parent-child relationships, span attributes
- OpenTelemetry (OTel) — vendor-neutral instrumentation standard: - SDKs for Python, Go, Java, JavaScript, .NET - Auto-instrumentation — no code changes required for common frameworks - Collector — receiving, processing, and exporting telemetry - Semantic conventions — standardized attribute names
- Jaeger — open-source distributed tracing backend
- Tempo (Grafana) — scalable tracing backend, integrates with Grafana
- Datadog APM — commercial tracing with automatic service discovery
Alerting Best Practices
- Alert fatigue — the primary cause of missed incidents is too many low-quality alerts
- Alert design principles: - Alert on symptoms, not causes — user-facing impact vs internal metrics - SLO-based burn rate alerts — more reliable than threshold-based - Actionable alerts — every alert must have a clear response action in a runbook - Appropriate urgency — not everything is pager-worthy
- Alert severity levels — critical (page immediately), warning (ticket), informational (dashboard)
- Runbooks — per-alert documentation: what fired, how to triage, common causes, remediation steps
- Alert suppression during maintenance — silences, inhibition rules
- Dead man's switch — alerting when monitoring itself stops reporting
Distributed Systems Observability
- USE Method (Brendan Gregg) — Utilization, Saturation, Errors; for resources
- RED Method (Tom Wilkie) — Rate, Errors, Duration; for services
- The Four Golden Signals (Google SRE book) — latency, traffic, errors, saturation
- Service health scoring — combining multiple SLIs into composite health
Resources
- Prometheus documentation (prometheus.io, free)
- Grafana documentation (free)
- "Site Reliability Engineering" book (Google, free online)
- Brendan Gregg performance resources (free)
- Loki documentation (free)
- OpenTelemetry documentation (free)
Stage 07
Incident Management
Incident management is a primary SRE function. The ability to lead incident response, make fast decisions, and run effective postmortems is tested in interviews.
Incident Response Process
- Detection — alert fires, SLO burn rate exceeded, user reports
- Severity classification — SEV1 (critical, all users impacted), SEV2 (major degradation), SEV3 (minor, subset of users), SEV4 (low impact)
- Incident command — roles: Incident Commander (IC), Communications Lead, Subject Matter Experts
- Communication — internal Slack/Teams war room, external status page updates, stakeholder notifications
- Triage — identifying blast radius, confirming user impact, ruling out false positive
- Mitigation vs root cause — mitigate first (restore service), investigate root cause after
- Common mitigation actions: - Rollback — kubectl rollout undo, Argo CD rollback, Helm rollback - Feature flag disable — turning off the feature that introduced the regression - Traffic shift — routing to healthy canary or blue/green deployment - Scale out — adding capacity to handle load - Circuit breaker — preventing cascade failures across services - DNS failover — Route 53 health check failover to backup region
- Resolution — confirming service restored, SLO burn rate back to normal
- Post-incident handoff — summary of what happened, what was done, open action items
Incident Communication
- Status pages — Atlassian Statuspage, Cachet; external user communication
- Incident templates — structured updates with impact, cause, mitigation status, ETA
- Blameless culture — incidents as learning opportunities, not punishment events
- Stakeholder communication — non-technical updates to executives during major incidents
Postmortem Practice
- Blameless postmortem philosophy — systems fail, not people
- Five Whys — iterative root cause analysis technique
- Timeline reconstruction — chronological sequence from first sign of issue to resolution
- Postmortem structure: - Executive summary — what happened, impact, duration - Timeline — detailed sequence of events with timestamps - Root cause(s) — actual underlying cause(s), not symptoms - Contributing factors — conditions that made the incident possible or worse - Detection — how was it discovered, and could it have been detected sooner - Mitigation — what actions restored service - Action items — specific, assigned, dated improvements to prevent recurrence
- Action item tracking — linking postmortem actions to project tracking (Jira, Linear)
- Reviewing postmortems for patterns — recurring failures indicate systemic issues
On-Call Management
- On-call rotation design — fair distribution, adequate backup, follow-the-sun for 24/7
- Alert routing — PagerDuty, OpsGenie; escalation policies, schedules, on-call notifications
- Runbooks — per-service documentation: what alerts fire, how to diagnose, how to resolve
- Error budget policy — what changes to deployment cadence when budget is low
- On-call compensation — escalation pay, time-off-in-lieu
Incident Tools
- PagerDuty — on-call scheduling, alert routing, incident management, postmortems
- OpsGenie — similar to PagerDuty
- FireHydrant — incident management platform
- Rootly — Slack-based incident management
- Blameless — SRE platform including SLO management and postmortems
Resources
- Google SRE book (free online, sre.google)
- DORA metrics documentation (free)
- PagerDuty incident response guide (free)
- Atlassian incident management guide (free)
Stage 08
Reliability Engineering Patterns
This is the engineering work that distinguishes SRE from ops: designing systems that stay up even when components fail.
Distributed Systems Fundamentals
- CAP Theorem — Consistency, Availability, Partition Tolerance; only two of three
- Eventual consistency — accepting temporary inconsistency for availability
- Failure modes in distributed systems — network partitions, split-brain, cascading failures
- Fallacies of distributed computing — the network is NOT reliable, latency is NOT zero
Reliability Patterns
- Redundancy — eliminating single points of failure - Active-active vs active-passive — trade-offs in failover speed vs cost - N+1 redundancy — always one more than needed
- Circuit breaker — stopping requests to failing downstream services: - Closed (normal), Open (failing, stop requests), Half-open (testing recovery) - Hystrix (Java), Resilience4j, Envoy circuit breaking
- Retry with exponential backoff — handling transient failures without overwhelming services - Jitter — randomizing retry intervals to prevent thundering herds
- Timeout — preventing indefinite waiting for downstream calls
- Bulkhead — isolating failures to prevent cascade: - Thread pool isolation per downstream dependency - Resource limits per service type
- Rate limiting — protecting services from overload: - Token bucket, leaky bucket algorithms - Client-side vs server-side rate limiting
- Load shedding — gracefully degrading under extreme load rather than crashing
- Graceful degradation — returning cached or partial results when dependencies fail
- Idempotency — making operations safe to retry without side effects
- Chaos engineering — deliberately injecting failures to test resilience: - Chaos Monkey (Netflix) — randomly terminating instances - Chaos Mesh, LitmusChaos — Kubernetes-native chaos tools - GameDays — planned exercises testing failure scenarios
High Availability Architecture
- Multi-AZ deployment — spreading across availability zones for hardware failure resilience
- Multi-region deployment — geographic redundancy for regional outage resilience - Active-active multi-region — serving from multiple regions simultaneously - Active-passive — primary region with standby failover
- Blue-green deployments — maintaining two identical environments, swapping traffic
- Canary deployments — gradually shifting traffic to new version: - Argo Rollouts — Kubernetes-native progressive delivery - Feature flags — code-level traffic splitting
- Database high availability: - RDS Multi-AZ — synchronous standby replica, automatic failover - Aurora Global Database — cross-region replication with < 1 second RPO - CockroachDB, Vitess — distributed SQL architectures
Toil Reduction
- Defining toil — manual, repetitive, automatable work that grows with service scale
- Toil budget — Google SRE target: < 50% of time on toil
- Automation targets — runbook automation, self-healing, automated remediation
- Toil reduction examples: - Automating certificate renewal — cert-manager eliminates manual cert rotation - Automating scaling — HPA eliminates manual scale operations - Automating rollback — automated smoke tests triggering rollback on failure - Automating log rotation — preventing disk full incidents
Capacity Planning
- Demand forecasting — analyzing usage trends, projecting future capacity needs
- Load testing — validating capacity before scaling events: - k6 — modern load testing tool, JavaScript scripts - Locust — Python-based load testing - JMeter — enterprise load testing - Gatling — Scala-based, good for complex scenarios
- Saturation analysis — identifying which resource will saturate first under load
- Cost optimization — rightsizing instances, reserved capacity vs on-demand
Resources
- Google SRE book (sre.google, free)
- "Designing Distributed Systems" by Brendan Burns (free online)
- Netflix Tech Blog (free)
- Martin Fowler's patterns documentation (free)
- Chaos Mesh documentation (free)
Stage 09
Scripting & Automation
Software engineering is what separates SRE from traditional operations. Writing code that eliminates toil is the primary differentiator.
Python for SRE
- All fundamentals — data types, functions, classes, error handling, file I/O
- HTTP clients — requests library for API calls to cloud providers, monitoring systems
- boto3 / azure-sdk / google-cloud — cloud SDK automation
- Kubernetes Python client — programmatic cluster management
- Prometheus Python client — instrumenting custom applications
- Datadog, PagerDuty, Slack API integrations — automation and alerting
- Practical SRE scripts: - Health checker — polling service endpoints, alerting on failure - Auto-remediation — detecting high disk usage, cleaning old logs/artifacts - Deployment validator — running smoke tests after deployment, triggering rollback on failure - Capacity report — querying cloud APIs for resource utilization, cost trend - Incident timeline builder — aggregating logs from multiple sources for postmortem
Go for SRE
- Why Go matters — Kubernetes, Prometheus, Terraform, Docker are all written in Go
- Go fundamentals — types, goroutines, channels, interfaces, error handling
- Reading and contributing to Go-based tools
- Writing simple operators and controllers for Kubernetes
Bash for SRE
- Production-grade shell scripting — error handling, logging, idempotency
- Kubernetes administration scripts — node drain, cluster upgrade automation
- Log parsing and alerting one-liners
- Cronjob scripts for scheduled maintenance
Automation Frameworks
- Ansible — configuration management and automation: - Playbooks, inventory, roles, handlers, templates (Jinja2) - Modules — command, file, template, service, package, k8s, aws_* - Idempotency — ensuring desired state - Ansible Tower / AWX — enterprise automation with RBAC and scheduling
- Terraform — infrastructure automation (see Stage 4)
- ArgoCD / Flux — GitOps-based Kubernetes automation (see Stage 5)
Resources
- Python for Everybody (Coursera, free)
- "The Go Programming Language" (free online intro)
- Ansible documentation (free)
- Kubernetes Python client documentation (free)
FAQ
Common questions
How long does it take to become an SRE?
18–24 months optimistic at 20–25 hours/week, 2–3 years realistic. SRE rewards software engineering applied to operations — strong programming, deep systems thinking, and operational maturity all compound. The fastest paths come from SDE backgrounds with infrastructure interest, or DevOps engineers who develop reliability-engineering depth. Pure operations backgrounds without programming depth struggle.
Which certifications matter for SRE?
Certified Kubernetes Administrator (CKA) and Certified Kubernetes Application Developer (CKAD) — Kubernetes is essentially a given requirement in SRE postings in 2026. AWS or GCP cloud certs depending on platform focus. Linux Foundation Certified System Administrator (LFCS). The cert market is fragmented; Kubernetes depth and observability fluency outweigh paper credentials.
Do I need a CS degree?
Helpful but not strictly required. SRE demands programming fluency (Go, Python, or Bash), distributed systems intuition, and operational maturity. Self-taught paths through bootcamps + production work compete effectively. The defining skill in SRE is observability depth — defining SLOs, instrumenting services, debugging from telemetry — which separates SRE candidates from DevOps generalists. Gartner projects 75% of enterprises will use SRE practices organization-wide by 2027.
What separates a hired SRE?
SLO definition and observability work in your portfolio. Show service-level objectives you've defined, error budgets you've operated, and incident analyses you've written. Other differentiators: Prometheus + Grafana fluency at depth, chaos engineering experience, and clear postmortem writing. Generic DevOps candidates lose to candidates who can articulate the production reliability mindset that distinguishes SRE.