Roadmap

Site Reliability Engineer (SRE)

The engineer who applies software engineering principles to operations problems. Builds the automation, observability, and reliability systems that keep production services running while letting development teams ship fast.

OPTIMISTIC 18–24 monthsREALISTIC 2–3 years

Stage 00

Computer & IT Fundamentals

SREs debug production at 2 AM. Understanding what hardware, operating systems, and software are actually doing under the hood enables faster diagnosis.

Computer Hardware

CPU — cores, threads, scheduling, context switching; CPU saturation diagnosis
RAM — virtual memory, paging, OOM killer; memory pressure investigation
Storage — IOPS, throughput, latency; disk I/O bottleneck analysis
NIC — bandwidth, packet loss, interrupt handling; network performance baseline
Physical vs virtual — cloud VM resource limits, noisy neighbor effects

Number Systems

Binary, hex — memory addresses, network masks, log analysis

How Operating Systems Work

Kernel vs user space — system call overhead, kernel panics
Processes and threads — scheduling, priorities, process states (R, S, D, Z)
Memory management — virtual memory, paging, swap, memory-mapped files
File descriptors — ulimits, too many open files errors common in production
System calls — strace for debugging unknown application behavior
Boot process — systemd, init, startup dependencies

Software Execution

How programs compile and execute — JVM, Python GIL, Go goroutines; language-specific performance characteristics
Dynamic libraries — shared library versions, LD_LIBRARY_PATH issues
Environment variables — application configuration via environment

Resources

CS50 (Harvard, free)
"Systems Performance" by Brendan Gregg (book reference)
Professor Messer CompTIA A+ (free YouTube)

Stage 01

Linux — Deep Systems Administration

SREs are expected to debug production Linux systems under pressure. Surface knowledge is insufficient.

Linux Fundamentals

Full filesystem hierarchy with operational significance: - /proc — kernel and process information, pseudo-filesystem - /sys — kernel parameters via sysfs, hardware configuration - /var/log — system and application logs - /etc — system configuration, service configs - /tmp and /dev/shm — temporary storage, memory-based
Terminal proficiency — full command fluency including advanced usage

Performance Investigation Commands

CPU: - top / htop — real-time process CPU and memory usage - vmstat — virtual memory, CPU activity, I/O stats - mpstat — per-CPU statistics - uptime — load average (1, 5, 15 minute); what load average means - pidstat — per-process CPU and I/O stats
Memory: - free -h — total, used, free, cached, available - vmstat -s — memory statistics - /proc/meminfo — detailed memory breakdown - smem — process memory reporting with PSS/USS - oom-kill events in dmesg — understanding OOM killer
Disk: - iostat -x — device utilization, await, service time - iotop — per-process I/O monitoring - df -h — filesystem space usage - du -sh — directory size - lsof — open files per process
Network: - ss -tulpn — listening ports and processes - netstat -s — network statistics summary - ip -s link — interface statistics (bytes, errors, drops) - sar -n DEV — network activity over time - tcpdump — packet capture for live debugging - iftop / nethogs — per-connection bandwidth usage
System-wide: - dmesg — kernel messages, hardware errors, OOM kills - journalctl — systemd journal querying - sar — System Activity Reporter; historical performance data - strace — tracing system calls of a running process - lsof — listing open files and network connections

Linux Networking

ip addr, ip route, ip link — interface management
iptables / nftables — firewall rules, NAT, port forwarding
ethtool — network interface settings
ss — socket statistics, faster replacement for netstat
/proc/net/ — raw networking statistics

Systemd

systemctl — managing services (start, stop, enable, disable, status)
journalctl — reading logs (--since, --until, -u unit, -f follow, -p priority)
Unit files — Service, Timer, Socket unit types
Dependencies — After=, Requires=, Wants=; service startup ordering
Resource controls — CPUQuota=, MemoryLimit= in unit files

Shell Scripting for SRE

Bash scripts for operational tasks — health checks, log rotation, alert responses
Error handling — set -e, set -u, trap for cleanup
Parallel execution — & and wait, xargs -P
Cron and systemd timers — scheduling regular tasks
Python for SRE automation — more complex scripts, API calls, data processing

Resources

TryHackMe Linux Fundamentals (free)
"The Linux Command Line" (free online)
Brendan Gregg's Linux performance tools poster (free)
OverTheWire Bandit (free)

Stage 02

Networking Fundamentals

Network issues cause a significant percentage of production incidents. SREs must diagnose connectivity, DNS, and load balancer issues under pressure.

Core Protocols

OSI model — all 7 layers, failure modes at each
TCP/IP — handshake, teardown, states, performance tuning (TCP_NODELAY, keepalive)
UDP — DNS, NTP, QUIC (HTTP/3)
ICMP — ping, traceroute, unreachable messages
IP addressing — subnetting, CIDR, IPv6 dual-stack

Protocols Critical for SRE

DNS — TTL, propagation, NXDOMAIN, SERVFAIL; DNS failure is a top incident cause - dig, nslookup, host — DNS debugging tools - DNS caching — /etc/resolv.conf, ndots, search domains - DNS in Kubernetes — CoreDNS, service discovery via DNS
HTTP/HTTPS/TLS — request lifecycle, keep-alive, connection pooling, certificate expiry - curl with verbose flags — -v, --resolve, --connect-to; bypassing DNS for testing - HTTP/2 vs HTTP/1.1 — multiplexing, header compression - TLS handshake — certificate chain validation, SNI - Let's Encrypt — automated certificate renewal, cert-manager in Kubernetes
Load balancing — L4 (TCP) vs L7 (HTTP), session persistence, health checks - Round robin, least connections, IP hash — algorithms and when each is appropriate - HAProxy, nginx, envoy — common load balancers
gRPC — protocol buffers, HTTP/2 transport, streaming, service mesh context
WebSocket — long-lived connections, load balancer timeout configuration

Cloud Networking for SRE

VPC, subnets, security groups — access troubleshooting
Load balancers — ALB/NLB (AWS), Azure Load Balancer, GCP Load Balancing
CDN — CloudFront, Fastly, Cloudflare; caching behavior, cache invalidation
DNS in cloud — Route 53 health checks, Azure Traffic Manager, GCP Cloud DNS

Kubernetes Networking

Pod networking model — every pod gets an IP
Services — ClusterIP, NodePort, LoadBalancer, ExternalName; service discovery
Ingress and Ingress controllers — nginx, Traefik, AWS ALB Ingress Controller
Network Policies — CNI-enforced pod communication controls
CoreDNS — Kubernetes internal DNS, configmap tuning
Service mesh — Istio, Linkerd; mTLS, traffic management, observability

Network Debugging Toolkit

ping, traceroute/tracepath — basic reachability
curl — HTTP testing with headers, timing, TLS inspection
tcpdump — packet capture on production (with care)
Wireshark — offline PCAP analysis
nmap — port connectivity testing
netcat (nc) — raw TCP/UDP testing
telnet — port connectivity check (legacy but still useful)
openssl s_client — TLS certificate and connection debugging

Resources

Professor Messer Network+ (free YouTube)
Kubernetes networking documentation (free)
"High Performance Browser Networking" (hpbn.co, free online)

Stage 03

Security Fundamentals

SREs work with security teams on reliability-impacting vulnerabilities and incidents. Understanding security principles enables better architecture and faster incident response.

Security Fundamentals

CIA Triad — availability is the SRE primary concern; all three matter
Authentication and authorization — IAM, service accounts, mutual TLS
Cryptography — TLS, certificate management, secrets rotation
Shared responsibility model — cloud provider vs customer reliability boundaries
Common attack patterns — DDoS (availability impact), ransomware (availability impact), credential theft (integrity impact)
Compliance context — SLO targets often embedded in compliance frameworks (HIPAA uptime, PCI availability)

Resources

Professor Messer Security+ SY0-701 (free YouTube)

Stage 04

Cloud Platforms

SREs manage production infrastructure on cloud platforms. Deep knowledge of at least one provider is required.

AWS — Core Services

Compute — EC2 (instance types, placement groups, dedicated vs shared), Lambda, ECS, EKS
Auto Scaling — launch templates, scaling policies (target tracking, step, scheduled)
Storage — S3, EBS, EFS, Instance Store; performance characteristics of each
Networking — VPC, Route 53, CloudFront, ALB/NLB/GLB, API Gateway
Database — RDS (Multi-AZ, read replicas, failover), DynamoDB (DAX, streams), ElastiCache
Observability — CloudWatch (metrics, logs, alarms), X-Ray (distributed tracing), CloudTrail
Reliability features: - Multi-AZ deployments — high availability within a region - Multi-region architectures — disaster recovery - AWS Backup — automated backup policies - Route 53 health checks and failover routing

Azure — Core Services

Compute — Virtual Machines, Azure Functions, AKS, Container Apps, App Service
Storage — Blob, Files, Queues, Azure Disk
Networking — VNet, Azure Load Balancer, Application Gateway, Azure Front Door
Database — Azure SQL (Always On AG, Business Critical), Cosmos DB, Azure Cache for Redis
Observability — Azure Monitor, Application Insights, Log Analytics workspace

GCP — Awareness Level

Compute Engine, GKE, Cloud Run, Cloud Functions
Cloud Storage, Cloud SQL, Spanner, Bigtable
Cloud Monitoring, Cloud Logging, Cloud Trace

Infrastructure as Code

Terraform — defining, versioning, and managing cloud infrastructure: - HCL syntax, providers, resources, modules, state - terraform plan — reviewing changes before apply - Remote state — S3 + DynamoDB, Azure Storage - Workspace per environment
Ansible — configuration management, agentless SSH-based: - Playbooks, inventory, roles, handlers - Idempotency — running multiple times produces same result - Common SRE uses — OS patching at scale, application configuration

Resources

AWS Solutions Architect Associate course (A Cloud Guru, Udemy)
HashiCorp Learn Terraform (free)
Ansible documentation (free)

Stage 05

Containers & Kubernetes

Kubernetes is essentially a given requirement in SRE postings in 2026. SREs manage cluster reliability and help engineering teams deploy safely.

Docker

Container lifecycle — build, run, stop, remove
Dockerfile — FROM, RUN, COPY, CMD, ENTRYPOINT, ENV, EXPOSE
Image layers and caching — optimizing build performance
Docker Compose — multi-container local environments
Resource limits — --memory, --cpus; preventing container resource exhaustion
Logging drivers — json-file, syslog, fluentd; sending container logs to aggregator
Health checks — HEALTHCHECK instruction, container health status

Kubernetes Architecture

Control plane — API server, etcd, scheduler, controller manager, cloud controller
Worker nodes — kubelet, kube-proxy, container runtime
etcd — distributed key-value store, the source of truth for all cluster state
API server — the single interface for all cluster operations

Core Kubernetes Objects

Pods — smallest deployable unit, co-located containers
Deployments — desired state for stateless applications, rolling updates, rollbacks
StatefulSets — ordered, stable identity pods for stateful applications
DaemonSets — running one pod per node (log collectors, monitoring agents)
Jobs and CronJobs — batch and scheduled workloads
Services — stable network endpoints for pods
ConfigMaps and Secrets — configuration and credential injection
Ingress — HTTP(S) routing from external traffic
Namespaces — logical cluster partitioning
ResourceQuotas and LimitRanges — namespace resource limits

Kubernetes Operations

kubectl — the primary CLI tool: - kubectl get, describe, logs, exec, port-forward, apply, delete - -o yaml — exporting resource definitions - --dry-run=client — validating changes without applying - kubectl top — resource usage (requires metrics-server) - kubectl rollout status/history/undo — deployment management
Helm — Kubernetes package manager: - Charts — pre-packaged applications - Values — customizing deployments - helm install, upgrade, rollback, uninstall - helm repo add, update, search
Horizontal Pod Autoscaler (HPA) — scaling based on CPU/memory/custom metrics
Vertical Pod Autoscaler (VPA) — right-sizing resource requests/limits
Cluster Autoscaler — adding/removing nodes based on pending pods

Kubernetes Reliability Patterns

Pod Disruption Budgets (PDB) — maintaining minimum availability during maintenance
Liveness, Readiness, Startup probes — health checking and traffic control
Resource requests and limits — preventing noisy neighbor issues
Affinity and anti-affinity rules — spreading pods across nodes/zones
Topology spread constraints — even distribution across failure domains
Priority classes — ensuring critical workloads get resources during pressure
Drain and cordon — safely removing nodes from service

Managed Kubernetes

EKS (AWS) — managed control plane, node groups, Fargate, add-ons
AKS (Azure) — node pools, Azure CNI, AAD integration
GKE (GCP) — Autopilot, standard, Anthos

Resources

Kubernetes documentation (kubernetes.io, free)
KodeKloud Kubernetes labs
"Kubernetes in Action" (book)
CKA exam resources
killer.sh CKA simulator

Stage 06

Observability — The Core SRE Skill

Observability is what separates SRE from DevOps in technical interviews. Defining SLOs, building actionable dashboards, and reducing alert noise are the skills hiring managers test directly.

Observability Fundamentals

Monitoring vs observability — monitoring tracks known failure modes; observability enables understanding unknown states
Three pillars of observability: - Metrics — numerical measurements aggregated over time (requests/sec, latency, error rate) - Logs — structured or unstructured event records - Traces — distributed request tracking across services
Signal quality — the difference between "the alert fired" and "the alert tells you exactly what to fix"
The goal — understanding system behavior from external outputs without needing to ask the system directly

SLI, SLO, SLA, Error Budgets

SLI (Service Level Indicator) — the metric you measure (availability, latency, error rate) - Availability SLI — (successful requests / total requests) × 100 - Latency SLI — percentage of requests completing under threshold (e.g., 99th percentile < 200ms) - Error rate SLI — percentage of requests returning 5xx errors
SLO (Service Level Objective) — the target for the SLI (e.g., 99.9% availability over 28 days) - Setting SLOs — too tight creates toil; too loose creates poor user experience - Multi-window SLOs — different windows for different user impact timescales - SLO-based alerting — burning budget too fast is a better signal than crossing a threshold
Error Budget — acceptable unreliability per SLO (99.9% SLO = 0.1% budget = 43.8 min/month) - Error budget policy — what happens when budget is exhausted (halt feature work, increase review) - Error budget burn rate — how fast budget is being consumed
SLA (Service Level Agreement) — contractual commitment to customers; typically more lenient than internal SLO

Prometheus

Architecture — pull-based scraping model, time-series database
Metrics types: - Counter — monotonically increasing (requests_total, errors_total) - Gauge — value that goes up and down (memory_usage_bytes, queue_depth) - Histogram — bucket-based distribution (request_duration_seconds_bucket) - Summary — pre-calculated percentiles (deprecated in favor of histograms for federation)
PromQL (Prometheus Query Language): - Instant vectors — current metric values - Range vectors — metric values over time range - Functions — rate(), irate(), increase(), avg_over_time(), histogram_quantile() - Aggregation — sum(), avg(), max(), min(), count() with by/without labels - Common queries: - Error rate — rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) - 99th percentile latency — histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) - Memory usage by pod — container_memory_working_set_bytes{namespace="production"}
Alerting rules — recording rules for pre-computation, alerting rules for notification
Alertmanager — routing, grouping, deduplication, silences, inhibition
Service discovery — Kubernetes SD, EC2 SD, file-based SD
Exporters — node_exporter (host metrics), kube-state-metrics (Kubernetes object state), blackbox_exporter (endpoint probing), postgres_exporter, redis_exporter

Grafana

Data sources — Prometheus, Loki, Tempo, CloudWatch, Datadog, InfluxDB, Elasticsearch
Dashboard design: - Panel types — time series, gauge, stat, bar chart, heatmap, table, logs, traces - Variables — dynamic dashboards with dropdown selectors (environment, namespace, service) - Alerting — Grafana-managed alerts vs Alertmanager integration - Annotations — marking deployments, incidents on dashboard timeline - Dashboard as code — JSON export, Grafonnet, Jsonnet
Alert rules — Grafana Alerting with notification channels (Slack, PagerDuty, email)
Loki integration — log query visualization alongside metrics

Loki — Log Aggregation

Architecture — horizontally scalable, like Prometheus but for logs
LogQL — query language similar to PromQL: - Stream selectors — {app="nginx", namespace="production"} - Log pipeline — |= "error" (filter), | json (parser), | logfmt - Metric queries — rate({app="nginx"}[5m])
Log labels — cardinality concerns (labels with high cardinality kill Loki performance)
Promtail — log shipping agent, tail log files and send to Loki
Grafana Agent — all-in-one telemetry agent replacing Promtail, Prometheus agent

Distributed Tracing

Why tracing — understanding request flow across microservices, identifying bottlenecks
Trace anatomy — trace ID, spans, parent-child relationships, span attributes
OpenTelemetry (OTel) — vendor-neutral instrumentation standard: - SDKs for Python, Go, Java, JavaScript, .NET - Auto-instrumentation — no code changes required for common frameworks - Collector — receiving, processing, and exporting telemetry - Semantic conventions — standardized attribute names
Jaeger — open-source distributed tracing backend
Tempo (Grafana) — scalable tracing backend, integrates with Grafana
Datadog APM — commercial tracing with automatic service discovery

Alerting Best Practices

Alert fatigue — the primary cause of missed incidents is too many low-quality alerts
Alert design principles: - Alert on symptoms, not causes — user-facing impact vs internal metrics - SLO-based burn rate alerts — more reliable than threshold-based - Actionable alerts — every alert must have a clear response action in a runbook - Appropriate urgency — not everything is pager-worthy
Alert severity levels — critical (page immediately), warning (ticket), informational (dashboard)
Runbooks — per-alert documentation: what fired, how to triage, common causes, remediation steps
Alert suppression during maintenance — silences, inhibition rules
Dead man's switch — alerting when monitoring itself stops reporting

Distributed Systems Observability

USE Method (Brendan Gregg) — Utilization, Saturation, Errors; for resources
RED Method (Tom Wilkie) — Rate, Errors, Duration; for services
The Four Golden Signals (Google SRE book) — latency, traffic, errors, saturation
Service health scoring — combining multiple SLIs into composite health

Resources

Prometheus documentation (prometheus.io, free)
Grafana documentation (free)
"Site Reliability Engineering" book (Google, free online)
Brendan Gregg performance resources (free)
Loki documentation (free)
OpenTelemetry documentation (free)

Stage 07

Incident Management

Incident management is a primary SRE function. The ability to lead incident response, make fast decisions, and run effective postmortems is tested in interviews.

Incident Response Process

Detection — alert fires, SLO burn rate exceeded, user reports
Severity classification — SEV1 (critical, all users impacted), SEV2 (major degradation), SEV3 (minor, subset of users), SEV4 (low impact)
Incident command — roles: Incident Commander (IC), Communications Lead, Subject Matter Experts
Communication — internal Slack/Teams war room, external status page updates, stakeholder notifications
Triage — identifying blast radius, confirming user impact, ruling out false positive
Mitigation vs root cause — mitigate first (restore service), investigate root cause after
Common mitigation actions: - Rollback — kubectl rollout undo, Argo CD rollback, Helm rollback - Feature flag disable — turning off the feature that introduced the regression - Traffic shift — routing to healthy canary or blue/green deployment - Scale out — adding capacity to handle load - Circuit breaker — preventing cascade failures across services - DNS failover — Route 53 health check failover to backup region
Resolution — confirming service restored, SLO burn rate back to normal
Post-incident handoff — summary of what happened, what was done, open action items

Incident Communication

Status pages — Atlassian Statuspage, Cachet; external user communication
Incident templates — structured updates with impact, cause, mitigation status, ETA
Blameless culture — incidents as learning opportunities, not punishment events
Stakeholder communication — non-technical updates to executives during major incidents

Postmortem Practice

Blameless postmortem philosophy — systems fail, not people
Five Whys — iterative root cause analysis technique
Timeline reconstruction — chronological sequence from first sign of issue to resolution
Postmortem structure: - Executive summary — what happened, impact, duration - Timeline — detailed sequence of events with timestamps - Root cause(s) — actual underlying cause(s), not symptoms - Contributing factors — conditions that made the incident possible or worse - Detection — how was it discovered, and could it have been detected sooner - Mitigation — what actions restored service - Action items — specific, assigned, dated improvements to prevent recurrence
Action item tracking — linking postmortem actions to project tracking (Jira, Linear)
Reviewing postmortems for patterns — recurring failures indicate systemic issues

On-Call Management

On-call rotation design — fair distribution, adequate backup, follow-the-sun for 24/7
Alert routing — PagerDuty, OpsGenie; escalation policies, schedules, on-call notifications
Runbooks — per-service documentation: what alerts fire, how to diagnose, how to resolve
Error budget policy — what changes to deployment cadence when budget is low
On-call compensation — escalation pay, time-off-in-lieu

Incident Tools

PagerDuty — on-call scheduling, alert routing, incident management, postmortems
OpsGenie — similar to PagerDuty
FireHydrant — incident management platform
Rootly — Slack-based incident management
Blameless — SRE platform including SLO management and postmortems

Resources

Google SRE book (free online, sre.google)
DORA metrics documentation (free)
PagerDuty incident response guide (free)
Atlassian incident management guide (free)

Stage 08

Reliability Engineering Patterns

This is the engineering work that distinguishes SRE from ops: designing systems that stay up even when components fail.

Distributed Systems Fundamentals

CAP Theorem — Consistency, Availability, Partition Tolerance; only two of three
Eventual consistency — accepting temporary inconsistency for availability
Failure modes in distributed systems — network partitions, split-brain, cascading failures
Fallacies of distributed computing — the network is NOT reliable, latency is NOT zero

Reliability Patterns

Redundancy — eliminating single points of failure - Active-active vs active-passive — trade-offs in failover speed vs cost - N+1 redundancy — always one more than needed
Circuit breaker — stopping requests to failing downstream services: - Closed (normal), Open (failing, stop requests), Half-open (testing recovery) - Hystrix (Java), Resilience4j, Envoy circuit breaking
Retry with exponential backoff — handling transient failures without overwhelming services - Jitter — randomizing retry intervals to prevent thundering herds
Timeout — preventing indefinite waiting for downstream calls
Bulkhead — isolating failures to prevent cascade: - Thread pool isolation per downstream dependency - Resource limits per service type
Rate limiting — protecting services from overload: - Token bucket, leaky bucket algorithms - Client-side vs server-side rate limiting
Load shedding — gracefully degrading under extreme load rather than crashing
Graceful degradation — returning cached or partial results when dependencies fail
Idempotency — making operations safe to retry without side effects
Chaos engineering — deliberately injecting failures to test resilience: - Chaos Monkey (Netflix) — randomly terminating instances - Chaos Mesh, LitmusChaos — Kubernetes-native chaos tools - GameDays — planned exercises testing failure scenarios

High Availability Architecture

Multi-AZ deployment — spreading across availability zones for hardware failure resilience
Multi-region deployment — geographic redundancy for regional outage resilience - Active-active multi-region — serving from multiple regions simultaneously - Active-passive — primary region with standby failover
Blue-green deployments — maintaining two identical environments, swapping traffic
Canary deployments — gradually shifting traffic to new version: - Argo Rollouts — Kubernetes-native progressive delivery - Feature flags — code-level traffic splitting
Database high availability: - RDS Multi-AZ — synchronous standby replica, automatic failover - Aurora Global Database — cross-region replication with < 1 second RPO - CockroachDB, Vitess — distributed SQL architectures

Toil Reduction

Defining toil — manual, repetitive, automatable work that grows with service scale
Toil budget — Google SRE target: < 50% of time on toil
Automation targets — runbook automation, self-healing, automated remediation
Toil reduction examples: - Automating certificate renewal — cert-manager eliminates manual cert rotation - Automating scaling — HPA eliminates manual scale operations - Automating rollback — automated smoke tests triggering rollback on failure - Automating log rotation — preventing disk full incidents

Capacity Planning

Demand forecasting — analyzing usage trends, projecting future capacity needs
Load testing — validating capacity before scaling events: - k6 — modern load testing tool, JavaScript scripts - Locust — Python-based load testing - JMeter — enterprise load testing - Gatling — Scala-based, good for complex scenarios
Saturation analysis — identifying which resource will saturate first under load
Cost optimization — rightsizing instances, reserved capacity vs on-demand

Resources

Google SRE book (sre.google, free)
"Designing Distributed Systems" by Brendan Burns (free online)
Netflix Tech Blog (free)
Martin Fowler's patterns documentation (free)
Chaos Mesh documentation (free)

Stage 09

Scripting & Automation

Software engineering is what separates SRE from traditional operations. Writing code that eliminates toil is the primary differentiator.

Python for SRE

All fundamentals — data types, functions, classes, error handling, file I/O
HTTP clients — requests library for API calls to cloud providers, monitoring systems
boto3 / azure-sdk / google-cloud — cloud SDK automation
Kubernetes Python client — programmatic cluster management
Prometheus Python client — instrumenting custom applications
Datadog, PagerDuty, Slack API integrations — automation and alerting
Practical SRE scripts: - Health checker — polling service endpoints, alerting on failure - Auto-remediation — detecting high disk usage, cleaning old logs/artifacts - Deployment validator — running smoke tests after deployment, triggering rollback on failure - Capacity report — querying cloud APIs for resource utilization, cost trend - Incident timeline builder — aggregating logs from multiple sources for postmortem

Go for SRE

Why Go matters — Kubernetes, Prometheus, Terraform, Docker are all written in Go
Go fundamentals — types, goroutines, channels, interfaces, error handling
Reading and contributing to Go-based tools
Writing simple operators and controllers for Kubernetes

Bash for SRE

Production-grade shell scripting — error handling, logging, idempotency
Kubernetes administration scripts — node drain, cluster upgrade automation
Log parsing and alerting one-liners
Cronjob scripts for scheduled maintenance

Automation Frameworks

Ansible — configuration management and automation: - Playbooks, inventory, roles, handlers, templates (Jinja2) - Modules — command, file, template, service, package, k8s, aws_* - Idempotency — ensuring desired state - Ansible Tower / AWX — enterprise automation with RBAC and scheduling
Terraform — infrastructure automation (see Stage 4)
ArgoCD / Flux — GitOps-based Kubernetes automation (see Stage 5)

Resources

Python for Everybody (Coursera, free)
"The Go Programming Language" (free online intro)
Ansible documentation (free)
Kubernetes Python client documentation (free)

FAQ

Common questions

How long does it take to become an SRE?

18–24 months optimistic at 20–25 hours/week, 2–3 years realistic. SRE rewards software engineering applied to operations — strong programming, deep systems thinking, and operational maturity all compound. The fastest paths come from SDE backgrounds with infrastructure interest, or DevOps engineers who develop reliability-engineering depth. Pure operations backgrounds without programming depth struggle.

Which certifications matter for SRE?

Certified Kubernetes Administrator (CKA) and Certified Kubernetes Application Developer (CKAD) — Kubernetes is essentially a given requirement in SRE postings in 2026. AWS or GCP cloud certs depending on platform focus. Linux Foundation Certified System Administrator (LFCS). The cert market is fragmented; Kubernetes depth and observability fluency outweigh paper credentials.

Do I need a CS degree?

Helpful but not strictly required. SRE demands programming fluency (Go, Python, or Bash), distributed systems intuition, and operational maturity. Self-taught paths through bootcamps + production work compete effectively. The defining skill in SRE is observability depth — defining SLOs, instrumenting services, debugging from telemetry — which separates SRE candidates from DevOps generalists. Gartner projects 75% of enterprises will use SRE practices organization-wide by 2027.

What separates a hired SRE?

SLO definition and observability work in your portfolio. Show service-level objectives you've defined, error budgets you've operated, and incident analyses you've written. Other differentiators: Prometheus + Grafana fluency at depth, chaos engineering experience, and clear postmortem writing. Generic DevOps candidates lose to candidates who can articulate the production reliability mindset that distinguishes SRE.

Site Reliability Engineer (SRE)

Common questions

Related roles