Roadmap

Data Engineer

The engineer who designs, builds, and maintains the infrastructure that moves data from sources to destinations. Builds pipelines, data warehouses, and data platforms that make data available, reliable, and accessible to analysts, data scientists, and business applications.

OPTIMISTIC 18–24 monthsREALISTIC 2–3 years

Stage 00

Programming Fundamentals

Data engineers are software engineers who specialize in data. Python and SQL are the foundation. Unlike data analysts, engineers write production code that others depend on.

Python — Production-Grade

All fundamentals from Data Analyst Stage 3 plus:
Object-Oriented Programming: - Classes and objects — `__init__`, instance variables, methods - Inheritance — extending classes; `super()` - Magic methods — `__repr__`, `__str__`, `__len__`, `__iter__` - Encapsulation — private/protected attributes; properties
Functional programming patterns: - map(), filter(), reduce() — functional iteration - Lambda functions — `lambda x: x*2` - List/dict/set comprehensions — concise iteration - Generator functions — `yield`; memory-efficient iteration
Error handling — try/except/finally; custom exceptions; retry logic for network failures
Logging — `logging` module; log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL); structured logging
Configuration management — environment variables with `os.environ`; `.env` files; `python-decouple`
Testing — `pytest`; unit tests; fixtures; mocking external dependencies
Type hints — `def func(x: int) -> str:`; increasingly required in production code; mypy for static checking
Context managers — `with` statement; `__enter__` and `__exit__`; database connections
Decorators — function decorators; `@staticmethod`, `@classmethod`, `@property`; retry decorators
Packaging — modules, packages, `__init__.py`; virtual environments (venv, conda); requirements.txt; pyproject.toml

Python Data Engineering Libraries

Pandas — see Data Analyst Stage 3; data engineering extends this with performance considerations: - Chunking — `pd.read_csv(chunksize=10000)`; processing large files in chunks - Dtype optimization — specifying dtypes at read time to reduce memory - Vectorization — avoiding loops; using apply() only when necessary
PySpark — see Stage 4
requests — HTTP client for API ingestion
httpx — async HTTP client for concurrent API calls
boto3 — AWS SDK for Python; S3, Glue, EMR, Redshift, Lambda interactions
google-cloud-storage, google-cloud-bigquery — GCP Python clients
azure-storage-blob, azure-synapse-analytics — Azure Python clients
sqlalchemy — database abstraction; ORM; connection pooling; used in Airflow operators
psycopg2, pyodbc — database drivers for PostgreSQL, SQL Server
pydantic — data validation; schema enforcement; used in pipeline data contracts
tenacity — retry logic with exponential backoff
great_expectations — data quality testing library
dbt-core — running dbt from Python; programmatic dbt execution

SQL — Production-Grade

All SQL from Data Analyst Stage 2 plus engineering patterns:
Transaction management — BEGIN, COMMIT, ROLLBACK; ISOLATION LEVELS; advisory locks
Performance optimization — EXPLAIN ANALYZE; index design; query rewrites
Schema management — ALTER TABLE; migration scripts; backward-compatible changes
Window functions at scale — understanding execution plans for window function queries
Bulk operations — INSERT INTO SELECT; CREATE TABLE AS SELECT (CTAS); MERGE statements
SQL in pipelines — generating SQL dynamically; parameterized queries (preventing injection)

Resources

"Python Crash Course" by Matthes (book)
"Fluent Python" by Ramalho (advanced Python book)
Real Python (free)
pytest documentation (free)

Stage 01

Data Modeling

Data models are the bluepricular architecture of the warehouse. Well-designed models enable fast, reliable analytics. Poor models create technical debt that compounds.

Dimensional Modeling (Kimball Methodology)

Why dimensional modeling — optimized for query performance and business user comprehension
Fact tables — the quantitative measurements; one row per business event: - Transactional fact table — one row per transaction (most common) - Periodic snapshot — one row per period per entity - Accumulating snapshot — one row per business process instance; updated as milestones are reached
Dimension tables — the descriptive context for facts: - Conformed dimensions — shared across multiple fact tables; enables joining across subject areas - Junk dimensions — consolidating low-cardinality flags into one table - Role-playing dimensions — same dimension used for different contexts (ship date, order date, delivery date all using the date dimension) - Degenerate dimension — dimension attribute stored in fact table (order number, invoice number)
Star schema — fact table surrounded by dimension tables; best for BI performance
Snowflake schema — normalized dimension tables; saves space, slower for queries; rarely preferred in modern cloud warehouses
Slowly Changing Dimensions (SCDs): - Type 1 — overwrite old value; no history - Type 2 — add new row with effective dates; preserves full history; `valid_from`, `valid_to`, `is_current` columns - Type 3 — add new column for current and previous value; limited history - Type 4 — use history table; separate current and historical storage
Surrogate keys — warehouse-generated integer keys replacing source system natural keys; stability, consistent data type, integration of multiple sources
Bridge tables — many-to-many relationships between facts and dimensions

Data Vault Modeling

Alternative to Kimball for enterprise data warehousing; better for audit and compliance requirements
Components: - Hubs — unique business keys with no context - Links — relationships between hubs (association) - Satellites — attributes and descriptive context for hubs and links; timestamped for history
Raw vault vs business vault — raw data vs business-rule applied data
Data Vault 2.0 — adds hash keys and parallel load patterns

Modern Approaches

Wide tables / denormalized analytics tables — flatten everything into one big table; common with modern columnar warehouses
Analytics engineering patterns (dbt): - Staging layer — clean and standardize raw data - Intermediate layer — join and restructure - Mart layer — business-ready tables organized by domain
OBT (One Big Table) — extreme denormalization for simplicity
Activity schema — event-based alternative to dimensional modeling

Data Modeling Tools

dbt — most common implementation tool for warehouse data models
Lucidchart / draw.io / dbdiagram.io — ERD (Entity-Relationship Diagram) tools
dbt-erdiagram / dbt-docs — auto-generated documentation from dbt models

Resources

"The Data Warehouse Toolkit" by Kimball and Ross (the canonical reference book)
dbt Learn data modeling course (free)
dbdiagram.io (free for diagrams)

Stage 02

ETL/ELT Pipelines and Orchestration

Building reliable, maintainable data pipelines is the core engineering discipline. Understanding ETL/ELT patterns and orchestration is the day-to-day job.

ETL vs ELT

ETL (Extract, Transform, Load) — traditional pattern; transform before loading; required when target lacks processing power; Informatica, Talend, SSIS
ELT (Extract, Load, Transform) — modern pattern; load raw data first, transform in warehouse; cloud warehouses have enough compute; dbt + Snowflake/BigQuery/Redshift

Extraction Patterns

Full load — extract all data every time; simple; expensive for large tables
Incremental load: - Timestamp-based — `WHERE updated_at > last_run_timestamp`; requires reliable updated_at column - CDC (Change Data Capture) — capture inserts/updates/deletes at database log level; Debezium, AWS DMS, Fivetran - Watermark — tracking last successfully processed record
API extraction patterns: - REST APIs — pagination (cursor-based, offset, page-based); rate limiting; authentication (OAuth 2.0, API key, JWT) - GraphQL — query only needed fields; avoid over-fetching - Webhooks — event-driven; receive data when it changes; reliability challenges - Bulk export APIs — large data extracts; S3/GCS destination
Database replication: - Logical replication — row-level changes; used for cross-database replication - Binlog replication — MySQL binary log; Debezium connector - WAL streaming — PostgreSQL write-ahead log; logical decoding

ELT Tools (Managed Connectors)

Fivetran — managed SaaS connectors; 500+ sources; automated schema migration; hands-off ELT
Airbyte — open-source alternative to Fivetran; self-hosted or cloud; custom connector SDK
Stitch — simple SaaS ELT; Singer protocol
AWS Glue — managed ETL service; Spark-based; schema inference; Glue Crawlers for data catalog
Azure Data Factory — managed ETL; drag-and-drop; connects Azure services and external sources
Google Cloud Dataflow — Apache Beam-based; batch and streaming; serverless

Apache Airflow — The Dominant Orchestrator

Architecture: - Webserver — UI for monitoring and triggering - Scheduler — parsing and scheduling DAGs - Executor — running tasks; LocalExecutor (single machine), CeleryExecutor (distributed), KubernetesExecutor (k8s) - Workers — executing task instances - Metadata database — PostgreSQL; storing DAG state, task state, XCom - DAG storage — S3, Git sync, or local filesystem
DAGs (Directed Acyclic Graphs) — Python files defining workflow structure: ```python from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime, timedelta default_args = { 'owner': 'data-team', 'retries': 3, 'retry_delay': timedelta(minutes=5), 'email_on_failure': True, } with DAG( dag_id='daily_sales_pipeline', default_args=default_args, schedule_interval='0 6 * * *', # 6am daily start_date=datetime(2024, 1, 1), catchup=False, tags=['sales', 'daily'], ) as dag: extract = PythonOperator(task_id='extract_sales', python_callable=extract_func) transform = PythonOperator(task_id='transform_sales', python_callable=transform_func) load = PythonOperator(task_id='load_to_warehouse', python_callable=load_func) extract >> transform >> load # dependency definition ```
Key Airflow concepts: - Operators — BashOperator, PythonOperator, SQLExecuteQueryOperator, KubernetesPodOperator, BigQueryOperator - Sensors — waiting for external events; FileSensor, HttpSensor, S3KeySensor, ExternalTaskSensor - Hooks — connecting to external systems; database hooks, cloud hooks; abstraction over credentials - XCom — cross-task communication; passing small values between tasks - Variables — storing configuration values; accessible across DAGs - Connections — storing credentials; referenced by conn_id in operators - Task instances and DAG runs — one DAG run per schedule interval; task instance = task in a DAG run - Catchup — running missed intervals when DAG is enabled; usually disabled - Backfill — manually running historical periods: `airflow dags backfill -s 2024-01-01 -e 2024-12-31 dag_id` - SLA misses — alerting when tasks take longer than expected - Pools — limiting concurrent task execution; preventing resource exhaustion - Priority weights — prioritizing critical tasks when resources are constrained
Airflow best practices: - Idempotent tasks — running twice produces same result as running once; no duplicate data - Atomic tasks — each task succeeds or fails completely; no partial state - Small DAGs — avoid monolithic DAGs; split into logical components - Avoid dynamic task generation — hard to debug; use TaskGroups instead - External task sensor — dependency between DAGs; avoid tight coupling

Alternative Orchestrators

Prefect — Python-first; easier local development; Prefect Cloud vs Prefect Server
Dagster — asset-centric; software-defined assets; strong lineage
dbt Cloud — dbt job scheduling; CI/CD for dbt models
AWS Step Functions — serverless; visual workflows; integrates with all AWS services
Google Cloud Composer — managed Airflow on GCP

dbt — Analytics Engineering

What dbt does: - Brings software engineering best practices to SQL transformations - Version control — all transformation logic in Git - Modularity — reusable models using `ref()` and `source()` - Testing — built-in and custom tests for data quality - Documentation — auto-generated from model descriptions - Lineage — automatic data lineage graph
dbt project structure: - `models/` — SQL files; each file becomes a table or view in the warehouse - `seeds/` — CSV lookup files committed to version control - `snapshots/` — SCD Type 2 implementation - `tests/` — custom test SQL files - `macros/` — Jinja2 macros for reusable SQL logic - `dbt_project.yml` — project configuration; model materialization defaults - `sources.yml` — declaring raw data sources; source freshness checks - `schema.yml` — model documentation and tests
Model materialization: - View — default; recreated each query; no storage cost; slow for complex logic - Table — materialized as a physical table; fast queries; must be refreshed - Incremental — appends or overwrites only new/changed data; efficient for large tables - `{{ config(materialized='incremental', unique_key='id') }}` - `{% if is_incremental() %} WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }}) {% endif %}` - Ephemeral — CTE; not materialized; used as building block
dbt tests: - Built-in: `not_null`, `unique`, `accepted_values`, `relationships` - dbt-utils: `expression_is_true`, `at_least_one`, `recency`, `equality` - Custom: `tests/` directory; any SQL returning rows = test failure - Source freshness: `loaded_at_field`; alerting on stale sources
Jinja2 in dbt: - `{{ ref('model_name') }}` — referencing other models; creates dependency - `{{ source('schema', 'table') }}` — referencing raw tables - `{{ config(materialized='table') }}` — configuring model behavior - `{% set variable = 'value' %}` — setting variables - `{{ var('variable_name') }}` — accessing project variables - `{{ env_var('ENV_VAR') }}` — accessing environment variables
dbt incremental strategies: - append — adds new rows; no updates - merge (UPSERT) — inserts new, updates existing based on unique_key - delete+insert — deletes then re-inserts changed partitions - insert_overwrite — replaces partitions
dbt Cloud vs dbt Core — Cloud: managed scheduling, IDE, CI/CD; Core: open-source, runs anywhere

Resources

dbt Fundamentals course (free at courses.getdbt.com)
Apache Airflow documentation (free)
"The Fundamentals of Data Engineering" by Reis and Housley (book)
dbt Community Slack (free)

Stage 03

Cloud Data Platforms

Data engineering in 2026 is predominantly cloud-based. Deep fluency in at least one cloud provider is required; awareness of all three is expected.

Cloud Storage and Object Storage

AWS S3: - Buckets, prefixes, objects — S3 has no real directories; prefixes simulate them - Storage classes — Standard, Intelligent-Tiering, Standard-IA, Glacier; cost vs access tradeoff - Lifecycle policies — automatically transition or expire objects - Versioning — keeping multiple versions of objects - Event notifications — triggering Lambda on PUT/DELETE - Presigned URLs — time-limited access without credentials - S3 Select — querying CSV/JSON/Parquet files with SQL without downloading - boto3 patterns: ```python import boto3 s3 = boto3.client('s3') s3.put_object(Bucket='my-bucket', Key='prefix/file.csv', Body=data) s3.get_object(Bucket='my-bucket', Key='prefix/file.csv') paginator = s3.get_paginator('list_objects_v2') ```
Google Cloud Storage (GCS) — similar to S3; Buckets and Objects; `google-cloud-storage`
Azure Blob Storage — Containers and Blobs; Data Lake Storage Gen2 adds HDFS-compatible filesystem

Cloud Data Warehouses — Engineering Deep Dive

Snowflake: - Architecture — separate storage (S3/Blob/GCS) and compute (virtual warehouses) - Virtual warehouses — size determines query speed and cost; auto-suspend, auto-resume - Stages — internal and external stages for loading data - COPY INTO — bulk loading from stage to table - Snowpipe — continuous auto-ingest from cloud storage - Tasks — Snowflake's built-in scheduler (alternative to Airflow for Snowflake-only workflows) - Streams — CDC on Snowflake tables; captures inserts/updates/deletes - Dynamic tables — automatically maintained derived tables; declarative pipelines - Zero-copy cloning — instantly clone databases/schemas/tables for testing - Time travel — historical queries; UNDROP - Data sharing — sharing data across Snowflake accounts without copying - Clustering keys — improving scan performance on large tables - Search optimization — for point lookup queries - MERGE statement for SCD Type 2 in Snowflake
Google BigQuery: - Serverless — no infrastructure management; pay per TB scanned - Columnar storage — excellent for aggregations on large tables; poor for single-row lookups - Partitioning — by date, timestamp, or integer range; partition pruning reduces scanned data - Clustering — ordering within partitions; like an index; applies after partitioning - Materialized views — precomputed query results; auto-refreshed - BigQuery ML — training ML models with SQL - External tables — querying GCS files directly (Parquet, CSV, Avro, ORC) - Data Transfer Service — managed connectors from Google products, SaaS - Reservations and slot commitments — predictable pricing for large workloads
Amazon Redshift: - MPP (Massively Parallel Processing) — distributes data across nodes - Distribution styles — EVEN, KEY (same key on same node for joins), ALL (small tables replicated) - Sort keys — compound vs interleaved; data ordered on disk; dramatically improves range queries - Redshift Spectrum — querying S3 data directly from Redshift - COPY command — bulk loading from S3 - Vacuum and analyze — maintaining table health; required for performance - Leader node vs compute nodes — queries compiled and distributed from leader - Redshift Serverless — no cluster management; scales automatically
Databricks: - Unified analytics platform — notebooks, jobs, SQL, ML, Delta Lake in one platform - Delta Lake — open-source storage layer on object storage; ACID transactions on data lakes - Time travel — `SELECT * FROM table VERSION AS OF 10` - Schema enforcement — prevents incompatible writes - Schema evolution — `mergeSchema = true` for adding columns - Optimize and Z-Ordering — compaction + co-location for query speed - Vacuum — removing old versions - Unity Catalog — centralized data governance; fine-grained access control; lineage - Delta Live Tables (DLT) — declarative pipeline framework; automatic error recovery - Databricks SQL — serverless SQL warehouse; BI tool connections - Notebooks — Python, SQL, Scala, R; git integration; widgets for parameterization

Cloud Infrastructure for Data Engineering

AWS: - AWS Glue — managed ETL; Glue Crawlers for data catalog; DynamicFrames for schema flexibility - AWS Lambda — serverless Python functions; event-driven; 15-minute limit - AWS Step Functions — visual workflow orchestration; integrates with all AWS services - Amazon EMR — managed Hadoop/Spark; cluster-based big data processing - Amazon Kinesis — real-time data streaming; Kinesis Data Streams, Firehose, Analytics - Amazon RDS — managed relational databases; source for data pipelines - AWS DMS — Database Migration Service; CDC for ongoing replication
GCP: - Google Cloud Composer — managed Airflow on GCP; deep GCP integration - Google Cloud Dataflow — Apache Beam; batch and streaming; auto-scaling - Google Cloud Pub/Sub — managed message queue; event streaming - Google Cloud Dataproc — managed Spark/Hadoop - Cloud Functions — serverless; event-driven; 60-minute timeout
Azure: - Azure Data Factory (ADF) — managed ETL with visual designer; 90+ connectors - Azure Synapse Analytics — unified analytics; Synapse SQL + Spark + Pipelines - Azure Event Hubs — managed Kafka-compatible event streaming - Azure Databricks — Databricks on Azure infrastructure - Microsoft Fabric — new unified platform; OneLake; direct lake mode for Power BI

Infrastructure as Code for Data Engineers

Terraform — provisioning cloud infrastructure; data engineering resources as code - `aws_s3_bucket`, `aws_glue_job`, `aws_redshift_cluster`, `snowflake_warehouse`
Docker — containerizing data pipelines; consistent environments; Airflow runs in Docker
Kubernetes — orchestrating containerized pipelines; Airflow on K8s; Spark on K8s

Resources

Snowflake documentation (free)
BigQuery documentation (free)
Databricks documentation (free)
dbt-labs GitHub (free)
Cloud provider free tiers for practice

Stage 04

Distributed Processing and Streaming

Large-scale data processing beyond what a single machine handles. Apache Spark for batch; Kafka + Flink/Spark Streaming for real-time.

Apache Spark

Architecture: - Driver — program that runs Spark application; SparkContext/SparkSession - Cluster Manager — Standalone, YARN (Hadoop), Mesos, Kubernetes - Workers — nodes running Executors - Executors — JVM processes on workers; run tasks; store data in memory/disk - Jobs, Stages, Tasks — hierarchy of work; job = action; stage = shuffle boundary; task = partition processing
Core concepts: - RDD (Resilient Distributed Dataset) — low-level; immutable; fault-tolerant; avoid in modern Spark - DataFrame — distributed tabular data; Spark SQL API; JVM optimization - Dataset — strongly-typed DataFrame; Scala/Java only - Transformations vs Actions: - Transformations are lazy — `filter()`, `map()`, `groupBy()`, `join()`, `select()`, `withColumn()`; build execution plan - Actions trigger execution — `count()`, `collect()`, `show()`, `write()`, `take()` - Lazy evaluation — transformations build DAG; optimized before execution - Catalyst optimizer — Spark's query optimizer; generates optimal physical plan - Tungsten engine — memory management and code generation for performance
PySpark — Python API for Spark: ```python from pyspark.sql import SparkSession from pyspark.sql.functions import col, sum, avg, count, window, to_timestamp from pyspark.sql.types import StructType, StructField, StringType, DoubleType spark = SparkSession.builder.appName("pipeline").getOrCreate() df = spark.read.parquet("s3://bucket/data/") df_clean = df.filter(col("amount") > 0).dropDuplicates(["id"]) df_agg = df_clean.groupBy("region", "product").agg( sum("amount").alias("total_sales"), count("id").alias("order_count") ) df_agg.write.mode("overwrite").partitionBy("region").parquet("s3://bucket/output/") ```
Performance tuning: - Partitioning — controlling data distribution across executors - Number of partitions — `spark.default.parallelism`; `repartition()` vs `coalesce()`; 2–3x number of CPU cores is typical - Shuffle operations — joins, groupBy, distinct; expensive; minimize; broadcast join for small tables - Broadcast joins — `broadcast(small_df)`; replicate small table to all executors; avoids shuffle - Caching — `df.cache()` or `df.persist(StorageLevel.MEMORY_AND_DISK)`; caching reused DataFrames - Skew handling — salting; bucketing; AQE (Adaptive Query Execution) - File format — Parquet preferred over CSV for analytics; columnar; predicate pushdown; compression - Partition pruning — querying only relevant partitions via filter on partition column - AQE (Adaptive Query Execution) — Spark 3.0+; dynamically optimizes at runtime

Apache Kafka — Event Streaming

Architecture: - Producers — publish messages to topics - Brokers — servers that store and serve messages; replicated across cluster - Topics — logical categories for messages; partitioned for parallelism - Partitions — ordered, immutable sequence of records; enabling parallelism - Consumer Groups — one consumer per partition per group; parallel consumption - Offsets — position of consumer in partition; committed to track progress - Zookeeper (legacy) / KRaft (modern) — metadata management
Core concepts: - Retention — configurable time-based or size-based retention - Replication factor — copies of each partition across brokers; durability - Producer acknowledgments — `acks=all` for durability; `acks=0` for speed - Exactly-once semantics — idempotent producers + transactional producers - Consumer lag — how far behind consumers are; key operational metric - Schema Registry — Confluent schema registry; enforcing Avro/Protobuf/JSON schema - Kafka Connect — managed connectors for sources and sinks; JDBC connector, S3 Sink, Debezium - Kafka Streams — client library for stream processing on Kafka data
Producer in Python: ```python from kafka import KafkaProducer import json producer = KafkaProducer( bootstrap_servers=['kafka:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8') ) producer.send('orders', value={'order_id': 123, 'amount': 99.99}) producer.flush() ```
Consumer in Python: ```python from kafka import KafkaConsumer consumer = KafkaConsumer( 'orders', group_id='order-processor', bootstrap_servers=['kafka:9092'], value_deserializer=lambda m: json.loads(m.decode('utf-8')), auto_offset_reset='earliest' ) for message in consumer: process(message.value) consumer.commit() ```

Stream Processing

Spark Structured Streaming: - Micro-batch processing — processes data in small batches; near real-time - Trigger intervals — 0 seconds (continuous), 1 minute, once - Watermarking — handling late-arriving data: `withWatermark("timestamp", "10 minutes")` - Output modes — append (new rows only), update (changed rows), complete (full results) - Checkpointing — fault tolerance; recoverable from failures - Windowing — tumbling, sliding, session windows for aggregations
Apache Flink: - True stream processing — event-by-event, not micro-batch - Lower latency than Spark Streaming - Time semantics — event time vs processing time; watermarks - Stateful processing — fault-tolerant state with RocksDB backend - Flink SQL — SQL-based stream processing; unified batch and stream API - Growing adoption for low-latency use cases (fraud detection, real-time recommendations)

File Formats

Parquet — columnar; efficient for analytics; predicate pushdown; compression; Pandas/Spark/BigQuery all support
Avro — row-based; schema evolution; good for streaming/Kafka; not great for analytics
ORC — columnar; originally Hadoop; good Hive support; losing to Parquet
Delta Lake — Parquet + transaction log; ACID on object storage; time travel
Apache Iceberg — open table format; table evolution; partition evolution; multi-engine support (Spark, Flink, Trino)
Apache Hudi — CDC patterns; upsert on data lakes; incremental processing

Resources

Apache Spark documentation (free)
Databricks Spark learning (free)
Confluent Kafka documentation (free)
"Designing Data-Intensive Applications" by Kleppmann (essential book)

Stage 05

Data Quality, Observability, and Governance

Production data pipelines fail. Data changes unexpectedly. These engineering disciplines prevent silent data failures.

Data Quality

Dimensions of data quality — completeness, accuracy, consistency, timeliness, validity, uniqueness
Data quality testing categories: - Schema tests — column exists, data type correct, not null - Content tests — expected values only, referential integrity, range checks - Statistical tests — outlier detection, distribution shift, unexpected zeros - Freshness tests — data arrived within expected window
Testing frameworks: - dbt tests — declarative YAML-based tests close to the model; schema.yml - Great Expectations — Python library; expectations suite; integration with Airflow/dbt - `expect_column_to_exist`, `expect_column_values_to_not_be_null` - `expect_column_values_to_be_between`, `expect_table_row_count_to_be_between` - Data docs — auto-generated HTML documentation of expectations and results - Soda Core — open-source; YAML-based checks; pushes results to Soda Cloud
Data contracts: - Formal agreement between data producers and consumers - Schema, semantics, SLAs, ownership - Tools: Data Contract CLI, dbt model contracts

Data Observability

What data observability is — understanding the health of data as it flows through pipelines
Five pillars of data observability (Monte Carlo): - Freshness — is data up to date? - Volume — is the expected amount of data present? - Distribution — have value ranges or patterns changed? - Schema — have column names or types changed? - Lineage — which upstream tables and pipelines feed this data?
Tools: - Monte Carlo — leading commercial observability platform; ML-based anomaly detection - Bigeye — data monitoring; anomaly detection; SLA tracking - Datafold — diff-based data quality; column-level impact analysis - Lightdash — open-source; integrates with dbt - dbt Cloud — model run monitoring; test failures; source freshness
Data lineage: - Column-level lineage — tracing transformations field by field - OpenLineage — open standard for lineage metadata - Marquez — open-source lineage server; implements OpenLineage - Atlas (Apache) — data governance and metadata framework; open-source

Data Governance

Data catalog — inventory of all datasets; business description, owner, schema, lineage - Datahub — LinkedIn open-source; widely adopted; integrates with Airflow, dbt, Spark - Collibra — enterprise data governance; data dictionary; stewardship workflows - Alation — enterprise data catalog; SQL query intelligence - Unity Catalog (Databricks) — governance for Databricks; column-level security; auditing - BigQuery Data Catalog — native GCP catalog
Data classification — PII, PHI, PCI data; tagging sensitive columns; driving access control
Access control: - Role-based access control — database roles; column masking; row-level security - Attribute-based access control — Unity Catalog; policy expressions - Data masking — masking sensitive data in non-production environments
Data retention — policies for how long different data types are kept; GDPR deletion rights
GDPR/CCPA engineering implications: - Right to erasure — deleting user data across all systems; compiling the data lineage to find all copies - Data minimization — collecting only what is needed - Consent management — tracking which users consented to which processing

Resources

Great Expectations documentation (free)
dbt testing guide (free)
Monte Carlo blog (free)
"Fundamentals of Data Engineering" (book)

Stage 06

DataOps and Software Engineering Practices

Production data systems require the same rigor as application software: version control, CI/CD, testing, monitoring.

Version Control for Data

Git fundamentals — branching (feature branches, main/prod), PRs, code review
dbt + Git — every transformation change committed; reviewed; tested in CI before production
Configuration as code — Airflow DAGs in Git; Terraform for infrastructure
Data pipeline versioning — semantic versioning for pipeline releases

CI/CD for Data Pipelines

GitHub Actions / GitLab CI for dbt: - Run dbt build on every PR against a dev environment - Run dbt test suite; block merge on test failure - Deploy to production on merge to main - Example: ```yaml name: dbt CI on: [pull_request] jobs: dbt-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - run: pip install dbt-snowflake - run: dbt build --select state:modified --defer --state prod/ - run: dbt test --select state:modified ```
Slim CI — run only changed models; `--select state:modified`; `--defer` to use prod artifacts for unchanged upstream
Blue/green deployments — maintaining two environments; switching when new version is validated

Containerization

Docker for data pipelines: - Airflow runs in Docker Compose for local development - Pipeline tasks as Docker containers — consistent environment; dependency isolation - ECR/GCR/ACR — container registries for storing pipeline images
Kubernetes for data pipelines: - KubernetesPodOperator in Airflow — running tasks as K8s pods - Spark on Kubernetes — replacing YARN for Spark workloads - Helm charts — package manager for K8s applications; Airflow, Kafka

Monitoring and Alerting

Pipeline monitoring: - Airflow — task duration alerts; SLA misses; email/Slack/PagerDuty integration - Datadog, Grafana, CloudWatch — infrastructure metrics; query performance - Custom metrics — pipeline row counts, processing times, error rates in time-series database
Data freshness monitoring: - dbt source freshness checks - Great Expectations freshness expectations - Custom SQL to check MAX(updated_at) against SLA
On-call for data — runbooks for common pipeline failures; escalation procedures
Post-mortems — blameless analysis of data incidents; root cause + prevention

Resources

"Practical DataOps" by Beragh and Jones
GitLab CI documentation (free)
Docker documentation (free)

FAQ

Common questions

How long does it take to become a Data Engineer?

18–24 months optimistic, 2–3 years realistic. The modern data stack (Snowflake + dbt + Airflow + Spark) takes time to learn properly, and cloud platform depth (AWS or Azure or GCP) is no longer optional. The fastest paths in 2026: SDE-to-data-engineer, data-analyst-with-strong-Python-becoming-data-engineer, and analytics-engineering as an intermediate step. 23% growth rate, salaries reaching $170K+ for experienced engineers.

Which certifications matter for data engineering?

AWS Data Engineer Associate or AWS Solutions Architect Associate for AWS. Azure Data Engineer Associate (DP-203) for Azure. Google Professional Data Engineer for GCP. Snowflake SnowPro Core for Snowflake-heavy environments. dbt Analytics Engineering Certification is increasingly listed. The cert market is fragmented because the data stack is fragmented; pick the cloud + warehouse your target employers use.

Do I need a CS degree?

Helpful but not required. Strong Python + SQL + distributed systems intuition can be self-taught, and bootcamps with rigorous data engineering tracks (DataExpert.io, DataTalks.Club) produce competitive candidates. What you absolutely need: comfort with batch and streaming concepts, schema design, and operational fluency (when does the pipeline fail at 3 AM, and how do you know?). The bar at FAANG-tier data engineering is genuinely high; mid-tier roles are accessible.

What separates a hired Data Engineer?

End-to-end pipeline ownership in your portfolio. Show one realistic data pipeline — from API/event source through ingestion, transformation in dbt or Spark, to a queryable warehouse — with documentation on schema decisions, partition strategy, and failure handling. Dashboard-only candidates lose to candidates who've built infrastructure. 94% of enterprises have embraced cloud platforms; cloud platform fluency is no longer optional. Bonus signals: open-source contributions to dbt, Airflow, or Dagster.

Data Engineer

Common questions

Related roles