Roadmap
Data Engineer
The engineer who designs, builds, and maintains the infrastructure that moves data from sources to destinations. Builds pipelines, data warehouses, and data platforms that make data available, reliable, and accessible to analysts, data scientists, and business applications.
OPTIMISTIC 18–24 months · REALISTIC 2–3 years
Stage 00
Programming Fundamentals
Data engineers are software engineers who specialize in data. Python and SQL are the foundation. Unlike data analysts, engineers write production code that others depend on.
Python — Production-Grade
- All fundamentals from Data Analyst Stage 3 plus:
- Object-Oriented Programming: - Classes and objects — `__init__`, instance variables, methods - Inheritance — extending classes; `super()` - Magic methods — `__repr__`, `__str__`, `__len__`, `__iter__` - Encapsulation — private/protected attributes; properties
- Functional programming patterns: - map(), filter(), reduce() — functional iteration - Lambda functions — `lambda x: x*2` - List/dict/set comprehensions — concise iteration - Generator functions — `yield`; memory-efficient iteration
- Error handling — try/except/finally; custom exceptions; retry logic for network failures
- Logging — `logging` module; log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL); structured logging
- Configuration management — environment variables with `os.environ`; `.env` files; `python-decouple`
- Testing — `pytest`; unit tests; fixtures; mocking external dependencies
- Type hints — `def func(x: int) -> str:`; increasingly required in production code; mypy for static checking
- Context managers — `with` statement; `__enter__` and `__exit__`; database connections
- Decorators — function decorators; `@staticmethod`, `@classmethod`, `@property`; retry decorators
- Packaging — modules, packages, `__init__.py`; virtual environments (venv, conda); requirements.txt; pyproject.toml
Python Data Engineering Libraries
- Pandas — see Data Analyst Stage 3; data engineering extends this with performance considerations: - Chunking — `pd.read_csv(chunksize=10000)`; processing large files in chunks - Dtype optimization — specifying dtypes at read time to reduce memory - Vectorization — avoiding loops; using apply() only when necessary
- PySpark — see Stage 4
- requests — HTTP client for API ingestion
- httpx — async HTTP client for concurrent API calls
- boto3 — AWS SDK for Python; S3, Glue, EMR, Redshift, Lambda interactions
- google-cloud-storage, google-cloud-bigquery — GCP Python clients
- azure-storage-blob, azure-synapse-analytics — Azure Python clients
- sqlalchemy — database abstraction; ORM; connection pooling; used in Airflow operators
- psycopg2, pyodbc — database drivers for PostgreSQL, SQL Server
- pydantic — data validation; schema enforcement; used in pipeline data contracts
- tenacity — retry logic with exponential backoff
- great_expectations — data quality testing library
- dbt-core — running dbt from Python; programmatic dbt execution
SQL — Production-Grade
- All SQL from Data Analyst Stage 2 plus engineering patterns:
- Transaction management — BEGIN, COMMIT, ROLLBACK; ISOLATION LEVELS; advisory locks
- Performance optimization — EXPLAIN ANALYZE; index design; query rewrites
- Schema management — ALTER TABLE; migration scripts; backward-compatible changes
- Window functions at scale — understanding execution plans for window function queries
- Bulk operations — INSERT INTO SELECT; CREATE TABLE AS SELECT (CTAS); MERGE statements
- SQL in pipelines — generating SQL dynamically; parameterized queries (preventing injection)
Resources
- "Python Crash Course" by Matthes (book)
- "Fluent Python" by Ramalho (advanced Python book)
- Real Python (free)
- pytest documentation (free)
Stage 01
Data Modeling
Data models are the bluepricular architecture of the warehouse. Well-designed models enable fast, reliable analytics. Poor models create technical debt that compounds.
Dimensional Modeling (Kimball Methodology)
- Why dimensional modeling — optimized for query performance and business user comprehension
- Fact tables — the quantitative measurements; one row per business event: - Transactional fact table — one row per transaction (most common) - Periodic snapshot — one row per period per entity - Accumulating snapshot — one row per business process instance; updated as milestones are reached
- Dimension tables — the descriptive context for facts: - Conformed dimensions — shared across multiple fact tables; enables joining across subject areas - Junk dimensions — consolidating low-cardinality flags into one table - Role-playing dimensions — same dimension used for different contexts (ship date, order date, delivery date all using the date dimension) - Degenerate dimension — dimension attribute stored in fact table (order number, invoice number)
- Star schema — fact table surrounded by dimension tables; best for BI performance
- Snowflake schema — normalized dimension tables; saves space, slower for queries; rarely preferred in modern cloud warehouses
- Slowly Changing Dimensions (SCDs): - Type 1 — overwrite old value; no history - Type 2 — add new row with effective dates; preserves full history; `valid_from`, `valid_to`, `is_current` columns - Type 3 — add new column for current and previous value; limited history - Type 4 — use history table; separate current and historical storage
- Surrogate keys — warehouse-generated integer keys replacing source system natural keys; stability, consistent data type, integration of multiple sources
- Bridge tables — many-to-many relationships between facts and dimensions
Data Vault Modeling
- Alternative to Kimball for enterprise data warehousing; better for audit and compliance requirements
- Components: - Hubs — unique business keys with no context - Links — relationships between hubs (association) - Satellites — attributes and descriptive context for hubs and links; timestamped for history
- Raw vault vs business vault — raw data vs business-rule applied data
- Data Vault 2.0 — adds hash keys and parallel load patterns
Modern Approaches
- Wide tables / denormalized analytics tables — flatten everything into one big table; common with modern columnar warehouses
- Analytics engineering patterns (dbt): - Staging layer — clean and standardize raw data - Intermediate layer — join and restructure - Mart layer — business-ready tables organized by domain
- OBT (One Big Table) — extreme denormalization for simplicity
- Activity schema — event-based alternative to dimensional modeling
Data Modeling Tools
- dbt — most common implementation tool for warehouse data models
- Lucidchart / draw.io / dbdiagram.io — ERD (Entity-Relationship Diagram) tools
- dbt-erdiagram / dbt-docs — auto-generated documentation from dbt models
Resources
- "The Data Warehouse Toolkit" by Kimball and Ross (the canonical reference book)
- dbt Learn data modeling course (free)
- dbdiagram.io (free for diagrams)
Stage 02
ETL/ELT Pipelines and Orchestration
Building reliable, maintainable data pipelines is the core engineering discipline. Understanding ETL/ELT patterns and orchestration is the day-to-day job.
ETL vs ELT
- ETL (Extract, Transform, Load) — traditional pattern; transform before loading; required when target lacks processing power; Informatica, Talend, SSIS
- ELT (Extract, Load, Transform) — modern pattern; load raw data first, transform in warehouse; cloud warehouses have enough compute; dbt + Snowflake/BigQuery/Redshift
Extraction Patterns
- Full load — extract all data every time; simple; expensive for large tables
- Incremental load: - Timestamp-based — `WHERE updated_at > last_run_timestamp`; requires reliable updated_at column - CDC (Change Data Capture) — capture inserts/updates/deletes at database log level; Debezium, AWS DMS, Fivetran - Watermark — tracking last successfully processed record
- API extraction patterns: - REST APIs — pagination (cursor-based, offset, page-based); rate limiting; authentication (OAuth 2.0, API key, JWT) - GraphQL — query only needed fields; avoid over-fetching - Webhooks — event-driven; receive data when it changes; reliability challenges - Bulk export APIs — large data extracts; S3/GCS destination
- Database replication: - Logical replication — row-level changes; used for cross-database replication - Binlog replication — MySQL binary log; Debezium connector - WAL streaming — PostgreSQL write-ahead log; logical decoding
ELT Tools (Managed Connectors)
- Fivetran — managed SaaS connectors; 500+ sources; automated schema migration; hands-off ELT
- Airbyte — open-source alternative to Fivetran; self-hosted or cloud; custom connector SDK
- Stitch — simple SaaS ELT; Singer protocol
- AWS Glue — managed ETL service; Spark-based; schema inference; Glue Crawlers for data catalog
- Azure Data Factory — managed ETL; drag-and-drop; connects Azure services and external sources
- Google Cloud Dataflow — Apache Beam-based; batch and streaming; serverless
Apache Airflow — The Dominant Orchestrator
- Architecture: - Webserver — UI for monitoring and triggering - Scheduler — parsing and scheduling DAGs - Executor — running tasks; LocalExecutor (single machine), CeleryExecutor (distributed), KubernetesExecutor (k8s) - Workers — executing task instances - Metadata database — PostgreSQL; storing DAG state, task state, XCom - DAG storage — S3, Git sync, or local filesystem
- DAGs (Directed Acyclic Graphs) — Python files defining workflow structure: ```python from airflow import DAG from airflow.operators.python import PythonOperator from datetime import datetime, timedelta default_args = { 'owner': 'data-team', 'retries': 3, 'retry_delay': timedelta(minutes=5), 'email_on_failure': True, } with DAG( dag_id='daily_sales_pipeline', default_args=default_args, schedule_interval='0 6 * * *', # 6am daily start_date=datetime(2024, 1, 1), catchup=False, tags=['sales', 'daily'], ) as dag: extract = PythonOperator(task_id='extract_sales', python_callable=extract_func) transform = PythonOperator(task_id='transform_sales', python_callable=transform_func) load = PythonOperator(task_id='load_to_warehouse', python_callable=load_func) extract >> transform >> load # dependency definition ```
- Key Airflow concepts: - Operators — BashOperator, PythonOperator, SQLExecuteQueryOperator, KubernetesPodOperator, BigQueryOperator - Sensors — waiting for external events; FileSensor, HttpSensor, S3KeySensor, ExternalTaskSensor - Hooks — connecting to external systems; database hooks, cloud hooks; abstraction over credentials - XCom — cross-task communication; passing small values between tasks - Variables — storing configuration values; accessible across DAGs - Connections — storing credentials; referenced by conn_id in operators - Task instances and DAG runs — one DAG run per schedule interval; task instance = task in a DAG run - Catchup — running missed intervals when DAG is enabled; usually disabled - Backfill — manually running historical periods: `airflow dags backfill -s 2024-01-01 -e 2024-12-31 dag_id` - SLA misses — alerting when tasks take longer than expected - Pools — limiting concurrent task execution; preventing resource exhaustion - Priority weights — prioritizing critical tasks when resources are constrained
- Airflow best practices: - Idempotent tasks — running twice produces same result as running once; no duplicate data - Atomic tasks — each task succeeds or fails completely; no partial state - Small DAGs — avoid monolithic DAGs; split into logical components - Avoid dynamic task generation — hard to debug; use TaskGroups instead - External task sensor — dependency between DAGs; avoid tight coupling
Alternative Orchestrators
- Prefect — Python-first; easier local development; Prefect Cloud vs Prefect Server
- Dagster — asset-centric; software-defined assets; strong lineage
- dbt Cloud — dbt job scheduling; CI/CD for dbt models
- AWS Step Functions — serverless; visual workflows; integrates with all AWS services
- Google Cloud Composer — managed Airflow on GCP
dbt — Analytics Engineering
- What dbt does: - Brings software engineering best practices to SQL transformations - Version control — all transformation logic in Git - Modularity — reusable models using `ref()` and `source()` - Testing — built-in and custom tests for data quality - Documentation — auto-generated from model descriptions - Lineage — automatic data lineage graph
- dbt project structure: - `models/` — SQL files; each file becomes a table or view in the warehouse - `seeds/` — CSV lookup files committed to version control - `snapshots/` — SCD Type 2 implementation - `tests/` — custom test SQL files - `macros/` — Jinja2 macros for reusable SQL logic - `dbt_project.yml` — project configuration; model materialization defaults - `sources.yml` — declaring raw data sources; source freshness checks - `schema.yml` — model documentation and tests
- Model materialization: - View — default; recreated each query; no storage cost; slow for complex logic - Table — materialized as a physical table; fast queries; must be refreshed - Incremental — appends or overwrites only new/changed data; efficient for large tables - `{{ config(materialized='incremental', unique_key='id') }}` - `{% if is_incremental() %} WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }}) {% endif %}` - Ephemeral — CTE; not materialized; used as building block
- dbt tests: - Built-in: `not_null`, `unique`, `accepted_values`, `relationships` - dbt-utils: `expression_is_true`, `at_least_one`, `recency`, `equality` - Custom: `tests/` directory; any SQL returning rows = test failure - Source freshness: `loaded_at_field`; alerting on stale sources
- Jinja2 in dbt: - `{{ ref('model_name') }}` — referencing other models; creates dependency - `{{ source('schema', 'table') }}` — referencing raw tables - `{{ config(materialized='table') }}` — configuring model behavior - `{% set variable = 'value' %}` — setting variables - `{{ var('variable_name') }}` — accessing project variables - `{{ env_var('ENV_VAR') }}` — accessing environment variables
- dbt incremental strategies: - append — adds new rows; no updates - merge (UPSERT) — inserts new, updates existing based on unique_key - delete+insert — deletes then re-inserts changed partitions - insert_overwrite — replaces partitions
- dbt Cloud vs dbt Core — Cloud: managed scheduling, IDE, CI/CD; Core: open-source, runs anywhere
Resources
- dbt Fundamentals course (free at courses.getdbt.com)
- Apache Airflow documentation (free)
- "The Fundamentals of Data Engineering" by Reis and Housley (book)
- dbt Community Slack (free)
Stage 03
Cloud Data Platforms
Data engineering in 2026 is predominantly cloud-based. Deep fluency in at least one cloud provider is required; awareness of all three is expected.
Cloud Storage and Object Storage
- AWS S3: - Buckets, prefixes, objects — S3 has no real directories; prefixes simulate them - Storage classes — Standard, Intelligent-Tiering, Standard-IA, Glacier; cost vs access tradeoff - Lifecycle policies — automatically transition or expire objects - Versioning — keeping multiple versions of objects - Event notifications — triggering Lambda on PUT/DELETE - Presigned URLs — time-limited access without credentials - S3 Select — querying CSV/JSON/Parquet files with SQL without downloading - boto3 patterns: ```python import boto3 s3 = boto3.client('s3') s3.put_object(Bucket='my-bucket', Key='prefix/file.csv', Body=data) s3.get_object(Bucket='my-bucket', Key='prefix/file.csv') paginator = s3.get_paginator('list_objects_v2') ```
- Google Cloud Storage (GCS) — similar to S3; Buckets and Objects; `google-cloud-storage`
- Azure Blob Storage — Containers and Blobs; Data Lake Storage Gen2 adds HDFS-compatible filesystem
Cloud Data Warehouses — Engineering Deep Dive
- Snowflake: - Architecture — separate storage (S3/Blob/GCS) and compute (virtual warehouses) - Virtual warehouses — size determines query speed and cost; auto-suspend, auto-resume - Stages — internal and external stages for loading data - COPY INTO — bulk loading from stage to table - Snowpipe — continuous auto-ingest from cloud storage - Tasks — Snowflake's built-in scheduler (alternative to Airflow for Snowflake-only workflows) - Streams — CDC on Snowflake tables; captures inserts/updates/deletes - Dynamic tables — automatically maintained derived tables; declarative pipelines - Zero-copy cloning — instantly clone databases/schemas/tables for testing - Time travel — historical queries; UNDROP - Data sharing — sharing data across Snowflake accounts without copying - Clustering keys — improving scan performance on large tables - Search optimization — for point lookup queries - MERGE statement for SCD Type 2 in Snowflake
- Google BigQuery: - Serverless — no infrastructure management; pay per TB scanned - Columnar storage — excellent for aggregations on large tables; poor for single-row lookups - Partitioning — by date, timestamp, or integer range; partition pruning reduces scanned data - Clustering — ordering within partitions; like an index; applies after partitioning - Materialized views — precomputed query results; auto-refreshed - BigQuery ML — training ML models with SQL - External tables — querying GCS files directly (Parquet, CSV, Avro, ORC) - Data Transfer Service — managed connectors from Google products, SaaS - Reservations and slot commitments — predictable pricing for large workloads
- Amazon Redshift: - MPP (Massively Parallel Processing) — distributes data across nodes - Distribution styles — EVEN, KEY (same key on same node for joins), ALL (small tables replicated) - Sort keys — compound vs interleaved; data ordered on disk; dramatically improves range queries - Redshift Spectrum — querying S3 data directly from Redshift - COPY command — bulk loading from S3 - Vacuum and analyze — maintaining table health; required for performance - Leader node vs compute nodes — queries compiled and distributed from leader - Redshift Serverless — no cluster management; scales automatically
- Databricks: - Unified analytics platform — notebooks, jobs, SQL, ML, Delta Lake in one platform - Delta Lake — open-source storage layer on object storage; ACID transactions on data lakes - Time travel — `SELECT * FROM table VERSION AS OF 10` - Schema enforcement — prevents incompatible writes - Schema evolution — `mergeSchema = true` for adding columns - Optimize and Z-Ordering — compaction + co-location for query speed - Vacuum — removing old versions - Unity Catalog — centralized data governance; fine-grained access control; lineage - Delta Live Tables (DLT) — declarative pipeline framework; automatic error recovery - Databricks SQL — serverless SQL warehouse; BI tool connections - Notebooks — Python, SQL, Scala, R; git integration; widgets for parameterization
Cloud Infrastructure for Data Engineering
- AWS: - AWS Glue — managed ETL; Glue Crawlers for data catalog; DynamicFrames for schema flexibility - AWS Lambda — serverless Python functions; event-driven; 15-minute limit - AWS Step Functions — visual workflow orchestration; integrates with all AWS services - Amazon EMR — managed Hadoop/Spark; cluster-based big data processing - Amazon Kinesis — real-time data streaming; Kinesis Data Streams, Firehose, Analytics - Amazon RDS — managed relational databases; source for data pipelines - AWS DMS — Database Migration Service; CDC for ongoing replication
- GCP: - Google Cloud Composer — managed Airflow on GCP; deep GCP integration - Google Cloud Dataflow — Apache Beam; batch and streaming; auto-scaling - Google Cloud Pub/Sub — managed message queue; event streaming - Google Cloud Dataproc — managed Spark/Hadoop - Cloud Functions — serverless; event-driven; 60-minute timeout
- Azure: - Azure Data Factory (ADF) — managed ETL with visual designer; 90+ connectors - Azure Synapse Analytics — unified analytics; Synapse SQL + Spark + Pipelines - Azure Event Hubs — managed Kafka-compatible event streaming - Azure Databricks — Databricks on Azure infrastructure - Microsoft Fabric — new unified platform; OneLake; direct lake mode for Power BI
Infrastructure as Code for Data Engineers
- Terraform — provisioning cloud infrastructure; data engineering resources as code - `aws_s3_bucket`, `aws_glue_job`, `aws_redshift_cluster`, `snowflake_warehouse`
- Docker — containerizing data pipelines; consistent environments; Airflow runs in Docker
- Kubernetes — orchestrating containerized pipelines; Airflow on K8s; Spark on K8s
Resources
- Snowflake documentation (free)
- BigQuery documentation (free)
- Databricks documentation (free)
- dbt-labs GitHub (free)
- Cloud provider free tiers for practice
Stage 04
Distributed Processing and Streaming
Large-scale data processing beyond what a single machine handles. Apache Spark for batch; Kafka + Flink/Spark Streaming for real-time.
Apache Spark
- Architecture: - Driver — program that runs Spark application; SparkContext/SparkSession - Cluster Manager — Standalone, YARN (Hadoop), Mesos, Kubernetes - Workers — nodes running Executors - Executors — JVM processes on workers; run tasks; store data in memory/disk - Jobs, Stages, Tasks — hierarchy of work; job = action; stage = shuffle boundary; task = partition processing
- Core concepts: - RDD (Resilient Distributed Dataset) — low-level; immutable; fault-tolerant; avoid in modern Spark - DataFrame — distributed tabular data; Spark SQL API; JVM optimization - Dataset — strongly-typed DataFrame; Scala/Java only - Transformations vs Actions: - Transformations are lazy — `filter()`, `map()`, `groupBy()`, `join()`, `select()`, `withColumn()`; build execution plan - Actions trigger execution — `count()`, `collect()`, `show()`, `write()`, `take()` - Lazy evaluation — transformations build DAG; optimized before execution - Catalyst optimizer — Spark's query optimizer; generates optimal physical plan - Tungsten engine — memory management and code generation for performance
- PySpark — Python API for Spark: ```python from pyspark.sql import SparkSession from pyspark.sql.functions import col, sum, avg, count, window, to_timestamp from pyspark.sql.types import StructType, StructField, StringType, DoubleType spark = SparkSession.builder.appName("pipeline").getOrCreate() df = spark.read.parquet("s3://bucket/data/") df_clean = df.filter(col("amount") > 0).dropDuplicates(["id"]) df_agg = df_clean.groupBy("region", "product").agg( sum("amount").alias("total_sales"), count("id").alias("order_count") ) df_agg.write.mode("overwrite").partitionBy("region").parquet("s3://bucket/output/") ```
- Performance tuning: - Partitioning — controlling data distribution across executors - Number of partitions — `spark.default.parallelism`; `repartition()` vs `coalesce()`; 2–3x number of CPU cores is typical - Shuffle operations — joins, groupBy, distinct; expensive; minimize; broadcast join for small tables - Broadcast joins — `broadcast(small_df)`; replicate small table to all executors; avoids shuffle - Caching — `df.cache()` or `df.persist(StorageLevel.MEMORY_AND_DISK)`; caching reused DataFrames - Skew handling — salting; bucketing; AQE (Adaptive Query Execution) - File format — Parquet preferred over CSV for analytics; columnar; predicate pushdown; compression - Partition pruning — querying only relevant partitions via filter on partition column - AQE (Adaptive Query Execution) — Spark 3.0+; dynamically optimizes at runtime
Apache Kafka — Event Streaming
- Architecture: - Producers — publish messages to topics - Brokers — servers that store and serve messages; replicated across cluster - Topics — logical categories for messages; partitioned for parallelism - Partitions — ordered, immutable sequence of records; enabling parallelism - Consumer Groups — one consumer per partition per group; parallel consumption - Offsets — position of consumer in partition; committed to track progress - Zookeeper (legacy) / KRaft (modern) — metadata management
- Core concepts: - Retention — configurable time-based or size-based retention - Replication factor — copies of each partition across brokers; durability - Producer acknowledgments — `acks=all` for durability; `acks=0` for speed - Exactly-once semantics — idempotent producers + transactional producers - Consumer lag — how far behind consumers are; key operational metric - Schema Registry — Confluent schema registry; enforcing Avro/Protobuf/JSON schema - Kafka Connect — managed connectors for sources and sinks; JDBC connector, S3 Sink, Debezium - Kafka Streams — client library for stream processing on Kafka data
- Producer in Python: ```python from kafka import KafkaProducer import json producer = KafkaProducer( bootstrap_servers=['kafka:9092'], value_serializer=lambda v: json.dumps(v).encode('utf-8') ) producer.send('orders', value={'order_id': 123, 'amount': 99.99}) producer.flush() ```
- Consumer in Python: ```python from kafka import KafkaConsumer consumer = KafkaConsumer( 'orders', group_id='order-processor', bootstrap_servers=['kafka:9092'], value_deserializer=lambda m: json.loads(m.decode('utf-8')), auto_offset_reset='earliest' ) for message in consumer: process(message.value) consumer.commit() ```
Stream Processing
- Spark Structured Streaming: - Micro-batch processing — processes data in small batches; near real-time - Trigger intervals — 0 seconds (continuous), 1 minute, once - Watermarking — handling late-arriving data: `withWatermark("timestamp", "10 minutes")` - Output modes — append (new rows only), update (changed rows), complete (full results) - Checkpointing — fault tolerance; recoverable from failures - Windowing — tumbling, sliding, session windows for aggregations
- Apache Flink: - True stream processing — event-by-event, not micro-batch - Lower latency than Spark Streaming - Time semantics — event time vs processing time; watermarks - Stateful processing — fault-tolerant state with RocksDB backend - Flink SQL — SQL-based stream processing; unified batch and stream API - Growing adoption for low-latency use cases (fraud detection, real-time recommendations)
File Formats
- Parquet — columnar; efficient for analytics; predicate pushdown; compression; Pandas/Spark/BigQuery all support
- Avro — row-based; schema evolution; good for streaming/Kafka; not great for analytics
- ORC — columnar; originally Hadoop; good Hive support; losing to Parquet
- Delta Lake — Parquet + transaction log; ACID on object storage; time travel
- Apache Iceberg — open table format; table evolution; partition evolution; multi-engine support (Spark, Flink, Trino)
- Apache Hudi — CDC patterns; upsert on data lakes; incremental processing
Resources
- Apache Spark documentation (free)
- Databricks Spark learning (free)
- Confluent Kafka documentation (free)
- "Designing Data-Intensive Applications" by Kleppmann (essential book)
Stage 05
Data Quality, Observability, and Governance
Production data pipelines fail. Data changes unexpectedly. These engineering disciplines prevent silent data failures.
Data Quality
- Dimensions of data quality — completeness, accuracy, consistency, timeliness, validity, uniqueness
- Data quality testing categories: - Schema tests — column exists, data type correct, not null - Content tests — expected values only, referential integrity, range checks - Statistical tests — outlier detection, distribution shift, unexpected zeros - Freshness tests — data arrived within expected window
- Testing frameworks: - dbt tests — declarative YAML-based tests close to the model; schema.yml - Great Expectations — Python library; expectations suite; integration with Airflow/dbt - `expect_column_to_exist`, `expect_column_values_to_not_be_null` - `expect_column_values_to_be_between`, `expect_table_row_count_to_be_between` - Data docs — auto-generated HTML documentation of expectations and results - Soda Core — open-source; YAML-based checks; pushes results to Soda Cloud
- Data contracts: - Formal agreement between data producers and consumers - Schema, semantics, SLAs, ownership - Tools: Data Contract CLI, dbt model contracts
Data Observability
- What data observability is — understanding the health of data as it flows through pipelines
- Five pillars of data observability (Monte Carlo): - Freshness — is data up to date? - Volume — is the expected amount of data present? - Distribution — have value ranges or patterns changed? - Schema — have column names or types changed? - Lineage — which upstream tables and pipelines feed this data?
- Tools: - Monte Carlo — leading commercial observability platform; ML-based anomaly detection - Bigeye — data monitoring; anomaly detection; SLA tracking - Datafold — diff-based data quality; column-level impact analysis - Lightdash — open-source; integrates with dbt - dbt Cloud — model run monitoring; test failures; source freshness
- Data lineage: - Column-level lineage — tracing transformations field by field - OpenLineage — open standard for lineage metadata - Marquez — open-source lineage server; implements OpenLineage - Atlas (Apache) — data governance and metadata framework; open-source
Data Governance
- Data catalog — inventory of all datasets; business description, owner, schema, lineage - Datahub — LinkedIn open-source; widely adopted; integrates with Airflow, dbt, Spark - Collibra — enterprise data governance; data dictionary; stewardship workflows - Alation — enterprise data catalog; SQL query intelligence - Unity Catalog (Databricks) — governance for Databricks; column-level security; auditing - BigQuery Data Catalog — native GCP catalog
- Data classification — PII, PHI, PCI data; tagging sensitive columns; driving access control
- Access control: - Role-based access control — database roles; column masking; row-level security - Attribute-based access control — Unity Catalog; policy expressions - Data masking — masking sensitive data in non-production environments
- Data retention — policies for how long different data types are kept; GDPR deletion rights
- GDPR/CCPA engineering implications: - Right to erasure — deleting user data across all systems; compiling the data lineage to find all copies - Data minimization — collecting only what is needed - Consent management — tracking which users consented to which processing
Resources
- Great Expectations documentation (free)
- dbt testing guide (free)
- Monte Carlo blog (free)
- "Fundamentals of Data Engineering" (book)
Stage 06
DataOps and Software Engineering Practices
Production data systems require the same rigor as application software: version control, CI/CD, testing, monitoring.
Version Control for Data
- Git fundamentals — branching (feature branches, main/prod), PRs, code review
- dbt + Git — every transformation change committed; reviewed; tested in CI before production
- Configuration as code — Airflow DAGs in Git; Terraform for infrastructure
- Data pipeline versioning — semantic versioning for pipeline releases
CI/CD for Data Pipelines
- GitHub Actions / GitLab CI for dbt: - Run dbt build on every PR against a dev environment - Run dbt test suite; block merge on test failure - Deploy to production on merge to main - Example: ```yaml name: dbt CI on: [pull_request] jobs: dbt-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - run: pip install dbt-snowflake - run: dbt build --select state:modified --defer --state prod/ - run: dbt test --select state:modified ```
- Slim CI — run only changed models; `--select state:modified`; `--defer` to use prod artifacts for unchanged upstream
- Blue/green deployments — maintaining two environments; switching when new version is validated
Containerization
- Docker for data pipelines: - Airflow runs in Docker Compose for local development - Pipeline tasks as Docker containers — consistent environment; dependency isolation - ECR/GCR/ACR — container registries for storing pipeline images
- Kubernetes for data pipelines: - KubernetesPodOperator in Airflow — running tasks as K8s pods - Spark on Kubernetes — replacing YARN for Spark workloads - Helm charts — package manager for K8s applications; Airflow, Kafka
Monitoring and Alerting
- Pipeline monitoring: - Airflow — task duration alerts; SLA misses; email/Slack/PagerDuty integration - Datadog, Grafana, CloudWatch — infrastructure metrics; query performance - Custom metrics — pipeline row counts, processing times, error rates in time-series database
- Data freshness monitoring: - dbt source freshness checks - Great Expectations freshness expectations - Custom SQL to check MAX(updated_at) against SLA
- On-call for data — runbooks for common pipeline failures; escalation procedures
- Post-mortems — blameless analysis of data incidents; root cause + prevention
Resources
- "Practical DataOps" by Beragh and Jones
- GitLab CI documentation (free)
- Docker documentation (free)
FAQ
Common questions
How long does it take to become a Data Engineer?
18–24 months optimistic, 2–3 years realistic. The modern data stack (Snowflake + dbt + Airflow + Spark) takes time to learn properly, and cloud platform depth (AWS or Azure or GCP) is no longer optional. The fastest paths in 2026: SDE-to-data-engineer, data-analyst-with-strong-Python-becoming-data-engineer, and analytics-engineering as an intermediate step. 23% growth rate, salaries reaching $170K+ for experienced engineers.
Which certifications matter for data engineering?
AWS Data Engineer Associate or AWS Solutions Architect Associate for AWS. Azure Data Engineer Associate (DP-203) for Azure. Google Professional Data Engineer for GCP. Snowflake SnowPro Core for Snowflake-heavy environments. dbt Analytics Engineering Certification is increasingly listed. The cert market is fragmented because the data stack is fragmented; pick the cloud + warehouse your target employers use.
Do I need a CS degree?
Helpful but not required. Strong Python + SQL + distributed systems intuition can be self-taught, and bootcamps with rigorous data engineering tracks (DataExpert.io, DataTalks.Club) produce competitive candidates. What you absolutely need: comfort with batch and streaming concepts, schema design, and operational fluency (when does the pipeline fail at 3 AM, and how do you know?). The bar at FAANG-tier data engineering is genuinely high; mid-tier roles are accessible.
What separates a hired Data Engineer?
End-to-end pipeline ownership in your portfolio. Show one realistic data pipeline — from API/event source through ingestion, transformation in dbt or Spark, to a queryable warehouse — with documentation on schema decisions, partition strategy, and failure handling. Dashboard-only candidates lose to candidates who've built infrastructure. 94% of enterprises have embraced cloud platforms; cloud platform fluency is no longer optional. Bonus signals: open-source contributions to dbt, Airflow, or Dagster.