Roadmap

Data Scientist

The professional who frames business questions as data problems, builds statistical models and machine learning systems to answer them, and communicates findings that drive decisions. Combines statistical rigor, programming depth, and business context to produce predictions, recommendations, and insights from structured and unstructured data.

OPTIMISTIC 12–18 monthsREALISTIC 18–30 months

Stage 00

Mathematics and Statistics Foundations

Data science is applied mathematics. You can get started with surface-level statistics, but senior roles require genuine depth in probability, linear algebra, and statistical inference.

Statistics — Applied Foundation

All content from Data Analyst Stage 0 (descriptive statistics, probability, inferential statistics) plus:

Linear Algebra — The Language of Machine Learning

Vectors — ordered collections of numbers; direction and magnitude in n-dimensional space
Matrices — rectangular arrays of numbers; representing linear transformations
Matrix operations: - Addition and scalar multiplication — element-wise - Matrix multiplication — row × column dot products; not commutative - Transpose — flipping rows and columns - Inverse — A⁻¹ such that AA⁻¹ = I (identity matrix); solving systems of equations - Determinant — scalar characterizing a matrix; det = 0 means non-invertible (singular)
Dot product — `a · b = |a||b|cos(θ)`; measures similarity between vectors; fundamental to neural networks
Eigenvalues and eigenvectors — `Av = λv`; eigenvectors are directions unchanged by transformation; eigenvalues are scaling factors - Used in PCA (Principal Component Analysis) for dimensionality reduction
Norms — measuring vector magnitude; L1 norm (sum of absolute values), L2 norm (Euclidean distance)
Matrix factorizations — SVD (Singular Value Decomposition); used in recommendation systems and dimensionality reduction

Calculus — For Understanding Learning Algorithms

Derivatives — rate of change; `f'(x) = lim(h→0) [f(x+h) - f(x)] / h`
Chain rule — derivative of composite function; critical for backpropagation in neural networks
Partial derivatives — derivative with respect to one variable holding others constant
Gradient — vector of partial derivatives; points in the direction of steepest increase
Gradient descent — moving in the direction of negative gradient to minimize a loss function - Batch GD — compute gradient on entire dataset - Stochastic GD (SGD) — compute gradient on one sample; noisy but fast - Mini-batch GD — compromise; standard practice in deep learning
Optimization concepts: - Local vs global minima — gradient descent finds local minima; random restarts help - Learning rate — how large a step to take; too large = overshoot; too small = slow - Learning rate schedules — reducing learning rate over training

Advanced Statistics

Probability distributions: - Discrete: Bernoulli, Binomial, Poisson, Geometric - Continuous: Uniform, Normal, Exponential, Beta, Gamma - Understanding when each applies; parameter interpretation
Maximum Likelihood Estimation (MLE) — finding parameters that maximize likelihood of observed data
Bayesian inference — updating prior beliefs with observed data: - Prior — belief before seeing data: P(θ) - Likelihood — probability of data given parameter: P(X|θ) - Posterior — updated belief after data: P(θ|X) ∝ P(X|θ) × P(θ) - Bayes' Theorem: P(θ|X) = P(X|θ)P(θ) / P(X)
Hypothesis testing beyond basics: - Multiple testing correction — Bonferroni correction; Benjamini-Hochberg (FDR) - Power analysis — determining required sample size for a given effect size - Effect size — Cohen's d; practically significant vs statistically significant
Multivariate statistics: - Covariance and correlation matrices - Multicollinearity — correlated predictors causing instability in regression coefficients - Principal Component Analysis (PCA) — dimensionality reduction; finding directions of maximum variance

Causal Inference — High Demand in 2026

Correlation vs causation — a confounding variable C can cause both X and Y to appear correlated without X causing Y
Potential outcomes framework — Y(1) = outcome if treated; Y(0) = outcome if untreated; ATE = E[Y(1) - Y(0)]
Randomized controlled experiments (A/B tests) — the gold standard; random assignment eliminates confounding
Observational study methods when RCT not possible: - Difference-in-Differences (DiD) — comparing pre/post change between treated and untreated groups - Regression Discontinuity (RD) — exploiting sharp cutoffs; people just above and below threshold as comparison - Propensity Score Matching (PSM) — matching treated and control units on probability of treatment - Instrumental Variables (IV) — using a variable that affects treatment but not outcome directly
Directed Acyclic Graphs (DAGs) — representing causal assumptions; identifying confounders to control for

Resources

3Blue1Brown "Essence of Linear Algebra" (YouTube, free, best visual introduction)
Khan Academy Statistics (free)
"Naked Statistics" by Wheelan (book)
StatQuest YouTube (free, excellent)
"The Book of Why" by Pearl and Mackenzie (causal inference, book)

Stage 01

Python for Data Science — Deep

Python is the primary language for data science. The libraries in this stage are what every data scientist uses daily.

Python Foundations

All content from Data Analyst Stage 3 applies — data types, pandas, NumPy, matplotlib, seaborn, EDA.

Scikit-learn — The ML Library

Design philosophy — consistent API: `fit()`, `transform()`, `predict()` across all estimators
Preprocessing: - `StandardScaler` — standardize features to mean 0, std 1 - `MinMaxScaler` — scale to [0, 1] range - `LabelEncoder` / `OrdinalEncoder` — encoding categorical labels - `OneHotEncoder` — creating dummy variables for nominal categories - `SimpleImputer` — handling missing values (mean, median, most frequent, constant) - `PolynomialFeatures` — creating polynomial and interaction features
Model training pattern using scikit-learn: split with train_test_split, build a Pipeline with StandardScaler and RandomForestClassifier, call pipeline.fit on training data, then pipeline.predict on test data and print classification_report(y_test, y_pred).
Cross-validation: - `cross_val_score` — k-fold CV; more robust than single train/test split - `StratifiedKFold` — preserving class distribution across folds - `cross_validate` — returning multiple metrics
Hyperparameter tuning: - `GridSearchCV` — exhaustive search over parameter grid - `RandomizedSearchCV` — random sampling from parameter distributions; more efficient - `BayesSearchCV` (scikit-optimize) — Bayesian optimization for hyperparameters
Pipelines — chaining preprocessing and modeling; prevents data leakage
Model persistence — joblib.dump and joblib.load for saving and loading models; stdlib serialization is an alternative but joblib is preferred for large numpy arrays

Statsmodels — Statistical Modeling

OLS regression — `sm.OLS(y, X).fit()`; coefficient estimates, p-values, R², confidence intervals, residual plots
Logistic regression — `sm.Logit(y, X).fit()`; log-odds interpretation; marginal effects
Time series models — `statsmodels.tsa`; ARIMA, SARIMAX, Holt-Winters
Hypothesis testing tools — t-tests, chi-square tests, Durbin-Watson (autocorrelation)
Difference from sklearn — statsmodels provides inference (p-values, confidence intervals); sklearn provides prediction

SciPy — Scientific Computing

`scipy.stats` — statistical functions; t-tests, chi-square, Mann-Whitney, Kolmogorov-Smirnov, ANOVA
`scipy.optimize` — optimization algorithms; minimizing functions
`scipy.spatial` — distance metrics; k-nearest neighbors computation

Advanced Python Patterns for Data Scientists

Decorators for function timing and logging — useful for pipeline profiling
Context managers — `with` statement; database connections; file handling
Generator functions — `yield`; memory-efficient iteration over large datasets
Multiprocessing — `concurrent.futures`; parallelizing CPU-bound data processing
Type hints — increasingly expected in production data science code
Testing data science code — `pytest`; testing data transformations and model behavior

Jupyter and Interactive Development

Jupyter notebooks — exploratory work; visualization; documentation alongside code
JupyterLab — enhanced IDE-like interface; multiple tabs; text editor
nbconvert — converting notebooks to HTML, PDF, scripts
Notebook best practices — clear markdown documentation; reproducible execution from top to bottom; no side effects between cells
VS Code with Python extension — increasingly preferred for production-quality code alongside notebooks

Resources

"Python Machine Learning" by Raschka and Mirjalili (book)
scikit-learn documentation (free)
"Hands-On Machine Learning" by Géron (the canonical ML textbook)
Kaggle Learn ML courses (free)

Stage 02

Machine Learning Algorithms — Deep

Understanding why algorithms work, when to use them, and how to debug them distinguishes senior from junior data scientists.

Supervised Learning — Regression

Linear Regression — Model: ŷ = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ; Cost function: MSE = (1/n)Σ(yᵢ - ŷᵢ)²; OLS (Ordinary Least Squares) — closed-form solution: β = (XᵀX)⁻¹Xᵀy; Assumptions: linearity, independence, homoscedasticity, normality of residuals; Diagnosing violations: residual plots, Q-Q plots, VIF for multicollinearity; Regularization: Ridge (L2) adds λΣβᵢ² to cost (shrinks coefficients toward zero; handles multicollinearity); Lasso (L1) adds λΣ|βᵢ| to cost (produces sparse models; forces some coefficients to exactly zero for feature selection); Elastic Net combines L1 and L2; λ (regularization strength) higher = more regularization = simpler model, tune via cross-validation; Polynomial regression — adding polynomial terms to capture nonlinearity with overfitting risk
Decision Trees: - Splitting criterion — Gini impurity (classification), variance reduction (regression) - Recursive partitioning — greedily splitting on the feature that best separates classes - Depth control — max_depth, min_samples_split, min_samples_leaf; preventing overfitting - Interpretable but high variance — single trees overfit easily
Random Forest: - Ensemble of decision trees; each tree trained on bootstrap sample (bagging) - Each split considers only `sqrt(n_features)` features (random feature selection); decorrelates trees - Prediction = majority vote (classification) or average (regression) - Feature importance — mean decrease in impurity across all trees - Robust, low tuning needed — good baseline; rarely disastrous
Gradient Boosting — the most important algorithm for tabular data: - Sequential ensemble — each tree corrects errors of previous trees - XGBoost: - Regularization terms in objective function; handles missing values - `n_estimators`, `learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `reg_alpha`, `reg_lambda` - Early stopping — stop when validation metric stops improving; prevents overfitting - LightGBM: - Histogram-based splitting; much faster than XGBoost for large datasets - Leaf-wise growth (vs depth-wise) — more accurate but needs careful depth control - `num_leaves` is the key parameter (not max_depth) - CatBoost: - Native handling of categorical features — no label encoding needed - Ordered boosting — reduces target leakage during training - When to use gradient boosting: tabular data, mixed types, production ML — the dominant approach - Hyperparameter tuning priority: n_estimators → learning_rate → max_depth → subsample → regularization

Supervised Learning — Classification

Logistic Regression — Model: P(y=1|x) = σ(βᵀx) = 1 / (1 + e⁻ᶻ); Linear decision boundary in feature space; Interpretable: coefficient e^β = multiplicative change in odds per unit change in x; Fast, good baseline; often competitive with more complex models on linearly separable problems; Multinomial logistic regression — extending to K classes (softmax function)
Support Vector Machines (SVM) — Find hyperplane maximizing margin between classes; Support vectors — training points closest to the decision boundary; only these affect the model; Kernel trick — mapping to higher-dimensional space via kernels (RBF, polynomial, linear) without explicit computation; C parameter — regularization: small C = wide margin (more misclassification allowed), large C = narrow margin; Best for: high-dimensional data; text classification (before deep learning); moderately sized datasets; Slow on large datasets — O(n²) to O(n³) training time
K-Nearest Neighbors (KNN) — Non-parametric — memorizes training data; predicts based on k nearest neighbors; k selection — small k = noisy decision boundary; large k = smooth but may lose local patterns; Distance metric — Euclidean (default), Manhattan, cosine; requires scaled features; Curse of dimensionality — performance degrades in high dimensions; all points become equidistant; Intuitive but slow at prediction time — O(n) per prediction
Naive Bayes — Applies Bayes' theorem with naive (strong) independence assumption among features; Gaussian Naive Bayes — continuous features assumed normally distributed; Multinomial Naive Bayes — count features; text classification; Fast; works well with small data; surprisingly effective for text; Independence assumption usually violated but works in practice

Unsupervised Learning

K-Means: - Algorithm: assign each point to nearest centroid → recompute centroids → repeat until convergence - k selection: elbow method (inertia vs k), silhouette score - Inertia — sum of squared distances to nearest centroid; decreasing function of k - Limitations: assumes spherical clusters; sensitive to outliers; must specify k - K-Means++ — smart initialization; reduces random initialization sensitivity
DBSCAN (Density-Based Spatial Clustering): - Defines clusters as dense regions separated by low-density regions - No need to specify k; handles arbitrary cluster shapes; identifies outliers - Parameters: eps (neighborhood radius), min_samples (minimum points to form a cluster) - Struggles with varying-density clusters; sensitive to parameters
Hierarchical Clustering: - Agglomerative — bottom-up; each point starts as its own cluster; merge closest clusters - Dendrogram — tree showing merging history; cut at any level to get n clusters - Linkage: single (minimum distance), complete (maximum distance), average, Ward (minimizes intra-cluster variance) - Computationally expensive (O(n² log n)); not suitable for large datasets
PCA (Principal Component Analysis): - Finds directions of maximum variance; projects data onto orthogonal axes - Steps: standardize → covariance matrix → eigendecomposition → sort by eigenvalue → select top k - Explained variance ratio — how much variance each component captures - Use cases: visualization (2D/3D), removing multicollinearity, compressing features before modeling
t-SNE (t-Distributed Stochastic Neighbor Embedding): - Non-linear; preserves local structure; excellent for visualization - Slow (O(n²)); non-deterministic; not suitable for preprocessing pipeline - Perplexity parameter — affects local vs global structure balance
UMAP (Uniform Manifold Approximation and Projection): - Faster than t-SNE; better at preserving global structure; can be used in pipelines - Increasingly preferred over t-SNE for dimensionality reduction

Model Evaluation Metrics — Complete

Regression Metrics — MAE (Mean Absolute Error) — average absolute difference; robust to outliers; interpretable in original units; MSE (Mean Squared Error) — average squared difference; penalizes large errors more; differentiable; RMSE (Root MSE) — sqrt(MSE); same units as target; combines MSE properties with interpretability; R² (Coefficient of Determination) — proportion of variance explained by model; 1 = perfect; 0 = baseline mean; Adjusted R² — penalizes for additional predictors; use when comparing models with different feature counts; MAPE (Mean Absolute Percentage Error) — percentage error; scale-independent; undefined when y = 0
Classification Metrics — Confusion matrix — TP, FP, TN, FN; the foundation of all other metrics; Accuracy — (TP + TN) / total; misleading with class imbalance; Precision — TP / (TP + FP); when cost of false positives is high (spam detection); Recall / Sensitivity — TP / (TP + FN); when cost of false negatives is high (cancer detection); F1 Score — harmonic mean of precision and recall; balanced metric; F-beta — F-score weighted toward recall (β > 1) or precision (β < 1); AUC-ROC: ROC curve — TPR vs FPR at different classification thresholds, AUC = 1.0 perfect; AUC = 0.5 random; AUC < 0.5 worse than random; AUC-PR — better than ROC for heavily imbalanced datasets when positive class is rare; Log-loss (Cross-entropy loss) — penalizes confident wrong predictions; useful when probabilities matter; Calibration — does P(y=1 | prediction = 0.7) actually equal 0.7? Platt scaling, isotonic regression; Cohen's Kappa — accuracy adjusted for chance agreement; useful for multi-class problems

Resources

"Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Géron (essential book)
"The Elements of Statistical Learning" by Hastie et al. (free PDF)
Kaggle Learn (free)
"Introduction to Statistical Learning" (ISLR, free PDF, accessible version of ESL)

Stage 03

Deep Learning and NLP

Mid-level and senior data science roles increasingly expect deep learning fluency. NLP is the fastest-growing skill demand in the field.

Deep Learning Foundations

Neural Network Fundamentals — Architecture: input layer → hidden layers → output layer; Neuron computation: output = activation(w·x + b)
Activation functions: - ReLU (Rectified Linear Unit): `max(0, x)`; most common; avoids vanishing gradient; can die (always output 0) - Sigmoid: `1/(1+e⁻ˣ)`; outputs [0,1]; binary classification output; saturates (vanishing gradient) - Tanh: `(eˣ - e⁻ˣ)/(eˣ + e⁻ˣ)`; outputs [-1,1]; zero-centered; still saturates - Softmax: `e^xᵢ / Σe^xⱼ`; multi-class output; probabilities sum to 1 - GELU, Swish — modern activations; used in transformers
Forward pass — computing output from input through all layers; Backpropagation — computing gradients via chain rule; updating weights via gradient descent
Batch normalization — normalizing layer inputs; stabilizes training; reduces internal covariate shift
Dropout — randomly zeroing neurons during training; regularization; `p` is the dropout rate
Weight initialization — Xavier/Glorot for tanh; He/Kaiming for ReLU; bad initialization causes vanishing/exploding gradients
PyTorch — Primary Framework: define an nn.Module subclass with __init__ registering layers (Linear, ReLU, Dropout, Linear) and a forward method composing them; construct Adam optimizer with learning rate and weight decay, use BCEWithLogitsLoss; training loop per epoch calls model.train(), optimizer.zero_grad(), forward pass, loss backward, optimizer.step(). Key patterns: model.train() vs evaluation mode (affects dropout and batch normalization), torch.no_grad() for inference, DataLoader for batching and shuffling, nn.Sequential for simple models, torch.optim (Adam, SGD, AdamW, RMSprop), and learning rate schedulers (StepLR, CosineAnnealingLR, OneCycleLR).
TensorFlow / Keras — Alternative: - Higher-level API; quicker for prototyping; dominant in production deployment - Keras `Sequential` vs functional API vs subclassing - `model.compile()`, `model.fit()`, `model.predict()` - TensorBoard — visualization of training metrics, computation graphs
CNNs (Convolutional Neural Networks) — Computer Vision: - Convolution operation — sliding filter across image; detecting local patterns - Feature maps — output of convolution; one per filter - Pooling — MaxPool, AvgPool; reducing spatial dimensions; translation invariance - CNN architecture: Conv → ReLU → Pool (repeat) → Flatten → Dense → Output - Transfer learning with CNNs: - Pretrained models: ResNet, VGG, EfficientNet, ViT — trained on ImageNet - Feature extraction — freeze pretrained weights; only train new classification head - Fine-tuning — unfreeze some layers; train with low learning rate - Hugging Face `timm` library — hundreds of pretrained image models

Natural Language Processing — High Demand in 2026

Text Preprocessing Fundamentals — Tokenization — splitting text into tokens (words, subwords, characters); Lowercasing, punctuation removal, stemming, lemmatization; Stop word removal — removing common words with low informational content; TF-IDF (Term Frequency-Inverse Document Frequency) — weighting words by importance; Bag of Words — representing documents as word count vectors
Transformer Models — The Modern NLP Stack: - Attention mechanism: - Query, Key, Value — computing attention weights between tokens - Scaled dot-product attention: `Attention(Q, K, V) = softmax(QKᵀ/√d)V` - Multi-head attention — multiple attention heads capturing different relationships - BERT (Bidirectional Encoder Representations from Transformers): - Bidirectional context understanding; pretrained on masked language modeling - Use cases: classification, named entity recognition, question answering - GPT-style models — autoregressive; left-to-right; generation tasks - Sentence transformers — generating semantic embeddings; similarity search
Hugging Face Transformers library: load AutoTokenizer and AutoModel from a pretrained checkpoint like bert-base-uncased, tokenize text with padding and truncation returning PyTorch tensors, run the model under torch.no_grad() for inference, and extract the CLS token embedding from outputs.last_hidden_state[:, 0, :].
Fine-tuning pretrained models — adding task-specific heads; using datasets library; LLM APIs — OpenAI, Anthropic, Google; calling LLMs from data science workflows; building evaluation pipelines
NLP Tasks: - Text classification — sentiment analysis, spam detection, topic classification - Named Entity Recognition (NER) — extracting entities (person, organization, location) from text - Text summarization — extractive vs abstractive - Semantic similarity — cosine similarity of embeddings; duplicate detection; recommendation - Information extraction — parsing structured data from unstructured text - Retrieval-Augmented Generation (RAG) — combining retrieval with generation for Q&A on custom data

Stage 04

Feature Engineering and Experiment Design

The most undervalued skill in data science. Feature engineering often produces more improvement than algorithm selection.

Feature Engineering

Numerical Features — Binning — converting continuous variables to categorical bins (Equal-width: dividing range into equal intervals; Equal-frequency (quantile): dividing so each bin has equal number of observations); Log transformation — reducing right skew; `np.log1p(x)` for zero-safe log; Power transformations — Box-Cox, Yeo-Johnson; stabilizing variance; making distribution more normal; Interaction features — multiplying two features to capture multiplicative effects; `age × income`; Polynomial features — adding higher-order terms; controlled via `sklearn.preprocessing.PolynomialFeatures`; Rank transformation — replacing values with their rank; robust to outliers; monotonic transformation
Categorical Features — Label encoding — mapping categories to integers; appropriate only for ordinal categories; One-hot encoding — creating binary indicator columns; appropriate for nominal categories; Target encoding (mean encoding) — replacing category with mean target value; powerful but prone to leakage (Must encode within CV folds to prevent target leakage); Count encoding — replacing category with its frequency in the dataset; Frequency encoding — same as count but normalized to proportion; Leave-one-out encoding — target encoding variant that avoids leakage; Hashing — converting high-cardinality features to fixed-size vectors
Date/Time Features — Year, month, day, day-of-week, hour, minute; Is weekend, is holiday, is month start/end; Cyclical encoding for periodic features: `sin(2π × day_of_year / 365)`, `cos(2π × day_of_year / 365)`; Time since event features — days since last purchase, days since account creation; Business day calculations — excluding weekends/holidays from time differences
Text Features — Character count, word count, average word length; Sentiment score — TextBlob, VADER; TF-IDF vectorization; Embedding features from pretrained models
Feature Selection — Filter methods: Correlation filter — remove highly correlated features (|r| > 0.9 between features); Variance filter — remove near-zero variance features; Mutual information — measuring statistical dependence between feature and target; Wrapper methods: RFE (Recursive Feature Elimination) — iteratively removing least important features; Sequential feature selection — forward or backward selection; Embedded methods: Lasso regularization — coefficients of unimportant features driven to zero; Random Forest / XGBoost feature importance — selecting top N features by importance; Feature importance interpretation: Permutation importance — measure performance drop when feature is shuffled; more reliable than tree-based importance; SHAP values — game-theoretic feature importance; shows feature contribution per prediction

A/B Testing and Experiment Design

Statistical Foundation — Null hypothesis — no difference between control and treatment; Type I error (α) — false positive; rejecting H₀ when it's true; typically α = 0.05; Type II error (β) — false negative; failing to reject H₀ when it's false; power = 1 - β; Power analysis — determining required sample size using effect size (d) = expected difference / standard deviation, typical power (1-β) = 0.8, and formula n = (z_α/2 + z_β)² × 2σ² / δ² for two-sample t-test; Multiple testing — running many tests increases Type I error; apply Bonferroni or FDR correction; p-values vs effect sizes — statistical significance ≠ practical significance; small effect can be highly significant with large n
A/B Test Design — Randomization — assignment must be random and stable per user (not random per page view); Unit of randomization — user-level vs session-level vs page-level; affects analysis; Sample size calculation — run the calculator before starting; do not stop early based on results; Novelty effect — initial spike in treatment metrics due to newness, not genuine improvement; SUTVA (Stable Unit Treatment Value Assumption) — treatment of one unit doesn't affect others; network effects violate this; Holdout groups — maintaining control group post-deployment to measure long-term effects
Experimentation Platforms — Statsig, Optimizely, LaunchDarkly — commercial experimentation platforms; Key features: feature flag management, variant assignment, metric tracking, statistical engine; Internal experiment trackers — Airflow + metrics pipelines; custom analysis in Python/SQL
Causal Inference in Practice — Difference-in-Differences with statsmodels OLS on formula 'outcome ~ post + treatment + post:treatment'; the DiD effect is the coefficient on post:treatment. Propensity Score Matching with scikit-learn: train a classifier to predict treatment assignment from covariates, match each treated unit to a similar control unit based on propensity score, compare outcomes between matched pairs. Regression Discontinuity — use the rdrobust package in Python.

Resources

"Feature Engineering for Machine Learning" by Zheng and Casari (book)
"Trustworthy Online Controlled Experiments" by Kohavi et al. (A/B testing bible book)
kaggle feature engineering course (free)

Stage 05

MLOps and Production Deployment

A model that never leaves a notebook has zero business value. MLOps is increasingly a required skill for data scientists.

MLOps Fundamentals

MLOps = DevOps principles applied to ML systems
The production ML lifecycle: data → training → evaluation → deployment → monitoring → retraining
Why ML systems fail in production: - Data drift — input distribution changes over time - Concept drift — relationship between inputs and outputs changes - Training-serving skew — preprocessing different between training and serving - Pipeline failures — upstream data changes breaking downstream models - Silent failures — model returns predictions but they are increasingly wrong

Experiment Tracking

MLflow — open-source; tracking runs, parameters, metrics, artifacts, model registry. Typical usage: wrap training inside mlflow.start_run(run_name='random_forest_v1'), call mlflow.log_params with a params dict, mlflow.log_metrics with metrics like accuracy and auc_roc, and mlflow.sklearn.log_model(model, 'model') to persist the trained estimator.
Weights & Biases (W&B) — more features; teams; interactive visualizations; popular in deep learning
Neptune.ai — alternative experiment tracker
DVC (Data Version Control) — versioning datasets and models in Git; tracking data lineage

Model Serving

REST API deployment — FastAPI + uvicorn for model serving. Pattern: create FastAPI() app, load the trained estimator via joblib.load('model.pkl'), define a POST /predict handler that takes a features list, reshapes it with numpy into (1, -1), calls model.predict and model.predict_proba, and returns prediction and probability.
Containerization — Docker; encapsulating model + dependencies; consistent environment
Cloud model serving: - AWS SageMaker — managed model training and deployment - Azure ML — managed ML platform; endpoints - GCP Vertex AI — model deployment; prediction endpoints - Databricks Model Serving — deploying MLflow models

Model Monitoring

Data drift detection: - Population Stability Index (PSI) — detecting shifts in feature distributions - KS test — comparing distributions between training and serving data - Feature drift reports — Evidently AI (free and paid); automated drift reports
Model performance monitoring: - Ground truth labels often delayed — monitoring proxies until labels arrive - Prediction drift — distributional shift in model outputs - Business metric correlation — tracking whether model's business impact holds
Tools: Evidently AI (open-source), Arize AI, WhyLabs, Fiddler

Feature Stores

Centralizing feature computation and serving for both training and serving
Preventing training-serving skew — same code for both
Feast — open-source feature store
Tecton, Hopsworks — commercial feature stores
When to use: multiple models using the same features; real-time feature serving required

CI/CD for Machine Learning

Automated testing for ML pipelines: - Data validation tests — schema checks, range checks, null checks - Model evaluation gates — deploying only if metrics above threshold - Integration tests — testing the full prediction pipeline end to end
GitHub Actions for ML pipelines: - Triggering retraining when new data arrives - Running evaluation suite on model version - Deploying to staging on PR merge; production on release

Resources

"Designing Machine Learning Systems" by Huyen (essential book for production ML)
Evidently AI documentation (free)
MLflow documentation (free)
"Machine Learning Engineering" by Burkov (book)

Stage 06

Cloud and Big Data Tools

Modern data science is cloud-based. Familiarity with cloud ML platforms and the tools the data is stored in is required.

Cloud ML Platforms

AWS SageMaker: - SageMaker Studio — managed JupyterLab environment - SageMaker Training — managed training jobs; distributed training - SageMaker Endpoints — model deployment; real-time and batch inference - SageMaker Pipelines — ML workflow orchestration - SageMaker Feature Store — centralized feature management
Azure Machine Learning: - Azure ML Studio — managed workspace; experiments, models, deployments - AutoML — automated model selection and hyperparameter tuning - Prompt Flow — building and evaluating LLM applications
GCP Vertex AI: - Vertex Workbench — managed JupyterLab - Vertex AI Training — managed training; TPU support - Vertex AI Prediction — model deployment; explanation - BigQuery ML — training models directly in BigQuery SQL

Databricks for Data Science

Databricks notebooks — collaborative; Python, SQL, Scala, R
MLflow integration — experiment tracking native to Databricks
Feature Store — centralized feature management
Unity Catalog — data governance including ML models
Delta Live Tables — for ML feature pipelines
Databricks AutoML — automated ML with explainability

SQL for Data Science

All content from Data Analyst Stage 2 plus:
Window functions for feature engineering — lag features, rolling averages, row numbering
Data warehouse SQL — Snowflake, BigQuery, Redshift specifics
dbt for feature engineering — building ML-ready features in the warehouse
BigQuery ML — training models in SQL

Distributed Computing

Apache Spark: - When Spark matters — datasets that don't fit in memory; distributed model training - PySpark for data scientists — `spark.read.parquet()`, DataFrames, `.groupBy()`, `.agg()`, ML pipelines - Spark MLlib — distributed ML; random forest, gradient boosting, ALS - Databricks Spark — managed cluster management; better UX than raw Spark
Dask — parallel computing in Python; extending pandas to out-of-memory datasets; familiar API

Resources

AWS SageMaker documentation (free)
Databricks learning portal (free)
"Learning Spark" by Chambers and Zaharia (book)

FAQ

Common questions

How long does it take to become a Data Scientist?

12–18 months optimistic for someone with strong quantitative background and 20–25 hours/week. 18–30 months realistic, longer if you're starting without statistics or programming foundation. The role demands three competencies — statistics, programming, business context — and weakness in any one ends interviews. Most successful self-taught paths run: stats foundation → Python + pandas → ML projects → portfolio + Kaggle → applications.

Which certifications matter for data science?

Almost none. Coursera/DeepLearning.AI specializations from Andrew Ng signal foundation. Cloud ML certs (AWS ML Specialty, Azure DP-100, GCP Professional ML Engineer) help for production-ML roles. AWS Certified Machine Learning – Specialty is moderately weighted. The strongest portfolio signal is a Kaggle profile with completed competitions and 3–5 GitHub projects showing end-to-end work — problem framing, EDA, modeling, evaluation, deployment narrative. Certs without portfolio work don't move the needle.

Do I need a master's or PhD?

Depends heavily on the employer. Big Tech (Google, Meta, Apple) and quantitative finance often filter for master's or PhD. Mid-market companies, startups, and internal data science teams hire bachelor's holders with strong portfolios routinely. ML engineering roles favor SDE backgrounds; research roles favor PhDs. Career-changers from analyst, statistician, or quant roles transition into data science without additional formal education when their portfolios demonstrate ML fluency. Statistics/ML appears in 92% of postings.

What separates a hired Data Scientist from one who doesn't?

Business context, not algorithm depth. Junior candidates who can describe XGBoost in detail but can't articulate when not to use it lose to candidates who frame problems well. Senior interviews are dominated by case studies — given an ambiguous business question, what's your plan? Other differentiators: production-aware modeling (not just notebook code), MLOps fluency (versioning, monitoring, drift detection), and clear written communication. BLS projects 36% growth in data scientist employment from 2023 to 2033.

Data Scientist

Common questions

Related roles