Roadmap
Data Scientist
The professional who frames business questions as data problems, builds statistical models and machine learning systems to answer them, and communicates findings that drive decisions. Combines statistical rigor, programming depth, and business context to produce predictions, recommendations, and insights from structured and unstructured data.
OPTIMISTIC 12–18 months · REALISTIC 18–30 months
Stage 00
Mathematics and Statistics Foundations
Data science is applied mathematics. You can get started with surface-level statistics, but senior roles require genuine depth in probability, linear algebra, and statistical inference.
Statistics — Applied Foundation
- All content from Data Analyst Stage 0 (descriptive statistics, probability, inferential statistics) plus:
Linear Algebra — The Language of Machine Learning
- Vectors — ordered collections of numbers; direction and magnitude in n-dimensional space
- Matrices — rectangular arrays of numbers; representing linear transformations
- Matrix operations: - Addition and scalar multiplication — element-wise - Matrix multiplication — row × column dot products; not commutative - Transpose — flipping rows and columns - Inverse — A⁻¹ such that AA⁻¹ = I (identity matrix); solving systems of equations - Determinant — scalar characterizing a matrix; det = 0 means non-invertible (singular)
- Dot product — `a · b = |a||b|cos(θ)`; measures similarity between vectors; fundamental to neural networks
- Eigenvalues and eigenvectors — `Av = λv`; eigenvectors are directions unchanged by transformation; eigenvalues are scaling factors - Used in PCA (Principal Component Analysis) for dimensionality reduction
- Norms — measuring vector magnitude; L1 norm (sum of absolute values), L2 norm (Euclidean distance)
- Matrix factorizations — SVD (Singular Value Decomposition); used in recommendation systems and dimensionality reduction
Calculus — For Understanding Learning Algorithms
- Derivatives — rate of change; `f'(x) = lim(h→0) [f(x+h) - f(x)] / h`
- Chain rule — derivative of composite function; critical for backpropagation in neural networks
- Partial derivatives — derivative with respect to one variable holding others constant
- Gradient — vector of partial derivatives; points in the direction of steepest increase
- Gradient descent — moving in the direction of negative gradient to minimize a loss function - Batch GD — compute gradient on entire dataset - Stochastic GD (SGD) — compute gradient on one sample; noisy but fast - Mini-batch GD — compromise; standard practice in deep learning
- Optimization concepts: - Local vs global minima — gradient descent finds local minima; random restarts help - Learning rate — how large a step to take; too large = overshoot; too small = slow - Learning rate schedules — reducing learning rate over training
Advanced Statistics
- Probability distributions: - Discrete: Bernoulli, Binomial, Poisson, Geometric - Continuous: Uniform, Normal, Exponential, Beta, Gamma - Understanding when each applies; parameter interpretation
- Maximum Likelihood Estimation (MLE) — finding parameters that maximize likelihood of observed data
- Bayesian inference — updating prior beliefs with observed data: - Prior — belief before seeing data: P(θ) - Likelihood — probability of data given parameter: P(X|θ) - Posterior — updated belief after data: P(θ|X) ∝ P(X|θ) × P(θ) - Bayes' Theorem: P(θ|X) = P(X|θ)P(θ) / P(X)
- Hypothesis testing beyond basics: - Multiple testing correction — Bonferroni correction; Benjamini-Hochberg (FDR) - Power analysis — determining required sample size for a given effect size - Effect size — Cohen's d; practically significant vs statistically significant
- Multivariate statistics: - Covariance and correlation matrices - Multicollinearity — correlated predictors causing instability in regression coefficients - Principal Component Analysis (PCA) — dimensionality reduction; finding directions of maximum variance
Causal Inference — High Demand in 2026
- Correlation vs causation — a confounding variable C can cause both X and Y to appear correlated without X causing Y
- Potential outcomes framework — Y(1) = outcome if treated; Y(0) = outcome if untreated; ATE = E[Y(1) - Y(0)]
- Randomized controlled experiments (A/B tests) — the gold standard; random assignment eliminates confounding
- Observational study methods when RCT not possible: - Difference-in-Differences (DiD) — comparing pre/post change between treated and untreated groups - Regression Discontinuity (RD) — exploiting sharp cutoffs; people just above and below threshold as comparison - Propensity Score Matching (PSM) — matching treated and control units on probability of treatment - Instrumental Variables (IV) — using a variable that affects treatment but not outcome directly
- Directed Acyclic Graphs (DAGs) — representing causal assumptions; identifying confounders to control for
Resources
- 3Blue1Brown "Essence of Linear Algebra" (YouTube, free, best visual introduction)
- Khan Academy Statistics (free)
- "Naked Statistics" by Wheelan (book)
- StatQuest YouTube (free, excellent)
- "The Book of Why" by Pearl and Mackenzie (causal inference, book)
Stage 01
Python for Data Science — Deep
Python is the primary language for data science. The libraries in this stage are what every data scientist uses daily.
Python Foundations
- All content from Data Analyst Stage 3 applies — data types, pandas, NumPy, matplotlib, seaborn, EDA.
Scikit-learn — The ML Library
- Design philosophy — consistent API: `fit()`, `transform()`, `predict()` across all estimators
- Preprocessing: - `StandardScaler` — standardize features to mean 0, std 1 - `MinMaxScaler` — scale to [0, 1] range - `LabelEncoder` / `OrdinalEncoder` — encoding categorical labels - `OneHotEncoder` — creating dummy variables for nominal categories - `SimpleImputer` — handling missing values (mean, median, most frequent, constant) - `PolynomialFeatures` — creating polynomial and interaction features
- Model training pattern using scikit-learn: split with train_test_split, build a Pipeline with StandardScaler and RandomForestClassifier, call pipeline.fit on training data, then pipeline.predict on test data and print classification_report(y_test, y_pred).
- Cross-validation: - `cross_val_score` — k-fold CV; more robust than single train/test split - `StratifiedKFold` — preserving class distribution across folds - `cross_validate` — returning multiple metrics
- Hyperparameter tuning: - `GridSearchCV` — exhaustive search over parameter grid - `RandomizedSearchCV` — random sampling from parameter distributions; more efficient - `BayesSearchCV` (scikit-optimize) — Bayesian optimization for hyperparameters
- Pipelines — chaining preprocessing and modeling; prevents data leakage
- Model persistence — joblib.dump and joblib.load for saving and loading models; stdlib serialization is an alternative but joblib is preferred for large numpy arrays
Statsmodels — Statistical Modeling
- OLS regression — `sm.OLS(y, X).fit()`; coefficient estimates, p-values, R², confidence intervals, residual plots
- Logistic regression — `sm.Logit(y, X).fit()`; log-odds interpretation; marginal effects
- Time series models — `statsmodels.tsa`; ARIMA, SARIMAX, Holt-Winters
- Hypothesis testing tools — t-tests, chi-square tests, Durbin-Watson (autocorrelation)
- Difference from sklearn — statsmodels provides inference (p-values, confidence intervals); sklearn provides prediction
SciPy — Scientific Computing
- `scipy.stats` — statistical functions; t-tests, chi-square, Mann-Whitney, Kolmogorov-Smirnov, ANOVA
- `scipy.optimize` — optimization algorithms; minimizing functions
- `scipy.spatial` — distance metrics; k-nearest neighbors computation
Advanced Python Patterns for Data Scientists
- Decorators for function timing and logging — useful for pipeline profiling
- Context managers — `with` statement; database connections; file handling
- Generator functions — `yield`; memory-efficient iteration over large datasets
- Multiprocessing — `concurrent.futures`; parallelizing CPU-bound data processing
- Type hints — increasingly expected in production data science code
- Testing data science code — `pytest`; testing data transformations and model behavior
Jupyter and Interactive Development
- Jupyter notebooks — exploratory work; visualization; documentation alongside code
- JupyterLab — enhanced IDE-like interface; multiple tabs; text editor
- nbconvert — converting notebooks to HTML, PDF, scripts
- Notebook best practices — clear markdown documentation; reproducible execution from top to bottom; no side effects between cells
- VS Code with Python extension — increasingly preferred for production-quality code alongside notebooks
Resources
- "Python Machine Learning" by Raschka and Mirjalili (book)
- scikit-learn documentation (free)
- "Hands-On Machine Learning" by Géron (the canonical ML textbook)
- Kaggle Learn ML courses (free)
Stage 02
Machine Learning Algorithms — Deep
Understanding why algorithms work, when to use them, and how to debug them distinguishes senior from junior data scientists.
Supervised Learning — Regression
- Linear Regression — Model: ŷ = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ; Cost function: MSE = (1/n)Σ(yᵢ - ŷᵢ)²; OLS (Ordinary Least Squares) — closed-form solution: β = (XᵀX)⁻¹Xᵀy; Assumptions: linearity, independence, homoscedasticity, normality of residuals; Diagnosing violations: residual plots, Q-Q plots, VIF for multicollinearity; Regularization: Ridge (L2) adds λΣβᵢ² to cost (shrinks coefficients toward zero; handles multicollinearity); Lasso (L1) adds λΣ|βᵢ| to cost (produces sparse models; forces some coefficients to exactly zero for feature selection); Elastic Net combines L1 and L2; λ (regularization strength) higher = more regularization = simpler model, tune via cross-validation; Polynomial regression — adding polynomial terms to capture nonlinearity with overfitting risk
- Decision Trees: - Splitting criterion — Gini impurity (classification), variance reduction (regression) - Recursive partitioning — greedily splitting on the feature that best separates classes - Depth control — max_depth, min_samples_split, min_samples_leaf; preventing overfitting - Interpretable but high variance — single trees overfit easily
- Random Forest: - Ensemble of decision trees; each tree trained on bootstrap sample (bagging) - Each split considers only `sqrt(n_features)` features (random feature selection); decorrelates trees - Prediction = majority vote (classification) or average (regression) - Feature importance — mean decrease in impurity across all trees - Robust, low tuning needed — good baseline; rarely disastrous
- Gradient Boosting — the most important algorithm for tabular data: - Sequential ensemble — each tree corrects errors of previous trees - XGBoost: - Regularization terms in objective function; handles missing values - `n_estimators`, `learning_rate`, `max_depth`, `subsample`, `colsample_bytree`, `reg_alpha`, `reg_lambda` - Early stopping — stop when validation metric stops improving; prevents overfitting - LightGBM: - Histogram-based splitting; much faster than XGBoost for large datasets - Leaf-wise growth (vs depth-wise) — more accurate but needs careful depth control - `num_leaves` is the key parameter (not max_depth) - CatBoost: - Native handling of categorical features — no label encoding needed - Ordered boosting — reduces target leakage during training - When to use gradient boosting: tabular data, mixed types, production ML — the dominant approach - Hyperparameter tuning priority: n_estimators → learning_rate → max_depth → subsample → regularization
Supervised Learning — Classification
- Logistic Regression — Model: P(y=1|x) = σ(βᵀx) = 1 / (1 + e⁻ᶻ); Linear decision boundary in feature space; Interpretable: coefficient e^β = multiplicative change in odds per unit change in x; Fast, good baseline; often competitive with more complex models on linearly separable problems; Multinomial logistic regression — extending to K classes (softmax function)
- Support Vector Machines (SVM) — Find hyperplane maximizing margin between classes; Support vectors — training points closest to the decision boundary; only these affect the model; Kernel trick — mapping to higher-dimensional space via kernels (RBF, polynomial, linear) without explicit computation; C parameter — regularization: small C = wide margin (more misclassification allowed), large C = narrow margin; Best for: high-dimensional data; text classification (before deep learning); moderately sized datasets; Slow on large datasets — O(n²) to O(n³) training time
- K-Nearest Neighbors (KNN) — Non-parametric — memorizes training data; predicts based on k nearest neighbors; k selection — small k = noisy decision boundary; large k = smooth but may lose local patterns; Distance metric — Euclidean (default), Manhattan, cosine; requires scaled features; Curse of dimensionality — performance degrades in high dimensions; all points become equidistant; Intuitive but slow at prediction time — O(n) per prediction
- Naive Bayes — Applies Bayes' theorem with naive (strong) independence assumption among features; Gaussian Naive Bayes — continuous features assumed normally distributed; Multinomial Naive Bayes — count features; text classification; Fast; works well with small data; surprisingly effective for text; Independence assumption usually violated but works in practice
Unsupervised Learning
- K-Means: - Algorithm: assign each point to nearest centroid → recompute centroids → repeat until convergence - k selection: elbow method (inertia vs k), silhouette score - Inertia — sum of squared distances to nearest centroid; decreasing function of k - Limitations: assumes spherical clusters; sensitive to outliers; must specify k - K-Means++ — smart initialization; reduces random initialization sensitivity
- DBSCAN (Density-Based Spatial Clustering): - Defines clusters as dense regions separated by low-density regions - No need to specify k; handles arbitrary cluster shapes; identifies outliers - Parameters: eps (neighborhood radius), min_samples (minimum points to form a cluster) - Struggles with varying-density clusters; sensitive to parameters
- Hierarchical Clustering: - Agglomerative — bottom-up; each point starts as its own cluster; merge closest clusters - Dendrogram — tree showing merging history; cut at any level to get n clusters - Linkage: single (minimum distance), complete (maximum distance), average, Ward (minimizes intra-cluster variance) - Computationally expensive (O(n² log n)); not suitable for large datasets
- PCA (Principal Component Analysis): - Finds directions of maximum variance; projects data onto orthogonal axes - Steps: standardize → covariance matrix → eigendecomposition → sort by eigenvalue → select top k - Explained variance ratio — how much variance each component captures - Use cases: visualization (2D/3D), removing multicollinearity, compressing features before modeling
- t-SNE (t-Distributed Stochastic Neighbor Embedding): - Non-linear; preserves local structure; excellent for visualization - Slow (O(n²)); non-deterministic; not suitable for preprocessing pipeline - Perplexity parameter — affects local vs global structure balance
- UMAP (Uniform Manifold Approximation and Projection): - Faster than t-SNE; better at preserving global structure; can be used in pipelines - Increasingly preferred over t-SNE for dimensionality reduction
Model Evaluation Metrics — Complete
- Regression Metrics — MAE (Mean Absolute Error) — average absolute difference; robust to outliers; interpretable in original units; MSE (Mean Squared Error) — average squared difference; penalizes large errors more; differentiable; RMSE (Root MSE) — sqrt(MSE); same units as target; combines MSE properties with interpretability; R² (Coefficient of Determination) — proportion of variance explained by model; 1 = perfect; 0 = baseline mean; Adjusted R² — penalizes for additional predictors; use when comparing models with different feature counts; MAPE (Mean Absolute Percentage Error) — percentage error; scale-independent; undefined when y = 0
- Classification Metrics — Confusion matrix — TP, FP, TN, FN; the foundation of all other metrics; Accuracy — (TP + TN) / total; misleading with class imbalance; Precision — TP / (TP + FP); when cost of false positives is high (spam detection); Recall / Sensitivity — TP / (TP + FN); when cost of false negatives is high (cancer detection); F1 Score — harmonic mean of precision and recall; balanced metric; F-beta — F-score weighted toward recall (β > 1) or precision (β < 1); AUC-ROC: ROC curve — TPR vs FPR at different classification thresholds, AUC = 1.0 perfect; AUC = 0.5 random; AUC < 0.5 worse than random; AUC-PR — better than ROC for heavily imbalanced datasets when positive class is rare; Log-loss (Cross-entropy loss) — penalizes confident wrong predictions; useful when probabilities matter; Calibration — does P(y=1 | prediction = 0.7) actually equal 0.7? Platt scaling, isotonic regression; Cohen's Kappa — accuracy adjusted for chance agreement; useful for multi-class problems
Resources
- "Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow" by Géron (essential book)
- "The Elements of Statistical Learning" by Hastie et al. (free PDF)
- Kaggle Learn (free)
- "Introduction to Statistical Learning" (ISLR, free PDF, accessible version of ESL)
Stage 03
Deep Learning and NLP
Mid-level and senior data science roles increasingly expect deep learning fluency. NLP is the fastest-growing skill demand in the field.
Deep Learning Foundations
- Neural Network Fundamentals — Architecture: input layer → hidden layers → output layer; Neuron computation: output = activation(w·x + b)
- Activation functions: - ReLU (Rectified Linear Unit): `max(0, x)`; most common; avoids vanishing gradient; can die (always output 0) - Sigmoid: `1/(1+e⁻ˣ)`; outputs [0,1]; binary classification output; saturates (vanishing gradient) - Tanh: `(eˣ - e⁻ˣ)/(eˣ + e⁻ˣ)`; outputs [-1,1]; zero-centered; still saturates - Softmax: `e^xᵢ / Σe^xⱼ`; multi-class output; probabilities sum to 1 - GELU, Swish — modern activations; used in transformers
- Forward pass — computing output from input through all layers; Backpropagation — computing gradients via chain rule; updating weights via gradient descent
- Batch normalization — normalizing layer inputs; stabilizes training; reduces internal covariate shift
- Dropout — randomly zeroing neurons during training; regularization; `p` is the dropout rate
- Weight initialization — Xavier/Glorot for tanh; He/Kaiming for ReLU; bad initialization causes vanishing/exploding gradients
- PyTorch — Primary Framework: define an nn.Module subclass with __init__ registering layers (Linear, ReLU, Dropout, Linear) and a forward method composing them; construct Adam optimizer with learning rate and weight decay, use BCEWithLogitsLoss; training loop per epoch calls model.train(), optimizer.zero_grad(), forward pass, loss backward, optimizer.step(). Key patterns: model.train() vs evaluation mode (affects dropout and batch normalization), torch.no_grad() for inference, DataLoader for batching and shuffling, nn.Sequential for simple models, torch.optim (Adam, SGD, AdamW, RMSprop), and learning rate schedulers (StepLR, CosineAnnealingLR, OneCycleLR).
- TensorFlow / Keras — Alternative: - Higher-level API; quicker for prototyping; dominant in production deployment - Keras `Sequential` vs functional API vs subclassing - `model.compile()`, `model.fit()`, `model.predict()` - TensorBoard — visualization of training metrics, computation graphs
- CNNs (Convolutional Neural Networks) — Computer Vision: - Convolution operation — sliding filter across image; detecting local patterns - Feature maps — output of convolution; one per filter - Pooling — MaxPool, AvgPool; reducing spatial dimensions; translation invariance - CNN architecture: Conv → ReLU → Pool (repeat) → Flatten → Dense → Output - Transfer learning with CNNs: - Pretrained models: ResNet, VGG, EfficientNet, ViT — trained on ImageNet - Feature extraction — freeze pretrained weights; only train new classification head - Fine-tuning — unfreeze some layers; train with low learning rate - Hugging Face `timm` library — hundreds of pretrained image models
Natural Language Processing — High Demand in 2026
- Text Preprocessing Fundamentals — Tokenization — splitting text into tokens (words, subwords, characters); Lowercasing, punctuation removal, stemming, lemmatization; Stop word removal — removing common words with low informational content; TF-IDF (Term Frequency-Inverse Document Frequency) — weighting words by importance; Bag of Words — representing documents as word count vectors
- Transformer Models — The Modern NLP Stack: - Attention mechanism: - Query, Key, Value — computing attention weights between tokens - Scaled dot-product attention: `Attention(Q, K, V) = softmax(QKᵀ/√d)V` - Multi-head attention — multiple attention heads capturing different relationships - BERT (Bidirectional Encoder Representations from Transformers): - Bidirectional context understanding; pretrained on masked language modeling - Use cases: classification, named entity recognition, question answering - GPT-style models — autoregressive; left-to-right; generation tasks - Sentence transformers — generating semantic embeddings; similarity search
- Hugging Face Transformers library: load AutoTokenizer and AutoModel from a pretrained checkpoint like bert-base-uncased, tokenize text with padding and truncation returning PyTorch tensors, run the model under torch.no_grad() for inference, and extract the CLS token embedding from outputs.last_hidden_state[:, 0, :].
- Fine-tuning pretrained models — adding task-specific heads; using datasets library; LLM APIs — OpenAI, Anthropic, Google; calling LLMs from data science workflows; building evaluation pipelines
- NLP Tasks: - Text classification — sentiment analysis, spam detection, topic classification - Named Entity Recognition (NER) — extracting entities (person, organization, location) from text - Text summarization — extractive vs abstractive - Semantic similarity — cosine similarity of embeddings; duplicate detection; recommendation - Information extraction — parsing structured data from unstructured text - Retrieval-Augmented Generation (RAG) — combining retrieval with generation for Q&A on custom data
Stage 04
Feature Engineering and Experiment Design
The most undervalued skill in data science. Feature engineering often produces more improvement than algorithm selection.
Feature Engineering
- Numerical Features — Binning — converting continuous variables to categorical bins (Equal-width: dividing range into equal intervals; Equal-frequency (quantile): dividing so each bin has equal number of observations); Log transformation — reducing right skew; `np.log1p(x)` for zero-safe log; Power transformations — Box-Cox, Yeo-Johnson; stabilizing variance; making distribution more normal; Interaction features — multiplying two features to capture multiplicative effects; `age × income`; Polynomial features — adding higher-order terms; controlled via `sklearn.preprocessing.PolynomialFeatures`; Rank transformation — replacing values with their rank; robust to outliers; monotonic transformation
- Categorical Features — Label encoding — mapping categories to integers; appropriate only for ordinal categories; One-hot encoding — creating binary indicator columns; appropriate for nominal categories; Target encoding (mean encoding) — replacing category with mean target value; powerful but prone to leakage (Must encode within CV folds to prevent target leakage); Count encoding — replacing category with its frequency in the dataset; Frequency encoding — same as count but normalized to proportion; Leave-one-out encoding — target encoding variant that avoids leakage; Hashing — converting high-cardinality features to fixed-size vectors
- Date/Time Features — Year, month, day, day-of-week, hour, minute; Is weekend, is holiday, is month start/end; Cyclical encoding for periodic features: `sin(2π × day_of_year / 365)`, `cos(2π × day_of_year / 365)`; Time since event features — days since last purchase, days since account creation; Business day calculations — excluding weekends/holidays from time differences
- Text Features — Character count, word count, average word length; Sentiment score — TextBlob, VADER; TF-IDF vectorization; Embedding features from pretrained models
- Feature Selection — Filter methods: Correlation filter — remove highly correlated features (|r| > 0.9 between features); Variance filter — remove near-zero variance features; Mutual information — measuring statistical dependence between feature and target; Wrapper methods: RFE (Recursive Feature Elimination) — iteratively removing least important features; Sequential feature selection — forward or backward selection; Embedded methods: Lasso regularization — coefficients of unimportant features driven to zero; Random Forest / XGBoost feature importance — selecting top N features by importance; Feature importance interpretation: Permutation importance — measure performance drop when feature is shuffled; more reliable than tree-based importance; SHAP values — game-theoretic feature importance; shows feature contribution per prediction
A/B Testing and Experiment Design
- Statistical Foundation — Null hypothesis — no difference between control and treatment; Type I error (α) — false positive; rejecting H₀ when it's true; typically α = 0.05; Type II error (β) — false negative; failing to reject H₀ when it's false; power = 1 - β; Power analysis — determining required sample size using effect size (d) = expected difference / standard deviation, typical power (1-β) = 0.8, and formula n = (z_α/2 + z_β)² × 2σ² / δ² for two-sample t-test; Multiple testing — running many tests increases Type I error; apply Bonferroni or FDR correction; p-values vs effect sizes — statistical significance ≠ practical significance; small effect can be highly significant with large n
- A/B Test Design — Randomization — assignment must be random and stable per user (not random per page view); Unit of randomization — user-level vs session-level vs page-level; affects analysis; Sample size calculation — run the calculator before starting; do not stop early based on results; Novelty effect — initial spike in treatment metrics due to newness, not genuine improvement; SUTVA (Stable Unit Treatment Value Assumption) — treatment of one unit doesn't affect others; network effects violate this; Holdout groups — maintaining control group post-deployment to measure long-term effects
- Experimentation Platforms — Statsig, Optimizely, LaunchDarkly — commercial experimentation platforms; Key features: feature flag management, variant assignment, metric tracking, statistical engine; Internal experiment trackers — Airflow + metrics pipelines; custom analysis in Python/SQL
- Causal Inference in Practice — Difference-in-Differences with statsmodels OLS on formula 'outcome ~ post + treatment + post:treatment'; the DiD effect is the coefficient on post:treatment. Propensity Score Matching with scikit-learn: train a classifier to predict treatment assignment from covariates, match each treated unit to a similar control unit based on propensity score, compare outcomes between matched pairs. Regression Discontinuity — use the rdrobust package in Python.
Resources
- "Feature Engineering for Machine Learning" by Zheng and Casari (book)
- "Trustworthy Online Controlled Experiments" by Kohavi et al. (A/B testing bible book)
- kaggle feature engineering course (free)
Stage 05
MLOps and Production Deployment
A model that never leaves a notebook has zero business value. MLOps is increasingly a required skill for data scientists.
MLOps Fundamentals
- MLOps = DevOps principles applied to ML systems
- The production ML lifecycle: data → training → evaluation → deployment → monitoring → retraining
- Why ML systems fail in production: - Data drift — input distribution changes over time - Concept drift — relationship between inputs and outputs changes - Training-serving skew — preprocessing different between training and serving - Pipeline failures — upstream data changes breaking downstream models - Silent failures — model returns predictions but they are increasingly wrong
Experiment Tracking
- MLflow — open-source; tracking runs, parameters, metrics, artifacts, model registry. Typical usage: wrap training inside mlflow.start_run(run_name='random_forest_v1'), call mlflow.log_params with a params dict, mlflow.log_metrics with metrics like accuracy and auc_roc, and mlflow.sklearn.log_model(model, 'model') to persist the trained estimator.
- Weights & Biases (W&B) — more features; teams; interactive visualizations; popular in deep learning
- Neptune.ai — alternative experiment tracker
- DVC (Data Version Control) — versioning datasets and models in Git; tracking data lineage
Model Serving
- REST API deployment — FastAPI + uvicorn for model serving. Pattern: create FastAPI() app, load the trained estimator via joblib.load('model.pkl'), define a POST /predict handler that takes a features list, reshapes it with numpy into (1, -1), calls model.predict and model.predict_proba, and returns prediction and probability.
- Containerization — Docker; encapsulating model + dependencies; consistent environment
- Cloud model serving: - AWS SageMaker — managed model training and deployment - Azure ML — managed ML platform; endpoints - GCP Vertex AI — model deployment; prediction endpoints - Databricks Model Serving — deploying MLflow models
Model Monitoring
- Data drift detection: - Population Stability Index (PSI) — detecting shifts in feature distributions - KS test — comparing distributions between training and serving data - Feature drift reports — Evidently AI (free and paid); automated drift reports
- Model performance monitoring: - Ground truth labels often delayed — monitoring proxies until labels arrive - Prediction drift — distributional shift in model outputs - Business metric correlation — tracking whether model's business impact holds
- Tools: Evidently AI (open-source), Arize AI, WhyLabs, Fiddler
Feature Stores
- Centralizing feature computation and serving for both training and serving
- Preventing training-serving skew — same code for both
- Feast — open-source feature store
- Tecton, Hopsworks — commercial feature stores
- When to use: multiple models using the same features; real-time feature serving required
CI/CD for Machine Learning
- Automated testing for ML pipelines: - Data validation tests — schema checks, range checks, null checks - Model evaluation gates — deploying only if metrics above threshold - Integration tests — testing the full prediction pipeline end to end
- GitHub Actions for ML pipelines: - Triggering retraining when new data arrives - Running evaluation suite on model version - Deploying to staging on PR merge; production on release
Resources
- "Designing Machine Learning Systems" by Huyen (essential book for production ML)
- Evidently AI documentation (free)
- MLflow documentation (free)
- "Machine Learning Engineering" by Burkov (book)
Stage 06
Cloud and Big Data Tools
Modern data science is cloud-based. Familiarity with cloud ML platforms and the tools the data is stored in is required.
Cloud ML Platforms
- AWS SageMaker: - SageMaker Studio — managed JupyterLab environment - SageMaker Training — managed training jobs; distributed training - SageMaker Endpoints — model deployment; real-time and batch inference - SageMaker Pipelines — ML workflow orchestration - SageMaker Feature Store — centralized feature management
- Azure Machine Learning: - Azure ML Studio — managed workspace; experiments, models, deployments - AutoML — automated model selection and hyperparameter tuning - Prompt Flow — building and evaluating LLM applications
- GCP Vertex AI: - Vertex Workbench — managed JupyterLab - Vertex AI Training — managed training; TPU support - Vertex AI Prediction — model deployment; explanation - BigQuery ML — training models directly in BigQuery SQL
Databricks for Data Science
- Databricks notebooks — collaborative; Python, SQL, Scala, R
- MLflow integration — experiment tracking native to Databricks
- Feature Store — centralized feature management
- Unity Catalog — data governance including ML models
- Delta Live Tables — for ML feature pipelines
- Databricks AutoML — automated ML with explainability
SQL for Data Science
- All content from Data Analyst Stage 2 plus:
- Window functions for feature engineering — lag features, rolling averages, row numbering
- Data warehouse SQL — Snowflake, BigQuery, Redshift specifics
- dbt for feature engineering — building ML-ready features in the warehouse
- BigQuery ML — training models in SQL
Distributed Computing
- Apache Spark: - When Spark matters — datasets that don't fit in memory; distributed model training - PySpark for data scientists — `spark.read.parquet()`, DataFrames, `.groupBy()`, `.agg()`, ML pipelines - Spark MLlib — distributed ML; random forest, gradient boosting, ALS - Databricks Spark — managed cluster management; better UX than raw Spark
- Dask — parallel computing in Python; extending pandas to out-of-memory datasets; familiar API
Resources
- AWS SageMaker documentation (free)
- Databricks learning portal (free)
- "Learning Spark" by Chambers and Zaharia (book)
FAQ
Common questions
How long does it take to become a Data Scientist?
12–18 months optimistic for someone with strong quantitative background and 20–25 hours/week. 18–30 months realistic, longer if you're starting without statistics or programming foundation. The role demands three competencies — statistics, programming, business context — and weakness in any one ends interviews. Most successful self-taught paths run: stats foundation → Python + pandas → ML projects → portfolio + Kaggle → applications.
Which certifications matter for data science?
Almost none. Coursera/DeepLearning.AI specializations from Andrew Ng signal foundation. Cloud ML certs (AWS ML Specialty, Azure DP-100, GCP Professional ML Engineer) help for production-ML roles. AWS Certified Machine Learning – Specialty is moderately weighted. The strongest portfolio signal is a Kaggle profile with completed competitions and 3–5 GitHub projects showing end-to-end work — problem framing, EDA, modeling, evaluation, deployment narrative. Certs without portfolio work don't move the needle.
Do I need a master's or PhD?
Depends heavily on the employer. Big Tech (Google, Meta, Apple) and quantitative finance often filter for master's or PhD. Mid-market companies, startups, and internal data science teams hire bachelor's holders with strong portfolios routinely. ML engineering roles favor SDE backgrounds; research roles favor PhDs. Career-changers from analyst, statistician, or quant roles transition into data science without additional formal education when their portfolios demonstrate ML fluency. Statistics/ML appears in 92% of postings.
What separates a hired Data Scientist from one who doesn't?
Business context, not algorithm depth. Junior candidates who can describe XGBoost in detail but can't articulate when not to use it lose to candidates who frame problems well. Senior interviews are dominated by case studies — given an ambiguous business question, what's your plan? Other differentiators: production-aware modeling (not just notebook code), MLOps fluency (versioning, monitoring, drift detection), and clear written communication. BLS projects 36% growth in data scientist employment from 2023 to 2033.