Roadmap

Data Analyst

The professional who collects, cleans, queries, and analyzes data to answer business questions. Turns raw numbers into charts, summaries, and recommendations that help organizations understand what is happening and why.

OPTIMISTIC 6-12 monthsREALISTIC 10-18 months

Stage 00

Mathematics and Statistics Fundamentals

Data analysis without statistical grounding produces misleading conclusions. You do not need calculus. You need the applied statistics that every analyst uses daily.

Descriptive Statistics

Measures of central tendency: mean (sensitive to outliers), median (robust for skewed data), mode (categorical data)
Measures of spread: range, variance (σ²/s²), standard deviation (σ/s), IQR (Q3-Q1), percentiles and quartiles
Distribution shapes: normal (68-95-99.7 rule), right/positive skew, left/negative skew, bimodal

Inferential Statistics

Populations vs samples — inference from sample to population
Sampling bias — how samples can misrepresent populations; random sampling importance
Confidence intervals — range likely containing the true population parameter; 95% CI most common
p-value — probability of observing results at least as extreme if null hypothesis is true; p < 0.05 conventional threshold
Hypothesis testing — null hypothesis (H₀), alternative hypothesis (H₁), Type I error (false positive), Type II error (false negative)
A/B testing — comparing two versions; randomized controlled experiment; interpreting significance and practical significance
Correlation vs causation — correlation does not imply causation; confounding variables; Simpson's paradox
Correlation coefficient (r) — Pearson's r: ranges -1 to +1; |r| > 0.7 strong, |r| < 0.3 weak
Regression basics — linear regression; slope interpretation; R² (coefficient of determination)

Probability Basics

Basic probability rules — P(A), P(A and B), P(A or B), P(A|B) conditional probability
Bayes' theorem — updating probabilities with new evidence; P(A|B) = P(B|A)×P(A) / P(B)
Expected value — E(X) = Σ(x × P(x)); foundation for business decision analysis

Resources

Khan Academy Statistics (free)
StatQuest YouTube by Josh Starmer (free, best intuitive statistics)
"Naked Statistics" by Wheelan (book, accessible intro)

Stage 01

Excel and Google Sheets

Excel remains the most universally deployed data tool. Even analysts who primarily use SQL and Python will use Excel daily for ad hoc work, stakeholder deliverables, and quick analysis.

Excel Fundamentals

Navigation — keyboard shortcuts (Ctrl+Arrow, Ctrl+Shift+Arrow, F2, F4, Ctrl+Z/Y)
Data entry — structured tables vs unstructured ranges; best practices for analysis-ready data
Formatting — number formats, date formats, conditional formatting, data bars
Named ranges — making formulas readable; dynamic named ranges

Core Formulas

Lookup functions: VLOOKUP, XLOOKUP, INDEX/MATCH, HLOOKUP
Logical functions: IF, nested IF (use IFS), AND, OR, NOT, IFERROR, IFBLANK
Aggregation functions: SUMIF/SUMIFS, COUNTIF/COUNTIFS, AVERAGEIF/AVERAGEIFS, MAXIFS/MINIFS
Text functions: LEFT, RIGHT, MID, LEN, FIND, SEARCH, TRIM, UPPER/LOWER/PROPER, CONCATENATE, TEXT
Date and time functions: TODAY, NOW, YEAR/MONTH/DAY, DATEDIF, EOMONTH, WEEKDAY
Statistical functions: COUNT/COUNTA/COUNTBLANK, AVERAGE/MEDIAN/MODE, STDEV, PERCENTILE, RANK

Pivot Tables

Creating pivot tables — source data requirements (structured table with headers)
Rows, Columns, Values, Filters — building the analysis structure
Value field settings — Sum, Count, Average, % of Total, % of Row Total, % of Column Total
Grouping — date grouping (year/quarter/month); number grouping
Sorting and filtering within pivot tables
Slicers — interactive filters; connecting multiple pivot tables to one slicer
Calculated fields — adding custom metrics to pivot tables
Pivot charts — visual representation of pivot table data

Power Query (Get & Transform)

What Power Query does — connect, transform, and load data from various sources
Data source connections — Excel files, CSV, databases, web URLs, SharePoint
Applied Steps — every transformation recorded; auditable; replayable
Common transformations: remove columns/rows/duplicates, change type, split column, merge queries, append queries, pivot/unpivot, group by, add conditional column, fill down
Refresh — updating data when source changes; parameterized queries

Data Visualization in Excel

Chart types — bar/column, line, scatter, pie/donut, waterfall, histogram, box plot
Chart formatting — meaningful titles, axis labels, data labels; removing chart junk
Sparklines — mini-charts inside cells for trend visualization

Resources

ExcelJet.net (free formula reference)
Chandoo.org (free Excel tutorials)
Microsoft Excel documentation (free)

Stage 02

SQL — Deep Mastery

SQL is listed in 90%+ of data analyst postings. It is the primary tool for data extraction and the skill most consistently tested in analyst interviews.

SQL Foundations (Reference DBA Path Stage 1)

Complete SQL content from DBA path — SELECT, JOIN, GROUP BY, HAVING, window functions, CTEs, subqueries, set operations, transactions

Analytical Query Patterns

Year-over-year comparison with LAG() over ORDER BY YEAR(sale_date)
Rolling 7-day average using AVG() OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
Cohort analysis — tracking user behavior by signup month with DATE_TRUNC and MIN OVER PARTITION
Running total with window functions using SUM OVER (ORDER BY date)
Percentile ranking using NTILE(4) and PERCENT_RANK()
Deduplication — keeping only the latest record per entity using ROW_NUMBER() OVER PARTITION BY

Data Quality Checks in SQL

Null check: SELECT COUNT(*) - COUNT(column_name) AS null_count FROM table
Duplicate check: SELECT id, COUNT(*) FROM table GROUP BY id HAVING COUNT(*) > 1
Range validation: SELECT MIN(value), MAX(value), AVG(value) FROM table
Referential integrity: LEFT JOIN WHERE foreign_id IS NULL
Distribution check: SELECT value, COUNT(*) FROM table GROUP BY value ORDER BY COUNT(*) DESC

Performance Awareness

Avoid SELECT * — always specify columns needed
Filter early — WHERE before aggregations; push filters into subqueries
Index awareness — WHERE on indexed columns is faster; know what indexes exist
EXPLAIN / EXPLAIN ANALYZE — reading query execution plans (see DBA path Stage 1)
Avoid functions on indexed columns in WHERE; use BETWEEN instead of YEAR() on indexed date column
CTEs vs subqueries — CTEs improve readability; most databases optimize them the same

Cloud Data Warehouse SQL Dialects

BigQuery (Google) — Standard SQL; DATE_TRUNC, TIMESTAMP_TRUNC, UNNEST for arrays, STRUCT types
Snowflake — ANSI SQL; QUALIFY clause for window function filtering; FLATTEN for semi-structured data
Redshift (AWS) — PostgreSQL-based; distribution keys and sort keys affect performance; LISTAGG for string aggregation
Databricks / Spark SQL — DATEDIFF, DATE_ADD, ARRAY_ functions; Delta Lake time travel
Key differences: date functions, string functions, NULL handling, and semi-structured data support vary by platform

Resources

Mode Analytics SQL Tutorial (free)
SQLZoo (free)
StratasScratch (paid SQL interview problems)
LeetCode SQL problems (free)
W3Schools SQL (free reference)

Stage 03

Python for Data Analysis

Python is listed in ~40% of entry-level postings and in the majority of mid-senior postings. It enables analysis at scale and automation that Excel and SQL cannot match.

Python Fundamentals

Data types — int, float, str, bool, list, dict, tuple, set
Control flow — if/elif/else, for loops, while loops, list comprehensions
Functions — def, args, kwargs, return, scope
File I/O — reading CSV, JSON, text files
Modules and imports — import, from x import y, aliasing (import pandas as pd)
Error handling — try/except/finally

NumPy — Numerical Computing

Arrays vs Python lists — arrays are homogeneous, faster, support vectorized operations
Array creation — np.array(), np.zeros(), np.ones(), np.arange(), np.linspace()
Array operations — element-wise arithmetic, broadcasting
Statistical functions — np.mean(), np.median(), np.std(), np.percentile()
Boolean indexing — arr[arr > 5] — filtering arrays
reshape, transpose, concatenate — array manipulation

Pandas — Data Analysis Workhorse

Series — one-dimensional labeled array; single column
DataFrame — two-dimensional labeled table; multiple columns
Creating DataFrames: pd.read_csv/excel/sql, or pd.DataFrame from dict
Inspection: head/tail, shape, dtypes, info, describe, isnull, nunique, value_counts
Selection and filtering: df['col'], df.loc, df.iloc, boolean filtering, df.query()
Data cleaning: dropna, fillna, duplicated/drop_duplicates, str methods, astype, to_datetime, rename, replace
Aggregation: groupby.agg, multi-level groupby, pivot_table, resample time-series
Combining DataFrames: pd.merge, pd.concat, join
Transformation: derived columns, apply, assign, melt (wide→long), pivot (long→wide), sort_values
Time series: set_index date, resample, rolling, shift for lag values

Matplotlib and Seaborn — Visualization

Matplotlib basics: fig/ax via subplots, chart types (plot, bar, scatter, hist), labels, savefig
Seaborn — statistical visualization on matplotlib: histplot, boxplot, scatterplot with hue, heatmap, lineplot, pairplot

Exploratory Data Analysis (EDA) — Systematic Approach

Load and inspect — shape, dtypes, head, describe
Null analysis — null counts, null patterns, strategies for handling
Distribution analysis — histogram for each numeric column; skewness check
Categorical distribution — value_counts for each categorical column
Outlier detection — box plots, IQR method, z-score method
Correlation analysis — heatmap; Pearson for numeric, Cramer's V for categorical
Bivariate analysis — scatter plots for continuous-continuous; box plots for continuous-categorical
Time series patterns — if temporal data: trend, seasonality, anomalies
Business-context questions — guided by actual analysis objectives

Resources

Kaggle Learn Python (free)
Kaggle Learn Pandas (free)
"Python for Data Analysis" by Wes McKinney (book by the pandas author)
Real Python tutorials (free)

Stage 04

Data Visualization and BI Tools

Building dashboards and reports is a core deliverable of data analyst work. Stakeholders experience your analysis primarily through the visualizations you produce.

Data Visualization Principles

Chart selection by relationship type: comparison (bar/line), distribution (histogram/box), composition (stacked/treemap), relationship (scatter/heatmap), flow (Sankey/funnel)
Design principles: data-ink ratio, preattentive attributes, color usage (colorblind-safe palettes), titles as insights, annotations, gridlines, consistency
Storytelling with data — the "So What?" test: does the chart answer a business question?
Dashboard design: information hierarchy, context always (vs target/period/benchmark), interactivity, mobile vs desktop

Power BI

Desktop vs Service — Desktop for development, Service for publishing and sharing
Data sources — Excel, CSV, SQL databases, SharePoint, web, REST APIs, cloud platforms
Power Query Editor — M language transformations within Power BI (same as Excel Power Query)
Data Model: tables and relationships, star schema (fact + dimension), cardinality, calculated columns vs measures
DAX (Data Analysis Expressions): basic/filtered measures, CALCULATE, SAMEPERIODLASTYEAR, TOTALYTD, iterator functions (AVERAGEX), context transition (ALL, FILTER, VALUES, RELATED), variables, row vs filter context
Report building — visuals (bar, line, card, table, matrix, map, gauge, decomposition tree, Q&A), slicers, bookmarks, drill-through, conditional formatting, tooltips
Power BI Service: workspaces, datasets, scheduled refresh, row-level security (RLS), apps

Tableau

Connect to data — Tableau's data source connections; data extract vs live connection
Basic chart types — Show Me panel; dragging dimensions and measures to rows/columns/marks
Filters — context filters, data source filters, extract filters; filter order matters
Calculated fields — IF [Sales] > 1000 THEN "High" ELSE "Low" END
Table calculations — running total, percent of total, rank, moving average, computed during render
Level of Detail (LOD) calculations: FIXED, INCLUDE, EXCLUDE — Tableau's most powerful feature
Parameters — dynamic inputs; allowing users to switch metrics or set thresholds
Dashboard actions — filter actions, highlight actions, URL actions, parameter actions
Tableau Server/Cloud — publishing, sharing, scheduling

Google Looker Studio (Data Studio)

Free, cloud-native BI tool; connects to Google Analytics, BigQuery, Sheets, databases
Good for reporting on digital marketing, Google products, website analytics
Blending data sources — combining data from multiple sources in one report
Community connectors — third-party data source connectors

Resources

Microsoft PL-300 study materials (Power BI certification)
Tableau Public (free)
"Storytelling with Data" by Cole Nussbaumer Knaflic (essential book)
Power BI documentation (free)
Tableau training videos (free on Tableau website)

Stage 05

Business Acumen and Communication

The analyst's value is not the chart. It is the insight the chart communicates. Domain knowledge and communication skills determine whether analysis drives decisions.

Translating Business Questions to Data Questions

Common business questions and analytical equivalents: sales drops (time series + segmentation + driver decomposition); best customers (RFM analysis); feature effectiveness (A/B test); promotion priority (margin + inventory + elasticity); churn (cohort + correlation + survival)
Metrics and KPIs: revenue (MRR, ARR, GMV, NRV, ARPU); customer (CAC, LTV, churn, retention, NPS); operational (conversion, funnel drop-off, AHT); web/product (DAU/MAU, engagement, bounce, session); financial (gross margin, EBITDA, burn rate, runway)

Data Storytelling

Structure — Context → Complication → Resolution
Leading with the insight, not the process — start with the answer, then show the evidence
One chart, one message — each visual should communicate exactly one thing
Avoiding common mistakes: data without interpretation, burying key insight, too many metrics, 3D charts, dual-axis charts, truncated y-axis

Stakeholder Communication

Knowing your audience — technical vs non-technical; what decisions do they need to make?
Asking good questions — clarifying ambiguous requests; understanding the decision context
Managing expectations — setting realistic timelines; communicating data limitations
Presenting findings — structured walkthrough; "here is what we found, here is why it matters, here is what we recommend"
Written communication — analysis documentation; assumptions; methodology; data sources

Domain Knowledge Acceleration

Finance — income statement (revenue, COGS, gross profit, OpEx, EBITDA, net income); balance sheet basics; cash flow; financial ratios
Marketing — funnel metrics; attribution models (last click, first click, multi-touch); cohort analysis; customer segmentation
E-commerce — GMV, take rate, cart abandonment, AOV, return rate, inventory metrics
SaaS — ARR, MRR, churn, LTV, CAC, expansion revenue, NRR (Net Revenue Retention)
Healthcare — patient metrics, utilization rates, outcomes data, compliance metrics
Operations/supply chain — fill rate, lead time, inventory turnover, on-time delivery

Resources

"Storytelling with Data" by Cole Nussbaumer Knaflic (book)
"The Data Warehouse Toolkit" by Kimball (for data modeling context)
industry-specific analyst communities on LinkedIn

Stage 06

Cloud Data Tools and Advanced Analytics

Modern data teams work in cloud data warehouses. Understanding how to navigate these environments and apply intermediate analytics is the path to mid-senior analyst roles.

Cloud Data Warehouses — Analyst Perspective

Snowflake — cloud-native, multi-cloud, separate storage and compute; worksheets, time travel, data sharing
BigQuery (Google Cloud) — serverless, columnar, pay-per-query; project.dataset.table; partitioned and clustered tables; BigQuery ML
Redshift (AWS) — MPP warehouse; distribution styles and sort keys
Databricks — data lakehouse; notebooks; SQL Warehouse for BI; Delta Lake
Microsoft Fabric — Microsoft's unified data platform; OneLake; Direct Lake mode for Power BI

dbt (Data Build Tool) — Analyst Exposure

What dbt does — transforms data in the warehouse using version-controlled SQL
Analysts increasingly use dbt to build and maintain data models that feed dashboards
dbt concepts for analysts: models, sources, tests (not null, unique, accepted values, referential integrity), documentation, seeds
dbt Cloud — managed dbt environment; job scheduling; IDE
Exposure to dbt distinguishes analysts who can collaborate with data engineers

Python Advanced Analytics

Cohort analysis using pandas transform, to_period, and groupby
Churn prediction basics — logistic regression with scikit-learn
Time series decomposition — trend, seasonality, residual with statsmodels
Customer segmentation — RFM analysis; K-means clustering basics
Statistical A/B test analysis — scipy.stats.ttest_ind for t-tests; chi-square for proportions

Resources

Snowflake Quickstart (free)
BigQuery free tier (free)
dbt Fundamentals course (free at courses.getdbt.com)
Kaggle datasets for practice

Stage 07

Hands-On Practice & Portfolio

Portfolio Projects (minimum 3-5)

Sales analysis — public retail dataset (Kaggle); analyze by region/product/time; build a Power BI/Tableau dashboard
Customer segmentation — RFM analysis on transaction data; cluster customers; recommend marketing actions
A/B test analysis — real or synthetic experiment data; statistical significance testing; recommendation with confidence intervals
Time series analysis — economic, weather, or business time series data; trend identification; seasonality decomposition
SQL portfolio — 10–20 well-documented SQL queries solving realistic business questions; published on GitHub

Practice Platforms

Kaggle — datasets and competitions; Kaggle Learn courses (free)
Mode Analytics Public — SQL problems with real company datasets (free)
StratasScratch — SQL and Python interview questions (paid)
LeetCode SQL — database problems by difficulty (free)
DataLemur — SQL interview prep focused on real tech company questions (free)
Tableau Public — publishing and browsing published visualizations (free)
Maven Analytics — guided analytics projects (paid)

What to Document on LabList

Portfolio project write-ups — linked to GitHub repos with code and dashboards
Power BI / Tableau public dashboards — links to published interactive dashboards
SQL query portfolio — annotated queries showing analysis patterns
Domain expertise signals — projects in your target industry domain
Cert progression — Google Data Analytics or PL-300 documented with context

FAQ

Common questions

How long does it take to become a Data Analyst?

6–12 months optimistic at 20–25 hours/week, 10–18 months realistic part-time. Data Analyst is one of the most accessible data career paths. SQL fluency takes 2–3 months of consistent practice, Power BI/Tableau another 2–3 months, then 4–6 months building a portfolio of analyses on real (or realistic) datasets. 65% of entry-level postings don't require a specific degree when you bring a strong portfolio.

Which certifications matter for data analyst roles?

Google Data Analytics Professional Certificate is the most common entry-level signal. Microsoft PL-300 for Power BI shops. Tableau Desktop Specialist for Tableau organizations. SQL certifications matter less than demonstrated query work in your portfolio. The pattern: one credible cert + 3–5 portfolio projects beats a wall of certs and no demonstrated work.

Do I need a stats or CS degree?

No. Data analyst is one of the most credential-flexible data roles. Self-taught analysts with strong SQL and a public dashboard portfolio compete effectively. What you do need: comfort writing SQL beyond simple SELECT queries (joins, window functions, CTEs are table stakes), data cleaning intuition, and a sense for what business questions are worth answering. The bar moves up at senior levels — but at entry level, demonstrated work is the gating signal.

What separates a hired Data Analyst from one who doesn't make it?

A portfolio of analyses that answer business questions, not just visualize datasets. Hiring managers look for: SQL queries that handle messy real-world data, dashboards driven by clear analytical questions, written summaries that draw conclusions instead of describing charts, and evidence of stakeholder collaboration. SQL is in 90%+ of postings; if your SQL portfolio is empty, you're filtered out. BLS projects 34% job growth for data professionals through 2034.

Data Analyst

Common questions

Related roles