Roadmap
Data Analyst
The professional who collects, cleans, queries, and analyzes data to answer business questions. Turns raw numbers into charts, summaries, and recommendations that help organizations understand what is happening and why.
OPTIMISTIC 6-12 months · REALISTIC 10-18 months
Stage 00
Mathematics and Statistics Fundamentals
Data analysis without statistical grounding produces misleading conclusions. You do not need calculus. You need the applied statistics that every analyst uses daily.
Descriptive Statistics
- Measures of central tendency: mean (sensitive to outliers), median (robust for skewed data), mode (categorical data)
- Measures of spread: range, variance (σ²/s²), standard deviation (σ/s), IQR (Q3-Q1), percentiles and quartiles
- Distribution shapes: normal (68-95-99.7 rule), right/positive skew, left/negative skew, bimodal
Inferential Statistics
- Populations vs samples — inference from sample to population
- Sampling bias — how samples can misrepresent populations; random sampling importance
- Confidence intervals — range likely containing the true population parameter; 95% CI most common
- p-value — probability of observing results at least as extreme if null hypothesis is true; p < 0.05 conventional threshold
- Hypothesis testing — null hypothesis (H₀), alternative hypothesis (H₁), Type I error (false positive), Type II error (false negative)
- A/B testing — comparing two versions; randomized controlled experiment; interpreting significance and practical significance
- Correlation vs causation — correlation does not imply causation; confounding variables; Simpson's paradox
- Correlation coefficient (r) — Pearson's r: ranges -1 to +1; |r| > 0.7 strong, |r| < 0.3 weak
- Regression basics — linear regression; slope interpretation; R² (coefficient of determination)
Probability Basics
- Basic probability rules — P(A), P(A and B), P(A or B), P(A|B) conditional probability
- Bayes' theorem — updating probabilities with new evidence; P(A|B) = P(B|A)×P(A) / P(B)
- Expected value — E(X) = Σ(x × P(x)); foundation for business decision analysis
Resources
- Khan Academy Statistics (free)
- StatQuest YouTube by Josh Starmer (free, best intuitive statistics)
- "Naked Statistics" by Wheelan (book, accessible intro)
Stage 01
Excel and Google Sheets
Excel remains the most universally deployed data tool. Even analysts who primarily use SQL and Python will use Excel daily for ad hoc work, stakeholder deliverables, and quick analysis.
Excel Fundamentals
- Navigation — keyboard shortcuts (Ctrl+Arrow, Ctrl+Shift+Arrow, F2, F4, Ctrl+Z/Y)
- Data entry — structured tables vs unstructured ranges; best practices for analysis-ready data
- Formatting — number formats, date formats, conditional formatting, data bars
- Named ranges — making formulas readable; dynamic named ranges
Core Formulas
- Lookup functions: VLOOKUP, XLOOKUP, INDEX/MATCH, HLOOKUP
- Logical functions: IF, nested IF (use IFS), AND, OR, NOT, IFERROR, IFBLANK
- Aggregation functions: SUMIF/SUMIFS, COUNTIF/COUNTIFS, AVERAGEIF/AVERAGEIFS, MAXIFS/MINIFS
- Text functions: LEFT, RIGHT, MID, LEN, FIND, SEARCH, TRIM, UPPER/LOWER/PROPER, CONCATENATE, TEXT
- Date and time functions: TODAY, NOW, YEAR/MONTH/DAY, DATEDIF, EOMONTH, WEEKDAY
- Statistical functions: COUNT/COUNTA/COUNTBLANK, AVERAGE/MEDIAN/MODE, STDEV, PERCENTILE, RANK
Pivot Tables
- Creating pivot tables — source data requirements (structured table with headers)
- Rows, Columns, Values, Filters — building the analysis structure
- Value field settings — Sum, Count, Average, % of Total, % of Row Total, % of Column Total
- Grouping — date grouping (year/quarter/month); number grouping
- Sorting and filtering within pivot tables
- Slicers — interactive filters; connecting multiple pivot tables to one slicer
- Calculated fields — adding custom metrics to pivot tables
- Pivot charts — visual representation of pivot table data
Power Query (Get & Transform)
- What Power Query does — connect, transform, and load data from various sources
- Data source connections — Excel files, CSV, databases, web URLs, SharePoint
- Applied Steps — every transformation recorded; auditable; replayable
- Common transformations: remove columns/rows/duplicates, change type, split column, merge queries, append queries, pivot/unpivot, group by, add conditional column, fill down
- Refresh — updating data when source changes; parameterized queries
Data Visualization in Excel
- Chart types — bar/column, line, scatter, pie/donut, waterfall, histogram, box plot
- Chart formatting — meaningful titles, axis labels, data labels; removing chart junk
- Sparklines — mini-charts inside cells for trend visualization
Resources
- ExcelJet.net (free formula reference)
- Chandoo.org (free Excel tutorials)
- Microsoft Excel documentation (free)
Stage 02
SQL — Deep Mastery
SQL is listed in 90%+ of data analyst postings. It is the primary tool for data extraction and the skill most consistently tested in analyst interviews.
SQL Foundations (Reference DBA Path Stage 1)
- Complete SQL content from DBA path — SELECT, JOIN, GROUP BY, HAVING, window functions, CTEs, subqueries, set operations, transactions
Analytical Query Patterns
- Year-over-year comparison with LAG() over ORDER BY YEAR(sale_date)
- Rolling 7-day average using AVG() OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
- Cohort analysis — tracking user behavior by signup month with DATE_TRUNC and MIN OVER PARTITION
- Running total with window functions using SUM OVER (ORDER BY date)
- Percentile ranking using NTILE(4) and PERCENT_RANK()
- Deduplication — keeping only the latest record per entity using ROW_NUMBER() OVER PARTITION BY
Data Quality Checks in SQL
- Null check: SELECT COUNT(*) - COUNT(column_name) AS null_count FROM table
- Duplicate check: SELECT id, COUNT(*) FROM table GROUP BY id HAVING COUNT(*) > 1
- Range validation: SELECT MIN(value), MAX(value), AVG(value) FROM table
- Referential integrity: LEFT JOIN WHERE foreign_id IS NULL
- Distribution check: SELECT value, COUNT(*) FROM table GROUP BY value ORDER BY COUNT(*) DESC
Performance Awareness
- Avoid SELECT * — always specify columns needed
- Filter early — WHERE before aggregations; push filters into subqueries
- Index awareness — WHERE on indexed columns is faster; know what indexes exist
- EXPLAIN / EXPLAIN ANALYZE — reading query execution plans (see DBA path Stage 1)
- Avoid functions on indexed columns in WHERE; use BETWEEN instead of YEAR() on indexed date column
- CTEs vs subqueries — CTEs improve readability; most databases optimize them the same
Cloud Data Warehouse SQL Dialects
- BigQuery (Google) — Standard SQL; DATE_TRUNC, TIMESTAMP_TRUNC, UNNEST for arrays, STRUCT types
- Snowflake — ANSI SQL; QUALIFY clause for window function filtering; FLATTEN for semi-structured data
- Redshift (AWS) — PostgreSQL-based; distribution keys and sort keys affect performance; LISTAGG for string aggregation
- Databricks / Spark SQL — DATEDIFF, DATE_ADD, ARRAY_ functions; Delta Lake time travel
- Key differences: date functions, string functions, NULL handling, and semi-structured data support vary by platform
Resources
- Mode Analytics SQL Tutorial (free)
- SQLZoo (free)
- StratasScratch (paid SQL interview problems)
- LeetCode SQL problems (free)
- W3Schools SQL (free reference)
Stage 03
Python for Data Analysis
Python is listed in ~40% of entry-level postings and in the majority of mid-senior postings. It enables analysis at scale and automation that Excel and SQL cannot match.
Python Fundamentals
- Data types — int, float, str, bool, list, dict, tuple, set
- Control flow — if/elif/else, for loops, while loops, list comprehensions
- Functions — def, args, kwargs, return, scope
- File I/O — reading CSV, JSON, text files
- Modules and imports — import, from x import y, aliasing (import pandas as pd)
- Error handling — try/except/finally
NumPy — Numerical Computing
- Arrays vs Python lists — arrays are homogeneous, faster, support vectorized operations
- Array creation — np.array(), np.zeros(), np.ones(), np.arange(), np.linspace()
- Array operations — element-wise arithmetic, broadcasting
- Statistical functions — np.mean(), np.median(), np.std(), np.percentile()
- Boolean indexing — arr[arr > 5] — filtering arrays
- reshape, transpose, concatenate — array manipulation
Pandas — Data Analysis Workhorse
- Series — one-dimensional labeled array; single column
- DataFrame — two-dimensional labeled table; multiple columns
- Creating DataFrames: pd.read_csv/excel/sql, or pd.DataFrame from dict
- Inspection: head/tail, shape, dtypes, info, describe, isnull, nunique, value_counts
- Selection and filtering: df['col'], df.loc, df.iloc, boolean filtering, df.query()
- Data cleaning: dropna, fillna, duplicated/drop_duplicates, str methods, astype, to_datetime, rename, replace
- Aggregation: groupby.agg, multi-level groupby, pivot_table, resample time-series
- Combining DataFrames: pd.merge, pd.concat, join
- Transformation: derived columns, apply, assign, melt (wide→long), pivot (long→wide), sort_values
- Time series: set_index date, resample, rolling, shift for lag values
Matplotlib and Seaborn — Visualization
- Matplotlib basics: fig/ax via subplots, chart types (plot, bar, scatter, hist), labels, savefig
- Seaborn — statistical visualization on matplotlib: histplot, boxplot, scatterplot with hue, heatmap, lineplot, pairplot
Exploratory Data Analysis (EDA) — Systematic Approach
- Load and inspect — shape, dtypes, head, describe
- Null analysis — null counts, null patterns, strategies for handling
- Distribution analysis — histogram for each numeric column; skewness check
- Categorical distribution — value_counts for each categorical column
- Outlier detection — box plots, IQR method, z-score method
- Correlation analysis — heatmap; Pearson for numeric, Cramer's V for categorical
- Bivariate analysis — scatter plots for continuous-continuous; box plots for continuous-categorical
- Time series patterns — if temporal data: trend, seasonality, anomalies
- Business-context questions — guided by actual analysis objectives
Resources
- Kaggle Learn Python (free)
- Kaggle Learn Pandas (free)
- "Python for Data Analysis" by Wes McKinney (book by the pandas author)
- Real Python tutorials (free)
Stage 04
Data Visualization and BI Tools
Building dashboards and reports is a core deliverable of data analyst work. Stakeholders experience your analysis primarily through the visualizations you produce.
Data Visualization Principles
- Chart selection by relationship type: comparison (bar/line), distribution (histogram/box), composition (stacked/treemap), relationship (scatter/heatmap), flow (Sankey/funnel)
- Design principles: data-ink ratio, preattentive attributes, color usage (colorblind-safe palettes), titles as insights, annotations, gridlines, consistency
- Storytelling with data — the "So What?" test: does the chart answer a business question?
- Dashboard design: information hierarchy, context always (vs target/period/benchmark), interactivity, mobile vs desktop
Power BI
- Desktop vs Service — Desktop for development, Service for publishing and sharing
- Data sources — Excel, CSV, SQL databases, SharePoint, web, REST APIs, cloud platforms
- Power Query Editor — M language transformations within Power BI (same as Excel Power Query)
- Data Model: tables and relationships, star schema (fact + dimension), cardinality, calculated columns vs measures
- DAX (Data Analysis Expressions): basic/filtered measures, CALCULATE, SAMEPERIODLASTYEAR, TOTALYTD, iterator functions (AVERAGEX), context transition (ALL, FILTER, VALUES, RELATED), variables, row vs filter context
- Report building — visuals (bar, line, card, table, matrix, map, gauge, decomposition tree, Q&A), slicers, bookmarks, drill-through, conditional formatting, tooltips
- Power BI Service: workspaces, datasets, scheduled refresh, row-level security (RLS), apps
Tableau
- Connect to data — Tableau's data source connections; data extract vs live connection
- Basic chart types — Show Me panel; dragging dimensions and measures to rows/columns/marks
- Filters — context filters, data source filters, extract filters; filter order matters
- Calculated fields — IF [Sales] > 1000 THEN "High" ELSE "Low" END
- Table calculations — running total, percent of total, rank, moving average, computed during render
- Level of Detail (LOD) calculations: FIXED, INCLUDE, EXCLUDE — Tableau's most powerful feature
- Parameters — dynamic inputs; allowing users to switch metrics or set thresholds
- Dashboard actions — filter actions, highlight actions, URL actions, parameter actions
- Tableau Server/Cloud — publishing, sharing, scheduling
Google Looker Studio (Data Studio)
- Free, cloud-native BI tool; connects to Google Analytics, BigQuery, Sheets, databases
- Good for reporting on digital marketing, Google products, website analytics
- Blending data sources — combining data from multiple sources in one report
- Community connectors — third-party data source connectors
Resources
- Microsoft PL-300 study materials (Power BI certification)
- Tableau Public (free)
- "Storytelling with Data" by Cole Nussbaumer Knaflic (essential book)
- Power BI documentation (free)
- Tableau training videos (free on Tableau website)
Stage 05
Business Acumen and Communication
The analyst's value is not the chart. It is the insight the chart communicates. Domain knowledge and communication skills determine whether analysis drives decisions.
Translating Business Questions to Data Questions
- Common business questions and analytical equivalents: sales drops (time series + segmentation + driver decomposition); best customers (RFM analysis); feature effectiveness (A/B test); promotion priority (margin + inventory + elasticity); churn (cohort + correlation + survival)
- Metrics and KPIs: revenue (MRR, ARR, GMV, NRV, ARPU); customer (CAC, LTV, churn, retention, NPS); operational (conversion, funnel drop-off, AHT); web/product (DAU/MAU, engagement, bounce, session); financial (gross margin, EBITDA, burn rate, runway)
Data Storytelling
- Structure — Context → Complication → Resolution
- Leading with the insight, not the process — start with the answer, then show the evidence
- One chart, one message — each visual should communicate exactly one thing
- Avoiding common mistakes: data without interpretation, burying key insight, too many metrics, 3D charts, dual-axis charts, truncated y-axis
Stakeholder Communication
- Knowing your audience — technical vs non-technical; what decisions do they need to make?
- Asking good questions — clarifying ambiguous requests; understanding the decision context
- Managing expectations — setting realistic timelines; communicating data limitations
- Presenting findings — structured walkthrough; "here is what we found, here is why it matters, here is what we recommend"
- Written communication — analysis documentation; assumptions; methodology; data sources
Domain Knowledge Acceleration
- Finance — income statement (revenue, COGS, gross profit, OpEx, EBITDA, net income); balance sheet basics; cash flow; financial ratios
- Marketing — funnel metrics; attribution models (last click, first click, multi-touch); cohort analysis; customer segmentation
- E-commerce — GMV, take rate, cart abandonment, AOV, return rate, inventory metrics
- SaaS — ARR, MRR, churn, LTV, CAC, expansion revenue, NRR (Net Revenue Retention)
- Healthcare — patient metrics, utilization rates, outcomes data, compliance metrics
- Operations/supply chain — fill rate, lead time, inventory turnover, on-time delivery
Resources
- "Storytelling with Data" by Cole Nussbaumer Knaflic (book)
- "The Data Warehouse Toolkit" by Kimball (for data modeling context)
- industry-specific analyst communities on LinkedIn
Stage 06
Cloud Data Tools and Advanced Analytics
Modern data teams work in cloud data warehouses. Understanding how to navigate these environments and apply intermediate analytics is the path to mid-senior analyst roles.
Cloud Data Warehouses — Analyst Perspective
- Snowflake — cloud-native, multi-cloud, separate storage and compute; worksheets, time travel, data sharing
- BigQuery (Google Cloud) — serverless, columnar, pay-per-query; project.dataset.table; partitioned and clustered tables; BigQuery ML
- Redshift (AWS) — MPP warehouse; distribution styles and sort keys
- Databricks — data lakehouse; notebooks; SQL Warehouse for BI; Delta Lake
- Microsoft Fabric — Microsoft's unified data platform; OneLake; Direct Lake mode for Power BI
dbt (Data Build Tool) — Analyst Exposure
- What dbt does — transforms data in the warehouse using version-controlled SQL
- Analysts increasingly use dbt to build and maintain data models that feed dashboards
- dbt concepts for analysts: models, sources, tests (not null, unique, accepted values, referential integrity), documentation, seeds
- dbt Cloud — managed dbt environment; job scheduling; IDE
- Exposure to dbt distinguishes analysts who can collaborate with data engineers
Python Advanced Analytics
- Cohort analysis using pandas transform, to_period, and groupby
- Churn prediction basics — logistic regression with scikit-learn
- Time series decomposition — trend, seasonality, residual with statsmodels
- Customer segmentation — RFM analysis; K-means clustering basics
- Statistical A/B test analysis — scipy.stats.ttest_ind for t-tests; chi-square for proportions
Resources
- Snowflake Quickstart (free)
- BigQuery free tier (free)
- dbt Fundamentals course (free at courses.getdbt.com)
- Kaggle datasets for practice
Stage 07
Hands-On Practice & Portfolio
Portfolio Projects (minimum 3-5)
- Sales analysis — public retail dataset (Kaggle); analyze by region/product/time; build a Power BI/Tableau dashboard
- Customer segmentation — RFM analysis on transaction data; cluster customers; recommend marketing actions
- A/B test analysis — real or synthetic experiment data; statistical significance testing; recommendation with confidence intervals
- Time series analysis — economic, weather, or business time series data; trend identification; seasonality decomposition
- SQL portfolio — 10–20 well-documented SQL queries solving realistic business questions; published on GitHub
Practice Platforms
- Kaggle — datasets and competitions; Kaggle Learn courses (free)
- Mode Analytics Public — SQL problems with real company datasets (free)
- StratasScratch — SQL and Python interview questions (paid)
- LeetCode SQL — database problems by difficulty (free)
- DataLemur — SQL interview prep focused on real tech company questions (free)
- Tableau Public — publishing and browsing published visualizations (free)
- Maven Analytics — guided analytics projects (paid)
What to Document on LabList
- Portfolio project write-ups — linked to GitHub repos with code and dashboards
- Power BI / Tableau public dashboards — links to published interactive dashboards
- SQL query portfolio — annotated queries showing analysis patterns
- Domain expertise signals — projects in your target industry domain
- Cert progression — Google Data Analytics or PL-300 documented with context
FAQ
Common questions
How long does it take to become a Data Analyst?
6–12 months optimistic at 20–25 hours/week, 10–18 months realistic part-time. Data Analyst is one of the most accessible data career paths. SQL fluency takes 2–3 months of consistent practice, Power BI/Tableau another 2–3 months, then 4–6 months building a portfolio of analyses on real (or realistic) datasets. 65% of entry-level postings don't require a specific degree when you bring a strong portfolio.
Which certifications matter for data analyst roles?
Google Data Analytics Professional Certificate is the most common entry-level signal. Microsoft PL-300 for Power BI shops. Tableau Desktop Specialist for Tableau organizations. SQL certifications matter less than demonstrated query work in your portfolio. The pattern: one credible cert + 3–5 portfolio projects beats a wall of certs and no demonstrated work.
Do I need a stats or CS degree?
No. Data analyst is one of the most credential-flexible data roles. Self-taught analysts with strong SQL and a public dashboard portfolio compete effectively. What you do need: comfort writing SQL beyond simple SELECT queries (joins, window functions, CTEs are table stakes), data cleaning intuition, and a sense for what business questions are worth answering. The bar moves up at senior levels — but at entry level, demonstrated work is the gating signal.
What separates a hired Data Analyst from one who doesn't make it?
A portfolio of analyses that answer business questions, not just visualize datasets. Hiring managers look for: SQL queries that handle messy real-world data, dashboards driven by clear analytical questions, written summaries that draw conclusions instead of describing charts, and evidence of stakeholder collaboration. SQL is in 90%+ of postings; if your SQL portfolio is empty, you're filtered out. BLS projects 34% job growth for data professionals through 2034.