Career Guide

Data Engineer Portfolio

A hiring manager at a Series B fintech we know spends 47 seconds on each portfolio link before making a keep-or-trash decision. Forty-seven seconds. In that window she's looking for three things: a README with a diagram, a CI pipeline that runs, and a data quality check that actually caught a bug. Everything else is noise. This guide tells you what to build so those 47 seconds work in your favor, not against you.

47s

Median portfolio review

3

Projects worth building

275

Companies hiring DE

1,418

Challenges you can draw from

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

What Hiring Managers Actually Look For

A staff DE we talked to said she rejects 80% of portfolios at the README. No diagram, no pass. No docker-compose, no pass. You get 47 seconds. Here's the order she scans in and what she's looking for at each step.

1. The README (30 seconds)

Does the README explain what the project does, what problem it solves, and how to run it? Is there an architecture diagram? If the README is empty or says “TODO,” they close the tab. The README is the single most important file in your portfolio project.

2. Project Scope (15 seconds)

Is this a real pipeline or just a script that reads a CSV? Hiring managers want to see end-to-end thinking: data ingestion, transformation, loading, and ideally some form of monitoring or quality checks. A project that covers the full pipeline lifecycle beats five projects that only cover one step each.

3. Code Quality (60 seconds)

They open one or two files and scan for: modular functions (not one giant script), docstrings, error handling, configuration separate from logic, and reasonable naming. They do not read every line. They look for signals that you write maintainable code.

4. Technology Choices (15 seconds)

Are the tools relevant to the job? If the job description says Airflow and Snowflake, a portfolio using Airflow and Snowflake gets attention. If you used obscure tools, the reviewer may not recognize them.

5. Tests and CI/CD (bonus)

The presence of a test folder and a CI configuration file (GitHub Actions, etc.) immediately puts you ahead of 90% of portfolios. Most portfolio projects have zero tests. Having even basic data validation tests shows production mindset.

3 Portfolio Project Ideas

Each project targets a different skill set. Building all three would give you a portfolio that covers batch processing, real-time processing, and data quality. Pick at least one.

Project 1: ETL Pipeline with Orchestration

Build an end-to-end batch pipeline that extracts data from a public API, transforms it, and loads it into a warehouse. Use Airflow or Dagster for orchestration. This is the most foundational project and the one most hiring managers expect to see.

Data source: A public API with daily updates (weather API, government open data, financial data). Avoid static CSV downloads.

Stack: Python, Airflow (or Dagster), dbt for transformations, PostgreSQL or DuckDB as the warehouse, Docker for local development.

Key features: Incremental loading (not full refresh every time), idempotent tasks, error handling with retries, data validation checks after each load.

What it shows: Orchestration, incremental processing, transformation logic, error handling, containerization.

Project 2: Real-Time Data Dashboard

Build a streaming pipeline that processes events in near real-time and feeds a live dashboard. This shows you can work with streaming concepts, which many batch-only portfolios lack.

Data source: A websocket API (cryptocurrency prices, public transit real-time feeds) or a self-generated event stream using a producer script.

Stack: Kafka (or Redpanda for lighter setup), Python consumer, PostgreSQL or ClickHouse for fast queries, a simple dashboard (Streamlit, Grafana, or Metabase).

Key features: At-least-once delivery, deduplication logic, windowed aggregations (5-minute rolling averages), backpressure handling.

What it shows: Streaming architecture, message queues, windowed processing, end-to-end data flow from producer to dashboard.

Project 3: Data Quality Monitoring System

Build a system that monitors data quality across multiple tables and alerts when something goes wrong. This project is underrepresented in portfolios, which makes it stand out.

Data source: Any database with tables that have known quality expectations (nullability, uniqueness, freshness, value ranges).

Stack: Python, Great Expectations or custom validation framework, PostgreSQL, Slack or email for alerting, a dashboard showing quality trends over time.

Key features: Configurable quality rules (YAML or JSON), historical tracking of quality scores, alerting on threshold violations, a dashboard that shows quality trends by table.

What it shows: Production mindset, data quality awareness, monitoring and alerting, configuration-driven design.

GitHub Repository Structure

A clean repo structure signals professionalism. Here is the layout that hiring managers expect.

weather-pipeline/
  README.md              # project overview, architecture, setup
  docker-compose.yml     # local development environment
  .github/
    workflows/
      ci.yml             # lint + test on every push
  dags/
    weather_etl.py       # Airflow DAG definition
  src/
    extract/
      weather_api.py     # API client with retry logic
    transform/
      clean_weather.py   # data cleaning and validation
      aggregate.py       # daily/weekly aggregations
    load/
      warehouse.py       # database loading functions
    utils/
      config.py          # configuration management
      logging.py         # structured logging setup
  models/                # dbt models (if using dbt)
    staging/
    marts/
  tests/
    test_extract.py      # unit tests for extraction
    test_transform.py    # unit tests for transformations
    test_integration.py  # end-to-end pipeline test
  config/
    settings.yaml        # environment-specific config
  .env.example           # required env vars (no secrets)

README Template

Your README should answer five questions in this order:

1. What does this project do? One paragraph. Example: “An ETL pipeline that pulls daily weather data from the OpenWeather API, cleans and validates it, and loads it into PostgreSQL for analysis.”

2. Architecture diagram. A simple diagram (even ASCII art) showing data flow: source -> extract -> transform -> load -> serve.

3. How to run it. Step-by-step instructions. Ideally: clone, copy .env.example to .env, fill in API key, run docker-compose up.

4. Design decisions. Why you chose these tools. Why incremental vs full refresh. What tradeoffs you made. This section shows you think critically about architecture.

5. What you would do with more time. Shows self-awareness. Example: “Add monitoring with Prometheus, implement SCD Type 2 for dimension tables, add integration tests with testcontainers.”

Portfolio Mistakes That Hurt You

These patterns make hiring managers lose interest fast.

No README or Empty README

The number one portfolio killer. A repo without a README is a repo that nobody will look at. Hiring managers will not clone your code and read it to figure out what it does. If the README is missing, the project does not exist in their evaluation.

Tutorial Copy-Paste

If your project looks identical to a YouTube tutorial (same data source, same structure, same variable names), it signals that you followed instructions without understanding the concepts. Start from a tutorial if you need to, but modify it: change the data source, add error handling, implement incremental loading, add tests. Make it yours.

Secrets in the Code

Hardcoded API keys, database passwords, or AWS credentials in your code is a disqualifying signal. It shows you do not understand basic security practices. Always use environment variables and include a .env.example file that lists the required variables without values.

Too Many Incomplete Projects

Ten repos with “WIP” status looks worse than two finished repos. If you have incomplete projects, either finish them or make them private. Your public GitHub should only show work that represents your best effort.

Portfolio Alone Is Not Enough

A portfolio gets you past the resume screen. But the interview still tests SQL, Python, and system design skills separately. The best preparation combines portfolio projects (to show you can build) with focused practice (to show you can perform under pressure).

Interview StageWhat It TestsHow Portfolio Helps
Resume ScreenExperience signalsReplaces missing work experience
SQL RoundQuery writing under pressureMinimal. Practice problems help more.
Python RoundFunction implementationShows code quality, but practice is still needed.
System DesignArchitecture thinkingHigh. You can reference your project as evidence.

Data Engineer Portfolio FAQ

Do data engineers need a portfolio?+
A portfolio is not required for experienced data engineers with strong work history. If your resume shows 3+ years at recognizable companies with concrete pipeline metrics (processed X events/day, reduced latency by Y%), that speaks for itself. Portfolios matter most for career switchers, bootcamp grads, and junior engineers who lack professional data engineering experience. A well-built portfolio project can replace the 'previous experience' section on a resume by showing you can build real systems.
How many portfolio projects do I need?+
Two to three. Quality matters far more than quantity. One deep, well-documented project that shows end-to-end pipeline thinking is worth more than ten half-finished repos. Hiring managers spend 2 to 5 minutes reviewing a portfolio. They click one project, skim the README, and glance at the code. If that first project impresses them, they move you forward. If it looks incomplete or the README is missing, they close the tab.
Should I use real data or fake data for portfolio projects?+
Use real public data whenever possible. Government datasets, public APIs (weather, transit, stock prices), and open datasets (Kaggle, UCI ML Repository) all work. Real data is messy, which forces you to handle edge cases that fake data does not have. It also makes the project more interesting to reviewers because they can see you solved real-world problems. Avoid using data from your employer without permission.
What technologies should I use in my portfolio projects?+
Match the technologies to your target job descriptions. If most jobs you want list Airflow, Spark, and AWS, use those. If they list dbt, Snowflake, and Fivetran, use those. Do not build a Hadoop project in 2024 unless you are targeting a specific legacy shop. A safe modern stack is: Python, SQL, Airflow or Dagster for orchestration, dbt for transformation, and either Snowflake or PostgreSQL as the warehouse. Docker for local development. GitHub Actions for CI/CD.

47 Seconds. Three Projects. One Job Offer.

A real pipeline with real tests and a real README. Build one this month and your portfolio lands different.