A hiring manager at a Series B fintech we know spends 47 seconds on each portfolio link before making a keep-or-trash decision. Forty-seven seconds. In that window she's looking for three things: a README with a diagram, a CI pipeline that runs, and a data quality check that actually caught a bug. Everything else is noise. This guide tells you what to build so those 47 seconds work in your favor, not against you.
Median portfolio review
Projects worth building
Companies hiring DE
Challenges you can draw from
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
A staff DE we talked to said she rejects 80% of portfolios at the README. No diagram, no pass. No docker-compose, no pass. You get 47 seconds. Here's the order she scans in and what she's looking for at each step.
Does the README explain what the project does, what problem it solves, and how to run it? Is there an architecture diagram? If the README is empty or says “TODO,” they close the tab. The README is the single most important file in your portfolio project.
Is this a real pipeline or just a script that reads a CSV? Hiring managers want to see end-to-end thinking: data ingestion, transformation, loading, and ideally some form of monitoring or quality checks. A project that covers the full pipeline lifecycle beats five projects that only cover one step each.
They open one or two files and scan for: modular functions (not one giant script), docstrings, error handling, configuration separate from logic, and reasonable naming. They do not read every line. They look for signals that you write maintainable code.
Are the tools relevant to the job? If the job description says Airflow and Snowflake, a portfolio using Airflow and Snowflake gets attention. If you used obscure tools, the reviewer may not recognize them.
The presence of a test folder and a CI configuration file (GitHub Actions, etc.) immediately puts you ahead of 90% of portfolios. Most portfolio projects have zero tests. Having even basic data validation tests shows production mindset.
Each project targets a different skill set. Building all three would give you a portfolio that covers batch processing, real-time processing, and data quality. Pick at least one.
Build an end-to-end batch pipeline that extracts data from a public API, transforms it, and loads it into a warehouse. Use Airflow or Dagster for orchestration. This is the most foundational project and the one most hiring managers expect to see.
Data source: A public API with daily updates (weather API, government open data, financial data). Avoid static CSV downloads.
Stack: Python, Airflow (or Dagster), dbt for transformations, PostgreSQL or DuckDB as the warehouse, Docker for local development.
Key features: Incremental loading (not full refresh every time), idempotent tasks, error handling with retries, data validation checks after each load.
What it shows: Orchestration, incremental processing, transformation logic, error handling, containerization.
Build a streaming pipeline that processes events in near real-time and feeds a live dashboard. This shows you can work with streaming concepts, which many batch-only portfolios lack.
Data source: A websocket API (cryptocurrency prices, public transit real-time feeds) or a self-generated event stream using a producer script.
Stack: Kafka (or Redpanda for lighter setup), Python consumer, PostgreSQL or ClickHouse for fast queries, a simple dashboard (Streamlit, Grafana, or Metabase).
Key features: At-least-once delivery, deduplication logic, windowed aggregations (5-minute rolling averages), backpressure handling.
What it shows: Streaming architecture, message queues, windowed processing, end-to-end data flow from producer to dashboard.
Build a system that monitors data quality across multiple tables and alerts when something goes wrong. This project is underrepresented in portfolios, which makes it stand out.
Data source: Any database with tables that have known quality expectations (nullability, uniqueness, freshness, value ranges).
Stack: Python, Great Expectations or custom validation framework, PostgreSQL, Slack or email for alerting, a dashboard showing quality trends over time.
Key features: Configurable quality rules (YAML or JSON), historical tracking of quality scores, alerting on threshold violations, a dashboard that shows quality trends by table.
What it shows: Production mindset, data quality awareness, monitoring and alerting, configuration-driven design.
A clean repo structure signals professionalism. Here is the layout that hiring managers expect.
weather-pipeline/
README.md # project overview, architecture, setup
docker-compose.yml # local development environment
.github/
workflows/
ci.yml # lint + test on every push
dags/
weather_etl.py # Airflow DAG definition
src/
extract/
weather_api.py # API client with retry logic
transform/
clean_weather.py # data cleaning and validation
aggregate.py # daily/weekly aggregations
load/
warehouse.py # database loading functions
utils/
config.py # configuration management
logging.py # structured logging setup
models/ # dbt models (if using dbt)
staging/
marts/
tests/
test_extract.py # unit tests for extraction
test_transform.py # unit tests for transformations
test_integration.py # end-to-end pipeline test
config/
settings.yaml # environment-specific config
.env.example # required env vars (no secrets)Your README should answer five questions in this order:
1. What does this project do? One paragraph. Example: “An ETL pipeline that pulls daily weather data from the OpenWeather API, cleans and validates it, and loads it into PostgreSQL for analysis.”
2. Architecture diagram. A simple diagram (even ASCII art) showing data flow: source -> extract -> transform -> load -> serve.
3. How to run it. Step-by-step instructions. Ideally: clone, copy .env.example to .env, fill in API key, run docker-compose up.
4. Design decisions. Why you chose these tools. Why incremental vs full refresh. What tradeoffs you made. This section shows you think critically about architecture.
5. What you would do with more time. Shows self-awareness. Example: “Add monitoring with Prometheus, implement SCD Type 2 for dimension tables, add integration tests with testcontainers.”
These patterns make hiring managers lose interest fast.
The number one portfolio killer. A repo without a README is a repo that nobody will look at. Hiring managers will not clone your code and read it to figure out what it does. If the README is missing, the project does not exist in their evaluation.
If your project looks identical to a YouTube tutorial (same data source, same structure, same variable names), it signals that you followed instructions without understanding the concepts. Start from a tutorial if you need to, but modify it: change the data source, add error handling, implement incremental loading, add tests. Make it yours.
Hardcoded API keys, database passwords, or AWS credentials in your code is a disqualifying signal. It shows you do not understand basic security practices. Always use environment variables and include a .env.example file that lists the required variables without values.
Ten repos with “WIP” status looks worse than two finished repos. If you have incomplete projects, either finish them or make them private. Your public GitHub should only show work that represents your best effort.
A portfolio gets you past the resume screen. But the interview still tests SQL, Python, and system design skills separately. The best preparation combines portfolio projects (to show you can build) with focused practice (to show you can perform under pressure).
| Interview Stage | What It Tests | How Portfolio Helps |
|---|---|---|
| Resume Screen | Experience signals | Replaces missing work experience |
| SQL Round | Query writing under pressure | Minimal. Practice problems help more. |
| Python Round | Function implementation | Shows code quality, but practice is still needed. |
| System Design | Architecture thinking | High. You can reference your project as evidence. |
A real pipeline with real tests and a real README. Build one this month and your portfolio lands different.
Resume format, bullet point formulas, and what to include for data engineering roles
Step-by-step career transition guide with skill requirements and timeline
Skill progression from beginner to senior with checkpoints and resources