Data Engineering Interview Prep
SQL is the most-tested skill in data engineering interviews, appearing in nearly seven out of ten interview loops. The most tested topics are aggregation, JOINs, and window functions. Organized by topic, with example questions and difficulty levels.
Every question below can be practiced with real SQL execution on DataDriven. Write the query, run it against a real database, and see if your output matches.
ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, and frame clauses. PARTITION BY is the single most common window function keyword in real DE interviews.
Write a query to find the second-highest salary in each department using window functions.
Calculate a 7-day rolling average of daily revenue using a window frame.
Use LAG to find customers whose order value decreased compared to their previous order.
Rank products by total sales within each category, handling ties with DENSE_RANK.
Use NTILE to divide employees into salary quartiles within each department.
Common Table Expressions for readable multi-step queries. Recursive CTEs for hierarchical data like org charts and category trees.
Write a CTE to find all employees in a management chain starting from the CEO.
Use a recursive CTE to generate a date series for the last 90 days.
Chain multiple CTEs to calculate month-over-month revenue growth rates.
Write a recursive query to find all subcategories of a given parent category.
Use a CTE to deduplicate records, keeping only the most recent entry per user.
INNER JOIN is the second most-tested SQL keyword after GROUP BY. LEFT, RIGHT, FULL OUTER, CROSS, self-joins, and inequality joins round out the category.
Write a self-join to find pairs of employees who share the same manager.
Use a LEFT JOIN to find customers who have never placed an order.
Write an inequality join to match events that occurred within 30 minutes of each other.
Join three tables to find the top-selling product in each region.
Use a CROSS JOIN to generate all possible date-product combinations, then LEFT JOIN to find missing data.
GROUP BY is the most frequently tested SQL keyword in DE interviews. COUNT, SUM, AVG, MIN, MAX with HAVING and conditional aggregation using CASE WHEN.
Find departments where the average salary exceeds $100K using HAVING.
Calculate the percentage of orders that were returned, grouped by product category.
Use conditional aggregation (CASE WHEN inside SUM) to pivot monthly sales into columns.
Find users who made purchases in at least 3 different months.
Write a query to find the most common pair of products bought together.
Scalar, correlated, and EXISTS subqueries. Critical for complex filtering and comparison logic.
Use a correlated subquery to find each employee's salary as a percentage of their department average.
Write an EXISTS subquery to find products that have never been ordered.
Use a scalar subquery to find orders placed on the day with the highest total revenue.
Write a subquery in the FROM clause to first aggregate, then filter on the aggregated result.
Use NOT EXISTS to find customers who bought product A but never bought product B.
COALESCE, NULLIF, IS NULL, three-valued logic. Subtle but critical for correctness.
Write a query that correctly handles NULL in a LEFT JOIN where the right side has no match.
Use COALESCE to replace NULL values with a running total from the previous row.
Explain why WHERE column != 'value' does not return rows where column IS NULL.
Use NULLIF to avoid division-by-zero errors in a revenue-per-user calculation.
Write a query to find records where any of five columns contains a NULL value.
DATE_TRUNC, DATE_DIFF, EXTRACT, interval arithmetic. Essential for time-series analysis in data engineering.
Calculate user retention rates at 1, 7, and 30 days after signup.
Use DATE_TRUNC to aggregate daily event data into weekly cohorts.
Find the gap in days between each consecutive login for every user.
Write a query to identify users who were active for at least 3 consecutive days.
Calculate the percentage of monthly active users who were also active the previous month.
LIKE, regex, CONCAT, SUBSTRING, SPLIT_PART. Common in data cleaning and ETL pipeline questions.
Extract the domain name from a list of email addresses using string functions.
Use LIKE with wildcards to find products whose names contain a specific pattern.
Split a comma-separated tags column into individual rows.
Write a query to standardize phone numbers into a consistent format.
Use REGEXP_REPLACE to clean HTML tags from a text column.
Given a daily_revenue table with day and revenue, compute a 7-day trailing average for each day. Only include days that have a full 7 days of history.
SELECT
day,
revenue,
ROUND(
AVG(revenue) OVER (
ORDER BY day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 2
) AS rolling_7d_avg
FROM daily_revenue
ORDER BY day;The window frame ROWS BETWEEN 6 PRECEDING AND CURRENT ROW captures exactly 7 rows (today plus the 6 before it). AVG computes the mean across that frame. For the first 6 rows, the frame is shorter than 7 days, so the average is based on fewer values. To exclude partial windows, wrap this in a CTE and filter WHERE row_number > 6.
Expected output
day | revenue | rolling_7d_avg
------------+----------+---------------
2024-01-07 | 12400.00 | 11457.14
2024-01-08 | 13100.00 | 11871.43
2024-01-09 | 11800.00 | 12128.57SQL is the most common skill tested in data engineering interviews, appearing in the majority of loops and a significant share of phone screens. The format is predictable: you get a schema description, a business question, and 20-30 minutes to write a query. The interviewer watches you think through the problem, not just the final answer.
Start with JOINs and aggregations. These appear in every interview and are the foundation for harder topics. If GROUP BY and HAVING are second nature, you can focus your limited interview time on the harder parts of the problem.
Master window functions. Window functions are the highest-difficulty topic that appears consistently in SQL rounds. Practice until you can write ROW_NUMBER, RANK, LAG, and LEAD without looking up the syntax.
Practice under time pressure. Knowing the concept is not enough. You need to write correct SQL in under 15 minutes per question. DataDriven's timed drill mode simulates real interview pressure.
Run your code. Reading solutions is not the same as writing them. Every question on DataDriven executes against a real database so you catch syntax errors, NULL edge cases, and logic bugs before the interview.
Write real SQL. Get real results. Know exactly where you stand before the interview.