What SQL dialect does Meta use in data engineer interviews?

Presto and Trino predominantly, with some Hive on the data-platform team. The Postgres catalog here is portable for the patterns (window functions, CTEs, aggregation, JOIN). The Presto-specific syntax (UNNEST, MAP_AGG, APPROX_DISTINCT, LATERAL VIEW EXPLODE in Hive) is tagged on the problems where it diverges. Practice in Postgres, mention Presto syntax during the interview.

How does Meta weight communication versus technical correctness in interviews?

Higher than most companies. Meta's data engineer rubric explicitly scores 'thinks out loud', 'asks clarifying questions', and 'articulates trade-offs' as separate dimensions from technical correctness. A data engineer candidate who lands the right answer silently scores lower than one who narrates the wrong approach, corrects mid-round when probed, and articulates why. Practice narrating through every problem.

What is the typical system design scenario at Meta?

Most often the feed-ranking signals pipeline (collect user interaction events at 10B+ per day, compute signals like dwell time and engagement velocity, serve to the ML ranking model with single-digit-millisecond latency) or the ads delivery analytics pipeline (impression-click-conversion attribution with 28-day windows, multi-region replay, idempotent reconciliation). Both 45-60 minutes; both expect 3+ failure modes per component.

How important is gap-and-island at Meta data engineer interviews?

Roughly 30 percent of reported Meta SQL questions involve gap-and-island for engagement streak detection (consecutive days active, longest watch session, days since last login). Master the date minus ROW_NUMBER times INTERVAL 1 day pattern. Multiple variants: HAVING COUNT >= N for minimum streak length, MAX(MAX_island_length) per user for longest streak, COUNT of distinct streaks per user.

Does Meta test PySpark in data engineer interviews?

Less than the Spark-first companies. Meta's internal stack favors Presto and Hive for batch SQL, Spark for ML feature pipelines and some heavy ETL. PySpark questions appear in data engineer loops for data-infra-adjacent teams (the team that builds the Spark platform) and ML data engineering teams. For general data engineer loops at Meta, SQL fluency matters more than Spark.

What is the bar for the Python round at Meta?

Pipeline-shaped Python, more vanilla than at most companies because Meta's stack pushes most transformation into Presto, Hive, and Spark. Python is for orchestration, validation, custom transforms. Common prompts: implement an SCD Type 2 merge in pandas, sessionize events with itertools.groupby, dedup with composite tiebreaker. The bar is correctness plus error handling. Silent failures are the failure mode the Meta rubric calls out.

How long is a Meta data engineer onsite?

4-5 rounds over 4-5 hours: SQL, Python, data modeling, system design, behavioral. Senior+ loops add a second design round (frequently a 'design the data platform' meta-question). Each round is 45-60 minutes. Lunch interview is usually informal but observed by the recruiter.

Meta Data Engineer Interview Questions (Presto SQL)

Meta Data Engineer Interview Questions

Meta-tagged data engineer interview questions with live grading.

Meta data engineer interview questions tagged based on reported interview shape. Window-function-heavy SQL on Presto and Trino. Gap-and-island patterns for engagement streak detection. Ads attribution data modeling. Feed-ranking signals pipeline design. Communication and trade-off articulation weighted explicitly in the Meta rubric.

Meta's data engineer interview loop in 2026 is 5 rounds: SQL (heavy on window functions, gap-and-island for engagement streaks, and time-series aggregation across the ads and feed-ranking warehouses), Python (pipeline-shaped: parsing, dedup, sessionization, with vanilla Python preferred because Meta's internal stack pushes most transformation into Presto, Hive, and Spark), data modeling (frequently the ads attribution model with impressions, clicks, conversions, last-touch versus multi-touch attribution, and the SCD2 advertiser dimension question), system design (typically the feed-ranking signals pipeline or the ads delivery analytics pipeline at 10B+ events per day with multi-region replay and 28-day late-arriving conversion windows), and behavioral (Meta weights communication on every round; the rubric explicitly scores "thinks out loud" and "asks clarifying questions" as separate dimensions from technical correctness).

The Meta data engineer SQL bar specifically. Window functions appear in 80 percent of reported Meta SQL questions, frequently composed (a window over a window via two CTEs). Gap-and-island for engagement streaks appears in roughly 30 percent of reported rounds: consecutive-days-active, longest-watch-session, streak-length distribution. The Presto and Trino dialect is what you write against at Meta. UNNEST for array column expansion. MAP_AGG for key-value rollups. APPROX_DISTINCT for cardinality at scale. Practice in Postgres here is portable for the patterns; the Meta-specific syntactic differences are tagged on each problem.

The Meta data engineer design round is calibrated to senior signal. A working high-level architecture is not enough. The L5 rubric weights: defend the choice against two alternatives, name 3 failure modes per component, address the late-arriving data and replay story, articulate cost reasoning (slot consumption, Spark workers, S3 storage). Conversion windows for ads attribution extend 28 days post-click, which means the design must handle reprocessing yesterday's totals when today brings new attributed conversions. The MERGE-ADD-not-REPLACE pattern is the standard answer. Multi-region replication for ads accounting is the L6 follow-up.

The Meta data engineer Python bar is pipeline-shaped and more vanilla than at most companies. Meta's internal stack favors Presto and Hive for batch SQL, Spark for ML feature pipelines and some heavy ETL. Python is for orchestration, validation, and custom transforms. Common Meta Python interview prompts: implement an SCD Type 2 merge in pandas, sessionize events with itertools.groupby, dedup with composite tiebreaker. The bar is correctness plus error handling. Silent failures in pipeline code are the failure mode the Meta rubric calls out.

Meta calls its levels E4 (mid), E5 (senior), E6 (staff), E7 (senior staff), E8 (principal). E5 is the typical senior data engineer floor. E6+ rubrics emphasize trade-off articulation, failure-mode naming, and the ability to adapt cleanly when the interviewer changes a requirement mid-round. The behavioral round at Meta probes "thinks out loud" and "asks clarifying questions" as separate dimensions from technical correctness; the data engineer who narrates through every step scores above the one who solves silently.

Full data engineer interview catalog - 1,400+ problems across all 5 rounds.
Window functions (Meta-tested heavily) - ROW_NUMBER, RANK, DENSE_RANK, LAG, SUM OVER for Meta's SQL bar.
Advanced SQL: gap-and-island and sessionization - Engagement streak patterns Meta is known for.
Data modeling: ads attribution model - SCD2 advertiser dimension, multi-touch attribution, conversion windows.
System design at FAANG scale - Feed-ranking and ads delivery pipeline patterns for Meta data engineer prep.
FAANG data engineer interview questions - Meta + Amazon + Google + Netflix + Apple with per-company rubrics.
Senior data engineer problems (Meta E5+) - Trade-off articulation, failure-mode naming, mid-round pivots for Meta data engineer L5+.
Meta data engineer mock interview - AI mock with Meta-calibrated rubric and prompt pool.
Streaming system design (Meta ads pipeline) - Streaming patterns for the ads delivery analytics pipeline.

What SQL dialect does Meta use in data engineer interviews?: Presto and Trino predominantly, with some Hive on the data-platform team. The Postgres catalog here is portable for the patterns (window functions, CTEs, aggregation, JOIN). The Presto-specific syntax (UNNEST, MAP_AGG, APPROX_DISTINCT, LATERAL VIEW EXPLODE in Hive) is tagged on the problems where it diverges. Practice in Postgres, mention Presto syntax during the interview.
How does Meta weight communication versus technical correctness in interviews?: Higher than most companies. Meta's data engineer rubric explicitly scores 'thinks out loud', 'asks clarifying questions', and 'articulates trade-offs' as separate dimensions from technical correctness. A data engineer candidate who lands the right answer silently scores lower than one who narrates the wrong approach, corrects mid-round when probed, and articulates why. Practice narrating through every problem.
What is the typical system design scenario at Meta?: Most often the feed-ranking signals pipeline (collect user interaction events at 10B+ per day, compute signals like dwell time and engagement velocity, serve to the ML ranking model with single-digit-millisecond latency) or the ads delivery analytics pipeline (impression-click-conversion attribution with 28-day windows, multi-region replay, idempotent reconciliation). Both 45-60 minutes; both expect 3+ failure modes per component.
How important is gap-and-island at Meta data engineer interviews?: Roughly 30 percent of reported Meta SQL questions involve gap-and-island for engagement streak detection (consecutive days active, longest watch session, days since last login). Master the date minus ROW_NUMBER times INTERVAL 1 day pattern. Multiple variants: HAVING COUNT >= N for minimum streak length, MAX(MAX_island_length) per user for longest streak, COUNT of distinct streaks per user.
Does Meta test PySpark in data engineer interviews?: Less than the Spark-first companies. Meta's internal stack favors Presto and Hive for batch SQL, Spark for ML feature pipelines and some heavy ETL. PySpark questions appear in data engineer loops for data-infra-adjacent teams (the team that builds the Spark platform) and ML data engineering teams. For general data engineer loops at Meta, SQL fluency matters more than Spark.
What is the bar for the Python round at Meta?: Pipeline-shaped Python, more vanilla than at most companies because Meta's stack pushes most transformation into Presto, Hive, and Spark. Python is for orchestration, validation, custom transforms. Common prompts: implement an SCD Type 2 merge in pandas, sessionize events with itertools.groupby, dedup with composite tiebreaker. The bar is correctness plus error handling. Silent failures are the failure mode the Meta rubric calls out.
How long is a Meta data engineer onsite?: 4-5 rounds over 4-5 hours: SQL, Python, data modeling, system design, behavioral. Senior+ loops add a second design round (frequently a 'design the data platform' meta-question). Each round is 45-60 minutes. Lunch interview is usually informal but observed by the recruiter.
What level is the equivalent of L4, L5, L6 at Meta?: Meta calls them E4 (mid), E5 (senior), E6 (staff), E7 (senior staff), E8 (principal). E5 is the typical senior data engineer floor. E6+ rubrics emphasize trade-off articulation, failure-mode naming, and the ability to adapt cleanly when the interviewer changes a requirement mid-round.

173 practice problems matching this filter. Domains: SQL (120), Python (51), Data Modeling (2). Difficulty: medium (68), easy (75), hard (30).

SQL (120)

Proof of Presence - medium - Two-factor sent. How many confirmed?
30-Day Page View Counts - easy - Thirty days of engagement. Quick snapshot.
7-Day Token Retention - medium - Premium tokens, day by day.
Presence vs. Participation - medium - Being in the region and being active are two very different things.
All Infra Regions - easy - The infrastructure spans the globe. Map it.
API Call Distribution Fraction - hard - Not all endpoints are created equal.
Average Event Progression Time - hard - How fast do users move through the funnel?
Average Session Duration - medium - How long do users actually stay?
Average Session Duration by Device - easy - Session length, device by device.
Build Success Rate by Trigger - medium - Which triggers produce green builds?
Busy Authors - medium - Some developers spread their commits everywhere.
Losing Altitude - hard - Two months apart, the same campaigns. Find the ones that slipped.
Cheapest CDN Route - easy - The cheapest path across regions.
Cheapest Cost Per Region - easy - Lowest spend per region.
The Quiet Outlier - hard - Ignore what the traffic does all day. Find the spike that barely showed up.
Clicked Ad Impressions - easy - They saw the ad. They clicked.
Loyalty's Double Tap - medium - When a nudge and a banner team up.
Diminishing Returns - medium - When search comes back thin, does the click still follow?
Clean Exit - easy - Priority one. Only some of them made it to the end.
The Blind Spot - medium - Pages they haven't discovered yet.
Content Session Counts - medium - Session metrics, content item by item.
Cost Share Within Category - medium - Each entry's slice of the category total.
Service Roll Call - easy - The mesh is sprawling. Find out exactly how many services are actually running.
Parallel Traces - medium - Same experiment. Different variants. Who overlaps?
Daily Cross-Platform Users - easy - Mobile and web. Same day, same users?
Daily Error Resolution Ratio - medium - Reported versus removed. The daily ratio.
Daily Net Revenue - hard - Net revenue, day by day. Refunds included.
Daily Session and User Counts - medium - Sessions and users, day by day.
Campaign Click Rate - medium - Among engaged users, which campaigns landed.
Deployments per Environment - medium - Dev, staging, prod. Where do most deploys land?
Disabled-Flag Share by Owner - medium - Which teams ship everything off by default.
Two-Way Street - medium - A conversation reads the same from either side.
Distinct Product Categories - easy - A quick category inventory.
Double Take - medium - Passed QA twice. That's the problem.
Duplicate Training Runs - medium - Same model, trained twice.
Verbose by Design - hard - Some paths say in six segments what others say in two.
Engagement Gap - medium - Zero transactions is still a data point. Count everyone.
Fault Lines - medium - Errors by day and region. Some areas are worse than they appear.
Errors With Service Health - easy - Error data, enriched with health context.
Event Count on Key Days - easy - Key days. Key event volumes.
Events by Month Across Years - easy - Month by month, year by year. The pattern emerges.
Event Types Spanning Multiple Months - easy - Some events span seasons.
Fastest Page View to Click - hard - How fast from view to click?
Feature Flag Adoption - medium - How widely adopted are the flags?
Feature Flag Fan vs Detractor Pairs - hard - Some users love the flag. Others want it gone.
The Ninety-Day Comeback - hard - Everyone shows up once. Who comes back before the quarter ends?
First Contact - easy - Every pipeline has a first run. This is what it brought back.
Full Funnel - hard - Search. Browse. Buy. Only a few do all three.
Health Checks per Service - easy - Some services get checked constantly.
Heavy Hitters - medium - Some repos never sleep.
Heavy Namespaces - medium - Kubernetes has favorites. Some namespaces carry more weight.
High Engagement Pages - hard - Some pages hold attention longer than others.
High-Rated In-Stock Percentage - easy - Highly rated and in stock. A rare combo.
Impressions by Search Keyword - hard - Campaign performance, keyword by keyword.
Inactive Unverified Users - easy - Signed up. Never verified. Never came back.
Largest Group - easy - One group towers above the rest.
Latest Session Per User - easy - Everyone has a most recent session.
Latest Version Per Service - easy - The latest version deployed. Each service.
Low-Volume Stream Topics - medium - Quiet topics in the stream.
Max Value Per Location - easy - Every location has a peak.
Mentorship User Pairs - medium - Pair them up. Mentor and mentee.
Messages From Specific Users - easy - Specific users. What did they say?
Metric Range by Department - medium - Where each team's numbers sit, low to high.
The Floor Price - medium - Before the negotiation, find what each provider really charges at its cheapest.
Most Common Monday Outcome - medium - Mondays have a pattern.
Most Efficient Region by Token Usage - hard - Some regions squeeze more out of every token.
Multi-Variant Experiments - easy - One user, multiple experiments.
Seen or Ignored - medium - A send is only half the story. Find where the taps actually land.
Did Anyone Actually Read It? - easy - A push isn't a win until a thumb taps it.
The Vanishing Rows - easy - Some records disappear when the tables meet. Figure out why.
When They Opened - medium - Push after push goes out. Some months, people actually read them.
The Loudest Neighbor - hard - In every namespace, one pod does most of the eating.
All at Once - hard - How many tokens were alive at the same time?
Pipeline Completion Rate - medium - How far do users get through the flow?
Power Users - medium - Engagement separates tourists from regulars.
The Regulars - medium - Past a certain threshold, casual becomes committed.
The Open Question - medium - Push sent. How many opened?
Q2 Search Volume - easy - Q2 search volume. The numbers.
Once and Only Once - hard - Repeated readings are noise. The value seen a single time is the signal.
Recurring Error Types - easy - The same errors, recurring.
Resolved vs Unresolved Alerts - hard - Resolved versus open. By severity.
Retargeting Campaign Impressions - easy - Retargeting impressions. All of them.
Reviews Per Reviewer - easy - The workload split across reviewers.
Search Algorithm Rating - hard - How good are the search results?
Service Scorecard - hard - Deploys vs. alerts. One row per service tells the whole story.
Services With Multi-Quarter Uptime - hard - Multi-quarter uptime streaks.
Session Count Distribution - hard - How are sessions distributed among the newest users?
Session Overview - medium - Full engagement picture, even for the ones who never showed up.
Shared Category Purchasers - medium - They bought different things from the same aisle.
Shared Endpoints - medium - Shared credentials across endpoints.
Double Vision - easy - Before the records move, the ones wearing the same name twice have to surface.
The Podium Finish - medium - Top two products per category.
Third Largest Batch Job - easy - Bronze medal in the batch job rankings.
Threads Excluding User - easy - Every thread they're not part of.
Top Active Senders per Channel - medium - Top three messages per channel by replies.
Top Category by User Segment - medium - Each segment has a favorite category.
Top Chat Contributors - medium - The ten most active chat users.
Top Identified Event Types - medium - The top users by events, but only the identifiable ones.
Transaction Overview - easy - The executive snapshot. Users, products, revenue.
Deep Pockets - medium - One month, every customer, every dollar accounted for.
The Named Transaction - easy - Transaction IDs are useless without context. Bring in the product names.
Unique Searchers - easy - How many users actually searched?
Who's Looking - easy - Every search is a question someone needed answered. Count the people asking.
Unique Stream Topics - easy - A clean inventory of streaming topics.
User 360 - hard - One row per user. Everything they did, or didn't do.
User Campaign Overlap Percentage - hard - How much ad overlap between users?
Six Degrees - hard - Every reply ties two names together. Find whose web reaches the furthest.
User Devices - medium - Desktop, mobile, tablet. What does each user actually use?
User Engagement Summary - medium - Sessions plus searches. The full engagement picture.
Behavioral Range - easy - Power users don't just visit more. They do more things.
User Sessions on Specific Days - easy - One user. Specific days. What happened?
Users Per Device Type - easy - Users per device. The split.
Users Who Clicked Ads - easy - Ad clickers and their account details.
Users With Purchase Events - easy - At least one purchase. That changes everything.
Verify Commit ID Uniqueness - easy - Duplicate commit IDs. Are there any?
View Count Per Page - easy - Every page has visitors. Some just have more.
Point of Entry - hard - Everyone starts by looking. Count who came back to buy.
Views by Content Type - medium - Count content views broken down by content type
Bookends of the Week - hard - Every week has two edges. Weigh what lands on each.
Word Count Per Message - medium - How wordy are the messages?

Python (51)

Batch Records - medium - Too many at once. Break them into groups.
Batch With Metadata - easy - The list gets chopped.
Column Max - easy - One value rules the column.
Column Range - easy - From minimum to maximum. What is the spread?
All Told - easy - Every shift leaves a number behind. Total the fleet.
Cumulative Sum - medium - The total grows with every row.
Corner to Corner - medium - Not every value sits in a row or column.
Dictionary Key Intersection - medium - Two dictionaries. What do they share?
The Carousel - medium - Every value takes its turn; the wheel keeps spinning.
Even Filter - easy - Only the even ones survive.
Explode List - easy - One row holds many values. Unpack it.
Find Indices - medium - It is in there somewhere. Where exactly?
All the Way Down - easy - Sections nested inside sections. One stream, in order.
Full Outer Zip - medium - Two sides. No value left behind.
Null Counter - easy - How many holes in the data?
Quality Gate - easy - Not everything passes inspection.
Full Circle - medium - Load has to keep moving. Pass it down the line.
Sort Descending - easy - Biggest first. No exceptions.
The Spike - hard - Spot the outliers before they page someone.
Best in Class - medium - Every category keeps its leaders. The rest fall away.
The Squeeze - easy - Long stretches of sameness collapse into almost nothing.
Downstream - hard - Nothing runs until what it needs has run.
The Deep Dive - easy - A specific position in the unsorted pile.
The Dependency Resolver - medium - Everything depends on everything.
The Mirror Index - easy - Every value remembers who pointed to it.
The Event Bucketer - easy - Logs slotted into buckets.
The Forward Fill - easy - Patch the gaps in a noisy sensor stream.
The Gap Filler - easy - Fill the Nones with the last real value.
The Horizon Scanner - medium - For each position, what is coming up ahead?
The IP Validator - easy - Real and fake, mixed together.
The Middle Ground - hard - The middle value keeps moving.
The Numbered Chair - easy - A standing list. Position n holds one entry.
The First of Their Kind - easy - When the same record arrives twice, only the first one survives.
The One-Way Street - easy - Monotonic time-series. Direction only.
The Original Keeper - easy - Clean up duplicate events without losing the timeline.
All the Way Down - medium - The payload nests without end. The warehouse needs one flat row.
The Pipeline Filter - easy - In the door as one thing, out the door as another.
The Repeat Review - medium - The echo came back.
The Resume Sifter - medium - Pull what's useful. Skip what you know.
The Running Total - easy - Each position holds the sum of everything before it.
The Sequel Spotter - easy - Spot the sequels hiding in the catalog.
The Shifting Standard - medium - A benchmark in motion.
The Social Graph - easy - Everyone knows someone.
The Spin Doctor - medium - Ninety degrees, but which way?
The Streak Breaker - easy - It has a problem with repetition.
The Stream Averager - easy - The answer moves with the data.
The Throttle Ceiling - medium - Too many requests in too short a timeframe. Throttle it.
The Word Mismatch - easy - Some text does not match.
Transform Column - easy - Same data, new shape.
Against the Grain - medium - Turn the grid on its side and let every column stand up as a row.
Value Count - easy - How many of each? Count them.

Data Modeling (2)

Content Engagement Data Model - hard - Post published. Now measure everything that happens next.
The Shape of a Run - medium - Two log lines bracket every process. Pair them and the fleet's rhythm appears.