Questions reported by candidates in 2024-2026 loops, paraphrased and de-identified. Each answer covers the approach, the gotcha, and the typical follow-up.
SQL · L4
Find products with inventory mismatch across retailers
JOIN product_master to retailer_inventory tables on the canonical_product_id. Group by canonical_product_id, compute mismatch as (MAX(qty) - MIN(qty)) / AVG(qty). Filter where mismatch > threshold. The interviewer's follow-up: why is mismatch expected? Answer: each retailer's inventory is independently managed; some mismatch is always present. The query is for outlier detection, not enforcement.
SQL · L4
Compute shopper acceptance rate per market per hour-of-day
GROUP BY market_id, EXTRACT(hour FROM offered_ts). Acceptance rate = orders_accepted / orders_offered. Volunteer that low-volume hour buckets (e.g., 3am) produce noisy rates; filter to buckets with at least N offers. Discuss the time zone issue: hour-of-day must be in the market's local timezone, not UTC.
SQL · L4
Top 10 retailers by 7-day rolling order volume
Daily aggregate orders per retailer. AVG OVER (PARTITION BY retailer_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). Rank with DENSE_RANK by rolling avg desc. Volunteer the partial-window edge case (first 6 days have less than 7-day data).
SQL · L5
Identify products with cross-retailer price arbitrage above $5
GROUP BY canonical_product_id, compute MAX(price) - MIN(price) across retailers carrying the same product. Filter > 5 USD. Discuss why this matters: arbitrage opportunities for consumers, signal to merchant ops that pricing is stale. Edge case: pack-size differences make raw price comparison wrong; need normalize to per-unit price.
SQL · L5
Find shoppers whose pick accuracy dropped this week vs trailing 4-week avg
Compute weekly pick_accuracy per shopper (correct picks / total picks). LAG to get prior 4-week avg. Filter where current week is more than 2 std-dev below trailing avg. Discuss the bias: new shoppers have noisy histories; filter to shoppers with at least 100 picks in trailing window.
Python · L4
Fuzzy match product names across retailers (entity resolution)
Tokenize, normalize (lowercase, strip punctuation, expand brand abbreviations). Use rapidfuzz token_set_ratio for fuzzy matching. Threshold-based match (typically 85+). Discuss false positive vs negative trade-off: a higher threshold misses real matches (Coca-Cola vs Coke), a lower threshold creates false matches (Pepsi vs Pepto). Senior signal: combine fuzzy match with brand and category constraints to reduce false positives.
from rapidfuzz import fuzz
import re
ABBREVIATIONS = {
"oz": "ounce",
"lb": "pound",
"ct": "count",
"pk": "pack",
}
def normalize(name: str) -> str:
name = name.lower()
name = re.sub(r"[^\w\s]", " ", name)
tokens = name.split()
tokens = [ABBREVIATIONS.get(t, t) for t in tokens]
return " ".join(tokens)
def match_score(a: str, b: str) -> int:
return fuzz.token_set_ratio(normalize(a), normalize(b))
# Usage
score = match_score("Coca-Cola 12 oz can", "Coca Cola Soda 12oz Can")
# 95+ -> matchPython · L4
Parse retailer XML feed with variable schema
Use lxml or xml.etree.ElementTree. Wrap each record in try/except, route malformed records to a dead-letter list with the parse error. Schema-on-read approach is right because retailer schemas drift quarterly. Senior signal: emit a quality metric (% of records parsed successfully per retailer per day), alert on drops.
Python · L5
Implement pick-batch optimizer for shopper route planning
Given an order with N items in a store, compute the optimal pick order to minimize travel time. Greedy nearest-neighbor is the right baseline. Mention that the full TSP is NP-hard; greedy gets within 25% of optimal in practice. Discuss the data inputs: store layout (aisle coordinates per product), shopper start position, item list. Senior signal: blend greedy with hard constraints (refrigerated items picked last to preserve cold chain).
Python · L5
Detect duplicate products in the catalog using ML embeddings
Compute embeddings (sentence-transformer or fastText) of product name + brand + size. Use approximate nearest neighbor (FAISS or HNSW) to find pairs with cosine similarity > 0.95. Manual review or automated merge based on threshold. Discuss why pure fuzzy matching fails at scale: paraphrases (Coke vs Coca-Cola) need semantic similarity, not character similarity.
System Design · L5
Design the catalog ingestion pipeline for 1,400 retailers
Per-retailer scheduled pull (cron or webhook) -> Pub/Sub raw_catalog topic -> Dataflow normalization (UPC, name, attributes, prices) -> entity resolution against product_master via embedding similarity + UPC exact match -> BigQuery catalog_fact. Cover: schema variability per retailer (custom parsers per vendor), partial-update handling (delta or snapshot), cold-start for new retailers (initial backfill of 100K+ products before going live), alerting on per-retailer parse error rate spikes.
System Design · L5
Design the real-time inventory probability service
Per-store-per-product probability the item is in stock when shopper arrives. Features: order rate, time since last successful pick, restock schedule, day-of-week, hour, weather. Real-time inference via Redis-backed ML model (LightGBM serialized to ONNX, served via Triton or in-process). Hourly batch retrain on historical pick outcomes. Discuss feature staleness budget: order rate updates every 5 minutes, restock schedule daily. Senior signal: A/B test new model versions on 1% traffic, monitor pick success rate as business metric.
System Design · L5
Design the search ranking feature pipeline
User query + product catalog + click history + order history -> features -> learning-to-rank model -> ranked results. Online features (current cart, recent clicks) computed at request time. Offline features (lifetime product affinity, historical CTR) precomputed daily and cached in Redis. Cover: training data leakage prevention via point-in-time features (use feature_ts <= query_ts), A/B test instrumentation (assign variant at request time, log exposure, daily aggregation).
System Design · L5
Design the shopper-recommended-substitution pipeline
When a product is out of stock, what should the shopper recommend? Features: product similarity (embedding-based), customer purchase history, substitute-acceptance rate. Real-time inference via Redis-backed model. Cover the feedback loop: shopper sends substitute photo, customer accepts or rejects, decision logged for model retraining. Senior signal: discuss the cold-start problem for new customers and new products.
Modeling · L5
Design schema for multi-retailer product catalog
Three core tables. product_master: canonical product (canonical_product_id PK, canonical_name, brand, category_id, upc). product_retailer_link: per-retailer SKU mapping (retailer_id, retailer_sku, canonical_product_id, price, in_stock, image_url, last_synced_ts). dim_category: conformed across retailers. Discuss: when do you create a new canonical product vs link to existing? Answer: UPC exact match always links; embedding similarity above 0.95 links with manual review; below 0.95 creates new canonical with a low-confidence flag for human review.
Behavioral · L5
Tell me about a project where you cut scope to hit a deadline
Instacart culture rewards shipping pragmatically. Story should cover: the original scope, why the deadline mattered, the specific cuts you made, why each was acceptable (cost vs benefit reasoning), what the post-launch plan was for the cut features. Decision postmortem is the L5 signal. Stories about ship vs. don't-ship trade-offs land better than stories about engineering excellence in a vacuum.