Pandas groupby for Data Engineers
pandas.DataFrame.groupby splits rows into groups, applies a function, and combines results. It is SQL GROUP BY translated to Python, with more flexibility and more ways to get it subtly wrong.
pandas on the left (reference). Its runnable SQL equivalent on the right, edit and run against an in-browser dataset.
Overview
What pandas groupby is and why it matters
pandas groupby is the Python equivalent of SQL GROUP BY, but with three capabilities SQL lacks: named aggregations in a single call, per-group transforms that preserve the original shape, and arbitrary user-defined functions via apply.
Data engineers use groupby for aggregation, deduplication, rolling metrics, and group-level filtering. Interview problems frequently mix groupby with merge, pivot, and time-series resampling.
Syntax
pandas groupby reference
| Token | Meaning |
|---|---|
| df.groupby(col) | Split rows by the distinct values of col. Returns a DataFrameGroupBy object. |
| .agg(name=(col, fn)) | Named aggregation: produces one output column per entry. The modern preferred syntax. |
| .transform(fn) | Applies fn per group but returns an index-aligned Series with the original row count. |
| .apply(fn) | Flexible escape hatch. Slower than agg/transform; use only when neither fits. |
| as_index=False | Skip promoting the groupby key to the index. Equivalent to calling reset_index() afterward. |
Patterns
How pandas groupby shows up in practice
Single-column groupby with multiple aggregations
The bread-and-butter pattern: one key, multiple metrics. Use named aggregation so the output columns are readable.
Interview noteNamed aggregation (added in pandas 0.25) produces flat column names. Avoid the older dict-of-dicts syntax, it creates a MultiIndex that trips up downstream code.
Multi-key groupby
Pass a list of columns to group by multiple keys. Each unique combination becomes a group.
transform for per-group normalization
transform returns a Series aligned with the original DataFrame. Use it to compute group-level metrics and compare each row against its group.
Interview notetransform is what senior candidates reach for. Using groupby + merge to achieve the same result signals less fluency.
Conditional aggregation
pandas equivalent of SUM(CASE WHEN ... THEN 1 ELSE 0 END). Use a lambda or boolean mask.
Filtering groups (HAVING equivalent)
groupby().filter(fn) keeps only groups where fn returns True. Equivalent to SQL HAVING on a per-group condition.
Interview questions
pandas groupby interview questions
Q1.Given a DataFrame of orders, compute each customer's total revenue and the share of revenue they contributed overall.
What they testWhether you reach for transform (correct) or a second groupby + merge (works but clunky). Senior candidates use transform.
Approachtotal = orders.groupby('customer_id').amount.sum(). share = total / total.sum(). Or in one expression: orders.amount / orders.amount.sum() before groupby, then groupby and sum.
Q2.What's the difference between groupby().agg, groupby().transform, and groupby().apply?
What they testCore API fluency. agg reduces each group to scalars. transform returns a per-row Series with the original index. apply runs an arbitrary function and is the slowest.
Approachagg: one row per group, one scalar per named aggregation. transform: same shape as input, each row gets its group's metric. apply: arbitrary return shape, use only when the other two do not fit.
Q3.Your groupby().agg({'amount': ['sum', 'mean']}) produced a MultiIndex in the columns. How do you flatten it?
What they testKnowledge that the dict-of-lists syntax produces a MultiIndex. Named aggregation avoids the problem entirely.
ApproachPrefer named aggregation: .agg(total=('amount', 'sum'), avg=('amount', 'mean')). If stuck with the MultiIndex, flatten with df.columns = ['_'.join(c).rstrip('_') for c in df.columns].
Q4.groupby is slow on a 50-million-row DataFrame. How do you make it faster?
What they testProduction awareness. Categoricals, sort=False, and switching to Polars or DuckDB are the standard answers.
Approach1) Cast groupby keys to pd.Categorical. 2) Pass sort=False if output order does not matter. 3) Use built-in aggregations (sum, mean) instead of apply with a lambda. 4) For very large data, move to Polars or DuckDB which are 5-20x faster on groupby.
Q5.Explain what groupby(...).size() returns versus groupby(...).count().
What they testNULL handling. size counts all rows per group. count counts non-NULL values per column.
Approachsize returns a Series with one row per group. count returns a DataFrame with one column per non-grouping column, each showing non-NULL counts. If the column has no NULLs, count and size agree.
Q6.How do you group by a derived value, like the month of a timestamp column?
What they testKnowing groupby accepts a Series, not just a column name. Also pd.Grouper for resampling.
ApproachOption 1: create the derived column first (df['month'] = df.ts.dt.to_period('M')) then groupby it. Option 2: pass a Series directly: df.groupby(df.ts.dt.to_period('M')). Option 3: use pd.Grouper(key='ts', freq='M') for time-series resampling.
Common mistakes
What breaks in practice
Using apply when agg or transform would work
apply runs Python-level code per group and is 10-100x slower than built-in aggregations. Reach for agg first, transform second, apply last.
Wrong
Right
Returning a MultiIndex from agg
The dict-of-lists syntax produces a MultiIndex in the columns. Named aggregation (introduced in pandas 0.25) produces flat names and is the modern standard.
Wrong
Right
Forgetting as_index=False or reset_index()
By default, groupby keys become the index. If you pass the result to a function expecting the key as a column (like merge), you get a KeyError.
Wrong
Right
FAQ
Common questions
- What does pandas groupby do?
- groupby splits a DataFrame into groups based on one or more columns, applies a function to each group, and combines the results. It is the pandas equivalent of SQL GROUP BY.
- What is the difference between agg and apply in pandas?
- agg reduces each group to one or more scalar values. apply accepts any function and returns whatever that function returns. agg is faster because it uses optimized internals; apply is a flexible escape hatch.
- How do I groupby multiple columns in pandas?
- Pass a list: df.groupby(['col_a', 'col_b']). Each unique combination of values becomes a group. The result is indexed by a MultiIndex unless you pass as_index=False.
- Why is my groupby slow?
- The usual causes: using apply with a Python lambda (slow), string groupby keys (cast to Categorical), sort=True (the default, unnecessary if order does not matter), or a DataFrame too large for memory (move to Polars or DuckDB).
- Does pandas groupby preserve NULLs?
- By default, groupby drops rows with NULL in the groupby column. Pass dropna=False (pandas 1.1+) to include them as a group.
Continue your prep
Data Engineer Interview Prep, explore the full guide
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.
Interview Rounds
By Company
- Stripe Data Engineer Interview
- Airbnb Data Engineer Interview
- Uber Data Engineer Interview
- Netflix Data Engineer Interview
- Databricks Data Engineer Interview
- Snowflake Data Engineer Interview
- Lyft Data Engineer Interview
- DoorDash Data Engineer Interview
- Instacart Data Engineer Interview
- Robinhood Data Engineer Interview
- Pinterest Data Engineer Interview
- Twitter/X Data Engineer Interview
By Role
- Senior Data Engineer Interview
- Staff Data Engineer Interview
- Principal Data Engineer Interview
- Junior Data Engineer Interview
- Entry-Level Data Engineer Interview
- Analytics Engineer Interview
- ML Data Engineer Interview
- Streaming Data Engineer Interview
- GCP Data Engineer Interview
- AWS Data Engineer Interview
- Azure Data Engineer Interview