Python · Interview concept

Pandas groupby for Data Engineers

pandas.DataFrame.groupby splits rows into groups, applies a function, and combines results. It is SQL GROUP BY translated to Python, with more flexibility and more ways to get it subtly wrong.

Try itGroupby with multiple aggregations

pandas on the left (reference). Its runnable SQL equivalent on the right, edit and run against an in-browser dataset.

groupby_demo
pandas (reference)
1import pandas as pd
2
3orders = pd.read_csv("orders.csv")
4
5summary = (
6 orders
7 .groupby("customer_id")
8 .agg(
9 total_orders=("amount", "count"),
10 revenue=("amount", "sum"),
11 avg_order=("amount", "mean"),
12 refunds=("status",
13 lambda s: (s == "refunded").sum()),
14 )
15 .reset_index()
16)
SQL equivalent (runnable)
Click Run to execute. Edit the code above to experiment.

Overview

What pandas groupby is and why it matters

pandas groupby is the Python equivalent of SQL GROUP BY, but with three capabilities SQL lacks: named aggregations in a single call, per-group transforms that preserve the original shape, and arbitrary user-defined functions via apply.

Data engineers use groupby for aggregation, deduplication, rolling metrics, and group-level filtering. Interview problems frequently mix groupby with merge, pivot, and time-series resampling.

Syntax

pandas groupby reference

TokenMeaning
df.groupby(col)Split rows by the distinct values of col. Returns a DataFrameGroupBy object.
.agg(name=(col, fn))Named aggregation: produces one output column per entry. The modern preferred syntax.
.transform(fn)Applies fn per group but returns an index-aligned Series with the original row count.
.apply(fn)Flexible escape hatch. Slower than agg/transform; use only when neither fits.
as_index=FalseSkip promoting the groupby key to the index. Equivalent to calling reset_index() afterward.

Patterns

How pandas groupby shows up in practice

Single-column groupby with multiple aggregations

The bread-and-butter pattern: one key, multiple metrics. Use named aggregation so the output columns are readable.

1(orders
2 .groupby("customer_id")
3 .agg(
4 orders=("order_id", "count"),
5 revenue=("amount", "sum"),
6 avg=("amount", "mean"),
7 )
8 .reset_index())

Interview noteNamed aggregation (added in pandas 0.25) produces flat column names. Avoid the older dict-of-dicts syntax, it creates a MultiIndex that trips up downstream code.

Multi-key groupby

Pass a list of columns to group by multiple keys. Each unique combination becomes a group.

1revenue_by_region_tier = (
2 orders
3 .groupby(["region", "customer_tier"])
4 .amount.sum()
5 .reset_index(name="revenue")
6)

transform for per-group normalization

transform returns a Series aligned with the original DataFrame. Use it to compute group-level metrics and compare each row against its group.

1# Percent of customer's total spend for each order
2orders["pct_of_customer_total"] = (
3 orders.amount
4 / orders.groupby("customer_id").amount.transform("sum")
5)

Interview notetransform is what senior candidates reach for. Using groupby + merge to achieve the same result signals less fluency.

Conditional aggregation

pandas equivalent of SUM(CASE WHEN ... THEN 1 ELSE 0 END). Use a lambda or boolean mask.

1(orders
2 .groupby("customer_id")
3 .agg(
4 total=("amount", "sum"),
5 refund_count=("status", lambda s: (s == "refunded").sum()),
6 big_order_revenue=("amount", lambda a: a[a > 500].sum()),
7 ))

Filtering groups (HAVING equivalent)

groupby().filter(fn) keeps only groups where fn returns True. Equivalent to SQL HAVING on a per-group condition.

1# Keep customers with more than 2 completed orders
2repeat_buyers = (
3 orders[orders.status == "completed"]
4 .groupby("customer_id")
5 .filter(lambda g: len(g) > 2)
6)

Interview questions

pandas groupby interview questions

Q1.Given a DataFrame of orders, compute each customer's total revenue and the share of revenue they contributed overall.

What they testWhether you reach for transform (correct) or a second groupby + merge (works but clunky). Senior candidates use transform.

Approachtotal = orders.groupby('customer_id').amount.sum(). share = total / total.sum(). Or in one expression: orders.amount / orders.amount.sum() before groupby, then groupby and sum.

Q2.What's the difference between groupby().agg, groupby().transform, and groupby().apply?

What they testCore API fluency. agg reduces each group to scalars. transform returns a per-row Series with the original index. apply runs an arbitrary function and is the slowest.

Approachagg: one row per group, one scalar per named aggregation. transform: same shape as input, each row gets its group's metric. apply: arbitrary return shape, use only when the other two do not fit.

Q3.Your groupby().agg({'amount': ['sum', 'mean']}) produced a MultiIndex in the columns. How do you flatten it?

What they testKnowledge that the dict-of-lists syntax produces a MultiIndex. Named aggregation avoids the problem entirely.

ApproachPrefer named aggregation: .agg(total=('amount', 'sum'), avg=('amount', 'mean')). If stuck with the MultiIndex, flatten with df.columns = ['_'.join(c).rstrip('_') for c in df.columns].

Q4.groupby is slow on a 50-million-row DataFrame. How do you make it faster?

What they testProduction awareness. Categoricals, sort=False, and switching to Polars or DuckDB are the standard answers.

Approach1) Cast groupby keys to pd.Categorical. 2) Pass sort=False if output order does not matter. 3) Use built-in aggregations (sum, mean) instead of apply with a lambda. 4) For very large data, move to Polars or DuckDB which are 5-20x faster on groupby.

Q5.Explain what groupby(...).size() returns versus groupby(...).count().

What they testNULL handling. size counts all rows per group. count counts non-NULL values per column.

Approachsize returns a Series with one row per group. count returns a DataFrame with one column per non-grouping column, each showing non-NULL counts. If the column has no NULLs, count and size agree.

Q6.How do you group by a derived value, like the month of a timestamp column?

What they testKnowing groupby accepts a Series, not just a column name. Also pd.Grouper for resampling.

ApproachOption 1: create the derived column first (df['month'] = df.ts.dt.to_period('M')) then groupby it. Option 2: pass a Series directly: df.groupby(df.ts.dt.to_period('M')). Option 3: use pd.Grouper(key='ts', freq='M') for time-series resampling.

Common mistakes

What breaks in practice

Using apply when agg or transform would work

apply runs Python-level code per group and is 10-100x slower than built-in aggregations. Reach for agg first, transform second, apply last.

Wrong

1# 100x slower than necessary
2orders.groupby("customer_id").apply(
3 lambda g: g.amount.sum()
4)

Right

1orders.groupby("customer_id").amount.sum()

Returning a MultiIndex from agg

The dict-of-lists syntax produces a MultiIndex in the columns. Named aggregation (introduced in pandas 0.25) produces flat names and is the modern standard.

Wrong

1orders.groupby("customer_id").agg({
2 "amount": ["sum", "mean", "max"]
3})
4# columns are ('amount','sum'), ('amount','mean'), ...

Right

1orders.groupby("customer_id").agg(
2 total=("amount", "sum"),
3 avg=("amount", "mean"),
4 biggest=("amount", "max"),
5)

Forgetting as_index=False or reset_index()

By default, groupby keys become the index. If you pass the result to a function expecting the key as a column (like merge), you get a KeyError.

Wrong

1summary = orders.groupby("customer_id").amount.sum()
2summary.merge(customers, on="customer_id") # KeyError

Right

1summary = (
2 orders.groupby("customer_id", as_index=False).amount.sum()
3)
4summary.merge(customers, on="customer_id")

FAQ

Common questions

What does pandas groupby do?
groupby splits a DataFrame into groups based on one or more columns, applies a function to each group, and combines the results. It is the pandas equivalent of SQL GROUP BY.
What is the difference between agg and apply in pandas?
agg reduces each group to one or more scalar values. apply accepts any function and returns whatever that function returns. agg is faster because it uses optimized internals; apply is a flexible escape hatch.
How do I groupby multiple columns in pandas?
Pass a list: df.groupby(['col_a', 'col_b']). Each unique combination of values becomes a group. The result is indexed by a MultiIndex unless you pass as_index=False.
Why is my groupby slow?
The usual causes: using apply with a Python lambda (slow), string groupby keys (cast to Categorical), sort=True (the default, unnecessary if order does not matter), or a DataFrame too large for memory (move to Polars or DuckDB).
Does pandas groupby preserve NULLs?
By default, groupby drops rows with NULL in the groupby column. Pass dropna=False (pandas 1.1+) to include them as a group.

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.