What does pandas groupby do?

groupby splits a DataFrame into groups based on one or more columns, applies a function to each group, and combines the results. It is the pandas equivalent of SQL GROUP BY.

What is the difference between agg and apply in pandas?

agg reduces each group to one or more scalar values. apply accepts any function and returns whatever that function returns. agg is faster because it uses optimized internals; apply is a flexible escape hatch.

How do I groupby multiple columns in pandas?

Pass a list: df.groupby(['col_a', 'col_b']). Each unique combination of values becomes a group. The result is indexed by a MultiIndex unless you pass as_index=False.

Why is my groupby slow?

The usual causes: using apply with a Python lambda (slow), string groupby keys (cast to Categorical), sort=True (the default, unnecessary if order does not matter), or a DataFrame too large for memory (move to Polars or DuckDB).

Does pandas groupby preserve NULLs?

By default, groupby drops rows with NULL in the groupby column. Pass dropna=False (pandas 1.1+) to include them as a group.

Python · Interview concept

Pandas groupby for Data Engineers

pandas.DataFrame.groupby splits rows into groups, applies a function, and combines results. It is SQL GROUP BY translated to Python, with more flexibility and more ways to get it subtly wrong.

Groupby with multiple aggregations

pandas on the left (reference). Its runnable SQL equivalent on the right, edit and run against an in-browser dataset.

pandas (reference)

	import pandas as pd

	orders = pd.read_csv("orders.csv")

	summary = (
	orders
	.groupby("customer_id")
	.agg(
	total_orders=("amount", "count"),
	revenue=("amount", "sum"),
	avg_order=("amount", "mean"),
	refunds=("status",
	lambda s: (s == "refunded").sum()),
	)
	.reset_index()
	)

	/* Exact SQL the pandas groupby reduces to */
	SELECT
	customer_id,
	COUNT(*) AS total_orders,
	SUM(amount) AS revenue,
	AVG(amount) AS avg_order,
	SUM(
	CASE
	WHEN status = 'refunded' THEN 1
	ELSE 0
	END
	) AS refunds
	FROM orders
	GROUP BY customer_id
	ORDER BY customer_id

Overview

What pandas groupby is and why it matters

pandas groupby is the Python equivalent of SQL GROUP BY, but with three capabilities SQL lacks: named aggregations in a single call, per-group transforms that preserve the original shape, and arbitrary user-defined functions via apply.

Data engineers use groupby for aggregation, deduplication, rolling metrics, and group-level filtering. Interview problems frequently mix groupby with merge, pivot, and time-series resampling.

Syntax

pandas groupby reference

Token	Meaning
df.groupby(col)	Split rows by the distinct values of col. Returns a DataFrameGroupBy object.
.agg(name=(col, fn))	Named aggregation: produces one output column per entry. The modern preferred syntax.
.transform(fn)	Applies fn per group but returns an index-aligned Series with the original row count.
.apply(fn)	Flexible escape hatch. Slower than agg/transform; use only when neither fits.
as_index=False	Skip promoting the groupby key to the index. Equivalent to calling reset_index() afterward.

Prepare for the interview

01 / Open invite

02min.

Know pandas groupby the way the interviewer who asks it knows it.

a pandas groupby query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1def sessionize(events):

2 sessions = []

3 for e in events:

4 if gap_minutes(e) > 30:

Execute your solution0.4s avg.

LyftInterview question

Solve a pandas groupby problem

Patterns

How pandas groupby shows up in practice

Single-column groupby with multiple aggregations

The bread-and-butter pattern: one key, multiple metrics. Use named aggregation so the output columns are readable.

	(orders
	.groupby("customer_id")
	.agg(
	orders=("order_id", "count"),
	revenue=("amount", "sum"),
	avg=("amount", "mean"),
	)
	.reset_index())

Interview noteNamed aggregation (added in pandas 0.25) produces flat column names. Avoid the older dict-of-dicts syntax, it creates a MultiIndex that trips up downstream code.

Multi-key groupby

Pass a list of columns to group by multiple keys. Each unique combination becomes a group.

	revenue_by_region_tier = (
	orders
	.groupby(["region", "customer_tier"])
	.amount.sum()
	.reset_index(name="revenue")
	)

transform for per-group normalization

transform returns a Series aligned with the original DataFrame. Use it to compute group-level metrics and compare each row against its group.

	# Percent of customer's total spend for each order
	orders["pct_of_customer_total"] = (
	orders.amount
	/ orders.groupby("customer_id").amount.transform("sum")
	)

Interview notetransform is what senior candidates reach for. Using groupby + merge to achieve the same result signals less fluency.

Conditional aggregation

pandas equivalent of SUM(CASE WHEN ... THEN 1 ELSE 0 END). Use a lambda or boolean mask.

	(orders
	.groupby("customer_id")
	.agg(
	total=("amount", "sum"),
	refund_count=("status", lambda s: (s == "refunded").sum()),
	big_order_revenue=("amount", lambda a: a[a > 500].sum()),
	))

Filtering groups (HAVING equivalent)

groupby().filter(fn) keeps only groups where fn returns True. Equivalent to SQL HAVING on a per-group condition.

	# Keep customers with more than 2 completed orders
	repeat_buyers = (
	orders[orders.status == "completed"]
	.groupby("customer_id")
	.filter(lambda g: len(g) > 2)
	)

Interview questions

pandas groupby interview questions

Q1.Given a DataFrame of orders, compute each customer's total revenue and the share of revenue they contributed overall.

What they testWhether you reach for transform (correct) or a second groupby + merge (works but clunky). Senior candidates use transform.

Approachtotal = orders.groupby('customer_id').amount.sum(). share = total / total.sum(). Or in one expression: orders.amount / orders.amount.sum() before groupby, then groupby and sum.

Q2.What's the difference between groupby().agg, groupby().transform, and groupby().apply?

What they testCore API fluency. agg reduces each group to scalars. transform returns a per-row Series with the original index. apply runs an arbitrary function and is the slowest.

Approachagg: one row per group, one scalar per named aggregation. transform: same shape as input, each row gets its group's metric. apply: arbitrary return shape, use only when the other two do not fit.

Q3.Your groupby().agg({'amount': ['sum', 'mean']}) produced a MultiIndex in the columns. How do you flatten it?

What they testKnowledge that the dict-of-lists syntax produces a MultiIndex. Named aggregation avoids the problem entirely.

ApproachPrefer named aggregation: .agg(total=('amount', 'sum'), avg=('amount', 'mean')). If stuck with the MultiIndex, flatten with df.columns = ['_'.join(c).rstrip('_') for c in df.columns].

Q4.groupby is slow on a 50-million-row DataFrame. How do you make it faster?

What they testProduction awareness. Categoricals, sort=False, and switching to Polars or DuckDB are the standard answers.

Approach1) Cast groupby keys to pd.Categorical. 2) Pass sort=False if output order does not matter. 3) Use built-in aggregations (sum, mean) instead of apply with a lambda. 4) For very large data, move to Polars or DuckDB which are 5-20x faster on groupby.

Q5.Explain what groupby(...).size() returns versus groupby(...).count().

What they testNULL handling. size counts all rows per group. count counts non-NULL values per column.

Approachsize returns a Series with one row per group. count returns a DataFrame with one column per non-grouping column, each showing non-NULL counts. If the column has no NULLs, count and size agree.

Q6.How do you group by a derived value, like the month of a timestamp column?

What they testKnowing groupby accepts a Series, not just a column name. Also pd.Grouper for resampling.

ApproachOption 1: create the derived column first (df['month'] = df.ts.dt.to_period('M')) then groupby it. Option 2: pass a Series directly: df.groupby(df.ts.dt.to_period('M')). Option 3: use pd.Grouper(key='ts', freq='M') for time-series resampling.

Common mistakes

What breaks in practice

Using apply when agg or transform would work

apply runs Python-level code per group and is 10-100x slower than built-in aggregations. Reach for agg first, transform second, apply last.

Wrong

	# 100x slower than necessary
	orders.groupby("customer_id").apply(
	lambda g: g.amount.sum()
	)

Right

orders.groupby("customer_id").amount.sum()

Returning a MultiIndex from agg

The dict-of-lists syntax produces a MultiIndex in the columns. Named aggregation (introduced in pandas 0.25) produces flat names and is the modern standard.

Wrong

	orders.groupby("customer_id").agg({
	"amount": ["sum", "mean", "max"]
	})
	# columns are ('amount','sum'), ('amount','mean'), ...

Right

	orders.groupby("customer_id").agg(
	total=("amount", "sum"),
	avg=("amount", "mean"),
	biggest=("amount", "max"),
	)

Forgetting as_index=False or reset_index()

By default, groupby keys become the index. If you pass the result to a function expecting the key as a column (like merge), you get a KeyError.

Wrong

	summary = orders.groupby("customer_id").amount.sum()
	summary.merge(customers, on="customer_id") # KeyError

Right

	summary = (
	orders.groupby("customer_id", as_index=False).amount.sum()
	)
	summary.merge(customers, on="customer_id")

Prepare for the interview

03 / From the bank03 of many

03hand-picked.

The Merge

Easy10 min

Chaos in. Order out.

Open the sandbox

Pick your level

DatabricksL3The Repeat Review SnowflakeL4The Halftime Score LinkedInL5Holding the Center

Pulled from debriefs where Python parsing was the gate.

FAQ

Common questions

What does pandas groupby do?: groupby splits a DataFrame into groups based on one or more columns, applies a function to each group, and combines the results. It is the pandas equivalent of SQL GROUP BY.
What is the difference between agg and apply in pandas?: agg reduces each group to one or more scalar values. apply accepts any function and returns whatever that function returns. agg is faster because it uses optimized internals; apply is a flexible escape hatch.
How do I groupby multiple columns in pandas?: Pass a list: df.groupby(['col_a', 'col_b']). Each unique combination of values becomes a group. The result is indexed by a MultiIndex unless you pass as_index=False.
Why is my groupby slow?: The usual causes: using apply with a Python lambda (slow), string groupby keys (cast to Categorical), sort=True (the default, unnecessary if order does not matter), or a DataFrame too large for memory (move to Polars or DuckDB).
Does pandas groupby preserve NULLs?: By default, groupby drops rows with NULL in the groupby column. Pass dropna=False (pandas 1.1+) to include them as a group.

Interview Prep

Data Engineer Interview Prep Guide→

50+ guides covering every round, company, and role

Interview Rounds

SQL Round Python Round System Design Data Modeling Behavioral Live Coding

By Company

Stripe Airbnb Netflix Uber Databricks Snowflake

By Role

Senior DE Staff DE Analytics Engineer ML Data Engineer Junior DE

Question Sets

Top 100 Questions FAANG Questions SQL Questions 50 Questions Take-Home Examples

Published 2026-04-20Updated 2026-04-20

Groupby with multiple aggregations

What pandas groupby is and why it matters

pandas groupby reference

Know pandas groupby the way the interviewer who asks it knows it.

How pandas groupby shows up in practice

Single-column groupby with multiple aggregations

Multi-key groupby

transform for per-group normalization

Conditional aggregation

Filtering groups (HAVING equivalent)

pandas groupby interview questions

Q1.Given a DataFrame of orders, compute each customer's total revenue and the share of revenue they contributed overall.

Q2.What's the difference between groupby().agg, groupby().transform, and groupby().apply?

Q3.Your groupby().agg({'amount': ['sum', 'mean']}) produced a MultiIndex in the columns. How do you flatten it?

Q4.groupby is slow on a 50-million-row DataFrame. How do you make it faster?

Q5.Explain what groupby(...).size() returns versus groupby(...).count().

Q6.How do you group by a derived value, like the month of a timestamp column?

What breaks in practice

Using apply when agg or transform would work

Returning a MultiIndex from agg

Forgetting as_index=False or reset_index()

Related concepts

The Merge

Common questions