Pandas sits at the prototyping layer of every Python data pipeline we build. Raw ingest lands in object storage, a Pandas notebook shapes the transform, and once the logic stabilizes it gets promoted to Spark or dbt. Knowing where Pandas fits in that stack matters as much as knowing the API. This sheet is organized around the four stages where DataFrames earn their keep: read, shape, merge, materialize.
Task-indexed, not API-indexed. Each section maps to the SQL equivalent so you can port logic between the two layers of the pipeline.
Python share of DE rounds
Pipeline stages covered
Python questions mapped
Companies in dataset
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
The read/write layer is where a Pandas pipeline meets object storage. Parquet is the format you want in production: columnar layout enables predicate pushdown, Snappy compression cuts S3 egress by 60 to 80%, and schema is embedded so downstream jobs don't guess types. CSV is for humans and quick diffs, never for a DAG that runs nightly.
import pandas as pd
df = pd.read_csv("users.csv")
# Common options
df = pd.read_csv(
"users.csv",
usecols=["id", "name", "signup_date"], # only load these columns
dtype={"id": int, "name": str}, # explicit types
parse_dates=["signup_date"], # parse as datetime
na_values=["", "NULL", "N/A"], # treat these as NaN
)Always specify dtypes and usecols for large files. Without them, Pandas infers types by scanning the entire file, which is slow and sometimes wrong.
df = pd.read_parquet("events.parquet")
# Read specific columns (Parquet supports column pruning)
df = pd.read_parquet("events.parquet", columns=["user_id", "event_type", "ts"])
# Read from S3 (requires s3fs)
df = pd.read_parquet("s3://bucket/path/events.parquet")Parquet preserves column types, so you do not need dtype arguments. Column pruning is especially fast with Parquet because the format is columnar.
# Standard JSON array
df = pd.read_json("data.json")
# Newline-delimited JSON (common in streaming)
df = pd.read_json("events.jsonl", lines=True)
# Nested JSON: normalize first
import json
with open("nested.json") as f:
raw = json.load(f)
df = pd.json_normalize(raw, record_path="items", meta=["order_id"])Line-delimited JSON (one object per line) is the standard format for event streams and log files. Use lines=True for these.
df.to_parquet("output.parquet", index=False)
# With compression (snappy is default, gzip for smaller files)
df.to_parquet("output.parquet", compression="gzip", index=False)
# Partitioned output (useful for Hive-style layouts)
df.to_parquet("output/", partition_cols=["year", "month"], index=False)Set index=False unless you specifically need the index in the output file. Partitioned writes create directory structures like output/year=2026/month=01/.
df.to_csv("output.csv", index=False)
# Tab-separated, no header
df.to_csv("output.tsv", sep="\t", header=False, index=False)CSV is human-readable but loses type information. Prefer Parquet for pipeline outputs. Use CSV only for interchange with systems that require it.
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@host:5432/db")
df = pd.read_sql("SELECT * FROM users WHERE active = true", engine)
# With chunked reading for large tables
chunks = pd.read_sql("SELECT * FROM events", engine, chunksize=50000)
for chunk in chunks:
process(chunk)For large tables, always use chunksize to avoid loading everything into memory. Process each chunk independently.
Selecting rows and columns is the most basic Pandas operation, but there are several ways to do it. Use loc for label-based indexing and boolean masks for conditional filtering. Avoid chained indexing (df[col][row]) because it can produce unexpected results.
# Single condition
active = df[df["status"] == "active"]
# Multiple conditions (use & for AND, | for OR, ~ for NOT)
result = df[(df["age"] >= 25) & (df["country"] == "US")]
# String contains
gmail = df[df["email"].str.contains("@gmail.com", na=False)]
# isin for multiple values
subset = df[df["department"].isin(["engineering", "data", "product"])]Always wrap conditions in parentheses when combining with & or |. Without parentheses, Python operator precedence causes errors.
# loc: label-based (rows and columns by name) df.loc[df["salary"] > 100000, ["name", "salary", "department"]] # iloc: position-based (rows and columns by index) df.iloc[0:10, 0:3] # first 10 rows, first 3 columns # Set values with loc df.loc[df["status"] == "inactive", "active_flag"] = False
loc includes both endpoints of a slice. iloc excludes the end. This is a common source of off-by-one bugs.
# String-based filtering (readable for complex conditions)
result = df.query("age >= 25 and country == 'US'")
# Reference variables with @
min_salary = 80000
high_earners = df.query("salary >= @min_salary")query() is often more readable than boolean masks for complex conditions. It also avoids the parentheses issue.
# Drop columns df = df.drop(columns=["temp_col", "debug_col"]) # Drop rows by condition df = df[df["value"].notna()] # drop rows where value is NaN # Drop duplicates df = df.drop_duplicates(subset=["user_id", "date"], keep="last")
drop_duplicates with keep='last' is the Pandas equivalent of SQL deduplication with ROW_NUMBER. Specify subset to control which columns define uniqueness.
GroupBy in Pandas works like GROUP BY in SQL: split the data into groups, apply a function, and combine the results. The syntax is different, but the mental model is the same. If you can write a SQL GROUP BY, you can write the Pandas equivalent.
# Single aggregation
revenue_by_region = df.groupby("region")["revenue"].sum()
# Multiple aggregations
summary = df.groupby("department").agg(
headcount=("employee_id", "count"),
avg_salary=("salary", "mean"),
max_salary=("salary", "max"),
total_bonus=("bonus", "sum"),
)Named aggregation (the keyword argument style) is cleaner and produces properly named columns without renaming afterward.
monthly_region = df.groupby(["year", "month", "region"]).agg(
total_revenue=("revenue", "sum"),
order_count=("order_id", "nunique"),
)
# Reset index to get flat columns
monthly_region = monthly_region.reset_index()groupby produces a MultiIndex by default. Call reset_index() to convert it back to regular columns. This is almost always what you want for downstream processing.
# Add group-level stats without collapsing rows
df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")
df["salary_vs_avg"] = df["salary"] - df["dept_avg_salary"]
# Percent of group total
df["pct_of_dept"] = df.groupby("department")["revenue"].transform(
lambda x: x / x.sum()
)transform returns a Series with the same index as the input. It is the Pandas equivalent of a SQL window function without ORDER BY.
# Apply a custom function to each group
def top_n_revenue(group, n=3):
return group.nlargest(n, "revenue")
top3_per_region = df.groupby("region").apply(top_n_revenue, n=3)apply is flexible but slow. Prefer built-in aggregation functions (sum, mean, count, nunique) whenever possible. Reserve apply for logic that cannot be expressed otherwise.
Merge is the Pandas equivalent of SQL JOIN. The syntax maps directly: how='inner' is INNER JOIN, how='left' is LEFT JOIN, and so on. If you can write SQL JOINs, the translation is mechanical.
# Inner merge (only matching rows)
result = pd.merge(orders, customers, on="customer_id", how="inner")
# Left merge (keep all orders, NaN for unmatched customers)
result = pd.merge(orders, customers, on="customer_id", how="left")
# Different column names
result = pd.merge(
orders, customers,
left_on="cust_id", right_on="customer_id",
how="left"
)Always specify how= explicitly. The default is inner, which silently drops non-matching rows. Left join is the safer default for most data engineering work.
# Customers with no orders (anti-join) merged = pd.merge(customers, orders, on="customer_id", how="left", indicator=True) no_orders = merged[merged["_merge"] == "left_only"]
indicator=True adds a _merge column with values 'left_only', 'right_only', or 'both'. This is the Pandas equivalent of LEFT JOIN WHERE right.id IS NULL.
# Join on composite key
result = pd.merge(
daily_metrics, targets,
on=["date", "region", "product"],
how="left"
)Pass a list of column names to on= for composite key joins. All columns must match for a row to join.
# Stack DataFrames vertically (UNION ALL) combined = pd.concat([df_jan, df_feb, df_mar], ignore_index=True) # Drop duplicates after concat (UNION) combined = pd.concat([df_jan, df_feb]).drop_duplicates()
concat with ignore_index=True resets the index. Without it, you get duplicate index values, which causes bugs in downstream operations.
Pivoting transforms long data into wide format, and melting does the reverse. Data engineers use pivot tables for reporting and melt for normalizing denormalized inputs.
# Rows = month, Columns = region, Values = revenue sum
pivot = df.pivot_table(
index="month",
columns="region",
values="revenue",
aggfunc="sum",
fill_value=0,
)pivot_table handles duplicate entries by aggregating. Plain pivot() raises an error on duplicates. Always use pivot_table unless you are certain there are no duplicates.
# Wide to long
# Columns: user_id, jan_revenue, feb_revenue, mar_revenue
# Becomes: user_id, month, revenue
long = df.melt(
id_vars=["user_id"],
value_vars=["jan_revenue", "feb_revenue", "mar_revenue"],
var_name="month",
value_name="revenue",
)melt is the inverse of pivot. Use it when you receive wide-format data from a spreadsheet or API and need to normalize it for your pipeline.
# unstack: move an index level to columns
unstacked = df.set_index(["region", "product"])["revenue"].unstack("product")
# stack: move columns back to index
stacked = unstacked.stack()stack/unstack operate on the index. They are lower-level than pivot/melt but useful when working with MultiIndex DataFrames.
Vectorized operations are fast. apply() is flexible but slow. Understanding when to use each is a practical interview skill and a daily engineering skill.
# Arithmetic
df["total"] = df["price"] * df["quantity"]
# String operations
df["domain"] = df["email"].str.split("@").str[1]
# Conditional with np.where
import numpy as np
df["tier"] = np.where(df["revenue"] > 10000, "enterprise", "standard")
# Multiple conditions with np.select
conditions = [
df["revenue"] > 100000,
df["revenue"] > 10000,
]
choices = ["enterprise", "mid-market"]
df["tier"] = np.select(conditions, choices, default="smb")Vectorized operations run in C under the hood. They are 10-100x faster than apply. Always try vectorized first.
# Row-wise apply
df["full_address"] = df.apply(
lambda row: f"{row['street']}, {row['city']}, {row['state']} {row['zip']}",
axis=1,
)
# Column-wise apply
df[["salary", "bonus"]] = df[["salary", "bonus"]].apply(pd.to_numeric, errors="coerce")apply with axis=1 iterates over rows in Python. It is inherently slow for large DataFrames. If your apply takes more than a few seconds, look for a vectorized alternative.
# map: apply a function or dict to a Series
df["status_label"] = df["status_code"].map({1: "active", 2: "inactive", 3: "banned"})
# replace: replace specific values
df["country"] = df["country"].replace({"US": "United States", "UK": "United Kingdom"})map is faster than apply for simple lookups. Use a dict for value mapping and a function for transformations.
Window operations in Pandas are the equivalent of SQL window functions. They compute values over a sliding window or expanding window without collapsing rows.
# 7-day rolling average
df = df.sort_values("date")
df["rolling_avg_7d"] = df["revenue"].rolling(window=7).mean()
# Rolling with min_periods (handle early rows)
df["rolling_avg_7d"] = df["revenue"].rolling(window=7, min_periods=1).mean()rolling() requires sorted data. Always sort by date first. min_periods controls how many values are needed before the window produces a result.
# Cumulative sum df["cumulative_revenue"] = df["revenue"].expanding().sum() # Shorthand df["cumulative_revenue"] = df["revenue"].cumsum() # Cumulative max (running maximum) df["running_max"] = df["revenue"].expanding().max()
expanding() is equivalent to cumsum/cummax but works with any aggregation function. cumsum() is a convenient shorthand for the most common case.
# Previous row value (LAG) df["prev_revenue"] = df["revenue"].shift(1) # Next row value (LEAD) df["next_revenue"] = df["revenue"].shift(-1) # Period-over-period change df["revenue_change"] = df["revenue"] - df["revenue"].shift(1) df["revenue_pct_change"] = df["revenue"].pct_change()
shift(1) is LAG, shift(-1) is LEAD. pct_change() computes (current - previous) / previous in one call.
# Rank within each department
df["salary_rank"] = df.groupby("department")["salary"].rank(
method="dense", ascending=False
)
# Top 3 per group
top3 = df[df["salary_rank"] <= 3]method='dense' is DENSE_RANK, method='min' is RANK, method='first' is ROW_NUMBER. The default is 'average' which assigns the mean rank to ties.
Pandas runs on a single core. For large datasets, these optimizations can be the difference between a pipeline that finishes in seconds and one that takes hours.
# Check memory usage
df.info(memory_usage="deep")
# Downcast numeric types
df["quantity"] = df["quantity"].astype("int32") # from int64
df["price"] = df["price"].astype("float32") # from float64
# Use category for low-cardinality strings
df["status"] = df["status"].astype("category")Casting a string column with 10 unique values to category can reduce memory by 90%. Always do this for columns like status, country, and department.
# CSV: slow, large, loses types
# Parquet: fast, compressed, preserves types
import time
start = time.time()
df = pd.read_csv("large_file.csv") # slow
print(f"CSV: {time.time() - start:.1f}s")
start = time.time()
df = pd.read_parquet("large_file.parquet") # fast
print(f"Parquet: {time.time() - start:.1f}s")Parquet reads are typically 5-10x faster than CSV for the same data because of columnar storage, compression, and metadata-driven type inference.
# BAD: iterrows (extremely slow)
for idx, row in df.iterrows():
df.loc[idx, "total"] = row["price"] * row["quantity"]
# GOOD: vectorized (instant)
df["total"] = df["price"] * df["quantity"]iterrows converts each row to a Series, which is expensive. It is 100-1000x slower than vectorized operations. There is almost never a reason to use it.
Practice the exact DataFrame patterns that show up in real ETL code, then promote them to Spark when the data grows.