Python

Pandas Cheat Sheet for Data Engineers

Pandas sits at the prototyping layer of every Python data pipeline we build. Raw ingest lands in object storage, a Pandas notebook shapes the transform, and once the logic stabilizes it gets promoted to Spark or dbt. Knowing where Pandas fits in that stack matters as much as knowing the API. This sheet is organized around the four stages where DataFrames earn their keep: read, shape, merge, materialize.

Task-indexed, not API-indexed. Each section maps to the SQL equivalent so you can port logic between the two layers of the pipeline.

35%

Python share of DE rounds

4

Pipeline stages covered

369

Python questions mapped

275

Companies in dataset

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Reading and Writing Data

The read/write layer is where a Pandas pipeline meets object storage. Parquet is the format you want in production: columnar layout enables predicate pushdown, Snappy compression cuts S3 egress by 60 to 80%, and schema is embedded so downstream jobs don't guess types. CSV is for humans and quick diffs, never for a DAG that runs nightly.

Read CSV

import pandas as pd

df = pd.read_csv("users.csv")

# Common options
df = pd.read_csv(
    "users.csv",
    usecols=["id", "name", "signup_date"],  # only load these columns
    dtype={"id": int, "name": str},          # explicit types
    parse_dates=["signup_date"],             # parse as datetime
    na_values=["", "NULL", "N/A"],           # treat these as NaN
)

Always specify dtypes and usecols for large files. Without them, Pandas infers types by scanning the entire file, which is slow and sometimes wrong.

Read Parquet

df = pd.read_parquet("events.parquet")

# Read specific columns (Parquet supports column pruning)
df = pd.read_parquet("events.parquet", columns=["user_id", "event_type", "ts"])

# Read from S3 (requires s3fs)
df = pd.read_parquet("s3://bucket/path/events.parquet")

Parquet preserves column types, so you do not need dtype arguments. Column pruning is especially fast with Parquet because the format is columnar.

Read JSON

# Standard JSON array
df = pd.read_json("data.json")

# Newline-delimited JSON (common in streaming)
df = pd.read_json("events.jsonl", lines=True)

# Nested JSON: normalize first
import json
with open("nested.json") as f:
    raw = json.load(f)
df = pd.json_normalize(raw, record_path="items", meta=["order_id"])

Line-delimited JSON (one object per line) is the standard format for event streams and log files. Use lines=True for these.

Write Parquet

df.to_parquet("output.parquet", index=False)

# With compression (snappy is default, gzip for smaller files)
df.to_parquet("output.parquet", compression="gzip", index=False)

# Partitioned output (useful for Hive-style layouts)
df.to_parquet("output/", partition_cols=["year", "month"], index=False)

Set index=False unless you specifically need the index in the output file. Partitioned writes create directory structures like output/year=2026/month=01/.

Write CSV

df.to_csv("output.csv", index=False)

# Tab-separated, no header
df.to_csv("output.tsv", sep="\t", header=False, index=False)

CSV is human-readable but loses type information. Prefer Parquet for pipeline outputs. Use CSV only for interchange with systems that require it.

Read from Database

from sqlalchemy import create_engine

engine = create_engine("postgresql://user:pass@host:5432/db")
df = pd.read_sql("SELECT * FROM users WHERE active = true", engine)

# With chunked reading for large tables
chunks = pd.read_sql("SELECT * FROM events", engine, chunksize=50000)
for chunk in chunks:
    process(chunk)

For large tables, always use chunksize to avoid loading everything into memory. Process each chunk independently.

Filtering and Selection

Selecting rows and columns is the most basic Pandas operation, but there are several ways to do it. Use loc for label-based indexing and boolean masks for conditional filtering. Avoid chained indexing (df[col][row]) because it can produce unexpected results.

Boolean mask filtering

# Single condition
active = df[df["status"] == "active"]

# Multiple conditions (use & for AND, | for OR, ~ for NOT)
result = df[(df["age"] >= 25) & (df["country"] == "US")]

# String contains
gmail = df[df["email"].str.contains("@gmail.com", na=False)]

# isin for multiple values
subset = df[df["department"].isin(["engineering", "data", "product"])]

Always wrap conditions in parentheses when combining with & or |. Without parentheses, Python operator precedence causes errors.

loc and iloc

# loc: label-based (rows and columns by name)
df.loc[df["salary"] > 100000, ["name", "salary", "department"]]

# iloc: position-based (rows and columns by index)
df.iloc[0:10, 0:3]  # first 10 rows, first 3 columns

# Set values with loc
df.loc[df["status"] == "inactive", "active_flag"] = False

loc includes both endpoints of a slice. iloc excludes the end. This is a common source of off-by-one bugs.

query() method

# String-based filtering (readable for complex conditions)
result = df.query("age >= 25 and country == 'US'")

# Reference variables with @
min_salary = 80000
high_earners = df.query("salary >= @min_salary")

query() is often more readable than boolean masks for complex conditions. It also avoids the parentheses issue.

Dropping rows and columns

# Drop columns
df = df.drop(columns=["temp_col", "debug_col"])

# Drop rows by condition
df = df[df["value"].notna()]  # drop rows where value is NaN

# Drop duplicates
df = df.drop_duplicates(subset=["user_id", "date"], keep="last")

drop_duplicates with keep='last' is the Pandas equivalent of SQL deduplication with ROW_NUMBER. Specify subset to control which columns define uniqueness.

GroupBy and Aggregation

GroupBy in Pandas works like GROUP BY in SQL: split the data into groups, apply a function, and combine the results. The syntax is different, but the mental model is the same. If you can write a SQL GROUP BY, you can write the Pandas equivalent.

Basic groupby

# Single aggregation
revenue_by_region = df.groupby("region")["revenue"].sum()

# Multiple aggregations
summary = df.groupby("department").agg(
    headcount=("employee_id", "count"),
    avg_salary=("salary", "mean"),
    max_salary=("salary", "max"),
    total_bonus=("bonus", "sum"),
)

Named aggregation (the keyword argument style) is cleaner and produces properly named columns without renaming afterward.

Multiple groupby keys

monthly_region = df.groupby(["year", "month", "region"]).agg(
    total_revenue=("revenue", "sum"),
    order_count=("order_id", "nunique"),
)

# Reset index to get flat columns
monthly_region = monthly_region.reset_index()

groupby produces a MultiIndex by default. Call reset_index() to convert it back to regular columns. This is almost always what you want for downstream processing.

Transform (broadcast back to original shape)

# Add group-level stats without collapsing rows
df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")
df["salary_vs_avg"] = df["salary"] - df["dept_avg_salary"]

# Percent of group total
df["pct_of_dept"] = df.groupby("department")["revenue"].transform(
    lambda x: x / x.sum()
)

transform returns a Series with the same index as the input. It is the Pandas equivalent of a SQL window function without ORDER BY.

Custom aggregation with apply

# Apply a custom function to each group
def top_n_revenue(group, n=3):
    return group.nlargest(n, "revenue")

top3_per_region = df.groupby("region").apply(top_n_revenue, n=3)

apply is flexible but slow. Prefer built-in aggregation functions (sum, mean, count, nunique) whenever possible. Reserve apply for logic that cannot be expressed otherwise.

Merge and Join

Merge is the Pandas equivalent of SQL JOIN. The syntax maps directly: how='inner' is INNER JOIN, how='left' is LEFT JOIN, and so on. If you can write SQL JOINs, the translation is mechanical.

Inner and left merge

# Inner merge (only matching rows)
result = pd.merge(orders, customers, on="customer_id", how="inner")

# Left merge (keep all orders, NaN for unmatched customers)
result = pd.merge(orders, customers, on="customer_id", how="left")

# Different column names
result = pd.merge(
    orders, customers,
    left_on="cust_id", right_on="customer_id",
    how="left"
)

Always specify how= explicitly. The default is inner, which silently drops non-matching rows. Left join is the safer default for most data engineering work.

Anti-join (find non-matches)

# Customers with no orders (anti-join)
merged = pd.merge(customers, orders, on="customer_id", how="left", indicator=True)
no_orders = merged[merged["_merge"] == "left_only"]

indicator=True adds a _merge column with values 'left_only', 'right_only', or 'both'. This is the Pandas equivalent of LEFT JOIN WHERE right.id IS NULL.

Merge on multiple keys

# Join on composite key
result = pd.merge(
    daily_metrics, targets,
    on=["date", "region", "product"],
    how="left"
)

Pass a list of column names to on= for composite key joins. All columns must match for a row to join.

concat (UNION equivalent)

# Stack DataFrames vertically (UNION ALL)
combined = pd.concat([df_jan, df_feb, df_mar], ignore_index=True)

# Drop duplicates after concat (UNION)
combined = pd.concat([df_jan, df_feb]).drop_duplicates()

concat with ignore_index=True resets the index. Without it, you get duplicate index values, which causes bugs in downstream operations.

Pivot and Reshape

Pivoting transforms long data into wide format, and melting does the reverse. Data engineers use pivot tables for reporting and melt for normalizing denormalized inputs.

pivot_table

# Rows = month, Columns = region, Values = revenue sum
pivot = df.pivot_table(
    index="month",
    columns="region",
    values="revenue",
    aggfunc="sum",
    fill_value=0,
)

pivot_table handles duplicate entries by aggregating. Plain pivot() raises an error on duplicates. Always use pivot_table unless you are certain there are no duplicates.

melt (unpivot)

# Wide to long
# Columns: user_id, jan_revenue, feb_revenue, mar_revenue
# Becomes: user_id, month, revenue
long = df.melt(
    id_vars=["user_id"],
    value_vars=["jan_revenue", "feb_revenue", "mar_revenue"],
    var_name="month",
    value_name="revenue",
)

melt is the inverse of pivot. Use it when you receive wide-format data from a spreadsheet or API and need to normalize it for your pipeline.

stack and unstack

# unstack: move an index level to columns
unstacked = df.set_index(["region", "product"])["revenue"].unstack("product")

# stack: move columns back to index
stacked = unstacked.stack()

stack/unstack operate on the index. They are lower-level than pivot/melt but useful when working with MultiIndex DataFrames.

Apply and Vectorized Operations

Vectorized operations are fast. apply() is flexible but slow. Understanding when to use each is a practical interview skill and a daily engineering skill.

Vectorized operations (prefer these)

# Arithmetic
df["total"] = df["price"] * df["quantity"]

# String operations
df["domain"] = df["email"].str.split("@").str[1]

# Conditional with np.where
import numpy as np
df["tier"] = np.where(df["revenue"] > 10000, "enterprise", "standard")

# Multiple conditions with np.select
conditions = [
    df["revenue"] > 100000,
    df["revenue"] > 10000,
]
choices = ["enterprise", "mid-market"]
df["tier"] = np.select(conditions, choices, default="smb")

Vectorized operations run in C under the hood. They are 10-100x faster than apply. Always try vectorized first.

apply (use when vectorized is not possible)

# Row-wise apply
df["full_address"] = df.apply(
    lambda row: f"{row['street']}, {row['city']}, {row['state']} {row['zip']}",
    axis=1,
)

# Column-wise apply
df[["salary", "bonus"]] = df[["salary", "bonus"]].apply(pd.to_numeric, errors="coerce")

apply with axis=1 iterates over rows in Python. It is inherently slow for large DataFrames. If your apply takes more than a few seconds, look for a vectorized alternative.

map and replace

# map: apply a function or dict to a Series
df["status_label"] = df["status_code"].map({1: "active", 2: "inactive", 3: "banned"})

# replace: replace specific values
df["country"] = df["country"].replace({"US": "United States", "UK": "United Kingdom"})

map is faster than apply for simple lookups. Use a dict for value mapping and a function for transformations.

Window Operations

Window operations in Pandas are the equivalent of SQL window functions. They compute values over a sliding window or expanding window without collapsing rows.

Rolling window

# 7-day rolling average
df = df.sort_values("date")
df["rolling_avg_7d"] = df["revenue"].rolling(window=7).mean()

# Rolling with min_periods (handle early rows)
df["rolling_avg_7d"] = df["revenue"].rolling(window=7, min_periods=1).mean()

rolling() requires sorted data. Always sort by date first. min_periods controls how many values are needed before the window produces a result.

Expanding window (cumulative)

# Cumulative sum
df["cumulative_revenue"] = df["revenue"].expanding().sum()

# Shorthand
df["cumulative_revenue"] = df["revenue"].cumsum()

# Cumulative max (running maximum)
df["running_max"] = df["revenue"].expanding().max()

expanding() is equivalent to cumsum/cummax but works with any aggregation function. cumsum() is a convenient shorthand for the most common case.

shift (LAG/LEAD equivalent)

# Previous row value (LAG)
df["prev_revenue"] = df["revenue"].shift(1)

# Next row value (LEAD)
df["next_revenue"] = df["revenue"].shift(-1)

# Period-over-period change
df["revenue_change"] = df["revenue"] - df["revenue"].shift(1)
df["revenue_pct_change"] = df["revenue"].pct_change()

shift(1) is LAG, shift(-1) is LEAD. pct_change() computes (current - previous) / previous in one call.

rank (within groups)

# Rank within each department
df["salary_rank"] = df.groupby("department")["salary"].rank(
    method="dense", ascending=False
)

# Top 3 per group
top3 = df[df["salary_rank"] <= 3]

method='dense' is DENSE_RANK, method='min' is RANK, method='first' is ROW_NUMBER. The default is 'average' which assigns the mean rank to ties.

Performance Tips

Pandas runs on a single core. For large datasets, these optimizations can be the difference between a pipeline that finishes in seconds and one that takes hours.

Reduce memory with dtypes

# Check memory usage
df.info(memory_usage="deep")

# Downcast numeric types
df["quantity"] = df["quantity"].astype("int32")  # from int64
df["price"] = df["price"].astype("float32")      # from float64

# Use category for low-cardinality strings
df["status"] = df["status"].astype("category")

Casting a string column with 10 unique values to category can reduce memory by 90%. Always do this for columns like status, country, and department.

Use Parquet over CSV

# CSV: slow, large, loses types
# Parquet: fast, compressed, preserves types

import time
start = time.time()
df = pd.read_csv("large_file.csv")      # slow
print(f"CSV: {time.time() - start:.1f}s")

start = time.time()
df = pd.read_parquet("large_file.parquet")  # fast
print(f"Parquet: {time.time() - start:.1f}s")

Parquet reads are typically 5-10x faster than CSV for the same data because of columnar storage, compression, and metadata-driven type inference.

Avoid iterrows

# BAD: iterrows (extremely slow)
for idx, row in df.iterrows():
    df.loc[idx, "total"] = row["price"] * row["quantity"]

# GOOD: vectorized (instant)
df["total"] = df["price"] * df["quantity"]

iterrows converts each row to a Series, which is expensive. It is 100-1000x slower than vectorized operations. There is almost never a reason to use it.

Pandas Cheat Sheet FAQ

Is Pandas fast enough for production data engineering?+
For datasets under 1-2 GB, Pandas is fast enough for most production workloads. Beyond that, you start hitting memory limits and single-threaded performance walls. For larger datasets, consider Polars (same API style, much faster), DuckDB (SQL interface, excellent for analytics), or PySpark (distributed). Many production pipelines use Pandas for small-to-medium transforms and hand off to Spark for heavy lifting.
Should I learn Pandas or Polars for data engineering interviews?+
Learn Pandas first. It is still the most commonly used DataFrame library in Python, and interviewers expect you to know it. Polars is gaining traction, but interview questions overwhelmingly use Pandas. Once you know Pandas, Polars takes about a day to learn because the concepts are identical. The syntax differences are minor.
How does Pandas groupby compare to SQL GROUP BY?+
They are functionally identical. Pandas groupby('col').agg(sum) is SELECT col, SUM(val) FROM table GROUP BY col. The main difference is that Pandas groupby returns a grouped object that you apply functions to, while SQL specifies everything in one statement. If you can write SQL GROUP BY queries, translating to Pandas is mechanical.
What Pandas operations do interviewers test most?+
Filtering (boolean masks), groupby with aggregation, merge/join, and handling missing values. These four cover about 80% of interview questions. Window operations (rolling, shift, rank) come up in advanced rounds. Pivot and melt are less common but occasionally tested for data wrangling scenarios.

Pandas in Production, Not in Theory

Practice the exact DataFrame patterns that show up in real ETL code, then promote them to Spark when the data grows.