Pandas Cheat Sheet

Q: Is Pandas fast enough for production data engineering?

For datasets under 1-2 GB, Pandas is fast enough for most production workloads. Beyond that, you start hitting memory limits and single-threaded performance walls. For larger datasets, consider Polars (same API style, much faster), DuckDB (SQL interface, excellent for analytics), or PySpark (distributed). Many production pipelines use Pandas for small-to-medium transforms and hand off to Spark for heavy lifting.

Q: Should I learn Pandas or Polars for data engineering interviews?

Learn Pandas first. It is still the most commonly used DataFrame library in Python, and interviewers expect you to know it. Polars is gaining traction, but interview questions overwhelmingly use Pandas. Once you know Pandas, Polars takes about a day to learn because the concepts are identical. The syntax differences are minor.

Q: How does Pandas groupby compare to SQL GROUP BY?

They are functionally identical. Pandas groupby('col').agg(sum) is SELECT col, SUM(val) FROM table GROUP BY col. The main difference is that Pandas groupby returns a grouped object that you apply functions to, while SQL specifies everything in one statement. If you can write SQL GROUP BY queries, translating to Pandas is mechanical.

Q: What Pandas operations do interviewers test most?

Filtering (boolean masks), groupby with aggregation, merge/join, and handling missing values. These four cover about 80% of interview questions. Window operations (rolling, shift, rank) come up in advanced rounds. Pivot and melt are less common but occasionally tested for data wrangling scenarios.

Pandas patterns that show up in data engineering interviews and on the job. Organized by what you're trying to do, with code you can copy. Every snippet is the version that works in production, not the version that works on the first row.

Pandas sits at the prototyping layer of most Python data pipelines. Raw data lands in object storage, a notebook shapes the transform, and once the logic stabilizes it gets promoted to Spark or dbt. This sheet covers the patterns that show up in interviews and the patterns that show up on the job, which are mostly the same. Code is copy-pasteable.

Reading and writing

Parquet is the format you want in production: columnar, schema-embedded, sixty to eighty percent smaller than CSV with Snappy compression. CSV is the format you'll actually be handed.

Read CSV

import pandas as pd df = pd.read_csv("users.csv") # Common options df = pd.read_csv( "users.csv", usecols=["id", "name", "signup_date"], # only load these columns dtype={"id": int, "name": str}, # explicit types parse_dates=["signup_date"], # parse as datetime na_values=["", "NULL", "N/A"], # treat these as NaN ) # Note: Always specify dtypes and usecols for large files. Without them, # Pandas infers types by scanning the entire file, which is slow and # sometimes wrong.

Read Parquet

df = pd.read_parquet("events.parquet") # Read specific columns (Parquet supports column pruning) df = pd.read_parquet("events.parquet", columns=["user_id", "event_type", "ts"]) # Read from S3 (requires s3fs) df = pd.read_parquet("s3://bucket/path/events.parquet") # Note: Parquet preserves column types, so you do not need dtype arguments. # Column pruning is especially fast with Parquet because the format is columnar.

Read JSON

# Standard JSON array df = pd.read_json("data.json") # Newline-delimited JSON (common in streaming) df = pd.read_json("events.jsonl", lines=True) # Nested JSON: normalize first import json with open("nested.json") as f: raw = json.load(f) df = pd.json_normalize(raw, record_path="items", meta=["order_id"]) # Note: Line-delimited JSON (one object per line) is the standard format # for event streams and log files. Use lines=True for these.

Write Parquet

df.to_parquet("output.parquet", index=False) # With compression (snappy is default, gzip for smaller files) df.to_parquet("output.parquet", compression="gzip", index=False) # Partitioned output (useful for Hive-style layouts) df.to_parquet("output/", partition_cols=["year", "month"], index=False) # Note: Set index=False unless you specifically need the index in the # output file. Partitioned writes create directory structures like # output/year=2026/month=01/.

Write CSV and Read from Database

# Write CSV df.to_csv("output.csv", index=False) # Tab-separated, no header df.to_csv("output.tsv", sep="\t", header=False, index=False) # Read from database from sqlalchemy import create_engine engine = create_engine("postgresql://user:pass@host:5432/db") df = pd.read_sql("SELECT * FROM users WHERE active = true", engine) # With chunked reading for large tables chunks = pd.read_sql("SELECT * FROM events", engine, chunksize=50000) for chunk in chunks: process(chunk)

Filtering and selection

Use .loc for label-based indexing, boolean masks for conditional filtering. Avoid chained indexing (df[col][row]); it works until it silently doesn't.

Boolean mask filtering

# Single condition active = df[df["status"] == "active"] # Multiple conditions (use & for AND, | for OR, ~ for NOT) result = df[(df["age"] >= 25) & (df["country"] == "US")] # String contains gmail = df[df["email"].str.contains("@gmail.com", na=False)] # isin for multiple values subset = df[df["department"].isin(["engineering", "data", "product"])] # Note: Always wrap conditions in parentheses when combining with & or |. # Without parentheses, Python operator precedence causes errors.

loc, iloc, and query()

# loc: label-based (rows and columns by name) df.loc[df["salary"] > 100000, ["name", "salary", "department"]] # iloc: position-based (rows and columns by index) df.iloc[0:10, 0:3] # first 10 rows, first 3 columns # Set values with loc df.loc[df["status"] == "inactive", "active_flag"] = False # query() method (readable for complex conditions) result = df.query("age >= 25 and country == 'US'") # Reference variables with @ min_salary = 80000 high_earners = df.query("salary >= @min_salary")

Dropping rows and columns

# Drop columns df = df.drop(columns=["temp_col", "debug_col"]) # Drop rows by condition df = df[df["value"].notna()] # drop rows where value is NaN # Drop duplicates df = df.drop_duplicates(subset=["user_id", "date"], keep="last") # Note: drop_duplicates with keep='last' is the Pandas equivalent of SQL # deduplication with ROW_NUMBER. Specify subset to control which columns # define uniqueness.

GroupBy and aggregation

Same mental model as SQL GROUP BY: split, apply, combine. If you can write the SQL, the pandas version is mechanical.

Basic groupby and multiple aggregations

# Single aggregation revenue_by_region = df.groupby("region")["revenue"].sum() # Multiple aggregations with named output columns summary = df.groupby("department").agg( headcount=("employee_id", "count"), avg_salary=("salary", "mean"), max_salary=("salary", "max"), total_bonus=("bonus", "sum"), ) # Multiple groupby keys monthly_region = df.groupby(["year", "month", "region"]).agg( total_revenue=("revenue", "sum"), order_count=("order_id", "nunique"), ) # Reset index to get flat columns monthly_region = monthly_region.reset_index()

transform (broadcast back to original shape)

# Add group-level stats without collapsing rows df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean") df["salary_vs_avg"] = df["salary"] - df["dept_avg_salary"] # Percent of group total df["pct_of_dept"] = df.groupby("department")["revenue"].transform( lambda x: x / x.sum() ) # Note: transform returns a Series with the same index as the input. # It is the Pandas equivalent of a SQL window function without ORDER BY.

Custom aggregation with apply

# Apply a custom function to each group def top_n_revenue(group, n=3): return group.nlargest(n, "revenue") top3_per_region = df.groupby("region").apply(top_n_revenue, n=3) # Note: apply is flexible but slow. Prefer built-in aggregation functions # (sum, mean, count, nunique) whenever possible. Reserve apply for logic # that cannot be expressed otherwise.

Merge and join

Pandas merge maps directly to SQL JOIN: how='inner' is INNER JOIN, how='left' is LEFT JOIN. The footgun is duplicated join keys silently producing a many-to-many.

Inner and left merge

# Inner merge (only matching rows) result = pd.merge(orders, customers, on="customer_id", how="inner") # Left merge (keep all orders, NaN for unmatched customers) result = pd.merge(orders, customers, on="customer_id", how="left") # Different column names result = pd.merge( orders, customers, left_on="cust_id", right_on="customer_id", how="left" ) # Note: Always specify how= explicitly. The default is inner, which silently # drops non-matching rows. Left join is the safer default for most DE work.

Anti-join and composite key merge

# Anti-join (customers with no orders) merged = pd.merge(customers, orders, on="customer_id", how="left", indicator=True) no_orders = merged[merged["_merge"] == "left_only"] # Merge on multiple keys result = pd.merge( daily_metrics, targets, on=["date", "region", "product"], how="left" ) # concat (UNION equivalent) combined = pd.concat([df_jan, df_feb, df_mar], ignore_index=True) # Drop duplicates after concat (UNION) combined = pd.concat([df_jan, df_feb]).drop_duplicates()

Pivot and reshape

Pivot turns long into wide; melt turns wide into long. Pivot for reporting, melt for normalizing denormalized inputs that arrived from someone else's pipeline.

pivot_table and melt

# pivot_table: rows = month, columns = region, values = revenue sum pivot = df.pivot_table( index="month", columns="region", values="revenue", aggfunc="sum", fill_value=0, ) # melt (unpivot): wide to long # Columns: user_id, jan_revenue, feb_revenue, mar_revenue # Becomes: user_id, month, revenue long = df.melt( id_vars=["user_id"], value_vars=["jan_revenue", "feb_revenue", "mar_revenue"], var_name="month", value_name="revenue", ) # stack/unstack unstacked = df.set_index(["region", "product"])["revenue"].unstack("product") stacked = unstacked.stack()

Apply and vectorized operations

Vectorized is fast. apply() is flexible and slow. Knowing when to reach for each is the difference between a transform that runs in seconds and one that runs in hours.

Vectorized operations (prefer these)

# Arithmetic df["total"] = df["price"] * df["quantity"] # String operations df["domain"] = df["email"].str.split("@").str[1] # Conditional with np.where import numpy as np df["tier"] = np.where(df["revenue"] > 10000, "enterprise", "standard") # Multiple conditions with np.select conditions = [ df["revenue"] > 100000, df["revenue"] > 10000, ] choices = ["enterprise", "mid-market"] df["tier"] = np.select(conditions, choices, default="smb") # Note: Vectorized operations run in C under the hood. They are 10-100x # faster than apply. Always try vectorized first.

apply and map

# Row-wise apply (use when vectorized is not possible) df["full_address"] = df.apply( lambda row: f"{row['street']}, {row['city']}, {row['state']} {row['zip']}", axis=1, ) # Column-wise apply df[["salary", "bonus"]] = df[["salary", "bonus"]].apply(pd.to_numeric, errors="coerce") # map: apply a function or dict to a Series df["status_label"] = df["status_code"].map({1: "active", 2: "inactive", 3: "banned"}) # replace: replace specific values df["country"] = df["country"].replace({"US": "United States", "UK": "United Kingdom"}) # Note: apply with axis=1 iterates over rows in Python. It is inherently slow # for large DataFrames. If your apply takes more than a few seconds, look for # a vectorized alternative.

Window operations

Rolling, expanding, shift. The pandas equivalents of SQL window functions, computing over a window without collapsing rows.

Rolling and expanding windows

# 7-day rolling average (requires sorted data) df = df.sort_values("date") df["rolling_avg_7d"] = df["revenue"].rolling(window=7).mean() # Rolling with min_periods (handle early rows) df["rolling_avg_7d"] = df["revenue"].rolling(window=7, min_periods=1).mean() # Cumulative sum df["cumulative_revenue"] = df["revenue"].expanding().sum() # Shorthand df["cumulative_revenue"] = df["revenue"].cumsum() # Cumulative max (running maximum) df["running_max"] = df["revenue"].expanding().max()

shift (LAG/LEAD) and rank

# Previous row value (LAG) df["prev_revenue"] = df["revenue"].shift(1) # Next row value (LEAD) df["next_revenue"] = df["revenue"].shift(-1) # Period-over-period change df["revenue_change"] = df["revenue"] - df["revenue"].shift(1) df["revenue_pct_change"] = df["revenue"].pct_change() # Rank within each department df["salary_rank"] = df.groupby("department")["salary"].rank( method="dense", ascending=False ) # Top 3 per group top3 = df[df["salary_rank"] <= 3] # Note: method='dense' is DENSE_RANK, method='min' is RANK, # method='first' is ROW_NUMBER.

Performance

Pandas runs on one core. These are the optimizations that take a transform from minutes to seconds before you start thinking about Polars or Spark.

Reduce memory with dtypes and use Parquet

# Check memory usage df.info(memory_usage="deep") # Downcast numeric types df["quantity"] = df["quantity"].astype("int32") # from int64 df["price"] = df["price"].astype("float32") # from float64 # Use category for low-cardinality strings df["status"] = df["status"].astype("category") # Note: Casting a string column with 10 unique values to category can reduce # memory by 90%. Always do this for columns like status, country, department. # Parquet reads are typically 5-10x faster than CSV for the same data.

Avoid iterrows

# BAD: iterrows (extremely slow) for idx, row in df.iterrows(): df.loc[idx, "total"] = row["price"] * row["quantity"] # GOOD: vectorized (instant) df["total"] = df["price"] * df["quantity"] # Note: iterrows converts each row to a Series, which is expensive. # It is 100-1000x slower than vectorized operations. There is almost # never a reason to use it.

Common questions

Is Pandas fast enough for production data engineering?+

For datasets under 1-2 GB, Pandas is fast enough for most production workloads. Beyond that, you start hitting memory limits and single-threaded performance walls. For larger datasets, consider Polars (same API style, much faster), DuckDB (SQL interface, excellent for analytics), or PySpark (distributed). Many production pipelines use Pandas for small-to-medium transforms and hand off to Spark for heavy lifting.

Should I learn Pandas or Polars for data engineering interviews?+

Learn Pandas first. It is still the most commonly used DataFrame library in Python, and interviewers expect you to know it. Polars is gaining traction, but interview questions overwhelmingly use Pandas. Once you know Pandas, Polars takes about a day to learn because the concepts are identical. The syntax differences are minor.

How does Pandas groupby compare to SQL GROUP BY?+

They are functionally identical. Pandas groupby('col').agg(sum) is SELECT col, SUM(val) FROM table GROUP BY col. The main difference is that Pandas groupby returns a grouped object that you apply functions to, while SQL specifies everything in one statement. If you can write SQL GROUP BY queries, translating to Pandas is mechanical.

What Pandas operations do interviewers test most?+

Filtering (boolean masks), groupby with aggregation, merge/join, and handling missing values. These four cover about 80% of interview questions. Window operations (rolling, shift, rank) come up in advanced rounds. Pivot and melt are less common but occasionally tested for data wrangling scenarios.

02 / Why practice

Practice the patterns that show up in interviews

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Open a Python problem

Where to go next

Python interview questions→

The patterns interviewers test, with worked answers.

Python for data engineering→

Beyond pandas: the rest of the toolkit you'll actually use.

Data engineer roadmap→

What to learn, in what order, including where pandas fits.