SQL Injection Cheat Sheet for Data Engineers (2026)
A travel startup shipped an Airflow DAG that ingested CSV filenames from an S3 bucket and interpolated them into a DROP PARTITION statement. A file named '2024-01-01'; DROP TABLE bookings; --.csv landed in the bucket via a misconfigured upload form and took down their warehouse for six hours. String concatenation in SQL is that easy to weaponize. This sheet covers the four injection types and the one fix that stops all of them.
The one rule that prevents SQL injection
Use parameterized queries for all values. Validate identifiers against allowlists. That's it. Every injection type on this page exploits the same root cause: treating user input as executable SQL instead of data.
ORM and driver protection
ORMs protect you when you use their query builders. They do not protect you when you bypass them with raw SQL. Below: safe and unsafe forms for the most common Python tools.
Know SQL injection the way the interviewer who asks it knows it.
Pulled from debriefs where SQL was the gate.
# ALWAYS this (parameterized)
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
# NEVER this (string concatenation)
cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")If you remember nothing else from this page, remember this contrast. The top form is safe in every database driver and every framework. The bottom form is exploitable in all of them.
Four types of SQL injection
Four flavors, one root cause. A SOC analyst at a logistics company pulled a postmortem on a breach where the attacker used a blind time-based injection through a reporting endpoint that took weeks to detect because the logs looked normal. All four go on your threat model.
Classic (in-band) injection · High risk
The attacker modifies a SQL query through user input, and results are returned directly in the application response. The textbook example: a login form where entering ' OR 1=1 -- bypasses authentication. The injected SQL runs in the same channel as the original query. Fix: parameterized queries. Never concatenate user input into SQL strings. The database engine treats parameters as data, not executable code.
Blind injection · Medium-high risk
The attacker cannot see query results directly, but can infer information from application behavior. Boolean-based blind injection checks whether the page changes. Time-based uses SLEEP or pg_sleep to confirm a condition is true by measuring response time. Slower to exploit but still dangerous. Fix: same as classic — parameterized queries. Blind injection exploits the same root cause. If inputs are parameterized, there is nothing to exploit.
Union-based injection · High risk
The attacker appends a UNION SELECT to the original query to extract data from other tables. Requires knowing (or guessing) the column count. Once the count matches, the attacker can pull data from any table the database user has access to. Fix: parameterized queries block this entirely. Pair them with least-privilege database users so that even if injection occurred, the attacker could not reach sensitive tables.
Second-order injection · Medium risk
Malicious input is stored in the database first and executed later when a different query reads it. Particularly relevant for data engineers: a value enters through one pipeline, gets stored, and then a downstream query uses it unsafely. The original insertion may be safe; the later consumption is not. Fix: parameterize every query, not just user-facing ones. Downstream queries reading from the database must also be parameterized. DEs miss this because the data 'came from our own database.'
Classic injection: vulnerable vs safe
# Vulnerable query (string concatenation)
query = "SELECT * FROM users WHERE name = '" + user_input + "'"
# If user_input = "' OR 1=1 --"
# Resulting query:
SELECT * FROM users WHERE name = '' OR 1=1 --'
# Returns ALL rows. Authentication bypassed.
# ────────────────────────
# Safe: parameterized query
cursor.execute(
"SELECT * FROM users WHERE name = %s",
(user_input,)
)String concatenation lets the attacker close the string and inject arbitrary SQL. Parameterized queries keep the user input as data — the database never parses it as SQL.
Blind injection: vulnerable vs safe
-- Boolean-based: does the page load differently?
' AND (SELECT COUNT(*) FROM users) > 100 --
-- Time-based: does the response take 5 seconds?
' AND pg_sleep(5) --
-- If the response is delayed, the attacker knows
-- the injected SQL executed successfully.
# ────────────────────────
# Also safe: using an ORM
User.objects.filter(name=user_input)
# Django ORM parameterizes automaticallyBlind injection works without visible output. The attacker probes by timing or by checking how the page renders. ORM query builders parameterize automatically and shut the door.
Union-based injection: vulnerable vs safe (defense in depth)
-- Original query returns 3 columns
SELECT id, name, email FROM products WHERE id = 1
-- Attacker input: 1 UNION SELECT username, password, null FROM admin_users
-- Resulting query:
SELECT id, name, email FROM products WHERE id = 1
UNION SELECT username, password, null FROM admin_users
-- Now the admin credentials appear in the product listing.
-- ────────────────────────
-- Defense in depth: restrict DB user permissions
GRANT SELECT ON products TO app_readonly;
-- app_readonly cannot access admin_users table
-- Even if injection worked, the query would failUNION-based attacks let the attacker pull data from tables your application never queries. Parameterized queries block the injection. Least-privilege database users limit blast radius if anything slips through.
Second-order injection: vulnerable vs safe
# Step 1: Safe insertion (parameterized)
INSERT INTO users (name) VALUES (%s) # stores: Robert'; DROP TABLE users;--
# Step 2: Unsafe downstream query (string concat)
query = f"SELECT * FROM logs WHERE user_name = '{row['name']}'"
# The stored value is now injected into a new query.
# ────────────────────────
# Safe downstream query
cursor.execute(
"SELECT * FROM logs WHERE user_name = %s",
(row['name'],)
)
# Parameterized even though the input came from our DBThe original INSERT was safe, but the downstream SELECT trusted the stored value. Parameterize every query — even ones reading 'your own' data. The DB doesn't remember which strings were once user input.
Why data engineers should care
Most SQL injection training focuses on web applications. DEs build systems that ingest data from APIs, partner feeds, file uploads, and user-generated content. These are all attack surfaces. The four most common DE-specific risks.
Dynamic SQL in ETL scripts
ETL scripts that build SQL strings dynamically — especially when table names, column names, or filter values come from configuration files, API responses, or upstream systems. If upstream is compromised, the injected value flows into your pipeline. Fix: parameterize values. For identifiers (table/column names), validate against an allowlist. Most drivers cannot parameterize identifiers.
API-driven query builders
REST or GraphQL APIs that accept filter parameters and build SQL from them. A caller passes sort_by=name; DROP TABLE users and your query builder injects it directly. Fix: parameterize values. Validate and allowlist column names for ORDER BY and GROUP BY. Never pass raw API input into SQL without validation.
Templated Airflow queries
Airflow DAGs that use Jinja templating to inject values into SQL. The {{ params.date }} syntax does not parameterize — it performs string substitution before the query reaches the database. Fix: use operator parameters (parameters argument in PostgresOperator). For dynamic identifiers, validate in a Python task before passing to SQL.
Bulk COPY or INSERT from files
Loading CSV or JSON files using COPY or bulk INSERT. If file contents are adversarial (a CSV field contains a SQL injection payload), the COPY itself is safe — it treats data as literal values. The risk is downstream queries that read those values without parameterization. Fix: ensure all downstream SQL reading loaded data is parameterized.
Dynamic SQL in ETL scripts
# BAD: dynamic table name from config
table = config['source_table']
query = f"SELECT * FROM {table}"
# GOOD: validate against allowlist
ALLOWED_TABLES = {'users', 'orders', 'products'}
if table not in ALLOWED_TABLES:
raise ValueError(f"Unknown table: {table}")
query = f"SELECT * FROM {table}" # safe after validationYou cannot parameterize identifiers in most drivers. Validate them against an allowlist before interpolating.
API-driven query builders
# BAD: API parameter directly in ORDER BY
sort_col = request.args.get('sort_by', 'id')
query = f"SELECT * FROM users ORDER BY {sort_col}"
# GOOD: allowlist the column
SORTABLE = {'id', 'name', 'created_at'}
if sort_col not in SORTABLE:
sort_col = 'id'
query = f"SELECT * FROM users ORDER BY {sort_col}"Allowlist ORDER BY columns. A caller who controls the sort column controls the query plan and can probe for additional vulnerabilities.
Templated Airflow queries
# BAD: Jinja template (string substitution)
sql = "SELECT * FROM events WHERE dt = '{{ params.date }}'"
# GOOD: use parameters argument
PostgresOperator(
task_id='query_events',
sql="SELECT * FROM events WHERE dt = %(date)s",
parameters={'date': '2026-01-15'},
)Jinja is string substitution, not parameterization. Operators expose a parameters argument for a reason — use it for values, and validate any dynamic identifiers in a Python task upstream.
Bulk COPY or INSERT from files
# COPY itself is safe (treats all data as literal values)
COPY staging_table FROM '/data/upload.csv' WITH (FORMAT csv);
# But downstream processing MUST be parameterized
cursor.execute(
"INSERT INTO clean_table SELECT * FROM staging_table WHERE status = %s",
('active',)
)COPY treats every byte as literal data. The risk is the next query that reads the staging table. Parameterize it the same as any other query.
SQLAlchemy (Python)
# Safe: ORM query
session.query(User).filter(User.name == user_input).all()
# Safe: text() with bind parameters
from sqlalchemy import text
session.execute(text("SELECT * FROM users WHERE name = :name"), {"name": user_input})
# UNSAFE: raw string formatting
session.execute(f"SELECT * FROM users WHERE name = '{user_input}'")session.query and text() with bind parameters are both safe. f-strings against execute() are not.
Django ORM (Python)
# Safe: ORM query
User.objects.filter(name=user_input)
# Safe: raw query with parameters
User.objects.raw("SELECT * FROM users WHERE name = %s", [user_input])
# UNSAFE: string formatting in raw query
User.objects.raw(f"SELECT * FROM users WHERE name = '{user_input}'")ORM filter() is always safe. raw() is safe only if you pass parameters as the second argument — never with f-strings.
psycopg2 / asyncpg (Python)
# Safe: psycopg2 parameterized
cursor.execute("SELECT * FROM users WHERE name = %s", (user_input,))
# Safe: asyncpg parameterized
await conn.fetch("SELECT * FROM users WHERE name = $1", user_input)
# UNSAFE: string concatenation
cursor.execute("SELECT * FROM users WHERE name = '" + user_input + "'")
# UNSAFE: f-string
cursor.execute(f"SELECT * FROM users WHERE name = '{user_input}'")Both drivers parameterize natively. psycopg2 uses %s placeholders, asyncpg uses $1, $2, etc. Never close them with string concatenation.
Prevention checklist
- Every query uses parameterized values (no f-strings, no string concatenation).
- Identifiers (table names, column names) are validated against allowlists.
- Database users have least-privilege permissions (read-only where possible).
- Airflow DAGs use operator parameters instead of Jinja substitution for values.
- ORM usage avoids raw SQL string formatting.
- Input from upstream systems is treated as untrusted, even from internal APIs.
- Stored procedures validate inputs before building dynamic SQL.
- CI/CD pipelines include static analysis for SQL injection patterns (e.g., bandit for Python).
SQL injection FAQ
Do data engineers need to worry about SQL injection?+
Does using an ORM prevent SQL injection?+
What about NoSQL injection?+
Is SQL injection still a real threat in 2026?+
Write SQL that won't get you paged
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition