SQL Injection Cheat Sheet for Data Engineers (2026)

A travel startup shipped an Airflow DAG that ingested CSV filenames from an S3 bucket and interpolated them into a DROP PARTITION statement. A file named '2024-01-01'; DROP TABLE bookings; --.csv landed in the bucket via a misconfigured upload form and took down their warehouse for six hours. String concatenation in SQL is that easy to weaponize. This sheet covers the four injection types and the one fix that stops all of them.

The one rule that prevents SQL injection

Use parameterized queries for all values. Validate identifiers against allowlists. That's it. Every injection type on this page exploits the same root cause: treating user input as executable SQL instead of data.

#3
OWASP Top 10 rank
4
Injection types covered
1
Fix: parameterize
1990s
When the fix was published

ORM and driver protection

ORMs protect you when you use their query builders. They do not protect you when you bypass them with raw SQL. Below: safe and unsafe forms for the most common Python tools.

Prepare for the interview
01 / Open invite
02min.

Know SQL injection the way the interviewer who asks it knows it.

a SQL injection query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1SELECT user_id,
2 COUNT(*) AS sessions
3FROM events
4WHERE ts >= NOW() - INTERVAL '7 day'
5
Execute your solution0.4s avg.
StripeInterview question
Solve a SQL injection problem
Prepare for the interview
03 / From the bank03 of many
03hand-picked.

Service User Growth Rate

Easy15 min

User growth, service by service.

# ALWAYS this (parameterized)
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))

# NEVER this (string concatenation)
cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")

If you remember nothing else from this page, remember this contrast. The top form is safe in every database driver and every framework. The bottom form is exploitable in all of them.

Four types of SQL injection

Four flavors, one root cause. A SOC analyst at a logistics company pulled a postmortem on a breach where the attacker used a blind time-based injection through a reporting endpoint that took weeks to detect because the logs looked normal. All four go on your threat model.

Classic (in-band) injection · High risk

The attacker modifies a SQL query through user input, and results are returned directly in the application response. The textbook example: a login form where entering ' OR 1=1 -- bypasses authentication. The injected SQL runs in the same channel as the original query. Fix: parameterized queries. Never concatenate user input into SQL strings. The database engine treats parameters as data, not executable code.

Blind injection · Medium-high risk

The attacker cannot see query results directly, but can infer information from application behavior. Boolean-based blind injection checks whether the page changes. Time-based uses SLEEP or pg_sleep to confirm a condition is true by measuring response time. Slower to exploit but still dangerous. Fix: same as classic — parameterized queries. Blind injection exploits the same root cause. If inputs are parameterized, there is nothing to exploit.

Union-based injection · High risk

The attacker appends a UNION SELECT to the original query to extract data from other tables. Requires knowing (or guessing) the column count. Once the count matches, the attacker can pull data from any table the database user has access to. Fix: parameterized queries block this entirely. Pair them with least-privilege database users so that even if injection occurred, the attacker could not reach sensitive tables.

Second-order injection · Medium risk

Malicious input is stored in the database first and executed later when a different query reads it. Particularly relevant for data engineers: a value enters through one pipeline, gets stored, and then a downstream query uses it unsafely. The original insertion may be safe; the later consumption is not. Fix: parameterize every query, not just user-facing ones. Downstream queries reading from the database must also be parameterized. DEs miss this because the data 'came from our own database.'

Classic injection: vulnerable vs safe

# Vulnerable query (string concatenation)
query = "SELECT * FROM users WHERE name = '" + user_input + "'"

# If user_input = "' OR 1=1 --"
# Resulting query:
SELECT * FROM users WHERE name = '' OR 1=1 --'
# Returns ALL rows. Authentication bypassed.

# ────────────────────────

# Safe: parameterized query
cursor.execute(
    "SELECT * FROM users WHERE name = %s",
    (user_input,)
)

String concatenation lets the attacker close the string and inject arbitrary SQL. Parameterized queries keep the user input as data — the database never parses it as SQL.

Blind injection: vulnerable vs safe

-- Boolean-based: does the page load differently?
' AND (SELECT COUNT(*) FROM users) > 100 --

-- Time-based: does the response take 5 seconds?
' AND pg_sleep(5) --

-- If the response is delayed, the attacker knows
-- the injected SQL executed successfully.

# ────────────────────────

# Also safe: using an ORM
User.objects.filter(name=user_input)
# Django ORM parameterizes automatically

Blind injection works without visible output. The attacker probes by timing or by checking how the page renders. ORM query builders parameterize automatically and shut the door.

Union-based injection: vulnerable vs safe (defense in depth)

-- Original query returns 3 columns
SELECT id, name, email FROM products WHERE id = 1

-- Attacker input: 1 UNION SELECT username, password, null FROM admin_users
-- Resulting query:
SELECT id, name, email FROM products WHERE id = 1
UNION SELECT username, password, null FROM admin_users
-- Now the admin credentials appear in the product listing.

-- ────────────────────────

-- Defense in depth: restrict DB user permissions
GRANT SELECT ON products TO app_readonly;
-- app_readonly cannot access admin_users table
-- Even if injection worked, the query would fail

UNION-based attacks let the attacker pull data from tables your application never queries. Parameterized queries block the injection. Least-privilege database users limit blast radius if anything slips through.

Second-order injection: vulnerable vs safe

# Step 1: Safe insertion (parameterized)
INSERT INTO users (name) VALUES (%s)  # stores: Robert'; DROP TABLE users;--

# Step 2: Unsafe downstream query (string concat)
query = f"SELECT * FROM logs WHERE user_name = '{row['name']}'"
# The stored value is now injected into a new query.

# ────────────────────────

# Safe downstream query
cursor.execute(
    "SELECT * FROM logs WHERE user_name = %s",
    (row['name'],)
)
# Parameterized even though the input came from our DB

The original INSERT was safe, but the downstream SELECT trusted the stored value. Parameterize every query — even ones reading 'your own' data. The DB doesn't remember which strings were once user input.

Why data engineers should care

Most SQL injection training focuses on web applications. DEs build systems that ingest data from APIs, partner feeds, file uploads, and user-generated content. These are all attack surfaces. The four most common DE-specific risks.

Dynamic SQL in ETL scripts

ETL scripts that build SQL strings dynamically — especially when table names, column names, or filter values come from configuration files, API responses, or upstream systems. If upstream is compromised, the injected value flows into your pipeline. Fix: parameterize values. For identifiers (table/column names), validate against an allowlist. Most drivers cannot parameterize identifiers.

API-driven query builders

REST or GraphQL APIs that accept filter parameters and build SQL from them. A caller passes sort_by=name; DROP TABLE users and your query builder injects it directly. Fix: parameterize values. Validate and allowlist column names for ORDER BY and GROUP BY. Never pass raw API input into SQL without validation.

Templated Airflow queries

Airflow DAGs that use Jinja templating to inject values into SQL. The {{ params.date }} syntax does not parameterize — it performs string substitution before the query reaches the database. Fix: use operator parameters (parameters argument in PostgresOperator). For dynamic identifiers, validate in a Python task before passing to SQL.

Bulk COPY or INSERT from files

Loading CSV or JSON files using COPY or bulk INSERT. If file contents are adversarial (a CSV field contains a SQL injection payload), the COPY itself is safe — it treats data as literal values. The risk is downstream queries that read those values without parameterization. Fix: ensure all downstream SQL reading loaded data is parameterized.

Dynamic SQL in ETL scripts

# BAD: dynamic table name from config
table = config['source_table']
query = f"SELECT * FROM {table}"

# GOOD: validate against allowlist
ALLOWED_TABLES = {'users', 'orders', 'products'}
if table not in ALLOWED_TABLES:
    raise ValueError(f"Unknown table: {table}")
query = f"SELECT * FROM {table}"  # safe after validation

You cannot parameterize identifiers in most drivers. Validate them against an allowlist before interpolating.

API-driven query builders

# BAD: API parameter directly in ORDER BY
sort_col = request.args.get('sort_by', 'id')
query = f"SELECT * FROM users ORDER BY {sort_col}"

# GOOD: allowlist the column
SORTABLE = {'id', 'name', 'created_at'}
if sort_col not in SORTABLE:
    sort_col = 'id'
query = f"SELECT * FROM users ORDER BY {sort_col}"

Allowlist ORDER BY columns. A caller who controls the sort column controls the query plan and can probe for additional vulnerabilities.

Templated Airflow queries

# BAD: Jinja template (string substitution)
sql = "SELECT * FROM events WHERE dt = '{{ params.date }}'"

# GOOD: use parameters argument
PostgresOperator(
    task_id='query_events',
    sql="SELECT * FROM events WHERE dt = %(date)s",
    parameters={'date': '2026-01-15'},
)

Jinja is string substitution, not parameterization. Operators expose a parameters argument for a reason — use it for values, and validate any dynamic identifiers in a Python task upstream.

Bulk COPY or INSERT from files

# COPY itself is safe (treats all data as literal values)
COPY staging_table FROM '/data/upload.csv' WITH (FORMAT csv);

# But downstream processing MUST be parameterized
cursor.execute(
    "INSERT INTO clean_table SELECT * FROM staging_table WHERE status = %s",
    ('active',)
)

COPY treats every byte as literal data. The risk is the next query that reads the staging table. Parameterize it the same as any other query.

SQLAlchemy (Python)

# Safe: ORM query
session.query(User).filter(User.name == user_input).all()

# Safe: text() with bind parameters
from sqlalchemy import text
session.execute(text("SELECT * FROM users WHERE name = :name"), {"name": user_input})

# UNSAFE: raw string formatting
session.execute(f"SELECT * FROM users WHERE name = '{user_input}'")

session.query and text() with bind parameters are both safe. f-strings against execute() are not.

Django ORM (Python)

# Safe: ORM query
User.objects.filter(name=user_input)

# Safe: raw query with parameters
User.objects.raw("SELECT * FROM users WHERE name = %s", [user_input])

# UNSAFE: string formatting in raw query
User.objects.raw(f"SELECT * FROM users WHERE name = '{user_input}'")

ORM filter() is always safe. raw() is safe only if you pass parameters as the second argument — never with f-strings.

psycopg2 / asyncpg (Python)

# Safe: psycopg2 parameterized
cursor.execute("SELECT * FROM users WHERE name = %s", (user_input,))

# Safe: asyncpg parameterized
await conn.fetch("SELECT * FROM users WHERE name = $1", user_input)

# UNSAFE: string concatenation
cursor.execute("SELECT * FROM users WHERE name = '" + user_input + "'")

# UNSAFE: f-string
cursor.execute(f"SELECT * FROM users WHERE name = '{user_input}'")

Both drivers parameterize natively. psycopg2 uses %s placeholders, asyncpg uses $1, $2, etc. Never close them with string concatenation.

Prevention checklist

  • Every query uses parameterized values (no f-strings, no string concatenation).
  • Identifiers (table names, column names) are validated against allowlists.
  • Database users have least-privilege permissions (read-only where possible).
  • Airflow DAGs use operator parameters instead of Jinja substitution for values.
  • ORM usage avoids raw SQL string formatting.
  • Input from upstream systems is treated as untrusted, even from internal APIs.
  • Stored procedures validate inputs before building dynamic SQL.
  • CI/CD pipelines include static analysis for SQL injection patterns (e.g., bandit for Python).

SQL injection FAQ

Do data engineers need to worry about SQL injection?+
Yes. DEs write SQL that processes data from external sources: APIs, file uploads, partner feeds, user-generated content. If those inputs contain malicious SQL and your queries use string concatenation, you have the same vulnerability as a web app. The attack surface differs (ETL pipelines, Airflow DAGs, API-driven queries) but the root cause is identical.
Does using an ORM prevent SQL injection?+
ORMs prevent injection when you use their query builder methods (filter, where, etc.). They don't prevent injection when you bypass the ORM with raw SQL and string formatting. The ORM is a protection layer, not a guarantee. Call raw() or execute() with f-strings and you're back to square one.
What about NoSQL injection?+
NoSQL databases (MongoDB, DynamoDB, etc.) have their own injection vulnerabilities. MongoDB queries built from unsanitized JSON input can be manipulated with operators like $gt, $ne, $regex. The principle is the same: never trust external input, always use the driver's built-in parameterization.
Is SQL injection still a real threat in 2026?+
SQL injection has been in the OWASP Top 10 for over two decades. It persists because developers keep building queries with string concatenation. Automated scanners make these vulnerabilities trivial to find. The fix has been known since the 1990s — parameterize your queries. The problem isn't a lack of solutions; it's a lack of discipline.
02 / Why practice

Write SQL that won't get you paged

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related guides