Security

SQL Injection Cheat Sheet

A travel startup we know shipped an Airflow DAG that ingested CSV filenames from an S3 bucket and interpolated them into a DROP PARTITION statement. A file named "2024-01-01'; DROP TABLE bookings; --.csv" landed in the bucket via a misconfigured upload form and took down their warehouse for six hours. String concatenation in SQL is that easy to weaponize. This sheet shows you exactly which patterns to rip out of your code before the next incident review.

Every injection type. Vulnerable and safe side-by-side. Pipeline-specific risks DEs miss.

#3

OWASP Top 10 rank

4

Injection types covered

1

Fix: parameterize

275

Companies in our dataset

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

The one rule that prevents SQL injection

Use parameterized queries for all values. Validate identifiers against allowlists. That is it. Every injection type on this page exploits the same root cause: treating user input as executable SQL instead of data.

# ALWAYS this (parameterized)
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))

# NEVER this (string concatenation)
cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")

Types of SQL Injection

Four flavors, one root cause. A SOC analyst we worked with pulled the postmortem on a breach at a logistics company: the attacker used a blind time-based injection through a reporting endpoint that took weeks to detect because the logs looked normal. All four types below need to go on your threat model.

High Risk

Classic (In-Band) Injection

The attacker modifies a SQL query through user input, and the results are returned directly in the application response. This is the textbook example: a login form where entering ' OR 1=1 -- bypasses authentication. The injected SQL runs in the same channel as the original query.

Vulnerable example:

-- Vulnerable query (string concatenation)
query = "SELECT * FROM users WHERE name = '" + user_input + "'"

-- If user_input = "' OR 1=1 --"
-- Resulting query:
SELECT * FROM users WHERE name = '' OR 1=1 --'
-- Returns ALL rows. Authentication bypassed.

Use parameterized queries. Never concatenate user input into SQL strings. The database engine treats parameters as data, not executable code.

-- Safe: parameterized query
cursor.execute(
    "SELECT * FROM users WHERE name = %s",
    (user_input,)
)
Medium-High Risk

Blind Injection

The attacker cannot see query results directly, but can infer information based on the application's behavior. Boolean-based blind injection checks whether the page changes. Time-based blind injection uses SLEEP or pg_sleep to confirm a condition is true by measuring response time. Slower to exploit, but still dangerous.

Vulnerable example:

-- Boolean-based: does the page load differently?
' AND (SELECT COUNT(*) FROM users) > 100 --

-- Time-based: does the response take 5 seconds?
' AND pg_sleep(5) --

-- If the response is delayed, the attacker knows
-- the injected SQL executed successfully.

Same fix: parameterized queries. Blind injection exploits the same root cause as classic injection. If inputs are parameterized, there is nothing to exploit.

-- Also safe: using an ORM
User.objects.filter(name=user_input)
# Django ORM parameterizes automatically
High Risk

Union-Based Injection

The attacker appends a UNION SELECT to the original query to extract data from other tables. This requires knowing (or guessing) the number of columns in the original query. Once the column count matches, the attacker can pull data from any table the database user has access to.

Vulnerable example:

-- Original query returns 3 columns
SELECT id, name, email FROM products WHERE id = 1

-- Attacker input: 1 UNION SELECT username, password, null FROM admin_users
-- Resulting query:
SELECT id, name, email FROM products WHERE id = 1
UNION SELECT username, password, null FROM admin_users
-- Now the admin credentials appear in the product listing.

Parameterized queries block this entirely. Pair them with least-privilege database users so that even if injection occurred, the attacker could not access sensitive tables.

-- Defense in depth: restrict DB user permissions
GRANT SELECT ON products TO app_readonly;
-- app_readonly cannot access admin_users table
-- Even if injection worked, the query would fail
Medium Risk

Second-Order Injection

The malicious input is stored in the database first and executed later when a different query reads it. This is particularly relevant for data engineers: a value enters through one pipeline, gets stored, and then a downstream query uses it unsafely. The original insertion may be safe, but the later consumption is not.

Vulnerable example:

-- Step 1: Safe insertion (parameterized)
INSERT INTO users (name) VALUES (%s)  -- stores: Robert'; DROP TABLE users;--

-- Step 2: Unsafe downstream query (string concat)
query = f"SELECT * FROM logs WHERE user_name = '{row['name']}'"
-- The stored value is now injected into a new query.

Parameterize every query, not just the ones that touch user-facing input. Downstream queries that read from the database must also be parameterized. Data engineers often miss this because the data 'came from our own database.'

-- Safe downstream query
cursor.execute(
    "SELECT * FROM logs WHERE user_name = %s",
    (row['name'],)
)
# Parameterized even though the input came from our DB

Why Data Engineers Should Care

Most SQL injection training focuses on web applications. But data engineers build systems that ingest data from APIs, partner feeds, file uploads, and user-generated content. These are all attack surfaces. Here are the four most common DE-specific risks.

Dynamic SQL in ETL scripts

ETL scripts that build SQL strings dynamically, especially when table names, column names, or filter values come from configuration files, API responses, or upstream systems. If an upstream system is compromised, the injected value flows into your pipeline.

Use parameterized queries for values. For identifiers (table/column names), validate against an allowlist. Never pass identifiers through query parameters because most drivers do not support parameterized identifiers.

# BAD: dynamic table name from config
table = config['source_table']
query = f"SELECT * FROM {table}"

# GOOD: validate against allowlist
ALLOWED_TABLES = {'users', 'orders', 'products'}
if table not in ALLOWED_TABLES:
    raise ValueError(f"Unknown table: {table}")
query = f"SELECT * FROM {table}"  # safe after validation

API-driven query builders

REST or GraphQL APIs that accept filter parameters and build SQL queries from them. A caller passes sort_by=name; DROP TABLE users and your query builder injects it directly.

Parameterize values. Validate and allowlist column names for ORDER BY and GROUP BY clauses. Never pass raw API input into SQL without validation.

# BAD: API parameter directly in ORDER BY
sort_col = request.args.get('sort_by', 'id')
query = f"SELECT * FROM users ORDER BY {sort_col}"

# GOOD: allowlist the column
SORTABLE = {'id', 'name', 'created_at'}
if sort_col not in SORTABLE:
    sort_col = 'id'
query = f"SELECT * FROM users ORDER BY {sort_col}"

Templated Airflow queries

Airflow DAGs that use Jinja templating to inject values into SQL. The {{ params.date }} syntax does not parameterize the query. It performs string substitution before the query reaches the database.

Use Airflow operators that support parameterized queries (parameters argument in PostgresOperator). For dynamic identifiers, validate them in a Python task before passing to SQL.

# BAD: Jinja template (string substitution)
sql = "SELECT * FROM events WHERE dt = '{{ params.date }}'"

# GOOD: use parameters argument
PostgresOperator(
    task_id='query_events',
    sql="SELECT * FROM events WHERE dt = %(date)s",
    parameters={'date': '2026-01-15'},
)

Bulk COPY or INSERT from files

Loading CSV or JSON files into the database using COPY or bulk INSERT. If the file contents are adversarial (e.g., a CSV field contains SQL injection payload), the COPY command itself is safe, but any downstream query that reads those values without parameterization is vulnerable.

COPY and bulk INSERT commands are generally safe because they treat all data as literal values. The risk comes from downstream queries. Ensure all downstream SQL that reads the loaded data is parameterized.

# COPY itself is safe
COPY staging_table FROM '/data/upload.csv' WITH (FORMAT csv);

# But downstream processing MUST be parameterized
cursor.execute(
    "INSERT INTO clean_table SELECT * FROM staging_table WHERE status = %s",
    ('active',)
)

ORM and Driver Protection

ORMs protect you when you use their query builders. They do not protect you when you bypass them with raw SQL. Here is the safe and unsafe version for each common Python tool.

SQLAlchemy (Python)

Safe
# Safe: ORM query
session.query(User).filter(User.name == user_input).all()

# Safe: text() with bind parameters
from sqlalchemy import text
session.execute(text("SELECT * FROM users WHERE name = :name"), {"name": user_input})
Unsafe
# UNSAFE: raw string formatting
session.execute(f"SELECT * FROM users WHERE name = '{user_input}'")

Django ORM (Python)

Safe
# Safe: ORM query
User.objects.filter(name=user_input)

# Safe: raw query with parameters
User.objects.raw("SELECT * FROM users WHERE name = %s", [user_input])
Unsafe
# UNSAFE: string formatting in raw query
User.objects.raw(f"SELECT * FROM users WHERE name = '{user_input}'")

psycopg2 / asyncpg (Python)

Safe
# Safe: parameterized query
cursor.execute("SELECT * FROM users WHERE name = %s", (user_input,))

# Safe: asyncpg
await conn.fetch("SELECT * FROM users WHERE name = $1", user_input)
Unsafe
# UNSAFE: string concatenation
cursor.execute("SELECT * FROM users WHERE name = '" + user_input + "'")

# UNSAFE: f-string
cursor.execute(f"SELECT * FROM users WHERE name = '{user_input}'")

Prevention Checklist

Every query uses parameterized values (no f-strings, no string concatenation)
Identifiers (table names, column names) are validated against allowlists
Database users have least-privilege permissions (read-only where possible)
Airflow DAGs use operator parameters instead of Jinja template substitution for values
ORM usage avoids raw SQL string formatting
Input from upstream systems is treated as untrusted (even internal APIs)
Stored procedures validate inputs before building dynamic SQL
CI/CD pipelines include static analysis for SQL injection patterns (e.g., bandit for Python)

SQL Injection FAQ

Do data engineers need to worry about SQL injection?+
Yes. Data engineers write SQL that processes data from external sources: APIs, file uploads, partner feeds, user-generated content. If any of those inputs contain malicious SQL and your queries use string concatenation, you have the same vulnerability as a web application. The attack surface is different (ETL pipelines, Airflow DAGs, API-driven queries) but the root cause is identical.
Does using an ORM prevent SQL injection?+
ORMs prevent injection when you use their query builder methods (filter, where, etc.). They do not prevent injection when you bypass the ORM and write raw SQL with string formatting. The ORM is a protection layer, not a guarantee. If you call raw() or execute() with f-strings, you are back to square one.
What about NoSQL injection?+
NoSQL databases (MongoDB, DynamoDB, etc.) have their own injection vulnerabilities. MongoDB queries built from unsanitized JSON input can be manipulated with operators like $gt, $ne, and $regex. The principle is the same: never trust external input, always use the driver's built-in parameterization.
Is SQL injection still a real threat in 2026?+
SQL injection has been in the OWASP Top 10 for over two decades. It persists because developers (and data engineers) keep building queries with string concatenation. Automated scanning tools make it trivial to find these vulnerabilities. The fix has been known since the 1990s: parameterize your queries. The problem is not a lack of solutions. It is a lack of discipline.

Write SQL That Won't Get You Paged

The best defense is knowing the grammar cold. Grind real queries until injection patterns look wrong on sight.