A travel startup we know shipped an Airflow DAG that ingested CSV filenames from an S3 bucket and interpolated them into a DROP PARTITION statement. A file named "2024-01-01'; DROP TABLE bookings; --.csv" landed in the bucket via a misconfigured upload form and took down their warehouse for six hours. String concatenation in SQL is that easy to weaponize. This sheet shows you exactly which patterns to rip out of your code before the next incident review.
Every injection type. Vulnerable and safe side-by-side. Pipeline-specific risks DEs miss.
OWASP Top 10 rank
Injection types covered
Fix: parameterize
Companies in our dataset
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Use parameterized queries for all values. Validate identifiers against allowlists. That is it. Every injection type on this page exploits the same root cause: treating user input as executable SQL instead of data.
# ALWAYS this (parameterized)
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
# NEVER this (string concatenation)
cursor.execute(f"SELECT * FROM users WHERE id = {user_id}")Four flavors, one root cause. A SOC analyst we worked with pulled the postmortem on a breach at a logistics company: the attacker used a blind time-based injection through a reporting endpoint that took weeks to detect because the logs looked normal. All four types below need to go on your threat model.
The attacker modifies a SQL query through user input, and the results are returned directly in the application response. This is the textbook example: a login form where entering ' OR 1=1 -- bypasses authentication. The injected SQL runs in the same channel as the original query.
Vulnerable example:
-- Vulnerable query (string concatenation) query = "SELECT * FROM users WHERE name = '" + user_input + "'" -- If user_input = "' OR 1=1 --" -- Resulting query: SELECT * FROM users WHERE name = '' OR 1=1 --' -- Returns ALL rows. Authentication bypassed.
Use parameterized queries. Never concatenate user input into SQL strings. The database engine treats parameters as data, not executable code.
-- Safe: parameterized query
cursor.execute(
"SELECT * FROM users WHERE name = %s",
(user_input,)
)The attacker cannot see query results directly, but can infer information based on the application's behavior. Boolean-based blind injection checks whether the page changes. Time-based blind injection uses SLEEP or pg_sleep to confirm a condition is true by measuring response time. Slower to exploit, but still dangerous.
Vulnerable example:
-- Boolean-based: does the page load differently? ' AND (SELECT COUNT(*) FROM users) > 100 -- -- Time-based: does the response take 5 seconds? ' AND pg_sleep(5) -- -- If the response is delayed, the attacker knows -- the injected SQL executed successfully.
Same fix: parameterized queries. Blind injection exploits the same root cause as classic injection. If inputs are parameterized, there is nothing to exploit.
-- Also safe: using an ORM User.objects.filter(name=user_input) # Django ORM parameterizes automatically
The attacker appends a UNION SELECT to the original query to extract data from other tables. This requires knowing (or guessing) the number of columns in the original query. Once the column count matches, the attacker can pull data from any table the database user has access to.
Vulnerable example:
-- Original query returns 3 columns SELECT id, name, email FROM products WHERE id = 1 -- Attacker input: 1 UNION SELECT username, password, null FROM admin_users -- Resulting query: SELECT id, name, email FROM products WHERE id = 1 UNION SELECT username, password, null FROM admin_users -- Now the admin credentials appear in the product listing.
Parameterized queries block this entirely. Pair them with least-privilege database users so that even if injection occurred, the attacker could not access sensitive tables.
-- Defense in depth: restrict DB user permissions GRANT SELECT ON products TO app_readonly; -- app_readonly cannot access admin_users table -- Even if injection worked, the query would fail
The malicious input is stored in the database first and executed later when a different query reads it. This is particularly relevant for data engineers: a value enters through one pipeline, gets stored, and then a downstream query uses it unsafely. The original insertion may be safe, but the later consumption is not.
Vulnerable example:
-- Step 1: Safe insertion (parameterized)
INSERT INTO users (name) VALUES (%s) -- stores: Robert'; DROP TABLE users;--
-- Step 2: Unsafe downstream query (string concat)
query = f"SELECT * FROM logs WHERE user_name = '{row['name']}'"
-- The stored value is now injected into a new query.Parameterize every query, not just the ones that touch user-facing input. Downstream queries that read from the database must also be parameterized. Data engineers often miss this because the data 'came from our own database.'
-- Safe downstream query
cursor.execute(
"SELECT * FROM logs WHERE user_name = %s",
(row['name'],)
)
# Parameterized even though the input came from our DBMost SQL injection training focuses on web applications. But data engineers build systems that ingest data from APIs, partner feeds, file uploads, and user-generated content. These are all attack surfaces. Here are the four most common DE-specific risks.
ETL scripts that build SQL strings dynamically, especially when table names, column names, or filter values come from configuration files, API responses, or upstream systems. If an upstream system is compromised, the injected value flows into your pipeline.
Use parameterized queries for values. For identifiers (table/column names), validate against an allowlist. Never pass identifiers through query parameters because most drivers do not support parameterized identifiers.
# BAD: dynamic table name from config
table = config['source_table']
query = f"SELECT * FROM {table}"
# GOOD: validate against allowlist
ALLOWED_TABLES = {'users', 'orders', 'products'}
if table not in ALLOWED_TABLES:
raise ValueError(f"Unknown table: {table}")
query = f"SELECT * FROM {table}" # safe after validationREST or GraphQL APIs that accept filter parameters and build SQL queries from them. A caller passes sort_by=name; DROP TABLE users and your query builder injects it directly.
Parameterize values. Validate and allowlist column names for ORDER BY and GROUP BY clauses. Never pass raw API input into SQL without validation.
# BAD: API parameter directly in ORDER BY
sort_col = request.args.get('sort_by', 'id')
query = f"SELECT * FROM users ORDER BY {sort_col}"
# GOOD: allowlist the column
SORTABLE = {'id', 'name', 'created_at'}
if sort_col not in SORTABLE:
sort_col = 'id'
query = f"SELECT * FROM users ORDER BY {sort_col}"Airflow DAGs that use Jinja templating to inject values into SQL. The {{ params.date }} syntax does not parameterize the query. It performs string substitution before the query reaches the database.
Use Airflow operators that support parameterized queries (parameters argument in PostgresOperator). For dynamic identifiers, validate them in a Python task before passing to SQL.
# BAD: Jinja template (string substitution)
sql = "SELECT * FROM events WHERE dt = '{{ params.date }}'"
# GOOD: use parameters argument
PostgresOperator(
task_id='query_events',
sql="SELECT * FROM events WHERE dt = %(date)s",
parameters={'date': '2026-01-15'},
)Loading CSV or JSON files into the database using COPY or bulk INSERT. If the file contents are adversarial (e.g., a CSV field contains SQL injection payload), the COPY command itself is safe, but any downstream query that reads those values without parameterization is vulnerable.
COPY and bulk INSERT commands are generally safe because they treat all data as literal values. The risk comes from downstream queries. Ensure all downstream SQL that reads the loaded data is parameterized.
# COPY itself is safe
COPY staging_table FROM '/data/upload.csv' WITH (FORMAT csv);
# But downstream processing MUST be parameterized
cursor.execute(
"INSERT INTO clean_table SELECT * FROM staging_table WHERE status = %s",
('active',)
)ORMs protect you when you use their query builders. They do not protect you when you bypass them with raw SQL. Here is the safe and unsafe version for each common Python tool.
# Safe: ORM query
session.query(User).filter(User.name == user_input).all()
# Safe: text() with bind parameters
from sqlalchemy import text
session.execute(text("SELECT * FROM users WHERE name = :name"), {"name": user_input})# UNSAFE: raw string formatting
session.execute(f"SELECT * FROM users WHERE name = '{user_input}'")# Safe: ORM query
User.objects.filter(name=user_input)
# Safe: raw query with parameters
User.objects.raw("SELECT * FROM users WHERE name = %s", [user_input])# UNSAFE: string formatting in raw query
User.objects.raw(f"SELECT * FROM users WHERE name = '{user_input}'")# Safe: parameterized query
cursor.execute("SELECT * FROM users WHERE name = %s", (user_input,))
# Safe: asyncpg
await conn.fetch("SELECT * FROM users WHERE name = $1", user_input)# UNSAFE: string concatenation
cursor.execute("SELECT * FROM users WHERE name = '" + user_input + "'")
# UNSAFE: f-string
cursor.execute(f"SELECT * FROM users WHERE name = '{user_input}'")The best defense is knowing the grammar cold. Grind real queries until injection patterns look wrong on sight.