SQL Interview Deep Dive
String functions are tested directly in fewer than 1% of interviews, but they show up in roughly 30% of take-home assignments where you need to parse, clean, or transform raw text fields.
Practice real data cleaning problems. Write the query, run it, and verify your output matches.
Data engineering interviews test string manipulation in the context of data cleaning and ETL. These are the functions that matter.
WHERE email LIKE '%@gmail.com'Pattern matching with wildcards. % matches any sequence of characters, _ matches exactly one character. ILIKE is the case-insensitive variant in Postgres.
Interview tip: LIKE is straightforward, but interviewers test edge cases: what if the pattern itself contains % or _? You need to escape them with a backslash or use the ESCAPE clause.
CONCAT(first_name, ' ', last_name) or first_name || ' ' || last_nameJoins strings together. CONCAT handles NULLs gracefully (treats them as empty strings). The || operator propagates NULLs in most dialects, which catches people off guard.
Interview tip: The NULL behavior difference matters. CONCAT('hello', NULL) returns 'hello'. But 'hello' || NULL returns NULL in Postgres. This is a common interview gotcha.
SUBSTRING(column FROM 1 FOR 3) or SUBSTR(column, 1, 3)Extracts a portion of a string by position. The FROM/FOR syntax is ANSI SQL. Most dialects also support the comma-separated form. Positions are 1-indexed.
Interview tip: SQL strings are 1-indexed, not 0-indexed like most programming languages. SUBSTRING('hello' FROM 1 FOR 2) returns 'he'. Getting this wrong wastes time in interviews.
SPLIT_PART(email, '@', 2)Splits a string by a delimiter and returns the Nth part. This is the SQL equivalent of Python's split(). Available in Postgres, Snowflake, and Trino. BigQuery uses SPLIT() which returns an array.
Interview tip: SPLIT_PART returns an empty string (not NULL) if the requested part index is out of range. This matters when you are using COALESCE or NULL checks downstream.
REPLACE(phone, '-', '') or TRANSLATE(phone, '()-. ', '')REPLACE swaps one substring for another. TRANSLATE maps characters one-to-one (like Python's str.translate). TRANSLATE is faster for removing multiple single characters at once.
Interview tip: Data cleaning questions often require chaining multiple REPLACE calls or using TRANSLATE for character-level cleanup. Know both and pick the right one.
REGEXP_REPLACE(text, '[^0-9]', '', 'g')Full regular expression support for matching and replacing. Postgres uses POSIX regex. BigQuery uses RE2. The 'g' flag means global replacement. Without it, only the first match is replaced.
Interview tip: Regex in SQL is a power tool, but most interviewers will accept LIKE or SPLIT_PART if they solve the problem. Reach for regex only when simpler functions cannot do the job.
String questions in interviews are almost always about data quality. You get a messy table and need to clean it before you can join, aggregate, or report on it. The interviewer is testing whether you can handle real-world data, not textbook data.
The most common scenario: two tables that should join on a shared field, but the formatting is inconsistent. Phone numbers with and without dashes. Names with mixed casing. Addresses with abbreviations. Your job is to normalize both sides so the join works.
A less obvious but equally important use case: parsing semi-structured data stored in string columns. Log messages, URL parameters, and CSV fields packed into a single column all require SPLIT_PART or regex to extract the pieces you need. This pattern is extremely common in data engineering pipelines.
The key skill is knowing which function to reach for. REPLACE handles simple substitutions. SPLIT_PART handles delimited data. Regex handles everything else but is harder to read. Interviewers want to see you pick the right tool, not the fanciest one.
Five data cleaning problems that test string functions in realistic scenarios.
Table A has phone numbers formatted as '(555) 123-4567'. Table B has them as '5551234567'. Write a query that joins the tables on phone number after normalizing both formats to digits only.
Use REGEXP_REPLACE(phone, '[^0-9]', '', 'g') to strip all non-digit characters, or chain REPLACE calls to remove parentheses, spaces, and dashes.
Given a table of URLs (e.g., 'https://subdomain.example.co.uk/path'), extract the domain name ('example') and the top-level domain ('co.uk') as separate columns.
Use SPLIT_PART to break on '/' and get the hostname. Then work backwards from the right to separate the domain from the TLD. Multi-part TLDs like .co.uk make this harder than it looks.
A logs table has a message column with entries like 'user_id=42 action=click page=/home duration=350ms'. Parse this into separate columns: user_id, action, page, duration_ms (as integer).
Use SPLIT_PART with space delimiter to isolate each pair, then SPLIT_PART again with '=' to get the value. Cast duration after stripping 'ms' with REPLACE.
Two tables have customer names that may differ in casing, spacing, and punctuation ('John O\'Brien' vs 'john obrien'). Write a query that matches customers across tables using normalized names.
Chain LOWER(), REPLACE (to strip apostrophes and hyphens), and REGEXP_REPLACE to collapse multiple spaces. Then join on the normalized result.
Given a users table, flag emails that are invalid. Invalid means: no @ symbol, more than one @ symbol, domain has no dot, or local part is empty. Return user_id, email, and a reason column explaining why each email is invalid.
Use SPLIT_PART on '@' and check lengths. POSITION('@' IN email) finds the first @. LENGTH(email) - LENGTH(REPLACE(email, '@', '')) counts @ symbols.
Given a users table with an email column, extract the domain from each email address, then count how many users belong to each domain. Exclude rows where email is NULL.
SELECT
SPLIT_PART(email, '@', 2) AS domain,
COUNT(*) AS user_count
FROM users
WHERE email IS NOT NULL
AND POSITION('@' IN email) > 0
GROUP BY SPLIT_PART(email, '@', 2)
ORDER BY user_count DESC
LIMIT 10;SPLIT_PART splits the email on the @ character and returns the second piece (the domain). The POSITION check guards against malformed emails that have no @ symbol. This is a common data cleaning step before joining user data to a domain-level reference table.
Expected output
domain | user_count
-----------------+-----------
gmail.com | 1842
yahoo.com | 567
outlook.com | 334
company.io | 201Data cleaning is a core skill, not a nice-to-have. Practice string manipulation with real SQL execution and immediate feedback.