SQL Interview Deep Dive

SQL String Functions Practice

String functions are tested directly in fewer than 1% of interviews, but they show up in roughly 30% of take-home assignments where you need to parse, clean, or transform raw text fields.

Practice real data cleaning problems. Write the query, run it, and verify your output matches.

6
Function Families
30%
Of Take-Homes Need Them
5
Data Cleaning Problems

String Functions You Need to Know

Data engineering interviews test string manipulation in the context of data cleaning and ETL. These are the functions that matter.

LIKE / ILIKE

Pattern MatchingHigh usage
WHERE email LIKE '%@gmail.com'

Pattern matching with wildcards. % matches any sequence of characters, _ matches exactly one character. ILIKE is the case-insensitive variant in Postgres.

Common use cases

Filter emails by domainFind product names containing a keywordMatch phone number patterns with _ wildcardsCase-insensitive search with ILIKE

Interview tip: LIKE is straightforward, but interviewers test edge cases: what if the pattern itself contains % or _? You need to escape them with a backslash or use the ESCAPE clause.

CONCAT / ||

BuildingHigh usage
CONCAT(first_name, ' ', last_name) or first_name || ' ' || last_name

Joins strings together. CONCAT handles NULLs gracefully (treats them as empty strings). The || operator propagates NULLs in most dialects, which catches people off guard.

Common use cases

Build full names from first and last name columnsConstruct URLs or file paths from componentsCreate composite keys for joiningFormat output strings for reporting

Interview tip: The NULL behavior difference matters. CONCAT('hello', NULL) returns 'hello'. But 'hello' || NULL returns NULL in Postgres. This is a common interview gotcha.

SUBSTRING / SUBSTR

ExtractionMedium usage
SUBSTRING(column FROM 1 FOR 3) or SUBSTR(column, 1, 3)

Extracts a portion of a string by position. The FROM/FOR syntax is ANSI SQL. Most dialects also support the comma-separated form. Positions are 1-indexed.

Common use cases

Extract area codes from phone numbersParse fixed-width file fieldsGet the first N characters of a stringExtract year from a date stored as text

Interview tip: SQL strings are 1-indexed, not 0-indexed like most programming languages. SUBSTRING('hello' FROM 1 FOR 2) returns 'he'. Getting this wrong wastes time in interviews.

SPLIT_PART

ParsingHigh usage
SPLIT_PART(email, '@', 2)

Splits a string by a delimiter and returns the Nth part. This is the SQL equivalent of Python's split(). Available in Postgres, Snowflake, and Trino. BigQuery uses SPLIT() which returns an array.

Common use cases

Extract domain from email addressesParse CSV values stored in a single columnGet path components from URLsSplit full names into first and last

Interview tip: SPLIT_PART returns an empty string (not NULL) if the requested part index is out of range. This matters when you are using COALESCE or NULL checks downstream.

REPLACE / TRANSLATE

CleaningMedium usage
REPLACE(phone, '-', '') or TRANSLATE(phone, '()-. ', '')

REPLACE swaps one substring for another. TRANSLATE maps characters one-to-one (like Python's str.translate). TRANSLATE is faster for removing multiple single characters at once.

Common use cases

Strip formatting from phone numbersNormalize whitespace in user inputRemove currency symbols from amount stringsClean up delimiter inconsistencies in raw data

Interview tip: Data cleaning questions often require chaining multiple REPLACE calls or using TRANSLATE for character-level cleanup. Know both and pick the right one.

Regex (REGEXP_REPLACE, REGEXP_MATCHES)

AdvancedLow usage
REGEXP_REPLACE(text, '[^0-9]', '', 'g')

Full regular expression support for matching and replacing. Postgres uses POSIX regex. BigQuery uses RE2. The 'g' flag means global replacement. Without it, only the first match is replaced.

Common use cases

Extract numbers from mixed alphanumeric stringsValidate email or phone number formatsParse structured text like log linesRemove all non-alphanumeric characters

Interview tip: Regex in SQL is a power tool, but most interviewers will accept LIKE or SPLIT_PART if they solve the problem. Reach for regex only when simpler functions cannot do the job.

String Manipulation in Data Engineering Interviews

String questions in interviews are almost always about data quality. You get a messy table and need to clean it before you can join, aggregate, or report on it. The interviewer is testing whether you can handle real-world data, not textbook data.

The most common scenario: two tables that should join on a shared field, but the formatting is inconsistent. Phone numbers with and without dashes. Names with mixed casing. Addresses with abbreviations. Your job is to normalize both sides so the join works.

A less obvious but equally important use case: parsing semi-structured data stored in string columns. Log messages, URL parameters, and CSV fields packed into a single column all require SPLIT_PART or regex to extract the pieces you need. This pattern is extremely common in data engineering pipelines.

The key skill is knowing which function to reach for. REPLACE handles simple substitutions. SPLIT_PART handles delimited data. Regex handles everything else but is harder to read. Interviewers want to see you pick the right tool, not the fanciest one.

Practice Problems

Five data cleaning problems that test string functions in realistic scenarios.

1

Clean Phone Numbers for Joining

Medium

Table A has phone numbers formatted as '(555) 123-4567'. Table B has them as '5551234567'. Write a query that joins the tables on phone number after normalizing both formats to digits only.

Show hint

Use REGEXP_REPLACE(phone, '[^0-9]', '', 'g') to strip all non-digit characters, or chain REPLACE calls to remove parentheses, spaces, and dashes.

2

Extract Domain and TLD from URLs

Hard

Given a table of URLs (e.g., 'https://subdomain.example.co.uk/path'), extract the domain name ('example') and the top-level domain ('co.uk') as separate columns.

Show hint

Use SPLIT_PART to break on '/' and get the hostname. Then work backwards from the right to separate the domain from the TLD. Multi-part TLDs like .co.uk make this harder than it looks.

3

Parse Key-Value Log Strings

Hard

A logs table has a message column with entries like 'user_id=42 action=click page=/home duration=350ms'. Parse this into separate columns: user_id, action, page, duration_ms (as integer).

Show hint

Use SPLIT_PART with space delimiter to isolate each pair, then SPLIT_PART again with '=' to get the value. Cast duration after stripping 'ms' with REPLACE.

4

Fuzzy Name Matching

Medium

Two tables have customer names that may differ in casing, spacing, and punctuation ('John O\'Brien' vs 'john obrien'). Write a query that matches customers across tables using normalized names.

Show hint

Chain LOWER(), REPLACE (to strip apostrophes and hyphens), and REGEXP_REPLACE to collapse multiple spaces. Then join on the normalized result.

5

Email Validation Report

Medium

Given a users table, flag emails that are invalid. Invalid means: no @ symbol, more than one @ symbol, domain has no dot, or local part is empty. Return user_id, email, and a reason column explaining why each email is invalid.

Show hint

Use SPLIT_PART on '@' and check lengths. POSITION('@' IN email) finds the first @. LENGTH(email) - LENGTH(REPLACE(email, '@', '')) counts @ symbols.

Worked Example: Extract Email Domain and Count Users Per Domain

Given a users table with an email column, extract the domain from each email address, then count how many users belong to each domain. Exclude rows where email is NULL.

SELECT
  SPLIT_PART(email, '@', 2)        AS domain,
  COUNT(*)                         AS user_count
FROM users
WHERE email IS NOT NULL
  AND POSITION('@' IN email) > 0
GROUP BY SPLIT_PART(email, '@', 2)
ORDER BY user_count DESC
LIMIT 10;

SPLIT_PART splits the email on the @ character and returns the second piece (the domain). The POSITION check guards against malformed emails that have no @ symbol. This is a common data cleaning step before joining user data to a domain-level reference table.

Expected output

 domain          | user_count
-----------------+-----------
 gmail.com       |       1842
 yahoo.com       |        567
 outlook.com     |        334
 company.io      |        201

String Functions FAQ

How often do SQL string functions come up in data engineering interviews?+
String functions appear in roughly 20-30% of SQL interviews, usually as part of a data cleaning or ETL question rather than a standalone topic. The most common pattern is: raw data has inconsistent formatting and you need to normalize it before joining or aggregating. SPLIT_PART, REPLACE, and LIKE cover most of these cases.
Should I use regex or simpler string functions in an interview?+
Start with the simplest function that solves the problem. LIKE, SPLIT_PART, and REPLACE are easier to read, less error-prone, and show that you pick the right tool for the job. Use regex only when the pattern is too complex for basic functions. Interviewers value clarity over cleverness.
What are the biggest differences in string functions across SQL dialects?+
The main differences: Postgres uses || for concatenation and has SPLIT_PART. MySQL uses CONCAT() and SUBSTRING_INDEX instead of SPLIT_PART. BigQuery uses SPLIT() which returns an array, and REGEXP_EXTRACT instead of REGEXP_MATCHES. Snowflake is closest to Postgres. Always state your dialect at the start of the interview.
How do NULL values interact with string functions?+
CONCAT treats NULLs as empty strings. The || operator returns NULL if any operand is NULL. LENGTH(NULL) returns NULL. COALESCE(column, '') is the standard way to handle NULLs before string operations. Forgetting this causes silent data loss in joins and filters.

Clean Messy Data with Real SQL

Data cleaning is a core skill, not a nice-to-have. Practice string manipulation with real SQL execution and immediate feedback.