SQL String Functions Practice
String functions are tested directly in fewer than 1% of interviews, but they show up in roughly 30% of take-home assignments where you need to parse, clean, or transform raw text fields.
What this guide covers
String functions are tested directly in <1% of interviews but show up in ~30% of take-home assignments where you parse, clean, or transform raw text. Six function families cover the corpus: pattern matching (LIKE), building (CONCAT), extraction (SUBSTRING), parsing (SPLIT_PART), cleaning (REPLACE/TRANSLATE), regex.
String functions you need to know
DE interviews test string manipulation in the context of data cleaning and ETL. These are the functions that matter.
LIKE / ILIKE
Syntax: WHERE email LIKE '%@gmail.com'. Pattern matching with wildcards. % matches any sequence; _ matches exactly one character. ILIKE is the case-insensitive variant in Postgres. Use cases: filter emails by domain; find product names containing a keyword; match phone number patterns with _ wildcards; case-insensitive search with ILIKE. Interview tip: edge cases test what happens if the pattern itself contains % or _. You need to escape them with a backslash or use the ESCAPE clause.
CONCAT / ||
Syntax: CONCAT(first_name, ' ', last_name) or first_name || ' ' || last_name. Joins strings together. CONCAT handles NULLs gracefully (treats them as empty). The || operator propagates NULLs in most dialects. Use cases: build full names; construct URLs or file paths from components; create composite keys for joining; format output for reporting. Interview tip: NULL behavior difference matters. CONCAT('hello', NULL) returns 'hello'. But 'hello' || NULL returns NULL in Postgres. Common interview gotcha.
SUBSTRING / SUBSTR
Syntax: SUBSTRING(column FROM 1 FOR 3) or SUBSTR(column, 1, 3). Extracts a portion of a string by position. FROM/FOR is ANSI SQL; most dialects also support the comma-separated form. Positions are 1-indexed. Use cases: extract area codes; parse fixed-width file fields; first N characters; extract year from a date stored as text. Interview tip: SQL strings are 1-indexed, not 0-indexed. SUBSTRING('hello' FROM 1 FOR 2) returns 'he'. Getting this wrong wastes time.
SPLIT_PART
Syntax: SPLIT_PART(email, '@', 2). Splits a string by a delimiter, returns the Nth part. SQL equivalent of Python's split(). Available in Postgres, Snowflake, Trino. BigQuery uses SPLIT() returning an array. Use cases: extract domain from emails; parse CSV values; get path components from URLs; split full names. Interview tip: SPLIT_PART returns empty string (not NULL) if the requested index is out of range. Matters when using COALESCE or NULL checks downstream.
REPLACE / TRANSLATE
Syntax: REPLACE(phone, '-', '') or TRANSLATE(phone, '()-. ', ''). REPLACE swaps one substring for another. TRANSLATE maps characters one-to-one (like Python's str.translate). TRANSLATE is faster for removing multiple single characters at once. Use cases: strip formatting from phone numbers; normalize whitespace; remove currency symbols; clean up delimiters. Interview tip: data cleaning questions often require chaining REPLACE calls or using TRANSLATE for character-level cleanup. Know both.
Regex (REGEXP_REPLACE, REGEXP_MATCHES)
Syntax: REGEXP_REPLACE(text, '[^0-9]', '', 'g'). Full regex for matching and replacing. Postgres uses POSIX regex; BigQuery uses RE2. The 'g' flag means global replacement; without it, only the first match is replaced. Use cases: extract numbers from mixed alphanumeric strings; validate email/phone formats; parse log lines; remove all non-alphanumeric. Interview tip: regex is a power tool, but most interviewers accept LIKE or SPLIT_PART if they solve the problem. Reach for regex only when simpler functions can't.
String manipulation in DE interviews
String questions are almost always about data quality. You get a messy table and need to clean it before you can join, aggregate, or report. The interviewer is testing whether you can handle real-world data, not textbook data.
The most common scenario: two tables that should join on a shared field, but formatting is inconsistent. Phone numbers with and without dashes. Names with mixed casing. Addresses with abbreviations. Your job is to normalize both sides so the join works.
Less obvious but equally important: parsing semi-structured data stored in string columns. Log messages, URL parameters, and CSV fields packed into a single column all require SPLIT_PART or regex to extract the pieces. Extremely common in DE pipelines.
The key skill is knowing which function to reach for. REPLACE handles simple substitutions. SPLIT_PART handles delimited data. Regex handles everything else but is harder to read. Interviewers want to see you pick the right tool, not the fanciest one.
Five practice problems
Data cleaning problems that test string functions in realistic scenarios.
Clean phone numbers for joining
Table A has phones as '(555) 123-4567'. Table B as '5551234567'. Write a query that joins them after normalizing both to digits only. Hint: REGEXP_REPLACE(phone, '[^0-9]', '', 'g') strips all non-digits, or chain REPLACE calls.
Extract domain and TLD from URLs
Given a table of URLs (e.g., 'https://subdomain.example.co.uk/path'), extract the domain ('example') and TLD ('co.uk') as separate columns. Hint: SPLIT_PART on '/' for hostname, then work backwards from the right. Multi-part TLDs like .co.uk make this harder than it looks.
Parse key-value log strings
Logs have entries like 'user_id=42 action=click page=/home duration=350ms'. Parse into separate columns: user_id, action, page, duration_ms (integer). Hint: SPLIT_PART with space delimiter to isolate each pair, then SPLIT_PART again with '=' to get the value. Cast duration after stripping 'ms' with REPLACE.
Fuzzy name matching
Two tables have customer names that may differ in casing, spacing, punctuation ('John O'Brien' vs 'john obrien'). Match across tables using normalized names. Hint: chain LOWER(), REPLACE (strip apostrophes/hyphens), and REGEXP_REPLACE to collapse multiple spaces.
Email validation report
Flag invalid emails: no @, more than one @, domain has no dot, or local part empty. Return user_id, email, and a reason. Hint: SPLIT_PART on '@' and check lengths. POSITION('@' IN email) finds the first @. LENGTH(email) - LENGTH(REPLACE(email, '@', '')) counts @ symbols.
Worked example: extract email domain, count users per domain
SELECT
SPLIT_PART(email, '@', 2) AS domain,
COUNT(*) AS user_count
FROM users
WHERE email IS NOT NULL
AND POSITION('@' IN email) > 0
GROUP BY SPLIT_PART(email, '@', 2)
ORDER BY user_count DESC
LIMIT 10;SPLIT_PART returns the second piece of the email (the domain). The POSITION check guards against malformed emails without @. Common data-cleaning step before joining to a domain-level reference table.
String functions FAQ
How often do SQL string functions come up in DE interviews?+
Regex or simpler string functions in an interview?+
Biggest differences across SQL dialects?+
How do NULL values interact with string functions?+
Clean messy data with real SQL
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition