SQL String Functions Practice

String functions are tested directly in fewer than 1% of interviews, but they show up in roughly 30% of take-home assignments where you need to parse, clean, or transform raw text fields.

What this guide covers

String functions are tested directly in <1% of interviews but show up in ~30% of take-home assignments where you parse, clean, or transform raw text. Six function families cover the corpus: pattern matching (LIKE), building (CONCAT), extraction (SUBSTRING), parsing (SPLIT_PART), cleaning (REPLACE/TRANSLATE), regex.

6
Function families
30%
Of take-homes need them
5
Data cleaning problems

String functions you need to know

DE interviews test string manipulation in the context of data cleaning and ETL. These are the functions that matter.

Pattern Matching · High

LIKE / ILIKE

Syntax: WHERE email LIKE '%@gmail.com'. Pattern matching with wildcards. % matches any sequence; _ matches exactly one character. ILIKE is the case-insensitive variant in Postgres. Use cases: filter emails by domain; find product names containing a keyword; match phone number patterns with _ wildcards; case-insensitive search with ILIKE. Interview tip: edge cases test what happens if the pattern itself contains % or _. You need to escape them with a backslash or use the ESCAPE clause.

Building · High

CONCAT / ||

Syntax: CONCAT(first_name, ' ', last_name) or first_name || ' ' || last_name. Joins strings together. CONCAT handles NULLs gracefully (treats them as empty). The || operator propagates NULLs in most dialects. Use cases: build full names; construct URLs or file paths from components; create composite keys for joining; format output for reporting. Interview tip: NULL behavior difference matters. CONCAT('hello', NULL) returns 'hello'. But 'hello' || NULL returns NULL in Postgres. Common interview gotcha.

Extraction · Medium

SUBSTRING / SUBSTR

Syntax: SUBSTRING(column FROM 1 FOR 3) or SUBSTR(column, 1, 3). Extracts a portion of a string by position. FROM/FOR is ANSI SQL; most dialects also support the comma-separated form. Positions are 1-indexed. Use cases: extract area codes; parse fixed-width file fields; first N characters; extract year from a date stored as text. Interview tip: SQL strings are 1-indexed, not 0-indexed. SUBSTRING('hello' FROM 1 FOR 2) returns 'he'. Getting this wrong wastes time.

Parsing · High

SPLIT_PART

Syntax: SPLIT_PART(email, '@', 2). Splits a string by a delimiter, returns the Nth part. SQL equivalent of Python's split(). Available in Postgres, Snowflake, Trino. BigQuery uses SPLIT() returning an array. Use cases: extract domain from emails; parse CSV values; get path components from URLs; split full names. Interview tip: SPLIT_PART returns empty string (not NULL) if the requested index is out of range. Matters when using COALESCE or NULL checks downstream.

Cleaning · Medium

REPLACE / TRANSLATE

Syntax: REPLACE(phone, '-', '') or TRANSLATE(phone, '()-. ', ''). REPLACE swaps one substring for another. TRANSLATE maps characters one-to-one (like Python's str.translate). TRANSLATE is faster for removing multiple single characters at once. Use cases: strip formatting from phone numbers; normalize whitespace; remove currency symbols; clean up delimiters. Interview tip: data cleaning questions often require chaining REPLACE calls or using TRANSLATE for character-level cleanup. Know both.

Advanced · Low

Regex (REGEXP_REPLACE, REGEXP_MATCHES)

Syntax: REGEXP_REPLACE(text, '[^0-9]', '', 'g'). Full regex for matching and replacing. Postgres uses POSIX regex; BigQuery uses RE2. The 'g' flag means global replacement; without it, only the first match is replaced. Use cases: extract numbers from mixed alphanumeric strings; validate email/phone formats; parse log lines; remove all non-alphanumeric. Interview tip: regex is a power tool, but most interviewers accept LIKE or SPLIT_PART if they solve the problem. Reach for regex only when simpler functions can't.

String manipulation in DE interviews

String questions are almost always about data quality. You get a messy table and need to clean it before you can join, aggregate, or report. The interviewer is testing whether you can handle real-world data, not textbook data.

The most common scenario: two tables that should join on a shared field, but formatting is inconsistent. Phone numbers with and without dashes. Names with mixed casing. Addresses with abbreviations. Your job is to normalize both sides so the join works.

Less obvious but equally important: parsing semi-structured data stored in string columns. Log messages, URL parameters, and CSV fields packed into a single column all require SPLIT_PART or regex to extract the pieces. Extremely common in DE pipelines.

The key skill is knowing which function to reach for. REPLACE handles simple substitutions. SPLIT_PART handles delimited data. Regex handles everything else but is harder to read. Interviewers want to see you pick the right tool, not the fanciest one.

Five practice problems

Data cleaning problems that test string functions in realistic scenarios.

Medium

Clean phone numbers for joining

Table A has phones as '(555) 123-4567'. Table B as '5551234567'. Write a query that joins them after normalizing both to digits only. Hint: REGEXP_REPLACE(phone, '[^0-9]', '', 'g') strips all non-digits, or chain REPLACE calls.

Hard

Extract domain and TLD from URLs

Given a table of URLs (e.g., 'https://subdomain.example.co.uk/path'), extract the domain ('example') and TLD ('co.uk') as separate columns. Hint: SPLIT_PART on '/' for hostname, then work backwards from the right. Multi-part TLDs like .co.uk make this harder than it looks.

Hard

Parse key-value log strings

Logs have entries like 'user_id=42 action=click page=/home duration=350ms'. Parse into separate columns: user_id, action, page, duration_ms (integer). Hint: SPLIT_PART with space delimiter to isolate each pair, then SPLIT_PART again with '=' to get the value. Cast duration after stripping 'ms' with REPLACE.

Medium

Fuzzy name matching

Two tables have customer names that may differ in casing, spacing, punctuation ('John O'Brien' vs 'john obrien'). Match across tables using normalized names. Hint: chain LOWER(), REPLACE (strip apostrophes/hyphens), and REGEXP_REPLACE to collapse multiple spaces.

Medium

Email validation report

Flag invalid emails: no @, more than one @, domain has no dot, or local part empty. Return user_id, email, and a reason. Hint: SPLIT_PART on '@' and check lengths. POSITION('@' IN email) finds the first @. LENGTH(email) - LENGTH(REPLACE(email, '@', '')) counts @ symbols.

Worked example: extract email domain, count users per domain

SELECT
  SPLIT_PART(email, '@', 2)        AS domain,
  COUNT(*)                         AS user_count
FROM users
WHERE email IS NOT NULL
  AND POSITION('@' IN email) > 0
GROUP BY SPLIT_PART(email, '@', 2)
ORDER BY user_count DESC
LIMIT 10;

SPLIT_PART returns the second piece of the email (the domain). The POSITION check guards against malformed emails without @. Common data-cleaning step before joining to a domain-level reference table.

String functions FAQ

How often do SQL string functions come up in DE interviews?+
In ~20-30% of SQL interviews, usually as part of a data cleaning or ETL question rather than standalone. Most common pattern: raw data has inconsistent formatting and you need to normalize before joining or aggregating. SPLIT_PART, REPLACE, LIKE cover most cases.
Regex or simpler string functions in an interview?+
Start with the simplest function that solves the problem. LIKE, SPLIT_PART, REPLACE are easier to read, less error-prone, and show that you pick the right tool. Use regex only when the pattern is too complex for basic functions. Interviewers value clarity over cleverness.
Biggest differences across SQL dialects?+
Postgres uses || for concatenation and has SPLIT_PART. MySQL uses CONCAT() and SUBSTRING_INDEX instead. BigQuery uses SPLIT() returning an array, REGEXP_EXTRACT instead of REGEXP_MATCHES. Snowflake is closest to Postgres. State your dialect at the start.
How do NULL values interact with string functions?+
CONCAT treats NULLs as empty strings. The || operator returns NULL if any operand is NULL. LENGTH(NULL) returns NULL. COALESCE(column, '') is the standard way to handle NULLs before string operations. Forgetting causes silent data loss in joins and filters.
02 / Why practice

Clean messy data with real SQL

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related guides