String manipulation shows up in roughly 8% of verified Python DE interview questions and a noticeable slice of SQL rounds too, usually folded into cleaning and parsing problems. Across 854 SQL challenges in the DataDriven library, the top five string functions by frequency are CONCAT, SUBSTRING, TRIM, LOWER/UPPER, and REPLACE. Every engine spells them a little differently.
This page covers the standard set with cross-engine syntax for PostgreSQL, Snowflake, BigQuery, SQL Server, and MySQL. The 22% of SQL rounds that use WHERE and the 29% that use INNER JOIN often involve a string predicate somewhere in the chain, so fluency here pays compound interest across every other topic in the corpus.
Python rounds use strings
SQL rounds use WHERE
SQL challenges analyzed
Functions cover 80%
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
CONCAT joins two or more strings into one. Every database supports it, though the syntax varies slightly.
-- CONCAT function (works everywhere)
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;
-- || operator (PostgreSQL, Snowflake, SQLite)
SELECT first_name || ' ' || last_name AS full_name
FROM employees;
-- + operator (SQL Server only)
SELECT first_name + ' ' + last_name AS full_name
FROM employees;
-- CONCAT_WS: concat with separator (avoids repeated separators)
SELECT CONCAT_WS(', ', city, state, country) AS location
FROM addresses;
-- Result: 'Austin, TX, US'NULL trap: CONCAT('hello', NULL) returns NULL in standard SQL. Use COALESCE to guard against this: CONCAT(first_name, ' ', COALESCE(last_name, '')). CONCAT_WS handles NULLs better: it skips NULL arguments.
SUBSTRING is the second most common string function in the corpus, trailing only CONCAT. It extracts a portion of a string by position, and the first character lives at position 1, not 0. That off-by-one catches Python-native engineers about 15% of the time in mock interviews, and it's worth naming the convention out loud before you write the query.
-- SUBSTRING(string, start, length)
SELECT SUBSTRING('DataDriven', 1, 4); -- 'Data'
SELECT SUBSTRING('DataDriven', 5, 6); -- 'Driven'
SELECT SUBSTRING('2024-01-15', 1, 7); -- '2024-01'
-- Extract area code from phone number
SELECT SUBSTRING(phone, 2, 3) AS area_code
FROM contacts
WHERE phone LIKE '(%';
-- '(512) 555-1234' -> '512'
-- Alternative syntax (PostgreSQL, standard SQL)
SELECT SUBSTRING('DataDriven' FROM 1 FOR 4); -- 'Data'TRIM removes leading and trailing characters from a string. By default it removes whitespace. In data engineering, TRIM is one of the most-used functions because raw data almost always has unwanted spaces.
-- Remove whitespace
SELECT TRIM(' hello '); -- 'hello'
SELECT LTRIM(' hello '); -- 'hello '
SELECT RTRIM(' hello '); -- ' hello'
-- Remove specific characters
SELECT TRIM(BOTH '-' FROM '--hello--'); -- 'hello'
SELECT TRIM(LEADING '0' FROM '00042'); -- '42'
-- Common pipeline pattern: clean and normalize
SELECT
TRIM(UPPER(country_code)) AS country_code,
TRIM(LOWER(email)) AS email,
TRIM(name) AS name
FROM raw_users;Pipeline tip: Always TRIM string columns from raw data before joining or comparing. A common data quality bug is 'US' failing to match 'US ' (with trailing space) in a JOIN condition. TRIM both sides to prevent this.
REPLACE swaps all occurrences of a substring with another string. It is case-sensitive in most databases.
-- Basic replacement
SELECT REPLACE('Hello World', 'World', 'SQL'); -- 'Hello SQL'
-- Remove characters (replace with empty string)
SELECT REPLACE('(512) 555-1234', '-', ''); -- '(512) 5551234'
SELECT REPLACE(REPLACE('(512) 555-1234', '(', ''), ')', '');
-- '512 555-1234'
-- Clean currency strings for numeric conversion
SELECT
CAST(
REPLACE(REPLACE(price_text, '$', ''), ',', '')
AS DECIMAL(10, 2)
) AS price_numeric
FROM raw_products;
-- '$1,299.99' -> 1299.99LENGTH returns the number of characters. UPPER and LOWER convert case. These three functions are used constantly for validation and normalization.
-- LENGTH
SELECT LENGTH('DataDriven'); -- 10
SELECT LENGTH(''); -- 0
SELECT LENGTH(NULL); -- NULL
-- Note: SQL Server uses LEN() and trims trailing spaces
-- BigQuery uses LENGTH() but also has CHAR_LENGTH()
-- UPPER and LOWER
SELECT UPPER('hello'); -- 'HELLO'
SELECT LOWER('HELLO'); -- 'hello'
-- Case-insensitive comparison
SELECT *
FROM users
WHERE LOWER(email) = LOWER('User@Example.COM');
-- Data validation: find suspiciously short/long values
SELECT *
FROM products
WHERE LENGTH(product_name) < 2
OR LENGTH(product_name) > 200;SPLIT_PART splits a string by a delimiter and returns the Nth element. Available in PostgreSQL, Snowflake, and Redshift. BigQuery uses SPLIT() which returns an array.
-- SPLIT_PART(string, delimiter, position)
SELECT SPLIT_PART('2024-01-15', '-', 1); -- '2024'
SELECT SPLIT_PART('2024-01-15', '-', 2); -- '01'
SELECT SPLIT_PART('2024-01-15', '-', 3); -- '15'
-- Extract domain from email
SELECT SPLIT_PART(email, '@', 2) AS domain
FROM users;
-- 'user@gmail.com' -> 'gmail.com'
-- Extract path segments from URL
SELECT
SPLIT_PART(url_path, '/', 2) AS section,
SPLIT_PART(url_path, '/', 3) AS page
FROM page_views;
-- '/blog/sql-joins' -> section='blog', page='sql-joins'
-- MySQL alternative: SUBSTRING_INDEX
-- SELECT SUBSTRING_INDEX('user@gmail.com', '@', -1); -- 'gmail.com'Regex functions handle patterns that simple string functions cannot. Syntax varies across databases, but the concepts are the same.
-- REGEXP_REPLACE: replace pattern matches
-- Remove all non-numeric characters from phone numbers
SELECT REGEXP_REPLACE('(512) 555-1234', '[^0-9]', '', 'g');
-- '5125551234'
-- REGEXP_SUBSTR / REGEXP_EXTRACT: extract matching text
-- PostgreSQL: SUBSTRING with regex
SELECT SUBSTRING('Order #12345 confirmed' FROM '#([0-9]+)');
-- '12345'
-- Snowflake: REGEXP_SUBSTR
SELECT REGEXP_SUBSTR('Order #12345 confirmed', '#([0-9]+)', 1, 1, 'e');
-- '12345'
-- BigQuery: REGEXP_EXTRACT
SELECT REGEXP_EXTRACT('Order #12345 confirmed', r'#([0-9]+)');
-- '12345'
-- REGEXP_LIKE / ~ : test if string matches pattern
-- PostgreSQL
SELECT * FROM logs WHERE message ~ 'ERROR|FATAL';
-- Snowflake
SELECT * FROM logs WHERE REGEXP_LIKE(message, 'ERROR|FATAL');
-- BigQuery
SELECT * FROM logs WHERE REGEXP_CONTAINS(message, r'ERROR|FATAL');Performance warning: Regex functions are significantly slower than simple string functions. On large tables, REGEXP_REPLACE on every row can add minutes to query time. Use simple functions (REPLACE, SPLIT_PART) when possible. Save regex for patterns that cannot be expressed otherwise.
Here are the string function patterns that show up in production ETL jobs and dbt models.
-- Create a surrogate key from multiple columns
SELECT
MD5(CONCAT(
COALESCE(CAST(user_id AS TEXT), ''),
'|',
COALESCE(CAST(event_date AS TEXT), ''),
'|',
COALESCE(event_type, '')
)) AS event_key,
*
FROM events;-- Parse key=value pairs from log entries
-- Input: 'user=alice action=login ip=10.0.0.1'
SELECT
SPLIT_PART(
SPLIT_PART(log_line, 'user=', 2), ' ', 1
) AS username,
SPLIT_PART(
SPLIT_PART(log_line, 'action=', 2), ' ', 1
) AS action,
SPLIT_PART(
SPLIT_PART(log_line, 'ip=', 2), ' ', 1
) AS ip_address
FROM access_logs;-- Normalize company names for deduplication
SELECT
TRIM(
REGEXP_REPLACE(
LOWER(company_name),
'\s+(inc\.?|llc|corp\.?|ltd\.?)$',
'',
'i'
)
) AS normalized_name,
company_name AS original_name
FROM companies;
-- 'Acme Corp.' -> 'acme'
-- 'ACME Inc' -> 'acme'
-- 'acme LLC' -> 'acme'String manipulation questions test attention to detail and familiarity with data cleaning patterns.
What they test:
SPLIT_PART or SUBSTRING with string position functions, plus GROUP BY. Can you parse a string and aggregate in one query?
Approach:
SELECT SPLIT_PART(email, '@', 2) AS domain, COUNT(*) FROM users GROUP BY 1 ORDER BY 2 DESC. Mention that you would LOWER(email) first to avoid case-sensitive duplicates.
What they test:
REPLACE for removing characters, or REGEXP_REPLACE for a single-pass cleanup. Data cleaning mindset.
Approach:
REGEXP_REPLACE(phone, '[^0-9]', '', 'g') strips all non-numeric characters in one pass. Alternatively, chain REPLACE calls for each character type. Validate that the result has exactly 10 digits.
What they test:
SPLIT_PART or SUBSTRING with POSITION. Handling edge cases like middle names and single names.
Approach:
SPLIT_PART(name, ' ', 1) for first_name. For last_name, use SUBSTRING(name FROM POSITION(' ' IN name) + 1) to handle multi-word last names. Mention edge cases: single names (no space), names with three or more parts, and NULL values.
CONCAT, SUBSTRING, TRIM, LOWER/UPPER, and REPLACE cover the vast majority of string questions in our 854-challenge library. Drill them once and you'll stop stalling on parsing tasks.