SQL String Functions

String manipulation shows up in roughly 8% of Python DE interview questions and a noticeable slice of SQL rounds too, usually folded into cleaning and parsing problems. The top five string functions by frequency: CONCAT, SUBSTRING, TRIM, LOWER/UPPER, REPLACE. Every engine spells them differently.

What this guide covers

String manipulation shows up in ~8% of Python DE interview questions and a noticeable slice of SQL rounds (cleaning, parsing). Across 854 SQL challenges, the top five string functions by frequency: CONCAT, SUBSTRING, TRIM, LOWER/UPPER, REPLACE. Every engine spells them a little differently.

8%
Python rounds use strings
22%
SQL rounds use WHERE
854
SQL challenges analyzed
5
Functions cover 80%

CONCAT and string concatenation

-- CONCAT function (works everywhere)
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;

-- || operator (PostgreSQL, Snowflake, SQLite)
SELECT first_name || ' ' || last_name AS full_name
FROM employees;

-- + operator (SQL Server only)
SELECT first_name + ' ' + last_name AS full_name
FROM employees;

-- CONCAT_WS: concat with separator (avoids repeated separators)
SELECT CONCAT_WS(', ', city, state, country) AS location
FROM addresses;
-- Result: 'Austin, TX, US'

NULL trap: CONCAT('hello', NULL) returns NULL in standard SQL. Use COALESCE to guard. CONCAT_WS handles NULLs better — it skips NULL arguments.

SUBSTRING

-- SUBSTRING(string, start, length)
SELECT SUBSTRING('DataDriven', 1, 4);   -- 'Data'
SELECT SUBSTRING('DataDriven', 5, 6);   -- 'Driven'
SELECT SUBSTRING('2024-01-15', 1, 7);   -- '2024-01'

-- Extract area code from phone number
SELECT SUBSTRING(phone, 2, 3) AS area_code
FROM contacts
WHERE phone LIKE '(%';
-- '(512) 555-1234' -> '512'

-- Alternative syntax (PostgreSQL, standard SQL)
SELECT SUBSTRING('DataDriven' FROM 1 FOR 4);  -- 'Data'

Second most common string function in the corpus. The first character lives at position 1, not 0 — that off-by-one catches Python-native engineers about 15% of the time in mock interviews. Name the convention out loud before you write the query.

TRIM, LTRIM, RTRIM

-- Remove whitespace
SELECT TRIM('  hello  ');          -- 'hello'
SELECT LTRIM('  hello  ');         -- 'hello  '
SELECT RTRIM('  hello  ');         -- '  hello'

-- Remove specific characters
SELECT TRIM(BOTH '-' FROM '--hello--');  -- 'hello'
SELECT TRIM(LEADING '0' FROM '00042');   -- '42'

-- Common pipeline pattern: clean and normalize
SELECT
  TRIM(UPPER(country_code)) AS country_code,
  TRIM(LOWER(email)) AS email,
  TRIM(name) AS name
FROM raw_users;

Pipeline tip: always TRIM string columns from raw data before joining or comparing. A common data quality bug is 'US' failing to match 'US ' (with trailing space) in a JOIN. TRIM both sides to prevent.

REPLACE

-- Basic replacement
SELECT REPLACE('Hello World', 'World', 'SQL');  -- 'Hello SQL'

-- Remove characters (replace with empty string)
SELECT REPLACE('(512) 555-1234', '-', '');  -- '(512) 5551234'
SELECT REPLACE(REPLACE('(512) 555-1234', '(', ''), ')', '');
-- '512 555-1234'

-- Clean currency strings for numeric conversion
SELECT
  CAST(
    REPLACE(REPLACE(price_text, '$', ''), ',', '')
    AS DECIMAL(10, 2)
  ) AS price_numeric
FROM raw_products;
-- '$1,299.99' -> 1299.99

LENGTH, UPPER, LOWER

-- LENGTH
SELECT LENGTH('DataDriven');  -- 10
SELECT LENGTH('');            -- 0
SELECT LENGTH(NULL);          -- NULL

-- Note: SQL Server uses LEN() and trims trailing spaces
-- BigQuery uses LENGTH() but also has CHAR_LENGTH()

-- UPPER and LOWER
SELECT UPPER('hello');  -- 'HELLO'
SELECT LOWER('HELLO');  -- 'hello'

-- Case-insensitive comparison
SELECT *
FROM users
WHERE LOWER(email) = LOWER('User@Example.COM');

-- Data validation: find suspiciously short/long values
SELECT *
FROM products
WHERE LENGTH(product_name) < 2
   OR LENGTH(product_name) > 200;

SPLIT_PART

-- SPLIT_PART(string, delimiter, position)
SELECT SPLIT_PART('2024-01-15', '-', 1);  -- '2024'
SELECT SPLIT_PART('2024-01-15', '-', 2);  -- '01'
SELECT SPLIT_PART('2024-01-15', '-', 3);  -- '15'

-- Extract domain from email
SELECT SPLIT_PART(email, '@', 2) AS domain
FROM users;
-- 'user@gmail.com' -> 'gmail.com'

-- Extract path segments from URL
SELECT
  SPLIT_PART(url_path, '/', 2) AS section,
  SPLIT_PART(url_path, '/', 3) AS page
FROM page_views;
-- '/blog/sql-joins' -> section='blog', page='sql-joins'

-- MySQL alternative: SUBSTRING_INDEX
-- SELECT SUBSTRING_INDEX('user@gmail.com', '@', -1);  -- 'gmail.com'

Available in PostgreSQL, Snowflake, Redshift. BigQuery uses SPLIT() returning an array. MySQL: SUBSTRING_INDEX.

Regular expression functions

-- REGEXP_REPLACE: replace pattern matches
-- Remove all non-numeric characters from phone numbers
SELECT REGEXP_REPLACE('(512) 555-1234', '[^0-9]', '', 'g');
-- '5125551234'

-- REGEXP_SUBSTR / REGEXP_EXTRACT: extract matching text
-- PostgreSQL: SUBSTRING with regex
SELECT SUBSTRING('Order #12345 confirmed' FROM '#([0-9]+)');
-- '12345'

-- Snowflake: REGEXP_SUBSTR
SELECT REGEXP_SUBSTR('Order #12345 confirmed', '#([0-9]+)', 1, 1, 'e');
-- '12345'

-- BigQuery: REGEXP_EXTRACT
SELECT REGEXP_EXTRACT('Order #12345 confirmed', r'#([0-9]+)');
-- '12345'

-- REGEXP_LIKE / ~ : test if string matches pattern
-- PostgreSQL
SELECT * FROM logs WHERE message ~ 'ERROR|FATAL';

-- Snowflake
SELECT * FROM logs WHERE REGEXP_LIKE(message, 'ERROR|FATAL');

-- BigQuery
SELECT * FROM logs WHERE REGEXP_CONTAINS(message, r'ERROR|FATAL');

Performance warning: regex is significantly slower than simple string functions. On large tables, REGEXP_REPLACE per row can add minutes. Use simple functions (REPLACE, SPLIT_PART) when possible. Save regex for patterns that can't be expressed otherwise.

Pipeline pattern: building composite keys

-- Create a surrogate key from multiple columns
SELECT
  MD5(CONCAT(
    COALESCE(CAST(user_id AS TEXT), ''),
    '|',
    COALESCE(CAST(event_date AS TEXT), ''),
    '|',
    COALESCE(event_type, '')
  )) AS event_key,
  *
FROM events;

Pipeline pattern: parsing semi-structured strings

-- Parse key=value pairs from log entries
-- Input: 'user=alice action=login ip=10.0.0.1'
SELECT
  SPLIT_PART(
    SPLIT_PART(log_line, 'user=', 2), ' ', 1
  ) AS username,
  SPLIT_PART(
    SPLIT_PART(log_line, 'action=', 2), ' ', 1
  ) AS action,
  SPLIT_PART(
    SPLIT_PART(log_line, 'ip=', 2), ' ', 1
  ) AS ip_address
FROM access_logs;

Pipeline pattern: normalizing for deduplication

-- Normalize company names for deduplication
SELECT
  TRIM(
    REGEXP_REPLACE(
      LOWER(company_name),
      '\s+(inc\.?|llc|corp\.?|ltd\.?)$',
      '',
      'i'
    )
  ) AS normalized_name,
  company_name AS original_name
FROM companies;
-- 'Acme Corp.' -> 'acme'
-- 'ACME Inc'   -> 'acme'
-- 'acme LLC'   -> 'acme'

String function interview questions

Q01

Extract the domain from an email column and count users per domain.

Tests SPLIT_PART or SUBSTRING with string position functions, plus GROUP BY. SELECT SPLIT_PART(email, '@', 2) AS domain, COUNT(*) FROM users GROUP BY 1 ORDER BY 2 DESC. Mention LOWER(email) first to avoid case-sensitive duplicates.

Q02

Clean a phone number column with mixed formats into a standard 10-digit format.

Tests REPLACE for removing characters, or REGEXP_REPLACE for a single-pass cleanup. REGEXP_REPLACE(phone, '[^0-9]', '', 'g') strips all non-numeric in one pass. Alternatively, chain REPLACE calls. Validate the result has exactly 10 digits.

Q03

Split full names ('John Smith') into first_name and last_name.

Tests SPLIT_PART or SUBSTRING with POSITION. SPLIT_PART(name, ' ', 1) for first_name. For last_name, SUBSTRING(name FROM POSITION(' ' IN name) + 1) to handle multi-word last names. Edge cases: single names, names with 3+ parts, NULLs.

SQL string functions FAQ

What are the most commonly used SQL string functions?+
CONCAT (combine strings), TRIM (remove whitespace), UPPER/LOWER (normalize case), REPLACE (swap substrings), SUBSTRING (extract part of a string), LENGTH (count characters). These six cover 90% of string manipulation needs. For complex patterns, REGEXP_REPLACE and SPLIT_PART are the next tier.
Do SQL string functions differ between databases?+
Yes. Core functions (CONCAT, TRIM, UPPER, LOWER, REPLACE, SUBSTRING) work the same in PostgreSQL, MySQL, Snowflake, BigQuery, most others. Advanced functions vary. SPLIT_PART exists in PostgreSQL and Snowflake but not MySQL (use SUBSTRING_INDEX). REGEXP syntax differs. STRING_AGG is GROUP_CONCAT in MySQL.
How do SQL string functions handle NULL?+
In most databases, any string function applied to NULL returns NULL. CONCAT('hello', NULL) returns NULL in standard SQL, though PostgreSQL's || operator treats NULL as empty string. TRIM(NULL), LENGTH(NULL) all return NULL. A single NULL in a string concatenation can blank out an entire derived column. Use COALESCE to provide defaults before applying string functions.
02 / Why practice

5 functions, 80% of the corpus

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related guides