SQL Reference

SQL String Functions

String manipulation shows up in roughly 8% of verified Python DE interview questions and a noticeable slice of SQL rounds too, usually folded into cleaning and parsing problems. Across 854 SQL challenges in the DataDriven library, the top five string functions by frequency are CONCAT, SUBSTRING, TRIM, LOWER/UPPER, and REPLACE. Every engine spells them a little differently.

This page covers the standard set with cross-engine syntax for PostgreSQL, Snowflake, BigQuery, SQL Server, and MySQL. The 22% of SQL rounds that use WHERE and the 29% that use INNER JOIN often involve a string predicate somewhere in the chain, so fluency here pays compound interest across every other topic in the corpus.

8%

Python rounds use strings

22%

SQL rounds use WHERE

854

SQL challenges analyzed

5

Functions cover 80%

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

CONCAT and String Concatenation

CONCAT joins two or more strings into one. Every database supports it, though the syntax varies slightly.

-- CONCAT function (works everywhere)
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM employees;

-- || operator (PostgreSQL, Snowflake, SQLite)
SELECT first_name || ' ' || last_name AS full_name
FROM employees;

-- + operator (SQL Server only)
SELECT first_name + ' ' + last_name AS full_name
FROM employees;

-- CONCAT_WS: concat with separator (avoids repeated separators)
SELECT CONCAT_WS(', ', city, state, country) AS location
FROM addresses;
-- Result: 'Austin, TX, US'

NULL trap: CONCAT('hello', NULL) returns NULL in standard SQL. Use COALESCE to guard against this: CONCAT(first_name, ' ', COALESCE(last_name, '')). CONCAT_WS handles NULLs better: it skips NULL arguments.

SUBSTRING

SUBSTRING is the second most common string function in the corpus, trailing only CONCAT. It extracts a portion of a string by position, and the first character lives at position 1, not 0. That off-by-one catches Python-native engineers about 15% of the time in mock interviews, and it's worth naming the convention out loud before you write the query.

-- SUBSTRING(string, start, length)
SELECT SUBSTRING('DataDriven', 1, 4);   -- 'Data'
SELECT SUBSTRING('DataDriven', 5, 6);   -- 'Driven'
SELECT SUBSTRING('2024-01-15', 1, 7);   -- '2024-01'

-- Extract area code from phone number
SELECT SUBSTRING(phone, 2, 3) AS area_code
FROM contacts
WHERE phone LIKE '(%';
-- '(512) 555-1234' -> '512'

-- Alternative syntax (PostgreSQL, standard SQL)
SELECT SUBSTRING('DataDriven' FROM 1 FOR 4);  -- 'Data'

TRIM, LTRIM, RTRIM

TRIM removes leading and trailing characters from a string. By default it removes whitespace. In data engineering, TRIM is one of the most-used functions because raw data almost always has unwanted spaces.

-- Remove whitespace
SELECT TRIM('  hello  ');          -- 'hello'
SELECT LTRIM('  hello  ');         -- 'hello  '
SELECT RTRIM('  hello  ');         -- '  hello'

-- Remove specific characters
SELECT TRIM(BOTH '-' FROM '--hello--');  -- 'hello'
SELECT TRIM(LEADING '0' FROM '00042');   -- '42'

-- Common pipeline pattern: clean and normalize
SELECT
  TRIM(UPPER(country_code)) AS country_code,
  TRIM(LOWER(email)) AS email,
  TRIM(name) AS name
FROM raw_users;

Pipeline tip: Always TRIM string columns from raw data before joining or comparing. A common data quality bug is 'US' failing to match 'US ' (with trailing space) in a JOIN condition. TRIM both sides to prevent this.

REPLACE

REPLACE swaps all occurrences of a substring with another string. It is case-sensitive in most databases.

-- Basic replacement
SELECT REPLACE('Hello World', 'World', 'SQL');  -- 'Hello SQL'

-- Remove characters (replace with empty string)
SELECT REPLACE('(512) 555-1234', '-', '');  -- '(512) 5551234'
SELECT REPLACE(REPLACE('(512) 555-1234', '(', ''), ')', '');
-- '512 555-1234'

-- Clean currency strings for numeric conversion
SELECT
  CAST(
    REPLACE(REPLACE(price_text, '$', ''), ',', '')
    AS DECIMAL(10, 2)
  ) AS price_numeric
FROM raw_products;
-- '$1,299.99' -> 1299.99

LENGTH, UPPER, LOWER

LENGTH returns the number of characters. UPPER and LOWER convert case. These three functions are used constantly for validation and normalization.

-- LENGTH
SELECT LENGTH('DataDriven');  -- 10
SELECT LENGTH('');            -- 0
SELECT LENGTH(NULL);          -- NULL

-- Note: SQL Server uses LEN() and trims trailing spaces
-- BigQuery uses LENGTH() but also has CHAR_LENGTH()

-- UPPER and LOWER
SELECT UPPER('hello');  -- 'HELLO'
SELECT LOWER('HELLO');  -- 'hello'

-- Case-insensitive comparison
SELECT *
FROM users
WHERE LOWER(email) = LOWER('User@Example.COM');

-- Data validation: find suspiciously short/long values
SELECT *
FROM products
WHERE LENGTH(product_name) < 2
   OR LENGTH(product_name) > 200;

SPLIT_PART

SPLIT_PART splits a string by a delimiter and returns the Nth element. Available in PostgreSQL, Snowflake, and Redshift. BigQuery uses SPLIT() which returns an array.

-- SPLIT_PART(string, delimiter, position)
SELECT SPLIT_PART('2024-01-15', '-', 1);  -- '2024'
SELECT SPLIT_PART('2024-01-15', '-', 2);  -- '01'
SELECT SPLIT_PART('2024-01-15', '-', 3);  -- '15'

-- Extract domain from email
SELECT SPLIT_PART(email, '@', 2) AS domain
FROM users;
-- 'user@gmail.com' -> 'gmail.com'

-- Extract path segments from URL
SELECT
  SPLIT_PART(url_path, '/', 2) AS section,
  SPLIT_PART(url_path, '/', 3) AS page
FROM page_views;
-- '/blog/sql-joins' -> section='blog', page='sql-joins'

-- MySQL alternative: SUBSTRING_INDEX
-- SELECT SUBSTRING_INDEX('user@gmail.com', '@', -1);  -- 'gmail.com'

Regular Expression Functions

Regex functions handle patterns that simple string functions cannot. Syntax varies across databases, but the concepts are the same.

-- REGEXP_REPLACE: replace pattern matches
-- Remove all non-numeric characters from phone numbers
SELECT REGEXP_REPLACE('(512) 555-1234', '[^0-9]', '', 'g');
-- '5125551234'

-- REGEXP_SUBSTR / REGEXP_EXTRACT: extract matching text
-- PostgreSQL: SUBSTRING with regex
SELECT SUBSTRING('Order #12345 confirmed' FROM '#([0-9]+)');
-- '12345'

-- Snowflake: REGEXP_SUBSTR
SELECT REGEXP_SUBSTR('Order #12345 confirmed', '#([0-9]+)', 1, 1, 'e');
-- '12345'

-- BigQuery: REGEXP_EXTRACT
SELECT REGEXP_EXTRACT('Order #12345 confirmed', r'#([0-9]+)');
-- '12345'

-- REGEXP_LIKE / ~ : test if string matches pattern
-- PostgreSQL
SELECT * FROM logs WHERE message ~ 'ERROR|FATAL';

-- Snowflake
SELECT * FROM logs WHERE REGEXP_LIKE(message, 'ERROR|FATAL');

-- BigQuery
SELECT * FROM logs WHERE REGEXP_CONTAINS(message, r'ERROR|FATAL');

Performance warning: Regex functions are significantly slower than simple string functions. On large tables, REGEXP_REPLACE on every row can add minutes to query time. Use simple functions (REPLACE, SPLIT_PART) when possible. Save regex for patterns that cannot be expressed otherwise.

String Functions in Data Pipelines

Here are the string function patterns that show up in production ETL jobs and dbt models.

Building Composite Keys

-- Create a surrogate key from multiple columns
SELECT
  MD5(CONCAT(
    COALESCE(CAST(user_id AS TEXT), ''),
    '|',
    COALESCE(CAST(event_date AS TEXT), ''),
    '|',
    COALESCE(event_type, '')
  )) AS event_key,
  *
FROM events;

Parsing Semi-Structured Strings

-- Parse key=value pairs from log entries
-- Input: 'user=alice action=login ip=10.0.0.1'
SELECT
  SPLIT_PART(
    SPLIT_PART(log_line, 'user=', 2), ' ', 1
  ) AS username,
  SPLIT_PART(
    SPLIT_PART(log_line, 'action=', 2), ' ', 1
  ) AS action,
  SPLIT_PART(
    SPLIT_PART(log_line, 'ip=', 2), ' ', 1
  ) AS ip_address
FROM access_logs;

Normalizing for Deduplication

-- Normalize company names for deduplication
SELECT
  TRIM(
    REGEXP_REPLACE(
      LOWER(company_name),
      '\s+(inc\.?|llc|corp\.?|ltd\.?)$',
      '',
      'i'
    )
  ) AS normalized_name,
  company_name AS original_name
FROM companies;
-- 'Acme Corp.' -> 'acme'
-- 'ACME Inc'   -> 'acme'
-- 'acme LLC'   -> 'acme'

3 String Function Interview Questions

String manipulation questions test attention to detail and familiarity with data cleaning patterns.

Q1: Extract the domain from an email column and count users per domain.

What they test:

SPLIT_PART or SUBSTRING with string position functions, plus GROUP BY. Can you parse a string and aggregate in one query?

Approach:

SELECT SPLIT_PART(email, '@', 2) AS domain, COUNT(*) FROM users GROUP BY 1 ORDER BY 2 DESC. Mention that you would LOWER(email) first to avoid case-sensitive duplicates.

Q2: Clean a phone number column that has mixed formats (parentheses, dashes, spaces) into a standard 10-digit format.

What they test:

REPLACE for removing characters, or REGEXP_REPLACE for a single-pass cleanup. Data cleaning mindset.

Approach:

REGEXP_REPLACE(phone, '[^0-9]', '', 'g') strips all non-numeric characters in one pass. Alternatively, chain REPLACE calls for each character type. Validate that the result has exactly 10 digits.

Q3: Given a column with full names ('John Smith'), split it into first_name and last_name columns.

What they test:

SPLIT_PART or SUBSTRING with POSITION. Handling edge cases like middle names and single names.

Approach:

SPLIT_PART(name, ' ', 1) for first_name. For last_name, use SUBSTRING(name FROM POSITION(' ' IN name) + 1) to handle multi-word last names. Mention edge cases: single names (no space), names with three or more parts, and NULL values.

SQL String Functions FAQ

What are the most commonly used SQL string functions?+
The string functions you will use most often in data engineering are: CONCAT (combine strings), TRIM (remove whitespace), UPPER/LOWER (normalize case), REPLACE (swap substrings), SUBSTRING (extract part of a string), and LENGTH (count characters). These six cover 90% of string manipulation needs in SQL. For more complex patterns, REGEXP_REPLACE and SPLIT_PART are the next tier.
Do SQL string functions differ between databases?+
Yes. The core functions (CONCAT, TRIM, UPPER, LOWER, REPLACE, SUBSTRING) work the same in PostgreSQL, MySQL, Snowflake, BigQuery, and most other databases. But advanced functions vary. SPLIT_PART exists in PostgreSQL and Snowflake but not in MySQL (use SUBSTRING_INDEX instead). REGEXP syntax differs across databases. STRING_AGG (PostgreSQL, Snowflake) is GROUP_CONCAT in MySQL and STRING_AGG in SQL Server. Always check your specific database documentation for edge cases.
How do SQL string functions handle NULL?+
In most databases, any string function applied to NULL returns NULL. CONCAT('hello', NULL) returns NULL in standard SQL, though PostgreSQL's || operator treats NULL as an empty string. TRIM(NULL) returns NULL. LENGTH(NULL) returns NULL. This matters in data pipelines because a single NULL in a string concatenation can blank out an entire derived column. Use COALESCE to provide default values before applying string functions.

5 functions, 80% of the corpus

CONCAT, SUBSTRING, TRIM, LOWER/UPPER, and REPLACE cover the vast majority of string questions in our 854-challenge library. Drill them once and you'll stop stalling on parsing tasks.