String Functions: Intermediate
Reddit's Trust & Safety data team identifies coordinated inauthentic behavior and spam rings by running SQL pattern matching queries across billions of posts stored in their data warehouse. Functions like REPLACE, LIKE with wildcards, and REGEXP_LIKE let analysts surface policy-violating phrases, off-platform links, and bot-like posting patterns across hundreds of communities in a single query. Reddit engineers have discussed how batch SQL analysis lets a small team audit content at scale and dramatically reduce the case volume that reaches human reviewers. The intermediate string techniques you are about to learn are the same tools that make this kind of large-scale content analysis possible.
SUBSTRING()
Extract portions of text by position
SUBSTRING extracts a portion of a string starting at a specified position for a specified length. It allows you to pull out exactly the characters you need from any position within the text. This function is fundamental for parsing structured data embedded in strings.
The syntax is SUBSTRING(string, start_position, length). The start_position indicates where extraction begins (1-indexed, meaning the first character is position 1). The length specifies how many characters to extract. If length is omitted, extraction continues to the end of the string.
Understanding how positions and lengths work in SUBSTRING is key to extracting exactly the characters you need.
Basic Extraction
| product_code | Format |
|---|---|
| ELEC-2024-001 | CATEGORY-YEAR-SEQUENCE |
| FURN-2024-042 | CATEGORY-YEAR-SEQUENCE |
| TOYS-2023-199 | CATEGORY-YEAR-SEQUENCE |
SUBSTRING(product_code, 1, 4) extracts 4 characters starting at position 1, returning the category code. SUBSTRING(product_code, 6, 4) extracts 4 characters starting at position 6, returning the year. Position counting starts at 1, not 0.
Extract to String End
When you omit the length parameter, SUBSTRING extracts from the start position to the end of the string:
Practical Applications
SUBSTRING becomes powerful when combined with other functions like STRPOS to handle variable-length data.
Parsing Email Domains
SUBSTRING combined with STRPOS can extract variable-length portions of strings:
SUBSTRING('hello', 1, 1) returns 'h', not 'e'. This is a frequent source of off-by-one errors.Fixed-Width Extraction
| record | Layout |
|---|---|
| SMITH JOHN 19850315NYC | LastName(10) FirstName(10) DOB(8) City(3) |
Each field is extracted by its position and length. TRIM removes the padding spaces from the name fields. This pattern is common when migrating data from mainframe systems.
Best Practices
SUBSTRING excels at structured parsing tasks but requires careful attention to position counting.
When to Use SUBSTRING()
SUBSTRING is ideal for data with predictable structure where you know exactly where values are located.
Understanding SUBSTRING execution helps you predict behavior with edge cases like NULL inputs or out-of-range positions.
SUBSTRING()scans to the start position, then copies the specified length- Returns
NULLif the input string isNULL - Returns empty string if start position exceeds string length
- Negative start positions work differently across databases
> Complete this query to extract the category code from a product code.
SELECT (product_code, , ) AS category FROM products
SUBSTRING with a fixed start position and length is the right tool when your data follows a consistent structure where each field always occupies the same character positions.
Combining SUBSTRING with STRPOS lets you handle variable-length strings by first finding a delimiter and then extracting relative to that location rather than relying on fixed positions.
Edge Extractions
Pull text from the start or end of values
Extracting characters from the beginning or end of a string is a common operation. SUBSTR handles both cases. To get the first N characters, use SUBSTR(string, 1, N). To get the last N characters, use a negative start position: SUBSTR(string, -N).
Dedicated prefix/suffix functions are found in some databases, but Presto provides unified handling through SUBSTR with position arguments. This single function replaces the need for multiple specialized functions, giving you consistent syntax across all extraction patterns.
Using SUBSTR from position 1 with a specified length captures the prefix portion of any string with clarity and consistency.
Extract from Start
SUBSTR(string, 1, N) extracts the first N characters, ideal for fixed-length prefixes like country codes, category identifiers, or version numbers:
| phone_number |
|---|
| +1-555-123-4567 |
| +44-20-7946-0958 |
| +81-3-1234-5678 |
SUBSTR(phone_number, 1, 3) extracts the first 3 characters, capturing the country code prefix. The 1 indicates starting at the first character, and 3 is the length to extract.
RIGHT()
Extracting from the End
SUBSTR with a negative start position extracts from the end of a string, useful for file extensions, trailing codes, or check digits:
| filename |
|---|
| report_2024.pdf |
| data_export.csv |
| backup.sql.gz |
SUBSTR(filename, -3) extracts the last 3 characters. The negative position -3 means "start 3 characters from the end." For fixed-length extensions like .pdf or .csv, this works perfectly. For variable-length extensions, you would need STRPOS to find the dot.
Comparison & Best Practices
Extraction Patterns
Compare how different SUBSTR calls extract from the beginning, end, or middle of a string.
Practical Examples
To get the first N characters, use SUBSTR(str, 1, N). To get the last N characters, use SUBSTR(str, -N). These patterns handle all the use cases needed for extracting fixed-width prefixes and suffixes from standardized identifiers.
Extracting Prefixes and Suffixes
Prefixes (the leading portion of a string) and suffixes (the trailing portion) are common in structured data. Country codes, file extensions, year prefixes, check digits, and category codes are all prefixes or suffixes that can be extracted using SUBSTR.
SUBSTR execution differs slightly from SUBSTRING in some databases, particularly with negative positions.
- SUBSTR() with positive positions scans from the start
- SUBSTR() with negative positions counts backward from the end
- Returns NULL if the input is NULL
- If requested length exceeds string length, the available portion is returned
> Complete this query to extract the first 3 characters of each product code as the category.
SELECT (product_code, , ) AS category FROM products
When the position of a delimiter is unknown in advance, combining STRPOS to locate it and SUBSTR to extract around it produces a robust parser that works correctly regardless of how long the prefix or suffix is.
REPLACE()
Swap characters or words inside text
REPLACE finds all occurrences of a substring and replaces them with a different substring. Syntax: REPLACE(string, search_string, replacement_string). The comparison is case-sensitive, and every match is replaced globally throughout the string.
REPLACE is essential for data cleaning, standardization, and text transformation workflows.
Basic Replacement
| company_name |
|---|
| Acme Corp. |
| Tech Inc. |
| Data Corp. Systems |
Every occurrence of 'Corp.' is replaced with 'Corporation'. The third row shows that REPLACE works even when the search string appears in the middle. Rows without matches are unchanged.
Removing Characters
Chaining REPLACE calls removes multiple different characters. Each call removes one type of character. The result contains only digits and spaces (spaces would require another REPLACE to remove).
Advanced Usage
REPLACE becomes more powerful when chained together or combined with other string functions.
Data Standardization
REPLACE is commonly used to convert between data formats or fix inconsistent input:
Single vs Chained
A single REPLACE handles one substitution. For multiple characters, you chain calls. Compare the approaches:
REPLACE() is case-sensitive in most databases. REPLACE('Hello', 'HELLO', 'Hi') returns 'Hello' unchanged because 'HELLO' does not match 'Hello'. Use LOWER() or UPPER() if you need case-insensitive replacement.Multiple Replacements
For multiple different replacements, nest REPLACE calls:
Each REPLACE() transforms one code. The innermost replacement runs first, then the next, and so on. Order matters if replacements might overlap.
Best Practices
REPLACE is a workhorse function for data cleaning, but knowing when to use it versus other approaches matters.
When to Use REPLACE()
Compare your data before and after applying REPLACE to understand its impact on consistency.
- Inconsistent abbreviations
- Unwanted characters in data
- Mixed date/number formats
- Legacy single-char codes
- Standardized terminology
- Clean, uniform strings
- Consistent date formats
- Human-readable values
REPLACE performs a global search-and-replace in a single pass, making it efficient for bulk text transformations.
REPLACE()scans the string once, replacing all matches- Returns the original string if no matches found
- Returns
NULLif any argument isNULL - Case-sensitive; can increase or decrease string length
Data standardization is one of the most common real-world applications of REPLACE.
> Complete this query to replace underscores with spaces in product names.
SELECT (product_name, , ) AS clean_name FROM products
- Performs global substitution: every occurrence is replaced in a single pass
- Chain multiple REPLACE calls to strip multiple unwanted characters (standard ETL pattern)
- Case-sensitive: normalize with LOWER() or UPPER() first if source data has inconsistent casing
REPLACE substitutes every occurrence in one pass, not just the first. This global behavior is what makes it effective for bulk data cleaning across entire columns.
Nesting REPLACE() calls is the standard approach for removing or substituting multiple different characters in a single query.
STRPOS()
Locate where a pattern appears in text
STRPOS finds the starting position of a substring within a larger string. Syntax: STRPOS(string, substring). Returns the 1-indexed position of the first match, or 0 if not found.
STRPOS is crucial for locating delimiters, parsing variable-format strings, and building conditional logic based on string contents. The position it returns can then be used directly in other string functions.
Delimiter Positions
STRPOS is essential for parsing strings with delimiters.
| alice@gmail.com |
| bob@company.org |
| support@help.desk.com |
STRPOS() returns where the @ symbol appears. The position varies by email length. In 'alice@gmail.com', @ is at position 6. In 'bob@company.org', @ is at position 4.
Extraction by Position
Combine STRPOS with SUBSTRING to extract variable-length portions:
The username is extracted from position 1 to just before the @. The domain is extracted from just after the @ to the end. STRPOS provides the exact location needed for these calculations.
Advanced Usage
Beyond simple lookups, STRPOS enables pattern-based logic and complex parsing strategies.
Substring Existence
STRPOS returns 0 when the substring is not found, making it useful for conditional logic:
If STRPOS returns a value greater than 0, the substring exists. This pattern enables classification based on string contents without exact matching.
Multiple Delimiters
When parsing complex strings with multiple delimiters, STRPOS helps locate each one:
STRPOS finds the first occurrence only. For paths with multiple slashes, you would extract portions sequentially or use database-specific functions for splitting.
Best Practices
STRPOS shines in data parsing tasks where you need to locate dynamic boundaries within strings.
When to Use STRPOS()
These are the most common scenarios where STRPOS proves invaluable.
STRPOS scans from the start of the string, so its performance depends on where matches occur.
- STRPOS() scans the string sequentially until a match is found
- Returns 0 if not found; returns 1 if match at first character
- Case-sensitive in most databases; finds first occurrence only
> Complete this query to find the position of the @ symbol in email addresses.
SELECT (email, ) AS at_pos FROM users
- Returns 0 (not NULL) when substring not found; compare result to 0 rather than using IS NULL
- Most powerful when used as input to SUBSTRING() for dynamic extraction based on found position
- Only finds first occurrence; parsing multiple delimiters requires nesting calls or using REPLACE() first
STRPOS returns 0 when the substring is not found. This means STRPOS(col, 'x') > 0 reliably tests for presence, while comparing the result to NULL would never match.
The most powerful use of STRPOS() is feeding its result to SUBSTRING(): by finding a delimiter position dynamically, you can extract variable-length substrings that depend on the actual content of each row.
TRIM()
Remove unwanted whitespace from values
TRIM removes unwanted characters from the beginning, end, or both sides of a string. By default it removes whitespace. Syntax: TRIM(string) or TRIM(LEADING/TRAILING/BOTH characters FROM string) for custom characters. LTRIM and RTRIM are shortcuts.
TRIM is essential for cleaning user input and ensuring accurate string comparisons, especially when dealing with legacy data that has inconsistent whitespace.
Removing Whitespace
| username | Display |
|---|---|
| alice | " alice " (spaces on both sides) |
| bob | "bob " (trailing spaces) |
| carol | " carol" (leading spaces) |
TRIM removes spaces from both ends. All usernames are now clean and comparable. Without trimming, ' alice ' would not match 'alice' in a WHERE clause or JOIN.
LTRIM() and RTRIM()
LTRIM removes leading characters only. RTRIM removes trailing characters only. These are useful when you need asymmetric cleaning:
Custom Character Trim
TRIM(LEADING '$' FROM ...) removes dollar signs from the beginning. Note that the second row has two dollar signs, and only the leading ones are removed. TRIM removes all consecutive matching characters from the specified side.
Practical Applications
TRIM is a staple in data pipelines and ETL workflows. Here are common patterns you will encounter.
Cleaning Import Data
When joins mysteriously return no results, whitespace is often the culprit. Applying TRIM() to both sides of the join ensures spaces do not prevent matches.
Import vs Query TRIM
You can apply TRIM when data is loaded or when it is queried. Each approach has trade-offs for performance and data integrity.
TRIM in JOIN conditions prevents index usage. For production systems, clean data at import time and store trimmed values. Use TRIM in joins only for debugging or one-time data fixes.Best Practices
TRIM is one of the most frequently used functions in data cleaning pipelines. Here is when to apply it.
When to Use TRIM()
TRIM is highly efficient because it only examines characters at the string edges, not the entire content.
TRIM()scans from the edges inward until non-matching characters- Returns
NULLif input isNULL - Very efficient as it only examines edge characters; does not affect middle
Invisible whitespace bugs are notoriously hard to debug, making TRIM an essential defensive practice.
> Complete this query to remove whitespace from usernames.
SELECT (username) AS clean_name FROM users
- Use SUBSTR(str, 1, N) for first N chars, SUBSTR(str, -N) for last N
- Combine STRPOS() with SUBSTR() for dynamic parsing
- Apply TRIM() to user input before storage
- Chain REPLACE() calls for multiple substitutions
- Test extraction logic with edge cases (empty strings, NULLs)
- Assume fixed positions without validating data format
- Forget that SQL positions start at 1, not 0
- Use TRIM() in JOIN conditions without understanding the performance impact
- Ignore case sensitivity in REPLACE() and STRPOS()
- Apply SUBSTR() with hardcoded positions to variable-format data
> You are a data engineer at FedEx parsing and normalizing raw address and tracking strings scraped from carrier API responses before loading them into the central warehouse for shipment analytics.
SUBSTRING() with fixed positions extracts the carrier code and service tier embedded at known character offsets within each standardized tracking number format.STRPOS() locates the delimiter between street address and city in variable-length address strings, giving SUBSTRING() a dynamic start point for the extraction.REPLACE() strips formatting characters such as parentheses and hyphens from raw phone strings and converts slash-delimited date fields to ISO hyphen format.TRIM() removes invisible leading and trailing whitespace from carrier response fields that would otherwise break joins against the reference address table.SUBSTR() extracts text; positions are 1-indexed, not 0-indexedSUBSTR(str, 1, N) for first N chars, SUBSTR(str, -N) for last NSTRPOS() returns 0 if not found, not NULL (unlike some languages)REPLACE() is case-sensitive; replaces all occurrences, not just the firstTRIM() removes only leading/trailing whitespace, not internal spacesLTRIM() and RTRIM() when you need directional controlTRIM() both sides before joiningSUBSTR + STRPOS is powerfulSlice, search, and reshape text at will
SUBSTR, TRIM, and string surgery 101
- Category
- SQL
- Difficulty
- intermediate
- Duration
- 30 minutes
- Challenges
- 0 hands-on challenges
Topics covered: SUBSTRING(), Edge Extractions, REPLACE(), STRPOS(), TRIM()
Lesson Sections
- SUBSTRING() (concepts: sqlSubstring)
Basic Extraction Consider product codes that embed category information in specific positions. Extract to String End Practical Applications Parsing Email Domains This query finds the position of the @ symbol, adds 1 to skip past it, then extracts everything from that point to the end. The domain portion has variable length, so omitting the length parameter is essential. Fixed-Width Extraction Many legacy systems use fixed-width formats where each field occupies specific character positions: Best
- Edge Extractions (concepts: sqlStringBuilding)
Extract from Start RIGHT() Extracting from the end requires calculating the start position or using negative indices, depending on your database. Extracting from the End Comparison & Best Practices Understanding the differences between extraction methods helps you choose the right approach for each situation. Extraction Patterns Practical Examples Extracting order year and sequence number from order IDs: Extracting Prefixes and Suffixes Extracting from the end of a string with a negative start p
- REPLACE() (concepts: sqlReplace)
Basic Replacement Standardizing company names by expanding abbreviations. Removing Characters To remove characters, replace them with an empty string: Advanced Usage Data Standardization Date formats with slashes are converted to ISO format with hyphens. This standardization is critical when importing data from systems with different conventions. Single vs Chained Multiple Replacements Best Practices When to Use REPLACE()
- STRPOS() (concepts: sqlStrpos)
Delimiter Positions Extraction by Position Advanced Usage Substring Existence STRPOS returns 0 when not found, but returns 1 for a match at the beginning. Remember that SQL positions start at 1, so a match at the first character returns 1, not 0. Multiple Delimiters Best Practices When to Use STRPOS()
- TRIM() (concepts: sqlTrim)
Removing Whitespace Whitespace problems are invisible but cause real bugs. Extra spaces prevent joins from matching and corrupt comparisons. LTRIM() and RTRIM() Custom Character Trim You can specify which characters to remove instead of whitespace: Practical Applications Cleaning Import Data Data imported from spreadsheets or external systems often has invisible whitespace that breaks joins: Import vs Query TRIM Best Practices When to Use TRIM() These are the most common scenarios where whitespa