String Functions: Advanced
Yelp hosts over 250 million reviews, and its data team uses advanced SQL string functions to extract structured insights from that ocean of unstructured text. By applying string aggregation and pattern matching across review corpora, they identify recurring complaint phrases, surface the dishes mentioned most often in five-star reviews, and detect sentiment trends that feed directly into their rating algorithm. What appears to users as a simple star score is actually the product of large-scale text analytics running over millions of free-form paragraphs. The advanced string techniques you are about to learn are the foundation for that kind of analysis.
SPLIT_TO_MAP()
Parse key-value pairs from raw text
SPLIT_TO_MAP parses a string containing key-value pairs into a map data structure. The function takes three arguments: the string to parse, the delimiter between pairs, and the delimiter between keys and values. Once parsed, you can access individual values by their keys using bracket notation.
The syntax is SPLIT_TO_MAP(string, entry_delimiter, key_value_delimiter). The entry_delimiter separates different key-value pairs (commonly "&" or ","). The key_value_delimiter separates each key from its value (commonly "=" or ":"). The result is a MAP type that supports key-based lookups.
Basic Syntax
SPLIT_TO_MAP works with any consistently delimited key-value string. The most common patterns are URL query strings and configuration entries.
Parsing Query Strings
| request_id | query_params |
|---|---|
| 1 | user_id=42&action=view&page=home |
| 2 | user_id=108&action=edit&page=settings |
| 3 | user_id=7&action=delete&page=profile |
SPLIT_TO_MAP parses the query string into a map. The bracket notation ['user_id'] extracts the value for that specific key. This converts unstructured text into structured, queryable data without complex string manipulation.
Parsing Config Strings
Configuration data often arrives as delimited strings from external systems. SPLIT_TO_MAP handles various delimiter formats:
Here the entry delimiter is a comma and the key-value delimiter is a colon. The function adapts to whatever format your source data uses. You define the parsing rules, and SPLIT_TO_MAP applies them consistently.
Delimiter Format Variants
NULL, not an error. Always validate that expected keys exist before building logic on their values. Use COALESCE() to provide default values for missing keys.Advanced Usage
Beyond simple extraction, SPLIT_TO_MAP results can be used directly in queries for filtering, joining, and aggregation.
Filtering by Map Values
Once parsed, map values can be used in WHERE clauses for filtering:
Best Practices
Choosing the right tool matters. SPLIT_TO_MAP shines with structured key-value data but is not a replacement for a proper JSON parser.
When to Use SPLIT_TO_MAP()
SPLIT_TO_MAP is particularly useful when working with serialized key-value data stored as strings.
SPLIT_TO_MAPscans the string once, splitting on delimiters- Returns empty map if input is empty; returns
NULLif input isNULL - Duplicate keys result in the last value winning; key lookup is case-sensitive
> Complete this query to parse a URL query string into a map and extract the action parameter.
SELECT call_id, (endpoint, , ) 'action' AS action FROM api_calls
SPLIT_TO_MAP converts flat key-value strings into queryable maps, turning text parsing into structured data access.
Accessing a non-existent key in the resulting map returns NULL, not an error. Use COALESCE to provide defaults for missing keys.
String Building with CONCAT and LPAD()
Format strings with padding and alignment
CONCAT combines multiple strings into a single result. Combined with LPAD for zero-padding and CAST for number formatting, you can build formatted strings with precise control over their appearance. This provides the flexibility to create display strings, report labels, and standardized identifiers directly within SQL queries.
The syntax is CONCAT(value1, value2, value3, ...). The LPAD(string, length, pad_char) function pads a string to a fixed width by prepending pad characters. ROUND(number, decimals) rounds to a specific number of decimal places.
Building Dynamic Strings
CONCAT is the foundation for building formatted output. It assembles literal text and column values in any order.
CONCAT combines a dollar sign literal with the rounded price. The ROUND function ensures prices display with consistent precision.
Currency Formatting
ROUND ensures monetary values display with exactly two decimal places. Combine it with CONCAT and a dollar sign prefix:
The ROUND function with 2 decimal places ensures consistent formatting. This pattern works reliably across different numeric precisions in your source data.
Zero-Padded Identifiers
LPAD pads a value with leading zeros to create fixed-width identifiers that sort correctly and display consistently:
LPAD pads the order ID to 6 characters using zeros. Combined with CONCAT, this creates uniform identifiers like ORD-000042.
Building Complex Strings
CONCAT assembles multiple values in order, creating descriptive labels:
Multiple CONCAT arguments combine text literals with column values. CAST converts integers to strings so CONCAT can join them.
Best Practices
Building formatted strings in SQL is powerful, but readability matters. Use clear variable names and keep string operations simple.
When to Use String Building
CONCAT with LPAD and ROUND excels at producing human-readable output with consistent styling.
Always CAST numeric values to VARCHAR before concatenating. Mixing types in CONCAT requires explicit conversion.
CONCAT with NULL values produces NULL output. If any input is NULL, the result is NULL. Use COALESCE to provide default values for potentially NULL columns.
> Complete this query to format a price as a dollar amount with two decimal places.
SELECT product_name, ('$', ROUND(, 2)) AS display_price FROM products
CONCAT combines values in the order provided. For numeric formatting, use ROUND to control decimal places.
Always CAST non-string values to VARCHAR when concatenating. This ensures consistent behavior across different data types.
LPAD pads strings to fixed widths using a specified character, ideal for creating sortable identifiers.
TO_BASE64() / FROM_BASE64()
Encode and decode binary data as text
Base64 encoding converts binary data into a text representation using 64 printable ASCII characters. TO_BASE64 encodes binary data to Base64 text. FROM_BASE64 decodes Base64 text back to binary. These functions enable safe storage and transmission of binary data through text-only channels.
The syntax is TO_BASE64(varbinary) for encoding and FROM_BASE64(varchar) for decoding. Base64 uses characters A-Z, a-z, 0-9, plus (+), and slash (/). The encoded output is approximately 33% larger than the original binary because every 3 bytes become 4 characters.
Basic Concepts
| Original Text | Base64 Encoded |
|---|---|
| Hello | SGVsbG8= |
| SQL | U1FM |
| Data! | RGF0YSE= |
Encoding & Decoding
To encode a text string, first convert it to binary using TO_UTF8, then encode to Base64:
TO_UTF8 converts the string to bytes, then TO_BASE64 converts those bytes to Base64 text. This two-step process is required because TO_BASE64 expects binary input, not string input.
Decoding Base64 to Text
Decoding reverses the process: FROM_BASE64 produces binary, then FROM_UTF8 converts to text:
Example: FROM_UTF8(FROM_BASE64('SGVsbG8gV29ybGQ=')) returns 'Hello World'. The decoding chain must reverse the encoding chain exactly.
Best Practices
Base64 Function Uses
TO_BASE64 and FROM_BASE64 serve opposite purposes in data pipelines, often used at system boundaries.
- Store binary in text columns
- Transmit via text protocols
- Prepare data for APIs
- Encode file contents
- Decode API responses
- Read stored binary data
- Process encoded payloads
- Restore original content
TO_BASE64processes 3 bytes at a time, outputting 4 charactersFROM_BASE64reverses: 4 characters to 3 bytes; invalid input causes errorsNULLinput producesNULLoutput; 33% size overhead
> Complete this query to decode a Base64-encoded API response back to readable text.
SELECT call_id, ((err_msg)) AS decoded FROM api_calls
The decoding chain must reverse the encoding chain exactly: FROM_BASE64 produces bytes, FROM_UTF8 interprets those bytes as text.
Base64-encoded data is approximately 33% larger than the original because every 3 bytes become 4 Base64 characters.
TO_HEX() / FROM_HEX()
Convert between hex and readable formats
Hexadecimal (hex) encoding represents binary data using characters 0-9 and A-F. TO_HEX converts binary data to a hexadecimal string. FROM_HEX converts a hexadecimal string back to binary. Each byte becomes exactly two hex characters, making hex useful for debugging and byte-level inspection.
The syntax is TO_HEX(varbinary) for encoding and FROM_HEX(varchar) for decoding. Hex encoding doubles the size of the data (each byte becomes two characters) but provides a direct byte-by-byte representation that is easy to read and manipulate.
Basic Concepts
| Character | ASCII Value | Hex Value |
|---|---|---|
| A | 65 | 41 |
| B | 66 | 42 |
| Z | 90 | 5A |
| a | 97 | 61 |
| 0 | 48 | 30 |
Encoding & Decoding
To see the hex representation of text, first convert to bytes with TO_UTF8:
Decoding Hex to Text
Decoding reverses the hex encoding. FROM_UTF8(FROM_HEX('44617461')) returns 'Data'. The hex string is converted to bytes, then those bytes are interpreted as UTF-8 text.
Best Practices
When to Use Hex Functions
TO_HEX processes each byte, outputting two characters. FROM_HEX expects pairs of valid hex characters (0-9, A-F). Hex input is case-insensitive, and the output is exactly double the input size.
> Complete this query to convert text to its hexadecimal byte representation.
SELECT kv_value, ((kv_value)) AS hex_bytes FROM kv_store
The encoding chain for hex is always text -> TO_UTF8 -> TO_HEX. Each byte becomes exactly two hex characters.
TO_HEX encoding is useful for debugging encoding issues because it reveals the exact bytes stored in a string.
TO_UTF8() / FROM_UTF8()
Handle character encoding conversions
UTF-8 is the dominant character encoding on the internet. TO_UTF8 converts a string to its UTF-8 byte sequence (varbinary). FROM_UTF8 converts a UTF-8 byte sequence back to a string. These functions bridge between text and binary representations, enabling precise encoding control.
The syntax is TO_UTF8(varchar) to get bytes and FROM_UTF8(varbinary) to get text. UTF-8 uses 1 to 4 bytes per character: ASCII characters use 1 byte, common international characters use 2-3 bytes, and emojis use 4 bytes. This variable-length encoding is efficient and universally supported.
Basic Concepts
- Hello: 5 characters, 5 bytes (ASCII, 1 byte each)
- é (accented e): 1 character, 2 bytes
- 中 (Chinese): 1 character, 3 bytes
- 😀 (emoji): 1 character, 4 bytes
Encoding & Decoding
TO_UTF8 is required before any byte-level operation like Base64 or hex encoding:
Example: LENGTH(TO_UTF8(text_col)) returns the byte count for a text column, which may differ from character count for non-ASCII text. FROM_UTF8(binary_col) interprets a byte sequence as UTF-8 text.
Chaining Encodings
The pattern is always: text -> TO_UTF8 -> bytes -> encoding function. Decoding reverses: encoded -> decoding function -> bytes -> FROM_UTF8 -> text. This chain preserves the original data exactly.
Best Practices
FROM_UTF8 fails on invalid byte sequences. Some databases offer an optional replacement parameter:
FROM_UTF8(binary_data, '') with a second parameter specifies what to substitute for invalid bytes. An empty string removes them. A question mark or similar character makes corruption visible. This is useful when processing data from systems with inconsistent encodings.
UTF-8 Function Uses
TO_UTF8encodes characters to 1-4 bytes eachFROM_UTF8decodes UTF-8 bytes; invalid sequences cause errorsNULLinput producesNULLoutput; byte length depends on characters
- Always chain TO_UTF8 before TO_BASE64 or TO_HEX for text
- Validate encoding integrity with round-trip tests
- Use CONCAT and LPAD for consistent number and currency display
- Handle missing map keys with COALESCE defaults
- Document the encoding chain in complex transformations
- Confuse Base64 encoding with encryption (it provides no security)
- Assume character count equals byte count for international text
- Mix encoding functions without proper conversion steps
- Use hardcoded format specifiers that do not match data types
- Skip error handling for FROM_UTF8 on untrusted input
> Complete this query to measure the actual byte size of text data.
SELECT kv_value, ((kv_value)) AS byte_count FROM kv_store
Character count and byte count differ for non-ASCII text. The word Café has 4 characters but 5 bytes because the accented é uses 2 bytes.
LENGTH(TO_UTF8(col)) is the standard way to measure actual storage size, essential for capacity planning with international text.
When importing data from external sources, round-tripping through TO_UTF8 and FROM_UTF8 reveals encoding corruption.
> You are a data engineer at Zendesk extracting structured product codes and issue categories from free-text customer complaint fields so that tickets can be automatically routed to the correct support queue.
SPLIT_TO_MAP() parses the URL-encoded metadata appended to each complaint record by the intake API, converting key-value pairs into queryable map fields.CONCAT() with LPAD() builds zero-padded ticket identifiers and standardized display strings for the routing dashboard using string concatenation.TO_BASE64() and FROM_BASE64() encode and decode binary attachment payloads stored alongside complaints so they survive transmission through the text-only logging pipeline.TO_UTF8() converts complaint text to bytes before TO_HEX() inspection, revealing hidden non-printable characters that cause downstream parsing failures in the routing engine.SPLIT_TO_MAP parses key-value strings into queryable map structures using customizable delimitersSPLIT_PART() extracts specific parts from delimited strings using 1-based indexingCONCAT() with LPAD() builds formatted strings with controlled padding and decimal placesTO_UTF8 / FROM_UTF8 bridge between text strings and binary byte sequencesTO_UTF8 before TO_BASE64 or TO_HEX when encoding textFROM_BASE64 then FROM_UTF8Production-grade text transformations
Encoding, formatting, and production-grade string handling
- Category
- SQL
- Difficulty
- advanced
- Duration
- 20 minutes
- Challenges
- 0 hands-on challenges
Topics covered: SPLIT_TO_MAP(), String Building with CONCAT and LPAD(), TO_BASE64() / FROM_BASE64(), TO_HEX() / FROM_HEX(), TO_UTF8() / FROM_UTF8()
Lesson Sections
- SPLIT_TO_MAP() (concepts: sqlSplitToMap)
Basic Syntax Parsing Query Strings URL query strings are the most common use case. They contain parameters in key=value format separated by ampersands: Parsing Config Strings Delimiter Format Variants The same function handles different delimiter conventions. The key is matching the delimiters to your data format. Common patterns include URL-style (&/=), config-style (,/:), and pipe-style (|/=). Advanced Usage Filtering by Map Values This query finds all requests where the action parameter equal
- String Building with CONCAT and LPAD()
Building Dynamic Strings Currency Formatting Zero-Padded Identifiers Building Complex Strings Best Practices When to Use String Building
- TO_BASE64() / FROM_BASE64()
Basic Concepts Base64 solves a fundamental problem: binary data contains bytes that are not valid text characters. Attempting to store raw binary in a text field corrupts the data. Base64 transforms binary into safe text that survives any text processing. The = characters at the end are padding. Base64 works in groups of 4 characters, and padding fills incomplete groups. The encoded result contains only letters, numbers, plus, slash, and equals. Encoding & Decoding Decoding Base64 to Text Best P
- TO_HEX() / FROM_HEX()
Basic Concepts Hexadecimal uses 16 symbols: 0-9 represent values 0-9, and A-F represent values 10-15. Two hex digits represent one byte (values 0-255). The letter "A" has ASCII value 65, which is 41 in hexadecimal (4*16 + 1 = 65). This direct byte-to-hex mapping makes hex ideal for inspecting exactly what bytes a string contains. Encoding & Decoding Each character becomes two hex digits. "Hello" is 5 characters, producing 10 hex digits. Reading the hex, 48 is "H", 65 is "e", and so on. This visi
- TO_UTF8() / FROM_UTF8() (concepts: sqlComplexPatterns)
Basic Concepts Different characters require different numbers of bytes in UTF-8: ASCII letters use 1 byte each. Accented European characters use 2 bytes. East Asian characters typically use 3 bytes. Emojis use 4 bytes. Understanding this helps predict storage requirements and debug length mismatches. Encoding & Decoding Chaining Encodings UTF-8 functions are the bridge between text and binary encodings: Best Practices UTF-8 Function Uses UTF-8 functions bridge text and binary representations, es