String Functions: Beginner
Booking.com processes over 1.5 million room reservations per day across listings in 43 languages, and behind the scenes, SQL string functions ensure that "New York", "new york", and "NEW YORK " all resolve to the same property listings. Functions like UPPER, LOWER, and TRIM normalize raw user input before it ever touches a search index, while CONCAT assembles structured display strings from fragmented property address pieces. Without reliable string manipulation at that scale, a trailing space or an unexpected capital letter would silently fracture search quality for millions of travelers. The functions you are about to learn are the same primitives powering that kind of precision.
String Data Type
Store and query text values correctly
String Literals in SQL
Strings vs Other Types
- 5 (Number): Can add, 5 + 3 = 8
- '5' (String): Cannot add, text not math
- Alice (String): Cannot add, text only
- 2024-01-15 (Date): Cannot add, specialized type
Storage Types
VARCHAR vs CHAR
Databases offer different string types for different use cases. VARCHAR stores variable-length text, using only the space needed. CHAR stores fixed-length text, padding shorter values with spaces. The choice affects storage and comparison behavior.
Character Encoding
Edge Cases
Empty Strings & NULL
An empty string '' is a string with zero characters. It is not the same as missing data. Missing data is represented as NULL. An empty string is a value. NULL means no value exists.
| user_id | middle_name | Meaning |
|---|---|---|
| 1 | 'Marie' | Has middle name |
| 2 | '' | No middle name, but field provided |
| 3 | NULL | Middle name not provided |
User 2 provided an empty string, meaning "I have no middle name." User 3 provided NULL, meaning "I did not answer this question." The distinction changes how you filter and count these rows.
> Complete this query to find orders with a specific status text.
SELECT order_id, status FROM orders WHERE status
String literals must be enclosed in single quotes. Double quotes have a different meaning in SQL, typically referring to column or table names rather than text values.
Understanding how strings are stored and compared is the foundation for all the string functions you will learn in the following sections.
CONCAT()
Combine text from multiple columns
CONCAT is short for concatenate, which means to join or link together. CONCAT combines multiple strings into a single string by placing them end-to-end in the order you specify. The function accepts two or more arguments and returns one continuous string.
The syntax is CONCAT(string1, string2, string3, ...). Each argument is a string, either a column or a literal value. The function joins them sequentially without adding spaces unless you explicitly include them.
The CONCAT function is straightforward to use but has important behaviors to understand when working with separators and multiple values.
Basic Concatenation
| first_name | last_name |
|---|---|
| Alice | Chen |
| Bob | Martinez |
| Carol | Kim |
Building Complex Strings
CONCAT can join many values to build formatted output. Consider generating display text for a product listing:
Here, CONCAT combines five pieces: product name, a separator, category, an opening parenthesis with dollar sign, and the price. The result is a formatted string ready for display.
CONCAT() to build identifiers, URLs, file paths, or formatted messages in your queries instead of doing it in application code. This reduces data transfer and keeps formatting logic centralized.Advanced Usage
Beyond basic two-string joins, CONCAT handles complex scenarios involving many values, mixed types, and potentially missing data.
Multiple Arguments
CONCAT accepts as many arguments as you need. Each argument is evaluated and converted to a string, then joined in order. You can mix column names and literal strings freely:
Handling Empty Values
If any argument to CONCAT is NULL, most databases will return NULL for the entire result. This behavior can cause unexpected blank output when fields contain missing data.
| first_name | middle_name | last_name |
|---|---|---|
| Alice | Marie | Chen |
| Bob | NULL | Martinez |
Bob's middle_name is NULL, so the entire concatenation becomes NULL. To handle this, you can use COALESCE to replace NULL with an empty string, or use conditional logic to skip NULL values.
Best Practices
Knowing when and where to use CONCAT helps you write cleaner queries and maintain consistency across your data pipelines.
When to Use CONCAT()
CONCAT is ideal for building composite strings from multiple data sources in your database.
Use CONCAT when you need to combine strings in the database rather than in application code. This approach centralizes logic, reduces data transfer, and makes transformations visible in your queries.
> Complete this query to build a full name from first and last name columns.
SELECT (, ' ', ) AS full_name FROM users
Understanding how CONCAT processes its arguments helps you predict behavior with NULL values and mixed types.
CONCAT()evaluates arguments left to right- Each argument is converted to a string before joining
NULLin any argument typically producesNULLoutput- Concatenating very long strings can be memory-intensive
LENGTH()
Measure the size of text values
LENGTH returns the number of characters in a string. It counts every character, including letters, numbers, spaces, and punctuation. The function accepts a single string argument and returns an integer.
The syntax is LENGTH(string). The result is the count of characters, not bytes. For most English text, character count equals byte count, but for international characters or emojis, one character may use multiple bytes. LENGTH always returns character count.
LENGTH is one of the simplest string functions, but its utility in data validation and analysis makes it indispensable.
Basic Usage
The function counts all characters. 'Ergonomic Keyboard' has 19 characters including the space. 'Mouse' has 5. LENGTH() is exact.
Filtering by Length
You can use LENGTH in WHERE clauses to filter rows based on string size:
This query finds usernames shorter than 5 characters. LENGTH makes it possible to enforce constraints, identify outliers, or find data that violates length requirements.
Data Quality Checks
LENGTH plays a key role in identifying data quality issues. Empty strings, NULL values, and unexpectedly short or long text all reveal problems worth investigating.
Finding Empty or NULL
LENGTH helps distinguish between empty strings and NULL values:
User 1 has a bio with 17 characters. User 2 has an empty bio with length 0. User 3 has NULL, so LENGTH returns NULL. This distinction is useful for data quality checks.
LENGTH vs Empty Check
There are two common ways to check if a string has content: using LENGTH or comparing directly to an empty string. Each approach handles NULL differently.
LENGTH() to validate input before processing. Find phone numbers that are too short, descriptions that exceed limits, or codes that don't match expected formats.Analysis & Best Practices
Beyond individual row validation, LENGTH helps analyze text patterns across entire datasets and inform schema design decisions.
Aggregate Length Analysis
Combining LENGTH with aggregate functions reveals patterns in text data:
When to Use LENGTH()
LENGTH appears throughout data validation, filtering, and analysis workflows.
LENGTH is commonly used in data quality pipelines to catch malformed input. If you expect product codes to be exactly 8 characters, you can filter WHERE LENGTH(product_code) != 8 to find invalid entries.
> Complete this query to find usernames shorter than 5 characters.
SELECT username FROM users WHERE () 5
LENGTH is one of the fastest string functions because it only needs to scan the string once and count characters without performing any transformation or copying.
LENGTH has straightforward execution characteristics but important edge case behavior.
- LENGTH() scans the string once to count characters
- Returns an integer (never a string); counts characters, not bytes
- NULL input produces NULL output
LOWER()
Standardize text to lowercase
LOWER converts all letters in a string to lowercase. Non-letter characters remain unchanged. The function accepts a single string argument and returns a new string with all uppercase letters replaced by their lowercase equivalents.
The syntax is LOWER(string). The input string is not modified; LOWER returns a new string. This function is essential for case-insensitive comparisons and data normalization.
LOWER is simple to use but essential for achieving consistent data handling across your queries.
Basic Normalization
Email addresses are often stored inconsistently due to user input. LOWER standardizes them.
| Alice@Email.com |
| BOB@email.COM |
| carol@email.com |
Case-Insensitive Compare
This query finds all variations of 'alice', regardless of how it was entered. LOWER(username) converts each stored value to lowercase, then compares it to the lowercase search term.
Data Handling
Understanding how LOWER interacts with your data helps avoid surprises when working with special characters and NULL values.
Preserving Original Data
LOWER does not change the original data in the table. It only transforms the output:
Use LOWER() when deduplicating data. Two records with email addresses 'user@site.com' and 'User@Site.COM' are duplicates, but string comparison treats them as different unless you normalize.
Non-Letter Characters
LOWER only affects letters. Numbers, spaces, punctuation, and special characters remain unchanged:
The letters became lowercase. The hyphen, digits, exclamation mark, and spaces are unchanged. LOWER() transforms only alphabetic characters.
Best Practices
When to Use LOWER()
The comparison below shows common problems caused by inconsistent casing and how LOWER solves them.
- Alice@Email.com != alice@email.com
- Duplicates slip through
- Joins silently fail on case
- Searches miss valid matches
- All emails match consistently
- Deduplication works correctly
- Joins find all matching rows
- Searches are case-insensitive
> Complete this query to normalize email addresses to lowercase for consistent comparison.
SELECT () AS normalized_email FROM users
- Store emails and identifiers as lowercase in the database to avoid having to normalize in every query
- LOWER() handles accented and international characters correctly with Unicode-aware database collations
- When deduplicating records from multiple sources, normalize text with LOWER() on both sides before comparing
LOWER is idempotent: applying it twice produces the same result as applying it once. A single call is sufficient to normalize any string regardless of its original casing.
When comparing strings from different sources, apply LOWER() to both sides so the comparison is case-insensitive without modifying the stored data.
UPPER()
Convert text to uppercase for matching
UPPER converts all letters in a string to uppercase. It is the inverse of LOWER. Non-letter characters remain unchanged. The function accepts a single string argument and returns a new string with all lowercase letters replaced by their uppercase equivalents.
The syntax is UPPER(string). Like LOWER, the input is not modified; UPPER returns a transformed copy. This function is used for formatting, normalization, and case-insensitive operations.
UPPER works identically to LOWER but transforms text in the opposite direction. The same principles apply.
Display Formatting
UPPER is commonly used to format output for display, especially for codes, labels, or emphasis.
Normalizing Identifiers
| product_code |
|---|
| abc-123 |
| XYZ-456 |
| Def-789 |
Advanced Usage
UPPER can be combined with other functions and used in conditional logic to build powerful transformations.
Case-Insensitive Match
Just like LOWER, you can use UPPER for case-insensitive comparisons by converting both sides to uppercase:
This query finds all orders with a status of 'pending', regardless of case. UPPER(status) converts each value to uppercase before comparing it to the uppercase search term.
Both functions serve the same purpose: case normalization. Use LOWER() for user-facing text like emails and usernames. Use UPPER() for system identifiers like codes and statuses. The choice is often a matter of convention.
UPPER() with Functions
UPPER can be combined with other string functions to build complex transformations:
Best Practices
Consistent use of UPPER for identifiers and codes helps maintain data quality across your systems.
When to Use UPPER()
UPPER is particularly useful for system identifiers and status codes that benefit from consistent formatting.
Use UPPER when output needs to be in uppercase for consistency, readability, or business requirements. Many systems require uppercase codes for integration with legacy platforms.
> Complete this query to build a region code by combining string functions.
SELECT ( (), '-', order_id ) AS region_order FROM orders
- Normalize case with LOWER() or UPPER() before comparisons
- Use CONCAT() to build formatted strings in the database
- Validate string lengths with LENGTH() before processing
- Store strings in consistent case when possible
- Assume case-insensitive comparisons without normalization
- Perform string operations in application code when SQL can do it
- Store text without validating length constraints
> You are a data analyst at a healthcare company standardizing patient name and contact fields pulled from three different legacy EHR systems before loading them into a unified patient database for reporting.
UPPER() and LOWER() normalize inconsistent text case so records entered as "Alice" in one system and "ALICE" in another merge into a single canonical patient record.LENGTH() validates that phone numbers and postal codes meet required character counts, flagging malformed entries before they enter the unified database.CONCAT() combines separate first name, last name, and suffix fields into a single display column used across dashboards and printed correspondence.CONCAT() joins multiple strings; || operator does the same in most databasesCONCAT() with NULL returns NULL unless your database handles it speciallyLENGTH() returns character count; OCTET_LENGTH returns byte count (matters for Unicode)LOWER() and UPPER() standardize case for consistent comparisonsText is the language of databases
Making text behave since day one
- Category
- SQL
- Difficulty
- beginner
- Duration
- 29 minutes
- Challenges
- 0 hands-on challenges
Topics covered: String Data Type, CONCAT(), LENGTH(), LOWER(), UPPER()
Lesson Sections
- String Data Type
A string is a data type that holds text. Strings are enclosed in single quotes in SQL. The text inside can contain letters, numbers, spaces, punctuation, or special characters. A string could be as short as one character or as long as several megabytes of text. When you see values like 'Alice', 'alice@email.com', or 'San Francisco' in a database, those are strings. The single quotes tell the database "this is text, not a keyword or number." Before manipulating strings, it helps to understand how
- CONCAT() (concepts: sqlConcat)
Basic Concatenation The most common use case is combining first and last names into a full name: The query takes first_name, adds a space ' ', then adds last_name. The space is a literal string argument. Without it, you would get 'AliceChen' instead of 'Alice Chen'. Building Complex Strings Advanced Usage Multiple Arguments Handling Empty Values Best Practices When to Use CONCAT() At scale, even simple string operations can become performance bottlenecks.
- LENGTH() (concepts: sqlLength)
Basic Usage Measuring string length is useful for validation, filtering, and analysis. Filtering by Length Data Quality Checks Finding Empty or NULL LENGTH vs Empty Check Analysis & Best Practices Aggregate Length Analysis This query calculates the average length of comments and finds the longest one. Such analysis helps you understand data distribution, set appropriate column sizes, or identify unusually long entries. When to Use LENGTH()
- LOWER() (concepts: sqlLowerUpper)
Basic Normalization All email addresses are now lowercase. This ensures consistency and makes comparisons reliable. Without normalization, 'Alice@Email.com' and 'alice@email.com' would be treated as different values. Case-Insensitive Compare By converting both sides of a comparison to lowercase, you can match strings regardless of case: Data Handling Preserving Original Data Non-Letter Characters Best Practices Case normalization is one of the most common data cleaning steps. Applying it consist
- UPPER()
Display Formatting Status values are converted to uppercase for consistent display. This formatting makes labels stand out and ensures uniformity regardless of how the data was entered. Normalizing Identifiers Product codes, order numbers, and other identifiers are often standardized as uppercase: All codes are now uppercase, ensuring consistency for comparisons, joins, and reporting. This normalization prevents mismatches caused by case differences. Advanced Usage Case-Insensitive Match UPPER()