String Functions: Beginner

Booking.com processes over 1.5 million room reservations per day across listings in 43 languages, and behind the scenes, SQL string functions ensure that "New York", "new york", and "NEW YORK " all resolve to the same property listings. Functions like UPPER, LOWER, and TRIM normalize raw user input before it ever touches a search index, while CONCAT assembles structured display strings from fragmented property address pieces. Without reliable string manipulation at that scale, a trailing space or an unexpected capital letter would silently fracture search quality for millions of travelers. The functions you are about to learn are the same primitives powering that kind of precision.

String Data Type

Daily Life

Interviews

Store and query text values correctly

A string is a data type that holds text. Strings are enclosed in single quotes in SQL. The text inside can contain letters, numbers, spaces, punctuation, or special characters. A string could be as short as one character or as long as several megabytes of text.

When you see values like 'Alice', 'alice@email.com', or 'San Francisco' in a database, those are strings. The single quotes tell the database "this is text, not a keyword or number."

Before manipulating strings, it helps to understand how SQL represents and stores them. The following sections cover the basics of string syntax and type distinctions.

String Literals in SQL

You can use string literals directly in your queries by wrapping them in single quotes:

1	SELECT
2	'Hello World' AS greeting
3	FROM users

Result

greeting
Hello World

The text 'Hello World' is a string literal. SQL treats it as a fixed text value, not as a column name or command. This query returns that exact text for every row.

Strings vs Other Types

Strings are text. Numbers are numeric values. The character '5' as a string is different from the number 5. You cannot perform arithmetic on strings, and you cannot use text operations on numbers without conversion.

Strings vs Other Types

5 (Number): Can add, 5 + 3 = 8
'5' (String): Cannot add, text not math
Alice (String): Cannot add, text only
2024-01-15 (Date): Cannot add, specialized type

Understanding the distinction prevents errors. If you try to add '5' + '3' as strings, the database will either fail or concatenate them into '53', not return 8.

Storage Types

How you store strings affects storage efficiency, performance, and comparison behavior. SQL offers different string types to handle varying requirements.

VARCHAR vs CHAR

Databases offer different string types for different use cases. VARCHAR stores variable-length text, using only the space needed. CHAR stores fixed-length text, padding shorter values with spaces. The choice affects storage and comparison behavior.

VARCHAR stores the exact text with no padding. Each value uses only the space it needs.

1	SELECT
2	username
3	FROM users

TIP

Strings consume more storage than numbers. A 10-digit number stored as an integer takes 4-8 bytes. The same number stored as a string takes 10 bytes. At billions of rows, this difference becomes significant.

Character Encoding

Databases store strings as sequences of bytes representing characters. Character encoding defines how those bytes map to letters, symbols, and special characters. UTF-8 is the dominant encoding and supports all languages and emojis. When you store text, the database encodes it. When you retrieve it, the database decodes it.

Encoding matters when you work with international text or special characters. If your database uses one encoding and your application expects another, text will appear corrupted. Always use UTF-8 unless you have a specific reason not to.

Edge Cases

Real-world data contains missing values, empty fields, and unexpected inputs. Understanding these edge cases prevents bugs in your queries.

Empty Strings & NULL

An empty string '' is a string with zero characters. It is not the same as missing data. Missing data is represented as NULL. An empty string is a value. NULL means no value exists.

user_id	middle_name	Meaning
1	'Marie'	Has middle name
2	''	No middle name, but field provided
3	NULL	Middle name not provided

User 2 provided an empty string, meaning "I have no middle name." User 3 provided NULL, meaning "I did not answer this question." The distinction changes how you filter and count these rows.

> Complete this query to find orders with a specific status text.

SELECT
  order_id,
  status
FROM orders
WHERE status ___ ___

'pending'

String comparisons in SQL are case-sensitive by default. The value 'pending' will not match 'Pending' or 'PENDING' unless you normalize the case first.

String literals must be enclosed in single quotes. Double quotes have a different meaning in SQL, typically referring to column or table names rather than text values.

Understanding how strings are stored and compared is the foundation for all the string functions you will learn in the following sections.

CONCAT()

Daily Life

Interviews

Combine text from multiple columns

CONCAT is short for concatenate, which means to join or link together. CONCAT combines multiple strings into a single string by placing them end-to-end in the order you specify. The function accepts two or more arguments and returns one continuous string.

The syntax is CONCAT(string1, string2, string3, ...). Each argument is a string, either a column or a literal value. The function joins them sequentially without adding spaces unless you explicitly include them.

The CONCAT function is straightforward to use but has important behaviors to understand when working with separators and multiple values.

Basic Concatenation

The most common use case is combining first and last names into a full name:

first_name	last_name
Alice	Chen
Bob	Martinez
Carol	Kim

1	SELECT
2	CONCAT(first_name, ' ', last_name) AS full_name
3	FROM users

Result

full_name
Alice Chen
Bob Martinez
Carol Kim

The query takes first_name, adds a space ' ', then adds last_name. The space is a literal string argument. Without it, you would get 'AliceChen' instead of 'Alice Chen'.

Building Complex Strings

CONCAT can join many values to build formatted output. Consider generating display text for a product listing:

1	SELECT
2	CONCAT(
3	product_name,
4	' - ',
5	category,
6	' ($',
7	price,
8	')'
9	) AS listing
10	FROM products

Result

listing
Ergonomic Keyboard - Electronics ($89.99)
Standing Desk - Furniture ($299.00)
Wireless Mouse - Electronics ($34.99)

Here, CONCAT combines five pieces: product name, a separator, category, an opening parenthesis with dollar sign, and the price. The result is a formatted string ready for display.

TIP

Use CONCAT() to build identifiers, URLs, file paths, or formatted messages in your queries instead of doing it in application code. This reduces data transfer and keeps formatting logic centralized.

Advanced Usage

Beyond basic two-string joins, CONCAT handles complex scenarios involving many values, mixed types, and potentially missing data.

Multiple Arguments

CONCAT accepts as many arguments as you need. Each argument is evaluated and converted to a string, then joined in order. You can mix column names and literal strings freely:

Two arguments: joins city directly to country with no separator, producing 'San FranciscoUSA'.

1	SELECT
2	CONCAT(city, country) AS location
3	FROM addresses

Handling Empty Values

If any argument to CONCAT is NULL, most databases will return NULL for the entire result. This behavior can cause unexpected blank output when fields contain missing data.

first_name	middle_name	last_name
Alice	Marie	Chen
Bob	NULL	Martinez

1	SELECT
2	CONCAT(
3	first_name,
4	' ',
5	middle_name,
6	' ',
7	last_name
8	) AS full_name
9	FROM users

Result

full_name
Alice Marie Chen
NULL

Bob's middle_name is NULL, so the entire concatenation becomes NULL. To handle this, you can use COALESCE to replace NULL with an empty string, or use conditional logic to skip NULL values.

Best Practices

Knowing when and where to use CONCAT helps you write cleaner queries and maintain consistency across your data pipelines.

When to Use CONCAT()

CONCAT is ideal for building composite strings from multiple data sources in your database.

Display Names

Build full names from separate first and last name fields

Unique Identifiers

Join keys to create composite identifiers

Formatted Strings

Combine addresses, URLs, file paths, or email addresses

Dynamic Messages

Construct labels and messages from data values

Use CONCAT when you need to combine strings in the database rather than in application code. This approach centralizes logic, reduces data transfer, and makes transformations visible in your queries.

> Complete this query to build a full name from first and last name columns.

SELECT
  ___(___, ' ', ___) AS full_name
FROM users

last_name

CONCAT

UPPER

first_name

Understanding how CONCAT processes its arguments helps you predict behavior with NULL values and mixed types.

How CONCAT() Executes

CONCAT() evaluates arguments left to right
Each argument is converted to a string before joining
NULL in any argument typically produces NULL output
Concatenating very long strings can be memory-intensive

At scale, even simple string operations can become performance bottlenecks.

LENGTH()

Daily Life

Interviews

Measure the size of text values

LENGTH returns the number of characters in a string. It counts every character, including letters, numbers, spaces, and punctuation. The function accepts a single string argument and returns an integer.

The syntax is LENGTH(string). The result is the count of characters, not bytes. For most English text, character count equals byte count, but for international characters or emojis, one character may use multiple bytes. LENGTH always returns character count.

LENGTH is one of the simplest string functions, but its utility in data validation and analysis makes it indispensable.

Basic Usage

Measuring string length is useful for validation, filtering, and analysis.

1	SELECT
2	product_name,
3	LENGTH(product_name) AS name_length
4	FROM products

Result

product_name	name_length
Ergonomic Keyboard	19
Mouse	5
4K Monitor	10

The function counts all characters. 'Ergonomic Keyboard' has 19 characters including the space. 'Mouse' has 5. LENGTH() is exact.

Filtering by Length

You can use LENGTH in WHERE clauses to filter rows based on string size:

1	SELECT
2	username
3	FROM users
4	WHERE LENGTH(username) < 5

Result

username
bob
ali
dan

This query finds usernames shorter than 5 characters. LENGTH makes it possible to enforce constraints, identify outliers, or find data that violates length requirements.

Data Quality Checks

LENGTH plays a key role in identifying data quality issues. Empty strings, NULL values, and unexpectedly short or long text all reveal problems worth investigating.

Finding Empty or NULL

LENGTH helps distinguish between empty strings and NULL values:

1	SELECT
2	user_id,
3	bio,
4	LENGTH(bio) AS bio_length
5	FROM users

Result

user_id	bio	bio_length
1	Software engineer	17
2		0
3	NULL	NULL

User 1 has a bio with 17 characters. User 2 has an empty bio with length 0. User 3 has NULL, so LENGTH returns NULL. This distinction is useful for data quality checks.

LENGTH vs Empty Check

There are two common ways to check if a string has content: using LENGTH or comparing directly to an empty string. Each approach handles NULL differently.

LENGTH(bio) > 0 finds rows with non-empty bios. NULL bios are excluded because LENGTH(NULL) returns NULL, which is not greater than 0.

1	SELECT
2	username
3	FROM users
4	WHERE LENGTH(bio) > 0

TIP

Use LENGTH() to validate input before processing. Find phone numbers that are too short, descriptions that exceed limits, or codes that don't match expected formats.

Analysis & Best Practices

Beyond individual row validation, LENGTH helps analyze text patterns across entire datasets and inform schema design decisions.

Aggregate Length Analysis

Combining LENGTH with aggregate functions reveals patterns in text data:

1	SELECT
2	AVG(LENGTH(comment)) AS avg_comment_length,
3	MAX(LENGTH(comment)) AS longest_comment
4	FROM reviews

Result

avg_comment_length	longest_comment
127.45	2048

This query calculates the average length of comments and finds the longest one. Such analysis helps you understand data distribution, set appropriate column sizes, or identify unusually long entries.

When to Use LENGTH()

LENGTH appears throughout data validation, filtering, and analysis workflows.

ValidateFilterEmptyStorageAnalysis

Validate

Check Length

Enforce size constraints

Filter

Size Filtering

Find short or long values

Empty

Detect Blanks

Spot missing text values

Storage

Column Sizing

Plan storage allocation

Analysis

Text Patterns

Analyze length breakdown

LENGTH is commonly used in data quality pipelines to catch malformed input. If you expect product codes to be exactly 8 characters, you can filter WHERE LENGTH(product_code) != 8 to find invalid entries.

> Complete this query to find usernames shorter than 5 characters.

SELECT
  username
FROM users
WHERE ___(___) ___ 5

LENGTH

username

LENGTH is one of the fastest string functions because it only needs to scan the string once and count characters without performing any transformation or copying.

LENGTH has straightforward execution characteristics but important edge case behavior.

How LENGTH() Executes

LENGTH() scans the string once to count characters
Returns an integer (never a string); counts characters, not bytes
NULL input produces NULL output

LOWER()

Daily Life

Interviews

Standardize text to lowercase

LOWER converts all letters in a string to lowercase. Non-letter characters remain unchanged. The function accepts a single string argument and returns a new string with all uppercase letters replaced by their lowercase equivalents.

The syntax is LOWER(string). The input string is not modified; LOWER returns a new string. This function is essential for case-insensitive comparisons and data normalization.

LOWER is simple to use but essential for achieving consistent data handling across your queries.

Basic Normalization

Email addresses are often stored inconsistently due to user input. LOWER standardizes them.

email
Alice@Email.com
BOB@email.COM
carol@email.com

1	SELECT
2	LOWER(email) AS normalized_email
3	FROM users

Result

normalized_email
alice@email.com
bob@email.com
carol@email.com

All email addresses are now lowercase. This ensures consistency and makes comparisons reliable. Without normalization, 'Alice@Email.com' and 'alice@email.com' would be treated as different values.

Case-Insensitive Compare

By converting both sides of a comparison to lowercase, you can match strings regardless of case:

1	SELECT
2	user_id,
3	username
4	FROM users
5	WHERE LOWER(username) = 'alice'

Result

user_id	username
1	Alice
2	ALICE
3	alice

This query finds all variations of 'alice', regardless of how it was entered. LOWER(username) converts each stored value to lowercase, then compares it to the lowercase search term.

Data Handling

Understanding how LOWER interacts with your data helps avoid surprises when working with special characters and NULL values.

Preserving Original Data

LOWER does not change the original data in the table. It only transforms the output:

The first query shows both the original value and the lowercase version. The original data is unchanged.

1	SELECT
2	username,
3	LOWER(username) AS lowercase_username
4	FROM users

Use LOWER() when deduplicating data. Two records with email addresses 'user@site.com' and 'User@Site.COM' are duplicates, but string comparison treats them as different unless you normalize.

Non-Letter Characters

LOWER only affects letters. Numbers, spaces, punctuation, and special characters remain unchanged:

1	SELECT
2	LOWER('Product-123! Available NOW') AS result
3	FROM users

Result

result
product-123! available now

The letters became lowercase. The hyphen, digits, exclamation mark, and spaces are unchanged. LOWER() transforms only alphabetic characters.

Best Practices

Case normalization is one of the most common data cleaning steps. Applying it consistently prevents subtle bugs that are hard to debug.

When to Use LOWER()

The comparison below shows common problems caused by inconsistent casing and how LOWER solves them.

•Before LOWER()

Alice@Email.com != alice@email.com
Duplicates slip through
Joins silently fail on case
Searches miss valid matches

•After LOWER()

All emails match consistently
Deduplication works correctly
Joins find all matching rows
Searches are case-insensitive

Always normalize case when uniqueness or comparison depends on text values. Many production bugs stem from failing to handle case inconsistencies.

> Complete this query to normalize email addresses to lowercase for consistent comparison.

SELECT
  ___(___) AS normalized_email
FROM users

LOWER

UPPER

Advanced LOWER() Practices

Store emails and identifiers as lowercase in the database to avoid having to normalize in every query
LOWER() handles accented and international characters correctly with Unicode-aware database collations
When deduplicating records from multiple sources, normalize text with LOWER() on both sides before comparing

LOWER is idempotent: applying it twice produces the same result as applying it once. A single call is sufficient to normalize any string regardless of its original casing.

When comparing strings from different sources, apply LOWER() to both sides so the comparison is case-insensitive without modifying the stored data.

UPPER()

Daily Life

Interviews

Convert text to uppercase for matching

UPPER converts all letters in a string to uppercase. It is the inverse of LOWER. Non-letter characters remain unchanged. The function accepts a single string argument and returns a new string with all lowercase letters replaced by their uppercase equivalents.

The syntax is UPPER(string). Like LOWER, the input is not modified; UPPER returns a transformed copy. This function is used for formatting, normalization, and case-insensitive operations.

UPPER works identically to LOWER but transforms text in the opposite direction. The same principles apply.

Display Formatting

UPPER is commonly used to format output for display, especially for codes, labels, or emphasis.

1	SELECT
2	UPPER(status) AS status_code
3	FROM orders

Result

status_code
PENDING
COMPLETED
CANCELLED

Status values are converted to uppercase for consistent display. This formatting makes labels stand out and ensures uniformity regardless of how the data was entered.

Normalizing Identifiers

Product codes, order numbers, and other identifiers are often standardized as uppercase:

product_code
abc-123
XYZ-456
Def-789

1	SELECT
2	UPPER(product_code) AS normalized_code
3	FROM products

Result

normalized_code
ABC-123
XYZ-456
DEF-789

All codes are now uppercase, ensuring consistency for comparisons, joins, and reporting. This normalization prevents mismatches caused by case differences.

Advanced Usage

UPPER can be combined with other functions and used in conditional logic to build powerful transformations.

Case-Insensitive Match

Just like LOWER, you can use UPPER for case-insensitive comparisons by converting both sides to uppercase:

1	SELECT
2	order_id,
3	status
4	FROM orders
5	WHERE UPPER(status) = 'PENDING'

Result

order_id	status
1001	pending
1002	Pending
1003	PENDING

This query finds all orders with a status of 'pending', regardless of case. UPPER(status) converts each value to uppercase before comparing it to the uppercase search term.

Both functions serve the same purpose: case normalization. Use LOWER() for user-facing text like emails and usernames. Use UPPER() for system identifiers like codes and statuses. The choice is often a matter of convention.

UPPER() with Functions

UPPER can be combined with other string functions to build complex transformations:

1	SELECT
2	CONCAT(
3	UPPER(region),
4	'-',
5	order_id
6	) AS region_order_id
7	FROM orders

Result

region_order_id
NORTH-1001
SOUTH-1002
EAST-1003

This query creates a formatted identifier by converting the region to uppercase, adding a hyphen, then appending the order ID. Combining functions allows you to perform multi-step transformations in a single query.

Best Practices

Consistent use of UPPER for identifiers and codes helps maintain data quality across your systems.

When to Use UPPER()

UPPER is particularly useful for system identifiers and status codes that benefit from consistent formatting.

Format Codes

Standardize status codes, labels, and identifiers for display

Normalize for Matching

Ensure case-insensitive joins and comparisons

Deduplicate Records

Find and merge records with case variations

Use UPPER when output needs to be in uppercase for consistency, readability, or business requirements. Many systems require uppercase codes for integration with legacy platforms.

> Complete this query to build a region code by combining string functions.

SELECT
  ___(
    ___(___),
    '-',
    order_id
    ) AS region_order
FROM orders

UPPER

CONCAT

region

LOWER

Follow these guidelines to write cleaner, more maintainable string operations in your SQL queries.

✓Do

Normalize case with LOWER() or UPPER() before comparisons
Use CONCAT() to build formatted strings in the database
Validate string lengths with LENGTH() before processing
Store strings in consistent case when possible

✗Don't

Assume case-insensitive comparisons without normalization
Perform string operations in application code when SQL can do it
Store text without validating length constraints

String handling is one of the most common tasks in real-world data work. Put these fundamentals to the test with hands-on challenges.

❯❯❯PUTTING IT ALL TOGETHER

> You are a data analyst at a healthcare company standardizing patient name and contact fields pulled from three different legacy EHR systems before loading them into a unified patient database for reporting.

UPPER() and LOWER() normalize inconsistent text case so records entered as "Alice" in one system and "ALICE" in another merge into a single canonical patient record.

LENGTH() validates that phone numbers and postal codes meet required character counts, flagging malformed entries before they enter the unified database.

CONCAT() combines separate first name, last name, and suffix fields into a single display column used across dashboards and printed correspondence.

KEY TAKEAWAYS

Strings are text values enclosed in single quotes; use '' to escape apostrophes

CONCAT() joins multiple strings; || operator does the same in most databases

CONCAT() with NULL returns NULL unless your database handles it specially

LENGTH() returns character count; OCTET_LENGTH returns byte count (matters for Unicode)

LOWER() and UPPER() standardize case for consistent comparisons

String comparisons are case-sensitive by default in most databases

Always normalize case before comparing or joining on string columns

String functions form the foundation of data cleaning and data transformation workflows

Text is the language of databases

Making text behave since day one

Category: SQL
Difficulty: beginner
Duration: 29 minutes
Challenges: 0 hands-on challenges

Topics covered: String Data Type, CONCAT(), LENGTH(), LOWER(), UPPER()

Lesson Sections

String Data Type
A string is a data type that holds text. Strings are enclosed in single quotes in SQL. The text inside can contain letters, numbers, spaces, punctuation, or special characters. A string could be as short as one character or as long as several megabytes of text. When you see values like 'Alice', 'alice@email.com', or 'San Francisco' in a database, those are strings. The single quotes tell the database "this is text, not a keyword or number." Before manipulating strings, it helps to understand how
CONCAT() (concepts: sqlConcat)
Basic Concatenation The most common use case is combining first and last names into a full name: The query takes first_name, adds a space ' ', then adds last_name. The space is a literal string argument. Without it, you would get 'AliceChen' instead of 'Alice Chen'. Building Complex Strings Advanced Usage Multiple Arguments Handling Empty Values Best Practices When to Use CONCAT() At scale, even simple string operations can become performance bottlenecks.
LENGTH() (concepts: sqlLength)
Basic Usage Measuring string length is useful for validation, filtering, and analysis. Filtering by Length Data Quality Checks Finding Empty or NULL LENGTH vs Empty Check Analysis & Best Practices Aggregate Length Analysis This query calculates the average length of comments and finds the longest one. Such analysis helps you understand data distribution, set appropriate column sizes, or identify unusually long entries. When to Use LENGTH()
LOWER() (concepts: sqlLowerUpper)
Basic Normalization All email addresses are now lowercase. This ensures consistency and makes comparisons reliable. Without normalization, 'Alice@Email.com' and 'alice@email.com' would be treated as different values. Case-Insensitive Compare By converting both sides of a comparison to lowercase, you can match strings regardless of case: Data Handling Preserving Original Data Non-Letter Characters Best Practices Case normalization is one of the most common data cleaning steps. Applying it consist
UPPER()
Display Formatting Status values are converted to uppercase for consistent display. This formatting makes labels stand out and ensures uniformity regardless of how the data was entered. Normalizing Identifiers Product codes, order numbers, and other identifiers are often standardized as uppercase: All codes are now uppercase, ensuring consistency for comparisons, joins, and reporting. This normalization prevents mismatches caused by case differences. Advanced Usage Case-Insensitive Match UPPER()