String Functions: Beginner

Booking.com processes over 1.5 million room reservations per day across listings in 43 languages, and behind the scenes, SQL string functions ensure that "New York", "new york", and "NEW YORK " all resolve to the same property listings. Functions like UPPER, LOWER, and TRIM normalize raw user input before it ever touches a search index, while CONCAT assembles structured display strings from fragmented property address pieces. Without reliable string manipulation at that scale, a trailing space or an unexpected capital letter would silently fracture search quality for millions of travelers. The functions you are about to learn are the same primitives powering that kind of precision.

String Data Type

Daily Life
Interviews

Store and query text values correctly

A string is a data type that holds text. Strings are enclosed in single quotes in SQL. The text inside can contain letters, numbers, spaces, punctuation, or special characters. A string could be as short as one character or as long as several megabytes of text.
When you see values like 'Alice', 'alice@email.com', or 'San Francisco' in a database, those are strings. The single quotes tell the database "this is text, not a keyword or number."
Before manipulating strings, it helps to understand how SQL represents and stores them. The following sections cover the basics of string syntax and type distinctions.
//

String Literals in SQL

You can use string literals directly in your queries by wrapping them in single quotes:
1SELECT
2 'Hello World' AS greeting
3FROM users
Result
greeting
Hello World
The text 'Hello World' is a string literal. SQL treats it as a fixed text value, not as a column name or command. This query returns that exact text for every row.
//

Strings vs Other Types

Strings are text. Numbers are numeric values. The character '5' as a string is different from the number 5. You cannot perform arithmetic on strings, and you cannot use text operations on numbers without conversion.
Strings vs Other Types
  • 5 (Number): Can add, 5 + 3 = 8
  • '5' (String): Cannot add, text not math
  • Alice (String): Cannot add, text only
  • 2024-01-15 (Date): Cannot add, specialized type
Understanding the distinction prevents errors. If you try to add '5' + '3' as strings, the database will either fail or concatenate them into '53', not return 8.

Storage Types

How you store strings affects storage efficiency, performance, and comparison behavior. SQL offers different string types to handle varying requirements.
//

VARCHAR vs CHAR

Databases offer different string types for different use cases. VARCHAR stores variable-length text, using only the space needed. CHAR stores fixed-length text, padding shorter values with spaces. The choice affects storage and comparison behavior.

VARCHAR stores the exact text with no padding. Each value uses only the space it needs.
1SELECT
2 username
3FROM users
TIP
Strings consume more storage than numbers. A 10-digit number stored as an integer takes 4-8 bytes. The same number stored as a string takes 10 bytes. At billions of rows, this difference becomes significant.
//

Character Encoding

Databases store strings as sequences of bytes representing characters. Character encoding defines how those bytes map to letters, symbols, and special characters. UTF-8 is the dominant encoding and supports all languages and emojis. When you store text, the database encodes it. When you retrieve it, the database decodes it.
Encoding matters when you work with international text or special characters. If your database uses one encoding and your application expects another, text will appear corrupted. Always use UTF-8 unless you have a specific reason not to.

Edge Cases

Real-world data contains missing values, empty fields, and unexpected inputs. Understanding these edge cases prevents bugs in your queries.
//

Empty Strings & NULL

An empty string '' is a string with zero characters. It is not the same as missing data. Missing data is represented as NULL. An empty string is a value. NULL means no value exists.

user_idmiddle_nameMeaning
1'Marie'Has middle name
2''No middle name, but field provided
3NULLMiddle name not provided

User 2 provided an empty string, meaning "I have no middle name." User 3 provided NULL, meaning "I did not answer this question." The distinction changes how you filter and count these rows.

> Complete this query to find orders with a specific status text.

SELECT
  order_id,
  status
FROM orders
WHERE status ___ ___
0
=
>
'pending'
String comparisons in SQL are case-sensitive by default. The value 'pending' will not match 'Pending' or 'PENDING' unless you normalize the case first.

String literals must be enclosed in single quotes. Double quotes have a different meaning in SQL, typically referring to column or table names rather than text values.

Understanding how strings are stored and compared is the foundation for all the string functions you will learn in the following sections.

CONCAT()

Daily Life
Interviews

Combine text from multiple columns

CONCAT is short for concatenate, which means to join or link together. CONCAT combines multiple strings into a single string by placing them end-to-end in the order you specify. The function accepts two or more arguments and returns one continuous string.

The syntax is CONCAT(string1, string2, string3, ...). Each argument is a string, either a column or a literal value. The function joins them sequentially without adding spaces unless you explicitly include them.

The CONCAT function is straightforward to use but has important behaviors to understand when working with separators and multiple values.

//

Basic Concatenation

The most common use case is combining first and last names into a full name:
first_namelast_name
AliceChen
BobMartinez
CarolKim
1SELECT
2 CONCAT(first_name, ' ', last_name) AS full_name
3FROM users
Result
full_name
Alice Chen
Bob Martinez
Carol Kim
The query takes first_name, adds a space ' ', then adds last_name. The space is a literal string argument. Without it, you would get 'AliceChen' instead of 'Alice Chen'.
//

Building Complex Strings

CONCAT can join many values to build formatted output. Consider generating display text for a product listing:

1SELECT
2 CONCAT(
3 product_name,
4 ' - ',
5 category,
6 ' ($',
7 price,
8 ')'
9 ) AS listing
10FROM products
Result
listing
Ergonomic Keyboard - Electronics ($89.99)
Standing Desk - Furniture ($299.00)
Wireless Mouse - Electronics ($34.99)

Here, CONCAT combines five pieces: product name, a separator, category, an opening parenthesis with dollar sign, and the price. The result is a formatted string ready for display.

TIP
Use CONCAT() to build identifiers, URLs, file paths, or formatted messages in your queries instead of doing it in application code. This reduces data transfer and keeps formatting logic centralized.

Advanced Usage

Beyond basic two-string joins, CONCAT handles complex scenarios involving many values, mixed types, and potentially missing data.

//

Multiple Arguments

CONCAT accepts as many arguments as you need. Each argument is evaluated and converted to a string, then joined in order. You can mix column names and literal strings freely:

Two arguments: joins city directly to country with no separator, producing 'San FranciscoUSA'.
1SELECT
2 CONCAT(city, country) AS location
3FROM addresses
//

Handling Empty Values

If any argument to CONCAT is NULL, most databases will return NULL for the entire result. This behavior can cause unexpected blank output when fields contain missing data.

first_namemiddle_namelast_name
AliceMarieChen
BobNULLMartinez
1SELECT
2 CONCAT(
3 first_name,
4 ' ',
5 middle_name,
6 ' ',
7 last_name
8 ) AS full_name
9FROM users
Result
full_name
Alice Marie Chen
NULL

Bob's middle_name is NULL, so the entire concatenation becomes NULL. To handle this, you can use COALESCE to replace NULL with an empty string, or use conditional logic to skip NULL values.

Best Practices

Knowing when and where to use CONCAT helps you write cleaner queries and maintain consistency across your data pipelines.

//

When to Use CONCAT()

CONCAT is ideal for building composite strings from multiple data sources in your database.

Display Names
Display Names
Build full names from separate first and last name fields
Unique Identifiers
Unique Identifiers
Join keys to create composite identifiers
Formatted Strings
Formatted Strings
Combine addresses, URLs, file paths, or email addresses
Dynamic Messages
Dynamic Messages
Construct labels and messages from data values

Use CONCAT when you need to combine strings in the database rather than in application code. This approach centralizes logic, reduces data transfer, and makes transformations visible in your queries.

> Complete this query to build a full name from first and last name columns.

SELECT
  ___(___, ' ', ___) AS full_name
FROM users
last_name
CONCAT
UPPER
first_name

Understanding how CONCAT processes its arguments helps you predict behavior with NULL values and mixed types.

How CONCAT() Executes
  • CONCAT() evaluates arguments left to right
  • Each argument is converted to a string before joining
  • NULL in any argument typically produces NULL output
  • Concatenating very long strings can be memory-intensive
At scale, even simple string operations can become performance bottlenecks.

LENGTH()

Daily Life
Interviews

Measure the size of text values

LENGTH returns the number of characters in a string. It counts every character, including letters, numbers, spaces, and punctuation. The function accepts a single string argument and returns an integer.

The syntax is LENGTH(string). The result is the count of characters, not bytes. For most English text, character count equals byte count, but for international characters or emojis, one character may use multiple bytes. LENGTH always returns character count.

LENGTH is one of the simplest string functions, but its utility in data validation and analysis makes it indispensable.

//

Basic Usage

Measuring string length is useful for validation, filtering, and analysis.
1SELECT
2 product_name,
3 LENGTH(product_name) AS name_length
4FROM products
Result
product_namename_length
Ergonomic Keyboard19
Mouse5
4K Monitor10

The function counts all characters. 'Ergonomic Keyboard' has 19 characters including the space. 'Mouse' has 5. LENGTH() is exact.

//

Filtering by Length

You can use LENGTH in WHERE clauses to filter rows based on string size:

1SELECT
2 username
3FROM users
4WHERE LENGTH(username) < 5
Result
username
bob
ali
dan

This query finds usernames shorter than 5 characters. LENGTH makes it possible to enforce constraints, identify outliers, or find data that violates length requirements.

Data Quality Checks

LENGTH plays a key role in identifying data quality issues. Empty strings, NULL values, and unexpectedly short or long text all reveal problems worth investigating.

//

Finding Empty or NULL

LENGTH helps distinguish between empty strings and NULL values:

1SELECT
2 user_id,
3 bio,
4 LENGTH(bio) AS bio_length
5FROM users
Result
user_idbiobio_length
1Software engineer17
20
3NULLNULL

User 1 has a bio with 17 characters. User 2 has an empty bio with length 0. User 3 has NULL, so LENGTH returns NULL. This distinction is useful for data quality checks.

//

LENGTH vs Empty Check

There are two common ways to check if a string has content: using LENGTH or comparing directly to an empty string. Each approach handles NULL differently.

LENGTH(bio) > 0 finds rows with non-empty bios. NULL bios are excluded because LENGTH(NULL) returns NULL, which is not greater than 0.
1SELECT
2 username
3FROM users
4WHERE LENGTH(bio) > 0
TIP
Use LENGTH() to validate input before processing. Find phone numbers that are too short, descriptions that exceed limits, or codes that don't match expected formats.

Analysis & Best Practices

Beyond individual row validation, LENGTH helps analyze text patterns across entire datasets and inform schema design decisions.

//

Aggregate Length Analysis

Combining LENGTH with aggregate functions reveals patterns in text data:

1SELECT
2 AVG(LENGTH(comment)) AS avg_comment_length,
3 MAX(LENGTH(comment)) AS longest_comment
4FROM reviews
Result
avg_comment_lengthlongest_comment
127.452048
This query calculates the average length of comments and finds the longest one. Such analysis helps you understand data distribution, set appropriate column sizes, or identify unusually long entries.
//

When to Use LENGTH()

LENGTH appears throughout data validation, filtering, and analysis workflows.

ValidateFilterEmptyStorageAnalysis
Validate
Check Length
Enforce size constraints
Filter
Size Filtering
Find short or long values
Empty
Detect Blanks
Spot missing text values
Storage
Column Sizing
Plan storage allocation
Analysis
Text Patterns
Analyze length breakdown

LENGTH is commonly used in data quality pipelines to catch malformed input. If you expect product codes to be exactly 8 characters, you can filter WHERE LENGTH(product_code) != 8 to find invalid entries.

> Complete this query to find usernames shorter than 5 characters.

SELECT
  username
FROM users
WHERE ___(___) ___ 5
<
LENGTH
username
>

LENGTH is one of the fastest string functions because it only needs to scan the string once and count characters without performing any transformation or copying.

LENGTH has straightforward execution characteristics but important edge case behavior.

How LENGTH() Executes
  • LENGTH() scans the string once to count characters
  • Returns an integer (never a string); counts characters, not bytes
  • NULL input produces NULL output

LOWER()

Daily Life
Interviews

Standardize text to lowercase

LOWER converts all letters in a string to lowercase. Non-letter characters remain unchanged. The function accepts a single string argument and returns a new string with all uppercase letters replaced by their lowercase equivalents.

The syntax is LOWER(string). The input string is not modified; LOWER returns a new string. This function is essential for case-insensitive comparisons and data normalization.

LOWER is simple to use but essential for achieving consistent data handling across your queries.

//

Basic Normalization

Email addresses are often stored inconsistently due to user input. LOWER standardizes them.

email
Alice@Email.com
BOB@email.COM
carol@email.com
1SELECT
2 LOWER(email) AS normalized_email
3FROM users
Result
normalized_email
alice@email.com
bob@email.com
carol@email.com
All email addresses are now lowercase. This ensures consistency and makes comparisons reliable. Without normalization, 'Alice@Email.com' and 'alice@email.com' would be treated as different values.
//

Case-Insensitive Compare

By converting both sides of a comparison to lowercase, you can match strings regardless of case:
1SELECT
2 user_id,
3 username
4FROM users
5WHERE LOWER(username) = 'alice'
Result
user_idusername
1Alice
2ALICE
3alice

This query finds all variations of 'alice', regardless of how it was entered. LOWER(username) converts each stored value to lowercase, then compares it to the lowercase search term.

Data Handling

Understanding how LOWER interacts with your data helps avoid surprises when working with special characters and NULL values.

//

Preserving Original Data

LOWER does not change the original data in the table. It only transforms the output:

The first query shows both the original value and the lowercase version. The original data is unchanged.
1SELECT
2 username,
3 LOWER(username) AS lowercase_username
4FROM users

Use LOWER() when deduplicating data. Two records with email addresses 'user@site.com' and 'User@Site.COM' are duplicates, but string comparison treats them as different unless you normalize.

//

Non-Letter Characters

LOWER only affects letters. Numbers, spaces, punctuation, and special characters remain unchanged:

1SELECT
2 LOWER('Product-123! Available NOW') AS result
3FROM users
Result
result
product-123! available now

The letters became lowercase. The hyphen, digits, exclamation mark, and spaces are unchanged. LOWER() transforms only alphabetic characters.

Best Practices

Case normalization is one of the most common data cleaning steps. Applying it consistently prevents subtle bugs that are hard to debug.
//

When to Use LOWER()

The comparison below shows common problems caused by inconsistent casing and how LOWER solves them.

Before LOWER()
  • Alice@Email.com != alice@email.com
  • Duplicates slip through
  • Joins silently fail on case
  • Searches miss valid matches
After LOWER()
  • All emails match consistently
  • Deduplication works correctly
  • Joins find all matching rows
  • Searches are case-insensitive
Always normalize case when uniqueness or comparison depends on text values. Many production bugs stem from failing to handle case inconsistencies.

> Complete this query to normalize email addresses to lowercase for consistent comparison.

SELECT
  ___(___) AS normalized_email
FROM users
LOWER
UPPER
email
Advanced LOWER() Practices
  • Store emails and identifiers as lowercase in the database to avoid having to normalize in every query
  • LOWER() handles accented and international characters correctly with Unicode-aware database collations
  • When deduplicating records from multiple sources, normalize text with LOWER() on both sides before comparing

LOWER is idempotent: applying it twice produces the same result as applying it once. A single call is sufficient to normalize any string regardless of its original casing.

When comparing strings from different sources, apply LOWER() to both sides so the comparison is case-insensitive without modifying the stored data.

UPPER()

Daily Life
Interviews

Convert text to uppercase for matching

UPPER converts all letters in a string to uppercase. It is the inverse of LOWER. Non-letter characters remain unchanged. The function accepts a single string argument and returns a new string with all lowercase letters replaced by their uppercase equivalents.

The syntax is UPPER(string). Like LOWER, the input is not modified; UPPER returns a transformed copy. This function is used for formatting, normalization, and case-insensitive operations.

UPPER works identically to LOWER but transforms text in the opposite direction. The same principles apply.

//

Display Formatting

UPPER is commonly used to format output for display, especially for codes, labels, or emphasis.

1SELECT
2 UPPER(status) AS status_code
3FROM orders
Result
status_code
PENDING
COMPLETED
CANCELLED
Status values are converted to uppercase for consistent display. This formatting makes labels stand out and ensures uniformity regardless of how the data was entered.
//

Normalizing Identifiers

Product codes, order numbers, and other identifiers are often standardized as uppercase:
product_code
abc-123
XYZ-456
Def-789
1SELECT
2 UPPER(product_code) AS normalized_code
3FROM products
Result
normalized_code
ABC-123
XYZ-456
DEF-789
All codes are now uppercase, ensuring consistency for comparisons, joins, and reporting. This normalization prevents mismatches caused by case differences.

Advanced Usage

UPPER can be combined with other functions and used in conditional logic to build powerful transformations.

//

Case-Insensitive Match

Just like LOWER, you can use UPPER for case-insensitive comparisons by converting both sides to uppercase:

1SELECT
2 order_id,
3 status
4FROM orders
5WHERE UPPER(status) = 'PENDING'
Result
order_idstatus
1001pending
1002Pending
1003PENDING

This query finds all orders with a status of 'pending', regardless of case. UPPER(status) converts each value to uppercase before comparing it to the uppercase search term.

Both functions serve the same purpose: case normalization. Use LOWER() for user-facing text like emails and usernames. Use UPPER() for system identifiers like codes and statuses. The choice is often a matter of convention.

//

UPPER() with Functions

UPPER can be combined with other string functions to build complex transformations:

1SELECT
2 CONCAT(
3 UPPER(region),
4 '-',
5 order_id
6 ) AS region_order_id
7FROM orders
Result
region_order_id
NORTH-1001
SOUTH-1002
EAST-1003
This query creates a formatted identifier by converting the region to uppercase, adding a hyphen, then appending the order ID. Combining functions allows you to perform multi-step transformations in a single query.

Best Practices

Consistent use of UPPER for identifiers and codes helps maintain data quality across your systems.

//

When to Use UPPER()

UPPER is particularly useful for system identifiers and status codes that benefit from consistent formatting.

Format Codes
Format Codes
Standardize status codes, labels, and identifiers for display
Normalize for Matching
Normalize for Matching
Ensure case-insensitive joins and comparisons
Deduplicate Records
Deduplicate Records
Find and merge records with case variations

Use UPPER when output needs to be in uppercase for consistency, readability, or business requirements. Many systems require uppercase codes for integration with legacy platforms.

> Complete this query to build a region code by combining string functions.

SELECT
  ___(
    ___(___),
    '-',
    order_id
    ) AS region_order
FROM orders
UPPER
CONCAT
region
LOWER
Follow these guidelines to write cleaner, more maintainable string operations in your SQL queries.
Do
  • Normalize case with LOWER() or UPPER() before comparisons
  • Use CONCAT() to build formatted strings in the database
  • Validate string lengths with LENGTH() before processing
  • Store strings in consistent case when possible
Don't
  • Assume case-insensitive comparisons without normalization
  • Perform string operations in application code when SQL can do it
  • Store text without validating length constraints
String handling is one of the most common tasks in real-world data work. Put these fundamentals to the test with hands-on challenges.
PUTTING IT ALL TOGETHER

> You are a data analyst at a healthcare company standardizing patient name and contact fields pulled from three different legacy EHR systems before loading them into a unified patient database for reporting.

UPPER() and LOWER() normalize inconsistent text case so records entered as "Alice" in one system and "ALICE" in another merge into a single canonical patient record.
LENGTH() validates that phone numbers and postal codes meet required character counts, flagging malformed entries before they enter the unified database.
CONCAT() combines separate first name, last name, and suffix fields into a single display column used across dashboards and printed correspondence.
KEY TAKEAWAYS
Strings are text values enclosed in single quotes; use '' to escape apostrophes
CONCAT() joins multiple strings; || operator does the same in most databases
CONCAT() with NULL returns NULL unless your database handles it specially
LENGTH() returns character count; OCTET_LENGTH returns byte count (matters for Unicode)
LOWER() and UPPER() standardize case for consistent comparisons
String comparisons are case-sensitive by default in most databases
Always normalize case before comparing or joining on string columns
String functions form the foundation of data cleaning and data transformation workflows

Text is the language of databases

Making text behave since day one

Category
SQL
Difficulty
beginner
Duration
29 minutes
Challenges
0 hands-on challenges

Topics covered: String Data Type, CONCAT(), LENGTH(), LOWER(), UPPER()

Lesson Sections

  1. String Data Type

    A string is a data type that holds text. Strings are enclosed in single quotes in SQL. The text inside can contain letters, numbers, spaces, punctuation, or special characters. A string could be as short as one character or as long as several megabytes of text. When you see values like 'Alice', 'alice@email.com', or 'San Francisco' in a database, those are strings. The single quotes tell the database "this is text, not a keyword or number." Before manipulating strings, it helps to understand how

  2. CONCAT() (concepts: sqlConcat)

    Basic Concatenation The most common use case is combining first and last names into a full name: The query takes first_name, adds a space ' ', then adds last_name. The space is a literal string argument. Without it, you would get 'AliceChen' instead of 'Alice Chen'. Building Complex Strings Advanced Usage Multiple Arguments Handling Empty Values Best Practices When to Use CONCAT() At scale, even simple string operations can become performance bottlenecks.

  3. LENGTH() (concepts: sqlLength)

    Basic Usage Measuring string length is useful for validation, filtering, and analysis. Filtering by Length Data Quality Checks Finding Empty or NULL LENGTH vs Empty Check Analysis & Best Practices Aggregate Length Analysis This query calculates the average length of comments and finds the longest one. Such analysis helps you understand data distribution, set appropriate column sizes, or identify unusually long entries. When to Use LENGTH()

  4. LOWER() (concepts: sqlLowerUpper)

    Basic Normalization All email addresses are now lowercase. This ensures consistency and makes comparisons reliable. Without normalization, 'Alice@Email.com' and 'alice@email.com' would be treated as different values. Case-Insensitive Compare By converting both sides of a comparison to lowercase, you can match strings regardless of case: Data Handling Preserving Original Data Non-Letter Characters Best Practices Case normalization is one of the most common data cleaning steps. Applying it consist

  5. UPPER()

    Display Formatting Status values are converted to uppercase for consistent display. This formatting makes labels stand out and ensures uniformity regardless of how the data was entered. Normalizing Identifiers Product codes, order numbers, and other identifiers are often standardized as uppercase: All codes are now uppercase, ensuring consistency for comparisons, joins, and reporting. This normalization prevents mismatches caused by case differences. Advanced Usage Case-Insensitive Match UPPER()