String Manipulation


String manipulation is vital for "data cleaning"—fixing inconsistent casing, removing whitespace, or merging multiple columns (like first_name and last_name) into a single field.

In PostgreSQL, strings are typically stored as TEXT or VARCHAR. Here is how you handle them.


String Concatenation

There are three main ways to join strings together in Postgres.

A. The Pipe Operator (||)

This is the standard SQL way to join strings.

SELECT first_name || ' ' || last_name AS full_name FROM users;

Warning: If any column in the chain is NULL, the entire result becomes NULL.

B. The CONCAT() Function

This function is safer because it ignores NULLs (treating them as empty strings).

SELECT CONCAT(first_name, ' ', last_name) AS full_name FROM users;

C. CONCAT_WS() (With Separator)

This is the "pro" choice for Data Engineers. You provide a separator first, and it automatically places it between every following argument. It also skips NULL values.

-- Joins address parts with a comma and space
SELECT CONCAT_WS(', ', street, city, state, zip) AS shipping_address FROM orders;

Essential Transformation Functions

Function

Action

Example

UPPER() / LOWER()

Changes casing.

UPPER('Data') \to 'DATA'

TRIM()

Removes leading/trailing spaces.

TRIM(' sql ') \to 'sql'

LEFT() / RIGHT()

Grabs NN characters from either side.

LEFT('Postgres', 4) \to 'Post'

LENGTH()

Returns number of characters.

LENGTH('AI') \to 2

REPLACE()

Swaps a substring for another.

REPLACE('01-01', '-', '/') \to '01/01'


Substrings and Positioning

When you need to extract specific parts of a string (like the domain from an email address), you use SUBSTRING and POSITION.

  • SUBSTRING(string FROM start FOR length): Extracts a specific slice.

  • SPLIT_PART(string, delimiter, part): Very useful for DEs. It splits a string by a character and grabs the NN-th piece.


Pattern Matching: LIKE vs. ILIKE

Postgres has a special operator for case-insensitive matching that isn't found in standard SQL or MySQL.

  • LIKE: Case-sensitive. 'Apple' LIKE 'a%' is False.

  • ILIKE: Case-insensitive. 'Apple' ILIKE 'a%' is True.

Wildcards:

  • %: Matches any number of characters.

  • _: Matches exactly one character.


Regular Expressions (~)

For complex cleaning (like validating a phone number or finding specific patterns), Postgres supports POSIX Regular Expressions.

  • ~: Matches (case-sensitive).

  • ~*: Matches (case-insensitive).

  • !~: Does not match.


Summary

  • CONCAT_WS is the safest way to merge columns to avoid NULL issues.

  • SPLIT_PART is a lifesaver for parsing CSV-style strings within a column.

  • ILIKE is a Postgres-specific superpower for easy searching.

  • TRIM(BOTH ' ' FROM column) is your best friend when cleaning data imported from messy CSV files.


Last updated