Hi, my name is Vitalii Rudnykh 👋

May 27, 2022

📔 Best practices for PHP regex

This guide will give you some tips and hopefully you’ll be able to write better and clean code whenever regex might get involved.

lion

Avoid using regex whenever possible. ✋

Let’s face it, regex is hard. So avoid using them whenever possible. Just to name a few cases where you should absolutely avoid them with a suitable alternative solution:

TaskSolution
Validating email addressesUse filter_var() with FILTER_VALIDATE_EMAIL *
Validating IP addressesUse filter_var() with FILTER_VALIDATE_IP
Validating URLsUse filter_var() with FILTER_VALIDATE_URL *
Validating datesSee this Stack Overflow thread.
Parsing JSONUse json_decode().
Parsing HTML / XMLUse a dedicated parser. See: How do you parse and process HTML/XML in PHP?
Parsing CSVUse str_getcsv() or fgetcsv().
Check if string contains substringUse strpos() or stripos().
Validating number rangesUse PHP comparison operators.

* It might not work with internationalized domains and email addresses.

Get to know the available PHP regex functions. 🧠

I’ve seen a lot of cases where people don’t seem to be aware of functions other than the classical preg_match() and preg_replace(). Below we’ll cover some use cases:

preg_split()

Sometimes you want splitting instead of matching. Instead of writing the following code:

$input = 'preg__split_for___fun';
if(preg_match_all('/[^_]+/', $input, $m)) {
    print_r($m[0]);
} else {
    echo 'no match';
}

Try to write it like this:

$input = 'preg__split_for___fun';
$output = preg_split('/_+/', $input);
print_r($output);

preg_grep()

Say you want to loop through an array and try to match values against a specific regex. I’ve seen people trying:

$input = ['data1', 'data2', 'exclude', 'data3'];
$result = [];
foreach ($input as $v) {
    if(preg_match('/data\d+/', $v)) {
        $result[] = $v;
    }
}
print_r($result); // Array ( [0] => data1 [1] => data2 [2] => data3 )

While you could achieve the same effect using preg_grep():

$input = ['data1', 'data2', 'exclude', 'data3'];
$result = preg_grep('/data\d+/', $input);
print_r($result); // Array ( [0] => data1 [1] => data2 [3] => data3 )

The only thing you need to take into account is that it returns an array indexed using the keys from the input array.

preg_filter()

Per the documentation:

preg_filter() is identical to preg_replace() except it only returns the (possibly transformed) subjects where there was a match.

In essence, this is the same as `preg_grep()`` but with a replace option.

preg_replace_callback()

Sometimes you don’t want a simple replace. This function let’s you use a function as callback. Say we want to match some words and convert them to upper case:

$input = 'words are important';
$output = preg_replace_callback('/\w+/', function($m) {
    return strtoupper($m[0]);
}, $input);
echo $output;

preg_quote()

Sometimes we want to include dynamic user input into our regex. For example: search for a sequence chosen by the user followed by digits. The code would look roughly like this:

$user_input = isset($_GET['input']) ? (string) $_GET['input'] : '';
$haystack = 'List: pid1000, pid2000, pid3000...';
$regex = '/' . preg_quote($user_input, '/') . '\d+/';
if(preg_match_all($regex, $haystack, $m)) {
    print_r($m[0]);
} else {
    echo 'no match';
}

preg_last_error()

This function might come handy for debugging purposes. It will return the error code of the last PCRE regex execution. Function to convert the error code to actual text:

function preg_errtxt($errcode)
{
    static $errtext;
    if (!isset($errtxt))
    {
        $errtext = array();
        $constants = get_defined_constants(true);
        foreach ($constants['pcre'] as $c => $n) if (preg_match('/_ERROR$/', $c)) $errtext[$n] = $c;
    }
    return array_key_exists($errcode, $errtext)? $errtext[$errcode] : NULL;
}

A note on security.

The «e» modifier.

The e - stands for evil eval. When using using it with preg_replace() it will perform a regex substitution and evaluate it as PHP code:

$input = 'up this case!';
$output = preg_replace('/\w+/e', 'strtoupper($0)', $input);
echo $output; // UP THIS CASE!

No wonder it got deprecated as of PHP 5.5 and removed from PHP 7. The solution would be to use `preg_replace_callback()`` instead:

$input = 'up this case!';

$output = preg_replace_callback('/\w+/', function($m) {
    return strtoupper($m[1]);
}, $input);
echo $output; // UP THIS CASE!

Comments. 📝

We always advise to write documented code. So why should we ignore this when writing regexes?

eXtended mode

This is the preferred and most common way of implementing comments by using the x modifier:

// Regex for password validation
$regex = '/
^                 # start-of-string
(?=.*[0-9])       # a digit must occur at least once
(?=.*[a-z])       # a lower case letter must occur at least once
(?=.*[A-Z])       # an upper case letter must occur at least once
(?=.*[@#$%^&+=])  # a special character must occur at least once
(?=\S+$)          # no whitespace allowed in the entire string
.{8,}             # anything, at least eight places though
$                 # end-of-string
/x';

Basically spaces are ignored, everything after a hashtag will get ignored as well. Here comes the pitfall: since spaces are ignored, what if I want to match a space? There are several ways, I will show two of them. Say I want to match one or more spaces:

  • Escaping the space $regex = '/\ +/x';
  • Using a character class $regex = '/\s+/x

Pattern modifiers.

PHP regex has a lot of modifiers. It even has modifiers that are non-existant in PCRE! Therefore I would highly recommend to read about them if you aren’t familiar with them. See the documentation page.

«i» modifier.

If this modifier is set, it will make the pattern case insensitive. So when you have a regex like: /[a-zA-Z0-9]+/, you might simplify it to /[a-z0-9]+/i. It will be a personal choice. You might want to avoid the i modifier to be as explicit as possible. Remember that this might lengthen your pattern.

«s» modifier.

Also known as DOTALL modifier/mode. If this modifier is set, it will make dots . match everything including newlines. Example: /a.*b/ will match:

a test b

but won’t match:

a test
test b

while /a.*b/s matches both inputs. Note that a dot in a character class [.] loses its meaning: it will match a literal dot.

«m» modifier.

lso known as MULTILINE modifier/mode. Quoting from the documentation:

By default, PCRE treats the subject string as consisting of a single “line” of characters (even if it actually contains several newlines). The “start of line” metacharacter ^ matches only at the start of the string, while the “end of line” metacharacter $ matches only at the end of the string, or before a terminating newline (unless D modifier is set). This is the same as Perl. When this modifier is set, the “start of line” and “end of line” constructs match immediately following or immediately before any newline in the subject string, respectively, as well as at the very start and end. This is equivalent to Perl’s /m modifier. If there are no \n characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.

Let’s see what this means in practice. Let’s say we want to match one or more digits [0-9]+ at the beginning of each line. The regex would look like this /^[0-9]+/m. Without the m modifier, it would have only matched the digits in the first line

123     # Matched by /^[0-9]+/ and /^[0-9]+/m
1234    # Matched by /^[0-9]+/m
12345   # Matched by /^[0-9]+/m

«u» modifier.

Lower case letter u. If this modifier is set, the input string and regex are treated as UTF-8. This means whenever you’re working with UTF-8 strings, you should enable this modifier. This means that both input and pattern must be valid UTF-8 strings. Which means it can’t be used to extract text out of an arbitrary binary string which contains UTF-8 strings.

I would recommend enabling this modifier unless you’re absolutely sure that you will be working with ASCII (or single-byte character sets). Note that shorthand character classes like \w, \d, \s, \b, … will become Unicode aware when this modifier is set.

Escaping a backslash hell.

The forward slash / is commonly used as a delimiter in the regex world. Sometimes it might be better to use different delimiters. Especially when you have some forward slashes in your regex:

// Backslash
$regex = '/^\/user\/(\d+)\/?/i';

// Clean
$regex = '~^/user/(\d+)/?~i';

Other common delimiters are #, !, %, _, ;. Keep in mind to use one that’s less likely to be included in your regex. For example #`` might be used in x` mode for comments or when you simply have a regex with one of those characters included.

One not so well known but interesting way is to use an assymetric pair of delimiters such as ():

$regex = '(^/user/(\d+)/?)i';

Notice that you don’t need to escape the brackets inside the regex. You could see the first braces as “group 0” and the second (inner braces) as group 1. However, the opinions are divided about its usage. Some would endorse it and some would avoid it as it might seem confusing.

Know what should and shouldn’t be escaped.

A lot of times, I see characters being escaped which are not required to escape. Resulting into a mess:

// Mess
$regex = '~\>\>user\d+\,\ \"\d+\-\d+\"~';

// Clean
$regex = '~>>user\d+, "\d+-\d+"~';

Check out the following cases where you do not need to escape:

  • The following characters <>@!#~=_,'" if they aren’t used as delimiters
    • /<>@!#~=_,'"/ will match <>@!#~=_,'".
  • Spaces if you’re not using the x modifier .
  • Hyphens outside of character classes.
    • /-+/ will match one or more hyphens.
  • Hyphens inside a character class at the beginning or at the end.
    • /[a-z-]+/ will match a range of letters from “a” to “z” including a hyphen. Example: abcde-fgh.
    • /[-a-z]+/ same as above.