Match pattern

The match pattern is the main component of a regular expression, and is therefore rather complex. If you are new to regular expressions this can be very overwhelming – but fear not, you don’t need know everything about regexes before you can be productive with them. In fact, 99% of people that use regular expressions only regularly use a small part of all their available features. Furthermore, you don’t have to remember each and every special character or operator – that’s what the Quick Reference & Context menus are for – you just need to understand the concept of how each one operates.

Special Characters

The most basic rule is that every character in a match pattern other than “special characters” match themselves. That is, the pattern foo will match foobar and bigfoot. The special characters are:

. ? + * ^ $ \ ( ) [ ] { } |

These characters don’t match themselves, as they have special meanings. If you want to match an actual $ character, for example, you need to “escape” it by prepending a backslash (\). So \$ will match $ and likewise, \\ will match \.

Backslash (\)

You’ll note that the backslash is a special character itself and this is, in fact, what’s special about it – it escapes things. The backslash can be used to prevent any character from having a special meaning, but also, it can give an ordinary character a special meaning. For example, \d will match a digit, not an actual d (we’ll cover \d shortly). So you need to be careful not to escape things that don’t have special meanings.

Dot (.)

The dot is probably the most common operator – it matches one of any character. So if the match pattern re..me was applied to rename, the first dot would match the n and the second dot the a. Of course this pattern would also match readme, regime, resume, etc. Be careful when using the dot – you should generally use something more restrictive if possible.

Pipe (|)

Just as in programming languages, the pipe means “or” – you can use it in a match pattern to specify alternatives. For example, hibit|press|tract will match exhibition or expression or extraction.

Round brackets ((...))

Round brackets have two functions in a match pattern: they capture text, which we’ll discuss later, and they group elements together. Say in the previous example we also wanted to match the ex before and ion after, but we can’t use exhibit|press|traction because that will match exhibition/expression/extraction. By grouping the original expression we get ex(hibit|press|tract)ion which will achieve what we want.

An “element” is not restricted to just text – it can be any valid regex pattern – so you can nest groups: (one|t(wo|hree)|f(our|ive)|s(ix|even)|eight|nine|zero) will match any spelt-out number. Grouping several elements becomes very useful when combined with repetition quantifiers.

Quantifiers – the basics

Quantifiers are operators that, when appended to an element, determine how many times that element can match – or if it needs to match at all. These are more complex that you would expect, so we’ll cover the basics now and come back to them later.

Question mark (?)

The first quantifier, the question mark, means the previous element is optional. Or more specifically, it can match zero or one times. So \.jpe?g would match either version of the file extension. By grouping elements together we can make larger sections optional: reg(ular )?ex(pression)? will match both regular expression and regex (but keep in mind this will also match regexpression & regular ex).

Plus (+)

The plus quantifier means the previous element can match one or more times. This can be used for repeating characters, like foo +bar (one or more spaces), or operators, such as .+ (which will match everything). More importantly, it can be used with groups of elements: ((the|(c|s|m)at|on) ?)+ will match the text the cat sat on the mat (as well as on the mat sat the cat).

Star (*)

Lastly, the star quantifier is a combination of the previous two – it means match the previous element zero or more times. This is often used to match any remaining characters: .*setup.* will match the entire filename if it contains “setup” anywhere. (If you matched using just setup the same files would still match, but only the word “setup” would be replaced by the replace pattern).

Character Classes ([...])

Character classes allow you to list a set of characters to match. The pattern [abc] will match any of the three characters, the same as (a|b|c). This is useful for basic single-character alternations, such as practi[cs]e to match either spelling. Of course classes are not limited to letters, [ _.] could be used to match the position between words. Note that when repeating a character class, you are repeating the class not the character that was matched.

Ranges (-)

You can also specify simple ranges of letters and numbers: [a-z]+ will match a basic lowercase word, and [0-9A-F]+ will match a hexadecimal string. Ranges and non-ranges can be combined, so \$[0-9,.]+ could be used to match a money value like $1,234.00.

Negative classes (^)

In addition to ranges, it’s also possible to invert a character class so that it matches everything else. So '[^']+' would match an opening quote ('), followed by one or more (+) non-quotes ([^']), followed by a closing quote ('). Be careful not to get confused when doing things like [0-9][^ ] – this doesn’t mean “match a number not followed by a space”, it actually means “match a number followed by a non-space” (so this will not match a number at the end of a filename).

Escaping inside character classes

If you’ve been keeping track you’ll have noticed that in a couple of the examples there has been special characters inside the character class, such as the . inside [ _.]. This doesn’t actually mean match a space or an underscore or an anything – the rules inside a character class are slightly different. There are only four special characters that need to be escaped:

^ - \ ]

All of the other special characters mentioned at the top of the page lose their special meaning inside a character class, as mostly they wouldn’t make any sense. To escape these characters you can use the backslash as normal ([\^\-\]\\]) or, for ^, - & ], simply place them in a position where they can’t have their usual meanings: ^ somewhere other than at the start, - at the start or the end, and ] at the beginning. So the pattern []^-] would be valid.

Shorthand character classes (\d, \w, \s)

There are several commonly-used character classes available in short form: \d is the same as [0-9] (a digit), \w the same as [a-zA-Z0-9_] (a word character), and \s the same as a space character (it actually matches “whitespace”, that is, spaces/tabs/newlines, but these aren’t used in filenames so we’ll ignore them). In addition, if you capitalise the letter it will match the respective negative character class: \D (non-digit), \W (non-word), \S (non-space). You can even use these inside your own character class, so [\d,] would match 1,580,000.

Anchors

So far we haven’t been able to specify where in the filename the pattern should match. Normally the regex engine will attempt the match wherever it can, and if there’s more than one possibility, the one closest to the start. Anchors are a bit different to what we’ve seen so far as they don’t match actual characters, but the positions between characters.

Start/end (^, $)

The ^ operator matches the position at the beginning of the filename, and likewise $ the position at the end. These are used when you only want to match at the start or end of a filename: \.\w+$ would match a file extension for example, or ^\S+ everything up until the first space (or the end of the filename). If you use both then the pattern will only match if it matches the entire filename (eg, ^........$ to match any 8 character filenames).

Because anchors match positions, when you replace them nothing is actually removed. You can use this to insert text at the beginning or end of a filename by having a match pattern of just ^ or $ by itself.

Word boundaries (\b)

In addition, the \b operator can be used to match a word-boundary (ie, the position between a \w and a \W character, or at the beginning and end of the filename). So \bphoto\b would match my photo.jpg, but not photography.jpg. Likewise, \B matches a non-word boundary (ie, the position between \w\w or \W\W).

Captures

Probably the most powerful feature of regular expressions is the ability to “capture” text that matches all or part of the pattern and use it later on. Captures can be automatically numbered based on the order they appear (unnamed) or given a specific name (named). There are two ways to use captures:

First we’ll look at creating unnamed & named captures, then how to use backreferences. Inserting the contents of a capture is covered in the Replace Pattern section.

Unnamed ((...))

As we mentioned previously, in addition to grouping, round brackets are used to create unnamed captures. These are assigned numbers based on the order of their opening parentheses. For example, if you were to match the regex (...)..((..)(....).) against the text regexrenamer there would be four unnamed captures, 1-4:

(reg)ex((re)(name)r)

    #1 = reg
    #2 = renamer
    #3 = re
    #4 = name

If you want to group elements together but don’t want the contents captured, you can use the non-capturing version: (?:...).

Named ((?<foo>...))

Named captures work in exactly the same way, except you give them a name rather that having them assigned a number. While unnamed captures are generally simpler and used more often, named captures are self-documenting and in a complex pattern can be easier to use when trying to figure out which capture is which.

The syntax is: (?<foo>...) where foo is the name you want to use to refer to the capture. So for example (?<day>\d\d)/(?<month>\d\d)/(?<year>\d\d\d\d) could be used to capture each element of a date to their respective named captures.

Backreferences (\n, \<foo>)

Backreferences are used when you want to match something you’ve already captured. To match an unnamed capture use \n, where n is the number of the capture (n > 9 can only be used if you have that many captures). Likewise, \<foo> will match the contents of the named capture foo.

A simple example for using a backreference is to match a repeating character: ([abc])\1+ will match a or b or c ([abc]), capture that character to unnamed capture #1 (()), then match one or more (+) of capture #1 (\1). So this would match aa, bbbb or cccccc, but not abc. You could use this feature, for example, to remove duplicate file extensions by replacing (\.\w+)\1$ with the contents of capture #1.

Quantifiers – advanced

We’ve already seen how to use the ?, +, and * quantifiers to make an element optional, repeatable, or both. But there is more to quantifiers, including being able to repeat an element a specific number of times (or between a range of times), and the difference between repeating as much and as little as possible.

Curly bracket quantifier ({...})

The curly bracket quantifier can be used instead of ?/+/* in several ways: {n} will match the previous element exactly n times (eg, \b\w{5}\b would match any 5-letter word) and {n,m} will match between n and m times (\b\w{3,5}\b to match 3-5 letter words). You can also repeat “up to” n times ({0,n} or {1,n}), and “at least” n times ({n,}). So ? is really just shorthand for {0,1}, + for {1,} and * for {0,}.

Lazy vs Greedy operators (...?)

If you apply the pattern ^(\w+)(\d+)$ to the text abc123 you’ll note that, because \w also includes digits, the contents of captures #1 and #2 can have several possible value combinations: abc/123, abc1/23, or abc12/3 (it can’t match abc123/nothing because \d+ means “one or more”). If you try this out you’ll find that it turns out to be abc12/3. This is because all quantifiers by default are “greedy”. That is, when given the option they will match as many times as possible. In this example the \w+ actually means “match one or more word characters, as much as possible”.

To change this behaviour and match as little as possible, append a ? to the quantifier to make it “lazy”. So \w+? means “match one or more word characters, as little as possible”, and if used in the example would result in abc/123. The curly bracket quantifiers can be made lazy as well, {5,10}? will still repeat the previous element 5 to 10 times, but will try to match as little as possible within that range. For a complete list, see Quantifiers in the quick reference.

Keep in mind the greedy/lazy issue only comes into effect when the number of times a quantifier can match is ambiguous – adding a ? to a quantifier will never cause a match to fail: ^\w+?$ will still match abc123 because it’s anchored at both ends and + doesn’t have any other option.

In the original example, you may be wondering: if the + quantifier is greedy, why did the \w+ match as much as possible but not the \d+? This is simply because the \w+ appears earlier in the pattern. In fact, the pattern should really have been written ^(\w+)(\d+?)$ as that is what it actually means.

Other advanced operators

This covers the majority of the match pattern syntax you need to know to effectively match filenames. There are still several advanced operators we havn’t mentioned, including lookaround, alternation & inline modifiers. While these are outside the scope of this guide, they are listed in the Quick Reference should you need to refer to them.