Here is the video version for this topic. You can read the written version which comes after the video section.
In the previous lessons, we have looked at Character Classes and Quantifiers . In this lesson, we'll learn about meta characters.
What are Meta Characters?
In Regex, we have normal characters, which as we have seen so far are letters, numbers, and symbols. But we also have meta characters.
Meta characters are characters in regex which represent a one or more characters. We have a couple of them across different regex engines, but we would look at some of the common ones:
- Whitespace Meta Character \s
- Non-Whitespace Meta Character \s
- Digit Meta Character \d
- Non-Digit Meta Character \D
- Word Meta Character \w
- Non-Word Meta Character \W
- Newline Meta Character \W
Whitespace Meta Character \s
This meta character, written as backward slash lowercase s \s, matches a whitespace character.
Remember that a whitespace is also a character as we saw in the terms used in Regex lesson .
A whitespace here can be a single space or tab space or a new line. Let's see an example where we use this meta character in a regular expression:
/\s[a-z]{3}/gi
This pattern means, a whitespace, followed by a character class of a range from a to z, which is repeated three times.
Remember that this quantifier would be applied to the preceding character which is the character class in this pattern. It will not be applied to \s[a-z], instead it would be applied to [a-z]
Let's apply this to a string:
As you can see in the matches:
- even though we have the case insensitive flag i, it does not match βHisβ because the string it is not preceded by a space
- " ide" is matched because it has space, then a character between a and z, three times, which is "i", "d" and "e"
- " was" is matched
- the vertical space, followed "tha" is also matched; the whitespace character applies to horizontal space as well as vertical space like the new line
- " bad" is also matched
Non-Whitespace Meta Character \S
This meta character, written as backward slash uppercase S \S, matches a non-whitespace character. This means, matching any character that is NOT a whitespace character.
The whitespace meta character written with lowercase s \s matches whitespace characters, while this meta character, the non-whitespace, written with uppercase s \S matches characters that are not whitespace. Non-whitespace characters include digits, letters and symbols.
This meta character can also be written as a negated character class containing a whitespace character: [^\s] which means "match anything except a whitespace character".
If you haven't seen the lesson yet on negated character classes , you should definitely check it out.
So let's see how we can use this meta character in a regular expression:
/\S{4}/
This pattern means four characters followed by each other which is not a space.
Using our same string example:
- "$500" is matched, which is four characters consisting of a 1 symbol and 3 numbers - does not include a space
- "idea" is matched, which is four characters consisting of 4 letters
- same thing with "that", the third match
Digit Meta Character \d
This meta character, written as backward slash lowercase d \d, matches any digit character which is your digits from 0 to 9. So instead of using a character class of a range of 0 to 9 [0-9], you can use this meta character.
For example, let's say we have this pattern:
/\d{3}/
This means, any digit (1), followed by another digit (2), followed by another digit (3).
Let's apply this pattern to a string:
You see that the pattern matches "345" and "700". But it does not match "22", as that is two digits.
Non-Digit Meta Character \D
Just as we have an opposite for the whitespace meta character, we also have the opposite of the digit meta character. This meta character is written as backward slash uppercase D \D.
This character matches any non-digit character. So this means it matches letters, spaces and symbols.
Instead of having to use a negated character class like [^0-9], which means every other character except digits between 0 to 9, you can simply use this meta character.
Let's see how to use this in a regular expression:
/\d\D{4,5}/
This pattern means, a digit, followed by a non-digit which is repeated for minimum 4 times, and maximum 5 times.
Let's apply this pattern on a string example:
You see the matches we have here are:
- "5cars", which is a digit followed by a non-digit repeated 4 times
- "6bikes", which is a digit followed by a non-digit repeated 5times
- "7bicyc", which is a digit followed by a non-digit repeated 5 times
Word Meta Character \w
This meta character, written as backward slash lowercase w \w, matches any word character and an underscore. Word characters here include letters and numbers. The letters can either be in upper or lowercase.
This meta character character does not match symbols. The only exception is the underscore symbol _.
This meta character is an alternative to a character class containing letters in lower and uppercases, and numbers, and the underscore: [a-zA-Z0-9_]
Let's see how to use this meta character in a regular expression:
/\w{3,}/
This pattern means any word character, repeated 3 or more times.
Let's apply it on a string example:
As you can see in the matches:
- "She" is matched because it contains three repetitions of word characters: S, h and e
- the space is not matched because it is not a word character
- "the" is matched because it also contains three repetitions of word characters
- same for "2nd", "fan", and "the"
- "lo_fi" is also matched because it contains more than three repetitions of a word character, which also includes the underscore
- "musical" also matches the pattern
Non-Word Meta Character \W
As you may guess, we also have an opposite for the word meta character, which is the Non-Word Meta Character.
This meta character, written as backward slash uppercase W \W, matches any character that is not a word character and an underscore. That means, this matches spaces, and other symbols.
Let's say we have this simple pattern:
/\W/
And we apply it to a string:
As you can see in the matches, we have all the spaces (the single, tab, and newline spaces) matched, also with the ampersand symbol. The underscore symbol is not matched.
Newline Meta Character \n
The newline meta character, is written as backward slash lowercase n. As the name implies, this matches new line characters.
For example, let's say we have this pattern:
/\n{2}/
This will match newlines that occur twice.
If we apply this pattern on a string like:
You see that it doesn't match the first line break, but the second one because it occurs twice.
If you were expecting a backward slash uppercase N \N as an opposite to the newline meta character, I'm sorry to tell you there's nothing like that π . At least, at the moment.
Summary
These are the most common meta characters in regular expressions at the moment. Some programming languages or regex engines may have more, or there may be new ones added as time goes on.
Here's a summary of the meta characters we covered in this lesson:
- Whitespace Meta Character \s: which matches whitespaces
- Non-Whitespace Meta Character /S: which matches non-whitespaces
- Digit Meta Character \d: which matches digits
- The opposite, Non-Digit Meta Character \D: which matches non-digits
- Word Meta Character \w: which matches word characters include letters, numbers and the underscore
- Non-Word Meta Character \W: which matches non-word characters like spaces and symbols (excluding the underscore)
- Newline Meta Character \n:
- Word-Boundary Meta Character \b:
You can use these characters in your patterns in diverse ways. You can use them with:
- character classes: /[^\s\w]/ β a negated character class which matches characters except a space and a word character
- quantifiers: /\w{5}\d/ β a pattern which matches word characters repeated 5 times followed by a number
And as you would see in the rest of this course, you can use them in combination with other regex features.