Regular Expressions (Regexp)

Theory: Character classes

In this lesson, we will discuss character classes.

A character class is a special designation that specifies a search for any character from a particular set.

Let us look at a simple example of how character classes work. Suppose we only need to find letters from the alphabet. To do this, you can describe character classes in square brackets, for example, the English alphabet: [a-z].

We can see that all alphabetical characters in the string are highlighted:


/[a-z]/

java 11_34-1938 tab

new line


You can search for numbers from zero to nine in the same way:


/[0-9]/

java 11_34-1938 tab

new line


And in this example, we specify just two characters, each of which will be found:


/[aj]/

java 11_34-1938 tab

new line


With character classes, you can use a mechanism called negation. It helps to invert the search.

When we put the character ^ before the first character in square brackets. This way we will find all characters except those listed after ^:


/[^aj]/

java 11_34-1938 tab

new line


If we need to find a hyphen and letters from the alphabet, we enter them at the beginning or end of a group of characters. That way, the hyphen will not be perceived as a special character:


/[aj-]/

java 11_34-1938 tab

new line


Regular expressions often use special predefined character classes. They are written using the \ and have their designations in the regular expression language.

In the previous lesson, we used \ as an escape character. Here we also use it as part of the notation.

Let us find all the digits in the text using \d:


/\d/

java 11_34-1938 tab

new line


If we specify a large D, the search will retrieve all other characters, including whitespace and tabs:


/\D/

java11_34-1938tab

new line


There are also:

  • The class\s, which helps search for whitespace characters
  • The class \S, representing all non-whitespace characters

As we can see, the principle is simple. Lowercase letters denote classes, and uppercase letters represent everything that does not belong to it.

There is another popular class \w. It includes all letters of the alphabet, all numbers, and underscores. The code below does not show it, but whitespace characters do not correspond to this class, nor does -:


/\w/

java 11_34-1938 tab

new line


The class\w is equivalent to this notation: [0-9a-zA-Z_]. Note that searches in character ranges are case-sensitive, so a-z is followed by A-Z.

Accordingly, \W searches for the opposite of \w. So we can find hyphens and whitespace characters:


/\W/

java 11_34-1938 tab

new line


Recommended programs