Grouping. Backreferences — Regular Expressions (Regexp)

Backreferences
Named groups
Disabling backreferencing
Atomic grouping

In this lesson, we will look at additional features and different grouping types.

Backreferences

We have a group of symbols from which we choose either ta or tu:

/(ta|tu)/

ta-tu ta-ta tu-tu

Suppose we want to find only those substrings in which the left and right parts match: ta - ta and tu - tu.

Let us try to add another "or" condition to our expression. That way, we will see that we have not got what we wanted:

/(ta|tu)-(ta|tu)/

ta-tu ta-ta tu-tu

It is the case when backreferencing helps. It works as follows. We use the special notation \1, which shows that we should substitute the characters from the first group for \1.

Thus, we will find substrings with the same left and right parts:

/(ta|tu)-\1/

ta-tu ta-ta tu-tu

By default, we create all character groups, write them to a specific memory area, and label them with characters from \1 to \9.

When we use quantification, it does not affect the result. The quantification is not involved in the backreference, so we take only the first occurrence in the memory area:

/(ta|tu)+-\1/

ta-tu ta-ta tu-tu

Named groups

When programmers have multiple groups, they do not find it very convenient to remember them by number. It is much easier to use names. To do this, you must add ?<name> after opening the bracket:

/(?<group1>ta|tu)-\k<group1>/

ta-tu ta-ta tu-tu

Now you can refer to the group using the name group1 to perform operations on the group1 in your code.

Disabling backreferencing

We can turn off backreferencing by putting a ?: inside our group:

/(?:ta|tu)-\1/

ta-tu ta-ta tu-tu

After that, we do not save the group to the memory area. An error can occur when calling it since the group does not exist in the memory.

If you use this approach, the regular expression will get very difficult to read, but it will work faster. This method works 100% of the time if:

You have a lot of groups and do not need them
You want to avoid using them to save up space and avoid interference with further grouping

Atomic grouping

Another interesting kind of grouping without backreferencing is atomic grouping.

JavaScript, Python, and other popular programming languages do not support atomic grouping. But you can google solutions to emulate them with existing constructions.

For atomic grouping, we use : instead of >:

/a(?>bc|b|x)cc/

abccaxcc

If we remove ?>, the regex will find two substrings — abcc and axcc:

/a(bc|b|x)cc/

abccaxcc

When we add the atomic grouping characters, ?>, the following happens: we find first a, then bc, then cc. Usually, in the example above, the search would have rolled back to a and continued checking from b since the alternation character | is present. Then, we would get to cc, and the check would work.

But with atomic grouping, the return along the string back to a is disabled. It continues moving along the alternatives bc -> b -> x. After x we find cc.

Once we find the first match from the atomic group (?>bc|b|x), other variants from this group do not get considered. Then the next character of the analyzed string is searched from the first character of the regular expression.

We would only be able to find a match for the whole string with atomic grouping if we added another c to it:

/a(?>bc|b|x)cc/

abcccaxcc

For full access to the course you need a professional subscription.

A professional subscription will give you full access to all Hexlet courses, projects and lifetime access to the theory of lessons learned. You can cancel your subscription at any time.

Get access

130

courses

1000

exercises

2000+

hours of theory

3200

tests