Regular Expression capturing group and its usage

2 minute read

Scenario

I’m recently working with Grafana and need to extract one segment from a “-“ separated Prometheus label. The label has the format: NAMESPACE-NODE-POD and I would like to extract the POD part.

The task is trivial If I’m able to code, just do s.split('-')[-1]. But we are in a non-programming environment thus only regular expression is supported.

What is Regular Expression?

I got mad at the garbled-style text when reading regex’s formal definition on Wikipedia. It is accurate, but really unfriendly to a non-expert. Let’s introduce regex in a more intuitive way:

Think of Regex as an enhanced edition of string-search. How to do string search? If we forget about efficient but complicated algorithms, string search is just matching pattern with text char-by-char, from left to right. If there is a non-match, then move search location to the next char and restart.

The overall process of regex is nearly the same. Matching of usual character is exactly the same, confusion arises only when special characters(like ? * + etc.) come into play.

What is group?

For this case, the main concept we need to understand is group. In short, group is a series of pattern elements. We use () to contain a group. For example, (ab)c will group element a and b into a group ab. After matching completes, the matched group string will be stored in variable $x where x is the index of group.

Why do we need group?

  1. We can use group to apply operation to whole group rather than single character. E.g. (ab)+ matches ab one or more times.
  2. We can use group to extract specific part of matching string. In our scenario, group is used for this purpose.

Solution

If we would like to extract day from date string: 1970-01-02, the pattern .*-.*-(.*) matches whole date string, and $1 will store matched day.

Explanation

.* matches any character any number of time. - matches the character literally with no extra meaning. There are 3 dash-separated segments but only third one is surrounded with parentheses. Thus group1 contains the matched day string. If you are using python, group(x) function can be used to get x-th group.

Further Reading

There are tons of concept embedded in regex pattern strings(this is why it looks so much like garbled text LOL). For simplicity I will not further extend this article. Interested readers can search following keywords online to learn more:

  1. Non-Capturing Groups
  2. Named Capturing Groups
  3. Backreferences

Comments