Regular expressions - why does every analyst need them?

While working with Google Analytics, Tag Manager or Data Studio, you probably have met with the word ‘regex’. If you are curious what it is and how to work with regular expressions, continue reading. 🙂

Regular expressions, in short regex, are special sequences of characters helping us to work more efficiently with text values. They are used in filtering, searching, creating goals and segments.

Regex is a sequence of characters that defines a search pattern which is then matched against the text. In web analytics, we can use regex to find text patterns URLs, event names, keywords, traffic sources…

In order to better understand what the regular expressions are, we will go through each special character, explain its meaning and show the use cases.

Pipe (|)

The pipe is the most used feature you’ll need to work with Analytics. Many regular expressions can be replaced by the default options that Analytics offers, but the pipe is still needed.

The meaning of this expression is simple – it means „or“.

If we want to filter out data from Facebook or Google, we will simply write facebook | google. Analytics selects either one or the other, or both.

Dot (.)

The dot represents any character. It’s also called a wild card.

If we define regex search pattern as .ay, it will match in the words day, say, pay, may, etc. However, the regex will not match the word ay (there must be a character in place of a dot).

The best way to use the dot’s potential is in combination with other regular expressions.

**Asterisk (*)**

We’ll use an asterisk to find a match 0 or more items.

As an example, I will mention my name Marianna. Sometimes it’s written with single n, sometimes with double n. Therefore, the regular expression can be written like this: mariann * a. This regular expression will match all different variations of the world Mariana, as mariana, marianna, mariannna, etc.

**Dot – Asterisk (.*)**

If we want to find a match for any character, repeated zero or more times, we use .*. A dot – asterisk is the strongest combination in the regex, but you need to consider where to use it.

If we need to create a filter in Analytics to ensure that the hostname is prepended to a URI, the dot – asterisk makes work so much easier.

The brackets, in this case, are used to define the matching group. This filter will display the entire URL in Analytics’ reports.

Imagine that we have multiple categories on our website. URLs are as follows:

/products/women/socks/

/products/men/socks/

/products/children/socks/

If we want to see only data for all socks, but we don’t care whether they are women’s, men’s or children’s socks, we’ll create the following filter:

/products/.*/socks/

This way we will see all 3 categories in the report. However, if we also have an address /products/sales/socks/, the regex will match it as well. That is the reason why we should be careful using wild card expressions.

Since .* finds a match in everything, there could be time lag during processing.

Plus sign (+)

Plus sign looks for a match one to several times based on the previous character.

For word hello+, will be matched also in hello, helloo, hellooo, etc.

Question mark (?)

A question mark means that the previous character is not required. So it may or may not be in the word.

As an example, we would like to match both words – John and Jon. In this case, will regex look like this: joh?n .

This expression might be very useful for matching words with potential grammar mistakes.

Parentheses (())

You certainly remember how parentheses change math preference in counting. eg .:

3 × 5 + 10 = 25

3 × (5 + 10) = 45

A similar principle applies to regex. We use parentheses to group the different parts of the expression together.

Let’s take the example of socks filtering we’ve listed with a star dot:

/products/women/socks/

/products/men/socks/

/products/children/socks/

To achieve 100% compliance, we need to write regex as follows:

^/products/(women|men|children)/socks/$

It means – find all URIs that start with /products and end with socks/. At the same time, the middle folder must contain a word women or men or children.

Square brackets ([])

We use square brackets when we need to create a simple list.

For example regex so[ua]p will match words soup and soap.

The full potential of square brackets can be achieved when using in combination with the dash.

Dash (-)

We use a dash to create a more advanced list. In combination with brackets, it’s used to define a range of characters. The most common way to use dash might be in the lists like:

[a-z] finds a match in all lowercase letters,
[A-Z] finds a match in all uppercase letters,
[0-9] finds a match in all digits,
[a-zA-Z0-9] finds a match in all lower, uppercase letters and digits.

Example:

Let’s say there are catalogues on the website which users can download. We want to filter out the following in the Event Label::

year2017.pdf
year2018.pdf
year2019.pdf

There are different ways how we can do it:

^year201[7-9]\.pdf$
^year201[7|8|9]\.pdf$
(year2017\.pdf)|(year2018\.pdf)|(year2019\.pdf)

As you can see, the most efficient way is to use the range list, using a combination of brackets and dash.

Caret (^)

We already used the caret sign in the previous examples. This sign simply means begins with…. We can often replace it directly in Analytics by choosing from the options.

As an example, let’s mention ^bicycle, where this expression will match bicycle, bicycles, bicycles for roads, bicycles for terrain, etc. However, it will not match road bicycles, mtb bicycle, downhill bicycles.

A dollar sign ($)

The dollar sign is the exact opposite of the caret sign – it means ends with…. We can also replace it by choosing ‘Ends with’ string matching option.

Let’s continue with the example of word bicycle. Regex bicycle$ will match in the words bicycle, mtb bicycle, downhill bicycle, but not in bicycles for road, bicycles, bicycles enduro, etc.

The Backslash (\)

The backslash is used when we want to escape a character, that has otherwise a special meaning in the regex.

For example, if we want to filter out only one IP address (67.172.171.105), we must omit the dot that separates the individual numbers: 67\.172\.171\.105 – in this case will regex find match only in IP 67.172.171.105.

As we know, dot means any single character. However, we only need a regular dot when filtering an IP address.

Another example is a question mark parameter in the URL.

If we want to filter this URI /globaldeals?_trkparms=%26clkid ,the regex must look like this: /globaldeals\?_trkparms=%26clkid.

Curly brackets ({})

Let’s go to the last expression – curly brackets. To explain them, we will give an example:

{1,2} – means that the last item is repeated at least once, but not more than twice,
{2} – means that the last item is repeated twice.

The first example can be used to create IP filters. To find a match in this IP address range of 77.120.120.0 up to 77.120.120.99, we need to write regex as follows: ^77\.120\.120\.[0–9]{1,2}$.

Lazy Matching and Greedy Matching

Finally, we’ll mention two more terms you might encounter with regex:

Lazy Matching – it returns the shortest match possible (it will find the first match and stop searching further),
Greedy Matching –it returns the longest match possible (it continue until it has a match).

For example, we want to find a match for the following expression

Hello World

<.+?> – Lazy Matching –
Hello World
,
<.+> – Greedy Matching –
Hello World

Summary:

\|	Matches one or another character
.	Matches any single character (letter, number or symbol)
*	Matches the preceding character 0 or more times
+	Matches the preceding character 1 or more times
?	Matches the preceding character 0 or 1 time
()	Remembers the parenthesis content as an item
[]	Matches the enclosed characters in any order anywhere in a string
{}	Repeat the last item from X to Y times
–	Create a range in the list
^	Begins with…
$	Ends with…
\	Escape any regex sign

\|	cpc\|ppc\|cpm	cpc, ppc, cpm
.	.ay	day, say, pay, may,….
*	foo*d	fod, food, foood, ….
+	hello+	hello, helloo, hellooo ….
?	joh?n	john, jon
()	^/products/(men\|women\|children)/socks/$	/products/men/socks/ /products/women/socks//products/children/socks/
[]	so[ua]p	soup, soap
{}	^77\.120\.120\.[0–9]{1,2}$	77.120.120.0 to 77.120.120.99
–	year 201[7-9]	year 2017, year 2018, year 2019
^	^bicycle	bicycle, bicycles, bicycles for road but not enduro bicycle
$	bicycle$	bicycle, enduro bicycle, dh bicycle but not bicycles, bicycles for road
\	67\.172\.171\.105	67.172.171.105

To verify that you have written regex correctly and it is matching the strings you want I recommend this regex validator tool. If you plan to use regex on a regular basis, here you can find the course that will walk you through the basics but also more advanced topics.

I hope my article will help you to make it easier to work with regular expressions. I am looking forward to your questions and comments ;).

Regular expressions – why does every analyst need them?