«Personally Identifiable Information» (PII) is a term that has gained prominence in the measurement industry over the years, and it’s taking centre stage in marketing and analytics conversations, mainly because of the evolving privacy landscape marked by stricter regulations governing the collection, storage, and use of users’ data.
In the previous version of Google Analytics, known as Universal Analytics, Google emphasized the prohibition of sending PII to their analytics platform. This stance remains unchanged with the latest Google Analytics (GA4) version.
Google’s definition of PII differs from what the privacy regulation covering your business entity defines it. To Google, it defines PII as information that, on its own, can directly identify, contact, or precisely locate an individual. Which encompasses:
- Email addresses
- Mailing addresses
- Phone numbers
- Exact geographical coordinates (such as GPS coordinates – with exceptions, as noted)
- Full names or usernames
It is vital to recognize that the definition of PII varies across different privacy regulations, including GDPR, DPA, HIPAA, LGPD, and others. Therefore, seeking guidance from your legal team to determine what PII encompasses for your specific business is advisable.
Given our focus on PII in this article, I will not delve into the legal intricacies of «personal data» or «personally identifiable data«.
It’s also important to state that I am not a legal practitioner, but I will share a methodology for redacting PII using a Google Tag Manager template. Additionally, I will provide a regular expression (regex) that you can proactively use to prevent the collection of PII in your Google Analytics (GA4) property.
How to Redact PII in Google Analytics (GA4) Using Google Tag Manager:
The method I will share supports redacting PII only within the page URL. It’s a less technical, easily understandable, and swiftly implementable solution. However, in rare cases, your redaction needs may outgrow the method. I will address some of the limitations of this solution and introduce you to the GA4 «Redact Data» feature later in this blog post.
Let’s begin by navigating to the «Variables» section of your Google Tag Manager container and clicking the «New» button within the «User-Defined Variables» section.
Select the variable type and explore the «variable» community template gallery by choosing the option highlighted in the image below.
Within the template gallery that appears, search for «PII Redactor» and choose the variable template created by «Mikeulrich75.» Then, click the blue «Add to workspace» button.
Click the «Add to workspace» button.
Continue by clicking «Add«.
In the variable configuration, input the page URL into the «URL Variable» field and check the «Redact Emails» option (which also redacts email and other forms of PII specified in the param list field).
If you check the «Redact Emails» checkbox, it will automatically redact both encoded and decoded email variations in the page URL to prevent you from collecting emails. However, please note that there may be some unique cases where this feature may not work. I will explain the limitations later in the article.
The next step involves entering the query, using the pipe symbol «|» to separate instances where you have multiple URL query parameters containing PII.
Assign a name to your variable and click the «Save» button.
Stay tuned, as I’ll soon share a powerful and helpful regular expression that you can use to proactively redact identified PII before it gets collected in your GA4 property.
Now, you need to navigate to the GA4 config tag and expand the field to set and add a new field. Use «page_location» as the field name and the PII redactor variable we created as the field value. Then, save the tag.
And suppose you’re migrated to the Google Tag. In that case, you can either create a new configuration field or add it to the configuration settings variable associated with your Google Tag.
You can test your Google Tag Manager setup by adding the URL to the GTM preview page and utilizing the GTM preview mode and GA4 debug view to witness real-time results.
Debug mode also shows if the PII is being redacted before the data gets collected to Google Analytics.
The screenshot below shows how other PII data types get redacted in Google Tag Manager before it’s sent to GA4.
this is URL used https://email@example.com&TEL=123-456-7890&PAssword=secretpassword&mob=234d45er&firstname=John&lname=Doe&address=123+Main+St&postcode=12345&po%20box=livinglife&fn=mary&state=AZ&drive=lambogini&lat=ideyforyou&lon=makeiask
Here is how the data gets reported in the Google Analytics debug view.
Regex for Proactive PII Redaction in GA4 When Using the GTM Template:
When utilizing the Google Tag Manager template provided, you can proactively use the regular expression (regex) pattern I will share to redact personally identifiable information (PII) in GA4.
This regex pattern helps to detect specific query parameter types known to contain user personal data as values of the query keys.
Here’s the regex pattern you should use:
The regex pattern searches for more than 40 variations of query parameter keys that may contain data that could potentially identify a person, and if they exist in the page URL, their values get redacted before it’s collected in GA4.
However, it is advisable to cross-reference your GA4 page report or collaborate with your engineering and marketing teams to pinpoint the potential ways and query parameter keys through which PII may get exposed in the page URL.
Limitations of This Approach:
When going with this redaction approach, along with the regex shared in this blog, the primary limitation you’ll encounter is the necessity to identify the query parameter keys containing personally identifiable data (PII) as their values before you can redact them.
It’s important to note that when using search forms, any extra words after the user’s email address will not be redacted as PII, even if the «Redact Email» setting is enabled in the template. However, if the search term only consists of an email address, it will be redacted (Remember that it’s not in all cases). A mechanism capable of recognizing PII within search terms must be in place to ensure proper redaction.
A less common scenario arises when the user’s email address is found within the page path and is immediately followed by a question mark «?» indicating the beginning of query strings. In such instances, the GTM variable template removes the question mark «?» after redacting the email address. While this occurrence is infrequent, it’s essential to be aware of this particular weird behaviour.
The Google Tag Manager PII redaction solution isn’t overly technical. However, its limitations are worth noting. It can only redact the page location dimension and any parameter dependent on data derived from the page location. This solution cannot be applied to enhance measurement, API protocol events, or data imports.
One noteworthy advantage of this method is that you can utilize the redacted URL output with other vendors that permit URL field customization. This is particularly feasible when implementing server-side Google Tag Manager.
However, very rare scenarios may arise where your redaction requirements surpass the capabilities of GTM. In such cases, you should explore more advanced technical setups.
Furthermore, this solution only works with Google Tag Manager instrumentation. For GA4 instrumentation done without GTM, you can look at the GA4 «Redact Data» feature solution or some technical redaction setup, which I will explain shortly.
Exploring the Google Analytics «Redact Data» Feature:
In a LinkedIn post by Brais Calvo, a valued member of the Datola community and a close friend of mine, first shared Fred Pike‘s discovery of the GA4 «Redact Data» feature. This feature hasn’t been universally rolled out to all GA4 properties as of the time of writing this article.
You can locate it within your GA4 data stream settings, and it provides two significant functionalities: automatic email redaction and query parameter specification. Enabling the first option in this feature ensures that GA4 will automatically redact all email addresses within your data stream.
Clicking the «Redact Data» option opens the following view.
This feature’s redaction process works within the user’s browser as GA4 loads. This means that any sensitive PII is redacted right at the source, so it never reaches Google servers. You no longer have to worry about PII initially passing through servers before undergoing redaction in GA. With this sophisticated approach, Google emphasizes its commitment to user privacy.
Thoughts to Keep in Mind About this Feature:
This topic sparked a discussion with my friend Brais. I asked for his thoughts and concerns about this feature, and he shared some brilliant insights.
- It runs on the client side and doesn’t work with Measurement Protocol or Data Import data.
- It always hides the field specified in the configuration, whether it contains sensitive information or not.
- It only affects specific GA4 parameters such as page_location, page_referrer, page_path, link_url, video_url, and form_destination.
- It’s geared towards information related to URL parameters.
- If the user marks any parameter used by GA4 for attribution (utms, gclid, etc.), it could potentially cause a small problem…😄
On the other hand, modifying events from the GA4 UI might also be necessary alongside the GTM solution because we, unfortunately, can’t interact with enhanced measurement events from GTM.
In my opinion, it would be ideal if there were a native feature in GA4 that ran on the server side and, before processing the request, checked for any fields that might contain PII using some filtering (regex) and decided whether to discard or redact them. But in the meantime, as analysts, we’ll have to adapt and get creative with the available options!
Closing Remarks – Wrapping It Up:
In summary, we’ve gained insights into Google’s interpretation of PII and why it may vary depending on your business entity’s privacy guidelines or regulatory compliance.
We’ve discussed utilising a custom GTM variable template and a potent regex to proactively redact personally identifiable data collected through the page URL before it’s stored in your GA4 property.
This article has also highlighted some of the limitations of this approach and underscored the importance of exploring alternative methods that may better suit your specific use case.
Furthermore, you’ve been introduced to the GA4 «Redact Data» feature, even if it’s not currently implemented in your GA4 property.
Regardless of the approach you choose, it’s crucial to acknowledge that Google discourages the collection of PII in GA4 properties. Thus, it falls upon us to ensure the redaction of this data before collection, emphasizing our responsibility for safeguarding the privacy of our website users.