Google Analytics is a really good tool for marketing-focused digital analytics. And by far the most popular one in this segment. With some custom setup, you can also use Google Analytics for tracking SaaS and other web apps & products.
Two of the most common shortcomings of Google Analytics that most of the more advanced users experience, though, are the lack of hit-level granularity and sampling. In this article, we are taking a look at some of the ways you can overcome these shortcomings without spending a fortune on Google Analytics 360.
PS! If your company already has 360, these techniques can give you an even more robust and complete dataset.
What is sampling and who’s affected
At first, let’s take a look at how Google Analytics describes sampling in their official documentation.
In data analysis, sampling is the practice of analyzing a subset of all data in order to uncover meaningful information in the larger data set. For example, if you wanted to estimate the number of trees in a 100-acre area where the distribution of trees was fairly uniform, you could count the number of trees in 1 acre and multiply by 100, or count the trees in a half acre and multiply by 200 to get an accurate representation of the entire 100 acres.
So, it simply means that some of the reports you see in Google Analytics (or any other tool that pulls data from it via the API) may not represent 100% of the relevant hits.
This is especially true when you build more advanced custom reports with detailed custom segments. Google Analytics samples complex ad-hoc queries even below 500k sessions. Unfortunately, this is exactly when you normally care about the data accuracy the most.
You can check if your report is affected from sampling by hovering over the shield icon at the top left of your report.
Normally, the reporting-level sampling starts when your selected date range has more than 500k sessions in total. Be vary, though, because it sometimes happens with less traffic as well.
In some circumstances, you may see fewer than 500k sessions sampled. This can result from the complexity of your Analytics implementation, the use of view filters, query complexity for segmentation, or some combination of those factors. Although we make a best effort to sample up to 500k sessions, it’s normal to sometimes see slightly fewer than 500k sessions returned for an ad-hoc query.
And that is not the only kind of sampling that can haunt you in the free version of Google Analytics. The second type of sampling takes place when data is being collected and the limits you need to know are as follows.
- 500 hits per session 
- 200,000 hits per user per day 
- 10 million hits per month per account 
It is important to understand that hits aren’t users, sessions or page views – hits are all data sent to Google Analytics including events, timing and data coming from the Measurement Protocol.
One way to check if you’re getting close to those limits is to go to Admin –> Property Settings in Google Analytics.
It is worth mentioning, though, that Google won’t automatically ignore all hits past the 10M mark. It will notify you in the UI and will likely contact you via email and suggest considering the GA 360 version. In the terms, though, it says “there is no assurance that the excess hits will be processed” and also that the warning message is not guaranteed to appear.
You should take action if…
- your site receives >10M hits a month
- you see the yellow shield next to your reports regularly
- your visitors generate more than 500 hits per session
How to avoid sampling in Google Analytics
If you hit any of the sampling limits but need reliable data in your work, you need to find a solution rather sooner than later.
Let’s take a look at the options you have.
Google Analytics 360 – This is the solution Google itself recommends. And no wonder why, it costs around $150k a year. GA 360 is a great tool and we recommend it to all companies with huge traffic and money to spend. Keep in mind, though, that if you only need the more generous data limits or BigQuery access, there are cheaper solutions (described in this article).
Collect less data – Well, who would want less data, right? We can’t recommend to skip tracking of some important user action like page view or file download but there may some automatic events that you don’t really care about. This could be some timing event that your systems sends every 10 seconds, a scroll depth event every 5% or something similar. Take a look at your events and you may find something. Just don’t remove something useful!
Unsample your Google Analytics data – If you don’t hit the data collection limits and the only worry is the sampling happening on reporting-level, you may be interested in solutions that let you unsample your existing data. How it works is that your query (run via the API) is divided into many sub-queries that are small enough that no sampling is applied. You can do so by writing your own small program in your favorite language or use one that others have built. For example this one written in GO.
There are also some paid tools available but we haven’t used them. Some of them run the API requests periodically to build an unsampled database based on your Google Analytics data. Keep in mind, though, that this doesn’t save you from data collection limits (i.e. 10M sessions/month) or some data aggregation that is inevitable in GA.
Use a parallel tracker – This is the most reliable solution against all data collection, processing and reporting limits you come across in Google Analytics. How it functions is that it duplicates all hits going from your site to Google Analytics, processes them separately and stores everything in your favorite data warehouse – BigQuery, for example.
With a parallel tracking setup, you are always free from any sampling and data collection limits. Furthermore, data processing incidents are rare but do occasionally happen in Google Analytics – having your own raw dataset lets you reprocess your data whenever needed. As a bonus, since you own the data, you may include PII, mix it with any other data or delete the records you don’t want. More on this solution later in the article.
What is hit-level data and why do I need it
Most data you see in Google Analytics is aggregated, and without custom configuration, you can’t get much of the raw hit-level (also known as event-level) data that you may need in more detailed analysis.
Hit-level data means that you can access the underlying hits that were sent to Google Analytics, allowing you to do your own aggregation as you wish, based on any criteria or dimension.
A good example of using hit-level data is analyzing the journey of a single user. On what page did they land on, what was the traffic source, which pages did they visit, how many interactions before converting etc. Having access to hit-level data lets you analyze the journey of each visitor in near-real-time. Furthermore, this data is perfect for machine learning algorithms that could, for example, detect and analyze users that are most likely to convert – you could then target them with ads and other campaigns.
Here’s the simplest query in BigQuery that shows you each hit from a single user in chronological order.
With hit-level data, one can access every single hit that was collected from the site along with all data-points that each hit included.
Unfortunately, there is no way you can get the true raw hit-level data out of Google Analytics. Not even using the API. Using custom dimension for the Client ID, Hit Type, Timestamp etc. can get you closer but it’s still far from perfect.
How to get access to raw, unsampled hit-level data
As mentioned before, you can’t get the raw hits out from Google Analytics. The premium version of Google Analytics (360) and its BigQuery export feature will get you closer (for 150k a year) but even that is not ideal.
The only way to gain access to the real underlying hits, with zero sampling and aggregation, is to leverage technology known as parallel tracking.
Parallel tracking means that all hits sent to Google Analytics are duplicated and sent to another endpoint. Depending on the solution, the data may be stored in Amazon S3, Google BigQuery or some other data warehouse.
Tools like Snowplow offer parallel tracking solution and accessing raw data in your data warehouse without any further processing (by default). This means there are no sessions, channel attribution or other really useful features that you do get in Google Analytics.
With Reflective Data’s Parallel Tracking (RDPT) solution, not only will you get all the raw hits but also a data processing engine that works very similarly to the one in Google Analytics itself. This means you will get sessions, attribution and features like referral exclusion out of the box. More advanced users can build (or request) their own rules for defining sessions, attribution and other features.
In order to break data silos, RDPT can integrate with any tool that has an API. Including your CRM and ad platforms.
RDPT’s default data warehouse is BigQuery but storing data elsewhere (Amazon S3 etc.) is also possible. BigQuery’s native integration to Google Data Studio makes it easy and cost-effective to build all sorts of reports and dashboards. Integrations with most other BI and data visualization tools are widely available.
How much does it cost
Google Analytics 360 is a really good tool for enterprises that want to gain more detailed access to their marketing data. Hefty price tag, limited access to raw data and occasional sampling should make you think twice before upgrading, though.
For companies that aren’t fully sold on Google Analytics 360 or companies that already have 360 but need access to even more complete dataset, the Reflective Data’s Parallel Tracker (RDPT) may be the perfect solution.
Pricing for RDPT depends on the amount of traffic, the complexity of the setup and the number of integrations. Compared to 150k, though, it will always be a bargain. The initial setup usually costs somewhere between $1k and $5k, and the monthly plans start at around $350. This includes a generous quota for BigQuery usage.
So, whatever your current analytics stack looks like, you should consider adding RDPT for the most robust, unsampled raw hit-level digital analytics data you can get. Learn more here.