Our goal is to track as little as possible, only collecting what will help us improve our product. To do that, we want to understand the following metrics:
- How many visitors have been on our website.
- Where those visitors are referred from.
- Which pages they visit.
- What browser are they using, and whether it is mobile or desktop.
- How is that trending over time.
Using these simple metrics, we can analyze how many people sign up to our newsletter, which pages are the most popular (so we can invest more in those types of page), whether our marketing is working (paid vs organic), and how we're doing over time. Knowing browser and device type helps us ensure we test on the right platforms.
We don't need anything more detailed, and certainly don't need or want to track visitors on an individual level. We especially want to avoid services which conduct mass tracking across the internet e.g. Google Analytics, Facebook Pixel, etc.
After spending some time trying Cloudflare Web Analytics, then reviewing all the other privacy-first options, we ended up using Plausible Analytics. This post explains the reasons behind that decision.
Cloudflare Web Analytics
Console is starting as an email newsletter. The web product is a simple static page that allows visitors to sign up. That's it. Our v1 tech stack is therefore quite unsophisticated - we host the website on Cloudflare Workers because it is the best way to build simple logic and deploy to a fast, globally distributed network.
Already using Cloudflare meant it makes sense to try their analytics products first. Like us they care about privacy and evaluating their tools would avoid increasing system complexity by adding new vendors.
We first tried Cloudflare Web Traffic Analytics which is part of the Pro plan. All of our content is already served by Cloudflare so running analytics on those existing logs means no additional telemetry or beacons in the website.
Confusingly, this product is different from Cloudflare Web Analytics, which use a JS beacon embedded into the web page to measure "real" users i.e. where the JS can be executed. I enabled both so that we could compare the stats.
Web Traffic Analytics showed x10 more traffic than Web Analytics was reporting. I had seen a review last year (admittedly from a competitor) which had the same problem, but wanted to see it for myself.
Unfortunately, this meant that Web Traffic Analytics were useless for us. I disabled it, falling back to just Web Analytics. Even if we had to use a beacon to ensure the traffic being measured was "real", at least it would be kept inside Cloudflare. Indeed, their announcement blog post was promising:
Being privacy-first means we don’t track individual users for the purposes of serving analytics. We don’t use any client-side state (like cookies or localStorage) for analytics purposes. Cloudflare also doesn’t track users over time via their IP address, User Agent string, or any other immutable attributes for the purposes of displaying analytics — we consider “fingerprinting” even more intrusive than cookies, because users have no way to opt out.
However, after using it for several weeks, it turned out the data retention period for the online dashboard was only 1 week (not documented anywhere). The data seemed accurate, and allowed us to see the metrics we were after, but we couldn't get any trend data. This is important to understand how we are doing overall and compare sources of traffic over time. I asked Cloudflare Support if there was any way to get longer retention and they said to query the GraphQL API. At this stage I don't really want to building a custom analytics dashboard.
Disappointed, I decided to go on the hunt for an alternative.
Now I had to evaluate several products, I needed to think about my requirements:
- Privacy. Google Analytics is not an option because it is part of Google's mass data mining efforts and tracks you all over the internet. We only want to track high level metrics and trends, not individuals. This means making design decisions to preserve privacy, even where that might harm data accuracy. We want "good enough" data, not "perfect". Data should be minimized and anonymized.
- SaaS with an open source option. Console v1 deliberately has no servers so there is nowhere to run any software. Dealing with data storage and traffic spikes is something I want to avoid until absolutely necessary, so the first choice is a SaaS product. However, this may change in the future so there should be a way to run the product on our infrastructure. Ideally this means the product is open source. Bonus if we can export the SaaS data for import into the self-hosted version.
- Data ownership. Paying for a product means we should own all the data that is generated in case we want to export and migrate it on-premise in the future.
With these in mind, I found several options:
Fathom is privacy-first and explains how it implements that (although in not as much detail as Plausible). It collects a minimal amount of data and they say its all owned by the customer, but their data policy page says:
We are currently re-writing our new data policy as we've received new, cutting-edge legal advice that we need to put together.
Unfortunately it is SaaS-only with no open source option, which ruled it out.
Formerly Piwik and no longer related to PiwikPRO, Matomo is an open source alternative to Google Analytics. This means its goal is to offer much of the same functionality, from user behavior tracking and heatmaps down to A/B testing and funnels.
Simple Analytics looks nice, but is SaaS-only and not open source. The T&C don't mention anything about who owns the data, so I assume Simple Analytics do.
Plausible is a privacy-first analytics product implemented by taking a strict approach to privacy principles: aggregate only, no cross-device tracking, daily rollups. They explain the data collected with reasoning for each item, and how the unique-visitor tracking works without cookies:
Plausible attempts to strike a reasonable balance between de-duplicating pageviews and staying respectful of visitor privacy. We do not attempt to generate a device-persistent identifier because they are considered personal data under GDPR. Instead, we generate a daily changing identifier using the visitor’s IP address and User Agent. To anonymize these datapoints, we run them through a hash function with a rotating salt. This generates a random string of letters and numbers that is used to calculate unique visitor numbers for the day. Old salts are deleted to avoid the possibility of linking visitor information from one day to the next. Forgetting used salts also removes the possibility of the original IP addresses being revealed in a brute-force attack.
Available as a SaaS product but with data owned by the customer, the product is also open-source and available self-hosted. The only thing missing is export/export, but it is planned for the future.
Privacy-first is important because it informs the development philosophy. We want to adhere to the data protection principle of data minimization. This is why we chose Plausible over Matomo, even though both fit our requirements.
Plausible and Matomo tick all the boxes from our requirements. Both have SaaS versions we can pay for, collect the data we want, have privacy functionality, and are open source if we ever want to self-host. However, Plausible wins because of how minimalist it is. privacy
Google Analytics used to be the only option for quick and easy analytics. It is still the industry standard and if you can figure out the UI, you can learn a lot about site visitors. But do you really need to know that much detail?
Over the last few years, the true cost of "free" has become clearer - privacy. A few massive tech companies tracking every click across the web is probably not a good idea because of the Panopticon Effect.
I'm pleased to see several options available to site owners. "Privacy-first" is a real principle that can be implemented in code. Small, independent businesses can thrive with a SaaS version of an open source product. SaaS is great when you don't want to manage the infrastructure of data storage and traffic spikes, but there is an exit route if necessary.
We will keep an eye on Cloudflare Web Analytics because the product does what we want all within our existing infrastructure, but for now Plausible is the best choice for privacy-first analytics.