Over the past few months I have noticed an increased amount of Google Analytics spam traffic (particularly referrals) on the websites I manage.
Most recently, having launched a site and checking a week later to find it had already had 500 hits despite not promoting it yet, I decided to investigate further.
I discovered that almost all of this traffic was spam, rendering the real traffic data practically useless. I’ve done some research into how to clean it up to restore the accurate data and thought it would be useful to share my findings.
- In this first blog I’ll explain how to identify the spam traffic.
- In part 2 I’ll explain how to set up a segment to filter out the spam data from the traffic that’s already been collected.
- Then finally in part 3 I’ll explain how to set up a new filter to block spam data from being logged in future tracked data.
Part 1: Spotting the Spam Traffic
First off I had to find out where all the traffic was coming from so I checked the Acquisition tab.
Clicking into the referral data reveals the source of the referrals; a whole host of spammy websites!
So what are they and why are they linking to my website? Well, I discovered that most aren’t even visiting the site at all and the ones that do are robots.
Delving a little deeper by adding a secondary dimension of Hostname, the spam referrals tended to be split into two main categories.
These ‘visitors’ haven’t actually visited the site at all. The Hostname column refers to where the visit took place. In my case my real traffic should have a hostname of PaulJardine.co.uk because it’s the only place I have my Analytics code (You could alternatively put it somewhere else like a Paypal checkout if you have a shop).
Looking at the Hostname data of where the Analytics code tracked these hits, it’s not set to my website address so these are spam hits because they even take place on my website! Seemingly, these hits show here because a website in a dark corner of the internet has a matching tracking code.
The vast majority of ghost referrals can easily be spotted as they have a hostname of ‘(not set)’. Others may be set to Apple or Amazon to try and disguise themselves. Either way, unless the hostname shows a valid source it’s false and needs removing.
Bot crawlers such as Semalt scour the web for purposes unknown, visiting your website and instantly leaving again. These are easily identifiable as the results with a session duration of 0.0 seconds and a 100% bounce rate.
Since the bot crawlers do actually visit my site the Hostname does say my website. However, the name of the source should give away whether it’s shady or not.
IMPORTANT: If you want to investigate a referral address, DO NOT VISIT THE SITE as it could have a virus or something nasty waiting for you. Do a Google search instead as this will tell you all you need to know about the site without having to actually visit it.
When Life Sends You Spam, Make Spam Fritters
Looking at the direct and search traffic revealed similar results. A massive chunk of the recorded traffic appeared to be ghost hits, clearly identifiable by the dodgy hostnames.
OK, so we’ve identified that there’s clearly a problem here! But fear not as next I’ll explain how I’ve restored the stats to their former glory!
In part 2, I’ll explain how to clean up you Google Analytics data using a filtered segment and in part 3 I’ll show you how to create a new filter to block the spam data from your future statistics.