Is Google Really 99% Spam-Free?

According to Google’s annual Webspam report, released in June 2020, 99% of the internet’s search results are spam-free. Is the web really that clean? And what does it all mean for black-hat SEO?

Background: Google’s Webspam Report

Every June Google releases a Webspam Report, which details the company’s efforts and success in its endless battle against search results spam. In 2019, almost a decade after the release of Panda, Penguin and other core algorithm updates designed to clean the web from black-hat SEO techniques, Google reported discovering 25 billion pages of spam a day.

Yet, Google’s spam fighters managed to clean 99% of the spam in the search engine’s results, and to gain control over user-generated spam (such as links in blog comments and forum threads). It means that in theory, 99% of time internet users can safely browse the internet safely without running into Viagra ads (for example). And in practice?

What is Spam?

First, some etymology (or: how canned meat arrived to your inbox junk folder with the help of Monty Python)

Google’s search engine pulls out webpages from its index and ranks them using an algorithm. The algorithm consists of several components, some are unknown and some are well-known, for example: the content’s relevancy to the given search query, the amount of incoming links, users’ behavior, and technical aspects such as page speed, mobile-friendliness and more. Spam, then, is the usage of techniques that imitate those components without actually delivering the goods (such as buying links for a minor site to make it look authoritative), or those that can harm the user (for example, by exposing them to malware).

Google also defines the following techniques as spam:

Automatically generated content
Participating in link schemes
Creating pages with little or no original content
Cloaking
Sneaky redirects
Hidden text or links
Doorway pages
Scraped content
Participating in affiliate programs without adding sufficient value
Loading pages with irrelevant keywords
Creating pages with malicious behavior, such as phishing or installing viruses, trojans, or other badware
Abusing structured data markup
Sending automated queries to Google

(Taken from Google’s Webmaster Guidelines)

All these techniques were quite common among SEOs in the early days of the industry, which were mostly spent in trials and errors and discussions in forums.

Some of these techniques (for example, concealing texts of links with invisible fonts) have disappeared years ago from the legitimate SEO world.

Other techniques were eliminated by Google’s algorithm updates:

The Panda update from 2011 has eliminated, or at least diminished, pages with thin or scraped content, full of ads or simply doesn’t fulfill its promises

The Penguin update from 2012 forced SEOs to rethink their link building strategies, stop buying forum links in bulks, and even ask Google to ignore certain links using the Disavow Links tool.

Even before the Penguin update, Google created the nofollow tag, which was designed to release websites (especially those with user-generated content) from the responsibility of their outbound links. The tag’s aim was to tell Google not to follow the tagged links, and not to interpret the link as an upvote on their behalf. This update was also meant to decrease the appeal of “participating in link schemes”, but succeed only partially. Recently, Google added new link tags to mark links from user generated content (rel=”ugc”) and sponsored content (rel=”sponsored”).

Yet, the acts of buying, exchanging or gaining links are still very common, and there’s no proof that they have a negative effect on any of the participating sides (at least as long as they are used sensibly and in less common languages, such as Hebrew).

How does Google discover Spam?

Google uses several methods to discover webspam, including users’ reports, manual checks and artificial intelligence techniques. Google doesn’t reveal all of its spam discovery techniques. “We can’t share the specific techniques we use for spam fighting because that would weaken our protections and ultimately make Search much less useful”, wrote Danny Sullivan, previously an SEO expert and now a Googler, in a post titled “Why keeping spam out of Search is so important”.

According to the 2019 Webspam Report, Google prioritized fighting spam using machine learning techniques. The big guns were targeted towards the especially harmful type of spam. For example, websites that impersonate official or well-known websites and deceive the users into sharing their personal and financial information or even downloading malware. Last year Google succeed in eliminating this type of spam in 60% more than in 2018.

But, spammers (or SEOs who like shortcuts) are also using artificial intelligence to manipulate Google’s search results.

An article in Search Engine Journal reviewed all the AI techniques used by spammers to convert texts from video and audio and vice versa, and push it to Google’s Rich Results.

Although Google’s algorithm is good at identifying and ignoring copied content, but identifying content copied from different media is not yet its strongest quality. That way, spammers can take advantage of this breach and, for example, turn texts into videos and dominant the video section at the top of the SERP, especially in trending searches. The irony is that Google’s free AI tool, Text-to-Speech, has been used for this purpose.

And if you want to take it step forward, you can take these texts across automated translation tools, such as the one owned by Google, and duplicate the same text in several languages in a matter of minutes. But these methods are too risky, both from Google’s spam algorithm point of view, and from the copyright infringement perspective.

To summarize:

There’s no doubt about Google’s improving ability to discover and eliminate search results spam (especially comparing to 2010).
The spam detection is not perfect; users can still run into spammy content, probably in more than 1% of internet visits.
SEOs continue to be asked to participate in “links schemes”, and almost every backlink research would reveal unnatural links.
The spammers’ abilities are also improving, but it’s safe to assume that the crime won’t continue pay for long.
At this rate, and with the BERT search algorithm, it’s probable that many of the current SEO methods, such as including of keywords in the content, for example, would be considered spammy and old-fashioned in the near future.

Gilad Sasson

All Posts

You must be logged in to post a comment.

Is Google Really 99% Spam-Free?

Background: Google’s Webspam Report

What is Spam?

How does Google discover Spam?

Gilad Sasson

Leave a Reply