The Classification of Search Engine Spam
With standards come definitions. Often in the search industry, the same
terms are used by different people to mean different things. These
different meanings can cause confusion and give spammers refuge. An
objective of this paper, then, is to place absolute definitions on some
important terminology.
The following terms are defined:
• Search engine
• Relevancy
• Search Engine Spam
• Not Search Engine Spam
• Content spam
• Meta spam
• Link farm
• Link content spam
• Link meta spam
• Agent-Based Spam
• IP Cloaking
The first term we will define is "Search Engine". Generally, a search
engine is any program that searches a database and produces a list of
results. To work at such an abstract level within this document would
limit us to a very theoretical generalist discussion. Therefore, for
the purposes of this document, we will apply a more narrow definition
of "search engine" as follows:
Search engine
a system that uses automated techniques,
such as robots (a.k.a. spiders) and indexers, to create indexes of the
Web, allows those indexes to be searched according to certain search
criteria, and delivers a set of results ordered by relevancy to those
search criteria. Examples of such search engines are AltaVista, Fast,
Google and Inktomi. (Fast and Inktomi deliver their results solely
through partners such as Lycos and MSN).
The next term that needs defining is relevancy. Because this document
is attempting to classify spam, and spam and relevancy are intertwined,
it is essential that we define relevancy in an objective way. That is
not to say that relevancy is objective. Far from it. Relevancy is
extremely subjective. Every search engine uses its own algorithm to
calculate relevancy. Therefore, we define relevancy as follows:
Relevancy
The search engine's measure of how well
a particular resource matches the input search criteria
Each search engine measures relevancy using its own algorithm.
Therefore, given the same set of resources and the same input search
criteria, each search engine will produce a different set of results.
This is because the results are ordered by relevancy, and each search
engine calculates relevancy differently.
It should be clear that the algorithms that calculate relevancy are the
life blood of search engines. Those search engines that deliver the
most relevant results to the market they have chosen to focus upon
should be the most successful search engines in those markets.
Search Engine Spam
So, what is search engine spam? We define it as follows:
Search Engine Spam
Any attempt to deceive a search engine's
relevancy algorithm
And what isn't spam?
Not Search Engine Spam
Anything that would still be done if
search engines did not exist, or anything that a search engine has
given written permission to do.
The remainder of this document assumes that the search engine has not
given written permission. It elaborates upon the meaning of the
previous two definitions and places them in a context that should be
acceptable to all industry professionals.
In attempting to classify spam, we considered many different instances
of spam and architectures for delivering spam. We gradually came to
realise that there are only two types of search engine spam:
Content spam
Data within a part of a Web resource
designed for humans (e.g. the of a HTML document) where that data is
designed only for search engines to see
Meta spam
Data within a Web resource that
describes that resource or another Web resource inaccurately or (when
the data should be readable by humans) incoherently
The fact that there are only two types of search engine spam derives
from the fact that search engine algorithms use only two basic factors
to calculate relevancy; on-the-page factors and off-the-page factors.
An example of an on-the-page factor is keyword density - how early and
often the keywords (words searched for) appear in the body copy of a
page. An example of an off-the-page factor is link popularity i.e. how
many other pages on the Web link to a particular page. In fact,
depending on the link popularity algorithm, it can be spammed with
either content spam or meta spam. This will be described in more detail
later.
Content Spam
First of all, we should consider why content spam is possible. It is
possible because the same URL can deliver different content (or the
same content displayed in different ways) to different visitors to that
URL. Even the simplest versions of HTTP and HTML support this, and
therefore offer the opportunity to deliver spam. For example, IMG
support and ALT text within HTML means that image-enabled visitors to a
URL will see different content to those visitors that, for various
reasons, cannot view images. Whether the ability to deliver spam
results in the delivery of spam is largely a matter of knowledge and
ethics.
This document is not designed to provide exhaustive examples of spam.
To do so would be counter productive as it could become a reference
source for those that wish to spam. Suffice it to say that the
following techniques are among those that may be subverted to deliver
content spam: tiny text, invisible text, noframes text, noscript text,
alt text, longdesc text.
It is extremely important to note that none of the above techniques
were designed to deliver spam. Therefore, the use of the technique does
not imply that spamming is taking place. So, how can we determine
whether the use of the technique constitutes spam? It is relatively
simple - apply this test:
Suppose search engines did not exist. Would the technique still be used
in the same way?
If the answer to the above question is no, then clearly the content is
designed only for search engines to see. Therefore it is spam. If you
are a search engine marketer or search engine optimization (SEO)
specialist, don't panic at this statement. Consider what it really
means.
Take, as an example, ALT text. Why was the tag invented? Not to deliver
spam, but to provide a readable version of the page to browsers without
graphical capabilities. These include phones, PDAs and screen readers
for the visually impaired. This last example is especially important as
disability legislation in many countries (e.g. USA, UK, Australia)
requires that content is accessible to all. Stuffing the ALT text of
clear pixels with lists of keywords is a common SEO technique. Consider
this sample piece of HTML, where clear.gif is a 1x1 transparent pixel
and an attempt is being made to rank higher for the word "spam":
<img src="../../images/clear.gif" alt="SPAM, Spam, spam, ugly
spam, obvious spam" />
This turns a page into meaningless garbage when it is read out loud or
displayed on a non-graphical browser.
Tags that have been designed to improve access for the disabled, or
less capable platforms, are often subverted to deliver spam. Yet it is
possible - and professionally essential - to use these tags in the
manner for which they were invented. Consider the impact of doing so.
The site is usable by more visitors, from more platforms. If marketing
is your goal, then you are reaching a wider market. This is an
ethically sound policy. It improves access for all and improves your
overall marketing capability. At the same time it does not deliver spam
which spoils a search engine's ability to calculate relevancy or makes
a page meaningless to visitors with lower capabilities.
Meta Spam
Now consider meta spam. Meta data is data that describes a resource.
Meta spam is data that mis-describes a resource or describes a resource
incoherently in order to manipulate a search engine's relevancy
calculations.
Think again about the humble ALT tag. Not only does it provide content
for a HTML resource, it also provides a description of an image
resource. In this description capacity, to mis-describe an image or to
describe it incoherently (using, say, a stream of keywords instead of a
descriptive sentence or phrase) is meta-spam. Perhaps the best examples
of meta spam at present can be found in the <head>
section of HTML pages. Remember, though, it’s only spam if it
is done purely for search engine relevancy gain.
Meta spam is more abstract than content spam. Rather than discuss it in
abstract terms, we will take some examples from HTML and XML/RDF in
order to illustrate meta spam and where it differs from and crosses
with content spam.
Generally, anything within the section of an HTML document, or anything
within the section that describes another resource, can be subverted to
deliver meta spam.
Examples of meta spam
The TITLE Tag
Location: <head> section of a HTML document
Example: <title>White Paper : The Classification of
Search Engine Spam</title>
Search engines tend to place a lot of emphasis on the title tag in
determining relevancy. Basically, if keywords occur in a page's title
tag, the page is more likely to be seen as relevant to those keywords.
The title of this document is "White Paper : The Classification of
Search Engine Spam", which accurately describes (using terminology
appropriate to the target audience) the contents of this document. If
we had made the title of this document "Spammer's delight - click here
to find out how to spam the search engines, SPAM, Spam, spam, ugly
spam, obvious spam" then we would have a couple of problems. One, the
title would mis-describe this page. Two, the title would be incoherent,
yet it is designed for search engine users to see.
Caveat: the <title> tag has several functions beyond
search engine listings. If an alternative use can justify using a
particular title, then the title is not spam.
The META DESCRIPTION tag
Location: <head> section of a HTML document
Example: <meta name="Description" id="Description" content="The
definitive guide to search engine spam." />
Everything said above about the title tag equally applies to the meta
description tag. However, the caveat regarding alternative uses is not
as strong. The title tag has many uses – the meta description
tag is almost exclusively used by search engines.
The META KEYWORDS tag
Location: <head> section of a HTML document
Example: <meta name="Keywords" id="Keywords" content="spam
classification search engine optimization optimisation ethical
marketing marketer professional" />
Unlike the title and meta description tags, the meta keywords tag is
not generally displayed to searchers. Therefore, it does not need to
meet the "coherency" condition. In addition, the keywords tag was
designed by search engines to assist search engines in determining
relevancy. Therefore, it is our opinion that nothing in the keywords
tag should be considered to be spam. Instead, the search engine should
use the keywords tag either not at all or to guide keyword selection,
but not to influence the relevancy calculations of those keywords.
Dublin Core Tags
Location: <head> section of a HTML document
Example: <meta name="DC.title" id="DC.title" content=" White
Paper : The Classification of Search Engine Spam" />
The Dublin Core tags can be considered similarly to the meta tags
already described.
XML/RDF Tags and Metadata
Location: XML/RDF files or streams or embedded in other Web resources
Example: <dc:title>White Paper : The Classification of
Search Engine Spam</dc:title>
XML/RDF tags and metadata can be considered similarly to the meta tags
already described. It is important to note that the use of XML/RDF
will, in itself, not bring an end to search engine spam. It will simply
provide an alternative spam channel. In fact, it could provide a
greater opportunity for spam unless careful checks and balances, or
contracts and conditions, are applied.
That concludes the discussion of types of search engine spam. The
remainder of this document will consider issues such as links,
redirection, agent delivery, IP delivery, cloaking, the role of the
search engine and the role of the marketer.
Links
With link popularity taking on a greater importance in the calculation
of relevancy, the spammer’s attention has turned to how to
manipulate this factor. Link popularity has two components: the
authority component (number of links from other resources to this
resource) and the hub component (number of links from this resource to
other resources).
Techniques such as link farms have been developed to subvert both the
authority and hub components. What is a link farm?
Link farm
A network of pages on one or more Web
sites, heavily cross-linked with each other, with the sole intention of
improving the search engine ranking of those pages and sites.
How can link farm pages be distinguished from other pages? The means of
the determination is beyond the scope of this document. Suffice it to
say that it can be done (hint: draw Web graphs of some small link farms
and look at the patterns that emerge).
Links can be used to deliver both types of search engine spam, i.e.
both content spam and meta spam.
Link content spam
When a link exists on a page A to page B
only to affect the hub component of page A or the authority component
of page B, that is an example of content spam on page A. Page B is not
spamming at all. Page A should receive a spam penalty. Without further
evidence, page B should not receive a penalty.
Link meta spam
When the anchor text or title text of a
link either mis-describes the link target, or describes the link target
using incoherent language, that is an example of link meta spam.
Here are some practical examples of link spam:
1. an SEO house stuffs the noframes content of a
client's framed home page with spam, including a link to the SEO's web
site to attempt to influence the authority factor of their site.
Result: The client receives a spam penalty for all the spam, including
the link. The SEO's web site receives no penalty for the link, in the
absence of any further evidence.
2. a guerrilla web marketer places a competitor's
Web site in a link farm, hoping it receives a spam penalty. Result: the
competitor Web site receives no penalty since, because it is not an
active participant in the link farm (it does not link to other sites in
the farm), there is no evidence of spam abuse. It also receives no
credit for the links it receives, because they come from a link farm.
Here is a general rule of thumb to determine whether link spam has
taken place - if the link is not designed to be followed by humans, or
the page it is on is not designed to be read by humans, then it is spam.
Redirects
There are many means of redirecting from one Web page to another, none
of which were invented for spamming, but all of which can be subverted
for spamming. Examples of redirection methods are HTTP 300 series
redirect response codes, HTTP 400 series error vectors, META REFRESH
tags and JavaScript redirects. Where a redirect is used to move a human
quickly from a page that has been designed for a search engine to see
to a page designed for a human to see, then the whole page designed for
the search engine to see is spam. Everything on it is an example of
either content spam or meta spam.
However, since redirection wasn't invented to facilitate spamming, the
existence of a redirect should not of itself indicate spam. A search
engine robot seeing a HTTP series response or short META refresh should
follow the redirect to the target, without indexing the source.
Here are some practical examples of redirection:
1. An SEO produces a page of gobbledegook designed
only for a search engine to see, tuned for a particular keyword to
match the keyword density the SEO believes the search engine's
algorithms currently use. The SEO places this content after a screenful
of paragraph breaks so human visitors to the page will not see it
without scrolling. The SEO then uses a JavaScript include file to
redirect to another page designed for humans to see. Result: This is
spam. Humans that visit the page with JavaScript disabled get
gobbledegook. The site's brand is damaged and the human does not become
a customer. If the human chooses to report the spam to the search
engine, the site on which the page sits receives a spam penalty.
2. A Webmaster restructures her Web site and
inserts Redirect lines into the server configuration file to ensure
visitors that follow links to the old pages automatically end up at the
correct new pages. The search engine robot therefore receives a HTTP
300 series response when it requests a particular page. Result: the
search engine robot should follow the redirect and treat the target
page as any other page on the Web. This is not spam.
Agent-Based Delivery and Agent-Based Spam
Agent-Based Delivery was invented at almost the same time as the Web
itself. It uses fields of the HTTP request header, in particular the
User-Agent field, in order to deliver the content according to features
such as the platform and language of the visitor. In other words,
different content is delivered from the same URL according to the HTTP
request.
Agent-Based Delivery can be subverted to deliver spam to search
engines. However, Agent-Based Delivery also has a purpose that does not
depend on the existence of search engines. Therefore, the use of
Agent-Based Delivery does not necessarily indicate an intention to spam
a search engine.
We will now briefly discuss the use of Agent-Based Delivery and some of
the implications.
Let's suppose that a webmaster uses Agent-Based Delivery to deliver one
version of a Web site to Mozilla browsers, and another version of the
same web site to non-Mozilla browsers. A search engine, as a
non-Mozilla browser, sees a different version of a Web site than a
human visitor that uses a Mozilla browser. Is this spam? The answer is
either Yes or No:
Yes If the non-Mozilla version is designed predominantly for search
engine robots to read,
No If the non-Mozilla version is designed predominantly for humans to
read
In other words, if the non-Mozilla version of the site is designed for
humans using a text to speech converter, a Lynx or Mosaic browser, a
PDA or WAP phone, an interactive TV set or any other non-Mozilla
browser, then the use of Agent-Based Delivery is not spam. It passes
this basic test:
Suppose search engines did not exist. Would the technique still be used
in the same way?
Note that just because Agent-Based Delivery does not imply spam, this
does not prevent Content Spam or Meta Spam being placed on pages served
by Agent-Based Delivery. This is analogous to spam being placed in
noframes or noscript tags.
We will now define and briefly discuss Agent-Based Spam.
Agent-Based Spam
The use of Agent-Based Delivery to
identify search engine robots by user agent and deliver unique content
to those robots.
This is always spam because the unique content is designed only for the
search engine robot to see, not for humans. This cannot be justified if
search engines did not exist, so it must have been done only to
influence search engine relevancy. Therefore it constitutes spam. Every
instance of unique content on the page will be content spam, meta spam
or both.
Note: it seems reasonable to permit individual HTML section tags such
as title and meta description to be delivered to individual search
engines. Reason: Since meta data is not seen on-the-page by humans
visiting the page it cannot be content spam (and on its own site the
search engine may publish the meta data as it wishes). As long as the
meta data describes the page accurately and coherently, it is not
meta-spam either. Therefore, it is not spam. To classify this activity
as spam would result in webmasters having to conform to the lowest
common denominator in delivering meta data, which does not encourage
search engines to improve.
Rule of thumb: it is OK to target search engine robots by their agent
name and deliver unique content in the section of a HTML document, but
not in the section.
IP Delivery and IP Cloaking
IP Delivery is the delivery of content according to the IP name or IP
address of the requester. These features in the request header can
indicate the ISP and location of the visitor. The two most common
reasons to use IP Delivery are to deliver secure content (e.g. within
an intranet or across a Virtual Private Network) and to deliver content
according to the likely location of the visitor. Both these activities
are perfectly valid in the absence of search engines and therefore do
not necessarily constitute search engine spam. For example, it would
not in itself be spam to use IP Delivery to determine that a search
engine was based in Germany, and deliver the same content to that
search engine's robot as to other German visitors. The content could
contain both content spam and meta spam, though.
We will now define and discuss IP Cloaking.
IP Cloaking
The identification of search engine
robots by IP name or address and the delivery of unique content to
those robots.
Using this definition, all uses of IP Cloaking are spam. This is
because the unique content is designed only for the search engine robot
to see, not for humans. This cannot be justified if search engines did
not exist, so it must have been done only to influence search engine
relevancy, therefore it constitutes spam. Every instance of unique
content will be content spam, meta spam or both.
IP Cloaking usually involves the building and maintenance, or rental or
purchase from a third party, of a database of IP names and addresses
used by search engine robots; the identification of search engine
robots using this database; and the delivery of unique content to those
robots. It therefore requires a lot of effort and/or expense. In return
for this effort and expense, the only feature that IP Cloaking offers
that other technologies do not offer is preventing humans reading the
cloaked page. This very feature means that the content on cloaked pages
is spam - designed purely to influence search engine relevancy
calculations. There is no non-spam use of IP Cloaking that could not be
fulfilled more simply, cheaply and reliably by alternative technologies.
IP Cloaking is excellent for hiding various illegal and immoral
practices such as copyright infringement, trademark stealing and
bait-and-switch. This is because IP Cloaking is designed to prohibit
review of the methods used.
For these reasons, we do not consider IP Cloaking to be an acceptable
technique for professionals to associate themselves with. IP Delivery
(i.e. delivering content according to the visitor's IP, but not
specifically targeting search engine robots) is acceptable. If IP
Delivery is deployed, a search engine robot should receive the same
content as a human typical of that search engine's users.
Clarification: If it is an IP-based technology that is not delivering
what we define as Search Engine Spam, then it is not IP Cloaking but IP
Delivery. In short, it's only cloaking if it is spam and it's only spam
if it cannot be justified in the absence search engines. If a search
engine has given you written permission to deliver unique content to
its robots, and supplied its robots' IP addresses to you for this
purpose (e.g. to enable a secure transaction) then, using our
definition, this is Not Search Engine Spam. Therefore the delivery of
unique content to robots with those IP addresses would be classed as IP
Delivery rather than IP Cloaking.
Conclusion
"It was a hard path and a dangerous
path, a crooked way and a lonely and a long."
JRR Tolkien, The Hobbit
This document has attempted to set out guidelines and principles for
classifying search engine spam. It has been written to allow search
engine marketers and other industry professionals to objectively
evaluate actions to see whether those actions equate to spamming a
search engine. It is hoped that quality search engines, ethical
marketers and search industry professionals will agree that this
document lays out standards which the industry should strive for.
Within this document, we identified two types of search engine spam
(content spam and meta spam) and discussed several examples of those
types of spam. We would like to conclude this document with a few
comments and guidelines to search engines and Web marketers.
To search engines
• It isn't spam if it's valid in the
absence of search engines - especially if it makes a site more
accessible - so don't penalise it.
• It isn't my spam if somebody else did
it outside my control - so penalise them, if anyone, not me.
To Web marketers
• Use Web technologies for the purposes
they were designed.
• Make your sites more marketable by
making them more accessible.
• Don't cloak.
by: Alan Perkins