What is search engine cloaking?
Cloaking is a search engine optimization technique where one version of the page is delivered to the search engine robot and a different version is delivered from the same address to the human visitor. This is done by tracking whether a browser or a spider has made the request.
Google’s definition of cloaking
http://www.google.com/webmasters/faq.html#cloaking
There are several reasons for using search engine cloaking technology.
- hiding the aggressively optimized (not so pretty) web page from the human visitor
- hiding the html code of your web page so that your competitors cannot try and duplicate any success you might have with the search engines
- custom language delivery
- IP based delivery for broadband users
- targeted advertising based on visitors’ geographical location
Page cloaking may sound like a tempting idea, but there are some risk factors involved. The biggest risk is getting banned from the search engines for the offence of spamming.
The search engines do not validate page cloaking. Usually, aggressive search engine optimization techniques, such as over-repetition of keywords, goes simultaneously with page cloaking. Search engines condemn overdoing of anything. Hence, there always lies the risk of getting banned from the search engine index.
The legitimacy of cloaking is an ongoing debate in the search engine industry.
How does cloaking work?
Cloaking is a program that resides in the server and detects requests to the web site. It works by identifying the search engine crawlers by their ip address (which looks something like 293.201.191.51), or user-agent string. The use of ip address is more secure and a much better solution. But this requires an updated database of known spider IP’s.
It takes a lot of time to create this database and keep it up-to-date. One always has to stay updated with the latest agent names or IP addresses of the search engines. This is a rather time-consuming job. List of these ip addresses is easily available for buying and is often an easier solution.
There is a lot of technology involved here, but let’s analyze and understand this in a simple way. Every computer on the World Wide Web has got a unique name or an ip (internet protocol) address in order to recognize each from the other. Every domain name has got an ip address by which it can be identified on the Internet. When a request comes to browse a web site, the cloaking script picks up the IP address and tries to locate it in the database of ip addresses. If it finds a match, it means the request has been made from a search engine. If not, it means that it came from a browser. If the request comes from a search engine, the web page that is returned is adjusted with just the right keyword density so that it reaches top positions within that search engine. If it was from a human visitor making the request from a browser, an attractive looking web pages is returned.
Advantages
+ Often, aggressive search engine optimization techniques require the sacrifice of the pleasant appearance of a web page. This may result in a web page that may not be appealing to the human visitor. Different sets of web pages can then be delivered with the help of cloaking technology, depending on whether the visitor is a spider or a human. The true advantage of cloaking is that it makes it possible to retain the site’s pleasant appearance for the human visitors, while still being able to show highly optimized pages to the search engines.
+ Hiding the html code of a web page, which has got a successful ranking in the search engines, from competitors in order to prevent them from copying it.
+ Targeted advertising depending on the country from which the request has been made.
+ Doorway pages. It is very difficult for a single page to attain top ranks in all the search engines. As each search engine has got its own algorithm to rank web pages, one web page cannot easily fulfill the ranking criteria of all of them. Different sets of the same web page are created, each of which is adjusted with the ranking criteria of one particular search engine. Suppose a page has to be created on soft toys to rank well in 5 search engines. In that case, 5 different versions of the same page have to be created for each of the search engines.
If Google has made a request for that particular page, the page named soft-toys-google.html is returned. If it was from Alltheweb, then soft-toys-alltheweb.html will be returned.
Disadvantages
+ The search engines do not approve of cloaking, and treat this as spam. So, there always lies the risk of being penalized. The penalty can come in the form of getting banned from the search engine index itself or getting so deeply buried in the search results that hardly any traffic will be able to reach it.
+ Cloaking asks for a lot of time and effort. One has to keep track of the latest changes to the ip addresses of the search engines, the ranking algorithm of each of the search engines individually and tweak their html codes accordingly.
+ These days, search engines use very sophisticated technology to rank web pages. They do not depend on on-the-page factors like keyword density and appropriate meta tags. Instead, they rely on the popularity factor of the page. If the human visitors never visit the cloaked pages, they can never satisfy the popularity factor and will never attain good ranks. So, cloaking is not needed in order to attain high ranks in the search engines. A web site created with good content and links from quality content sites is enough to attain top ranks in the search engines.
User agent name delivery
All search engine spiders as well as browsers have an agent name. Some software programs can be installed in the web server in such a way that it will deliver web pages depending on the HTTP_USER_AGENT variable. The agent names of some of the spiders and browsers are listed below:
Netscape Browser – >Mozilla
Altavista ->Scooter
Google -> Googlebot
Inktomi -> Slurp
Lycos -> Lycos_Spider
Northern Light -> Gulliver
Infoseek -> InfoSeek Sidewinder
Excite -> ArchitextSpider
If the agent name of a spider or browser is captured, the delivery of the appropriate web page is not a difficult task. This can be accomplished by using Server Side Includes(SSI). SSI can be configured to deliver web pages, depending on the value of the variable HTTP_USER_AGENT from the same web address.
You can write a program code which tells the web server to deliver the page soft-toys.htm if the value of the HTTP_USER_AGENT variable is Mozilla, to deliver the page named soft-toys_inktomi.htm if the value is Slurp, deliver soft-toys_altavista.htm if user agent is Scooter, and to deliver soft-toys_google.htm if it is Googlebot.
Agent-based cloaking is easy and simple to configure. Howver, it has got some disadvantages as well. The biggest disadvantage of this method is lack of security. It is not a 100% secured way of hiding your html code. Anybody can write an agent name imitating program and easily access the page, which is explicitly designed for the search engine spiders. This can be done by telling their agent name the same as that of the spider.
Also, it is possible to detect the use of cloaking techniques by deceiving the http_user_agent variable with that of different search engine spiders. If the different sets of pages that are returned are unlike each other, it means you are cloaking.
IP-based delivery – IP-address cloaking
IP-based cloaking is more complicated than agent based cloaking. It requires some custom programming to configure your web server to deliver pages based on an IP address. The advantage of IP cloaking is that it provides complete security for the html code of your web pages. This is because it is impossible to imitate an IP address in the way a search engine spider’s agent name can be imitated.
An Internet Protocol (IP) address is a combination of four numerical figures, which identifies a computer in the world wide web. It is a unique 32-bit number specified as four 8-bit numbers (represented as integers) called octets. The four octets are connected by periods. The numbers must be in the range 0-255. A sample IP address is 255.32.3.10.
Each spider has got an ip address. For example, 204.123.9.19 is the ip address of an Altavista spider, 209.185.253.176 is the example of a Google spider and 209.202.148.23 is the ip address of one of the Fast/Alltheweb spider.
As it is possible to track the ip address of the computer that requests your web page, this data can then be utilized to deliver specific pages to the search engine spiders.
It is difficult and complicated to implement ip based delivery, as one has to maintain an exhaustive database of ip addresses of the search engine spiders. Also, this database has to be updated frequently, as ip addresses of the spiders can change and new ip addresses are always being added.
The REMOTE_ADDR variable on the web server holds the ip address of the computer that requests the web page. You can use a cloaking script that tracks the value of the REMOTE_ADDR variable, and checks for that value in the database of the ip addresses.
If it is present, then the source from which the request was made can be easily determined and a page that has been explicitly designed for that source can be delivered.
If that ip address is not present in the database, this implies that the request was made by a human visitor from a browser.
The provocation to exploit and misuse this technology is great, and the search engines do not approve of cloaking. If you are using cloaking, be aware that there is a great chance that if you get caught, the search engines may ban your web site from their index for this offence.
How do you detect that someone is cloaking?
The search engines can detect ip cloaking by sending a spider from a different ip address than any ip address that it has used formerly. Since this is a new ip address, the database of ip adress, which is used for cloaking, will not contain this address. If the search engine finds that the page delivered to the spider with the new ip address is distinct from the page that is delivered to a spider with a known ip address, it senses that the site has used cloaking techniques.
Also, search engines can send a spider to the site, which does not account for the name of the search engine in the http_user_agent variable. If the search engine finds that the page delivered to this spider is different from the page that has been delivered to a spider which accounts the name of the search engine in the http_user_agent variable, it senses that the site has used user agent cloaking.
Is it still effective?
Some search engine optimization firms defend the idea of using cloaked pages by naming it as a way to “protect” highly optimized web pages. However, according to us, cloaking is intrinsically deceptive as it is more often used to manipulate search engine rankings, rather than to protect the HTML code. Instead of developing quality web sites, this is simply a way to “trick” the search engines. Hence, according to the search engines, cloaking is unethical, dishonest and unfair.
In addition, these days search engines use the notion of site popularity while ranking web pages. The ranking algorithm of the search engines do not depend on on-the-page factors of a web page alone. So, just keyword rich meta tags or repetition of keywords throughout the content do not make much difference any more.