Stop words are common words that are ignored by search engines at the time of searching a key phrase. This is done in order to save space on their server, and also to accelerate the search process.
Space saving
When a search is conducted in a search engine, it will exclude the stop words from the search query, and will use the query by replacing all the stop words with a marker. A marker is a symbol that is substituted with the stop words. The intention is to save space. This way, the search engines are able to save more web pages in that extra space, as well as retain the relevancy of the search query.
Example: “jobs in USA” is a search query. The search engines will mark the stop word “in” with “*” and will conduct the search for “jobs * USA”.
Speeding up the search process
Let us consider a search query “the steel industry”. In some search engines, all words will be stored, but they will exclude certain commonly used words from the search query. This is done in order to save time. Or else, they will have to make three different runs to find matches in this case. One run will be for “the”, one will be for “steel” and the last will be for “industry”. But there is a chance of getting the relevant pages by looking only for the last two words. Thereby an extra run, and also extra searching time for that run will be saved. So in order to save time, search engines ignore searching for some words.
Some commonly excluded “stop words” are:
|
after
|
also
|
an
|
and
|
|
as
|
at
|
be
|
because
|
|
before
|
between
|
but
|
before
|
|
for
|
however
|
from
|
if
|
|
in
|
into
|
of
|
or
|
|
other
|
out
|
since
|
such
|
|
than
|
that
|
the
|
these
|
|
there
|
this
|
those
|
to
|
|
under
|
upon
|
when
|
where
|
|
whether
|
which
|
with
|
within
|
|
without
|
.
|
.
|
.
|
Normally, search engines exclude “stop words”, i.e. common words which only modify other words but carry no inherent meaning themselves, such as adverbs, conjunctions, prepositions, or forms of “be”. But, if a common word is essential in getting the result one wants, one can force a search engine (for example Google) to include a stop word in a query by using the inclusion operator (“+” sign). It is a must to use the operator for including each stop word in a query.
Another method is to conducting a phrase search. According to search engine expert Greg Notess, Google, along with other search engines, automatically searches for stop words when they are in phrases (i.e. two or more words are within double quotation marks). The phrase search does not require the + sign in front of the common words. For example, one can just type the phrase “how play golf” and find Tiger Woods’ best selling golf digest, rather than having to type in “+how +play golf”.
We have more informative articles on search engine optimization and submission issues.
Boolean Search
The term “boolean” has been named after the British mathematician George Bull (1815- 64). Boolean searching refers to search operations on multiple words. This means that search terms will include – all words of the search query in case of “AND” operator, any one word of the search query in case of “OR” operator, or a specified term of the search query will not be included in the search results in case of “NOT” or “AND NOT” operator.
There are mainly three different types of Boolean operators: “AND”, “OR”, “NOT”. Although most search engines support Boolean operator, there may be some differences regarding the application of these operators. The differences to be kept in mind are as follows:
- Case sensitivity – Whether the sites require these words to be in capital or not?
- Shorthand Boolean operators – Some sites support use of shorthand operators while some do not. “+” sign is used for “AND” operator, “-” sign is used for “NOT” operator, and absence of any sign denotes “OR” operator.
- NOT operator – some sites accept it as “NOT”, while some accept it as “AND NOT”.
Inserting AND between two search terms tells the search engine to find only those pages which contain both of the search terms. For example, if one searches for ‘musical instruments AND parts’, the results will include only those web pages which contain both terms, but not necessarily in that order.
OR
The OR operator asks the search engine to check for the presence of the search terms on either side of the OR operator. Using OR will increase the number of results. Considering the same set of sample search terms, ‘musical instruments’ and ‘parts’, if one places ‘OR’ between them, search results will show every site that includes the word ‘musical instruments’, as well as the word ‘parts’.
NOT
The presence of NOT operator enables the search engine to eliminate sites having the search term following the NOT operator. For example, if one searches for sites that deal with gold mines but are not interested in sites that deal with coal mines, the search query may be: gold NOT coal mines.
Netscape Search supports the use of Boolean operator, but it does not accept the shorthand form of the operators. It is also not case sensitive. It uses “AND” operator by default, if Boolean operators are omitted.
Google supports the use of Boolean operator. It automatically returns pages that include all of the search terms. There is no need to include the word “AND” between the terms. However, it has to be taken into account that search results will be affected by the order in which the terms have been typed.
Google also supports the logical “OR” operator. In order to retrieve pages that include either word A or word B, searchers can use an uppercase OR between the terms. For example, to search for a vacation in Hawaior Switzerland, the search query will be ‘vacation hawaiOR switzerland’.
Google FAQ: Automatic Exclusion of Common Words
http://www.google.com/help/basics.html#stopwords
Although AllTheWeb is aware of the necessities of complex search queries, yet it does not support advanced boolean searching.
Stemming (Word Variations)
Stemming is defined as “a form of automatic right truncation of each word in the index to its root”. It is performed by PLWeb Turbo automatically, in order to accommodate the variety and ambiguity of the English language. If the word “search” is used as a search query, stemming causes variants like “searcher”, “searches”, “searched”, “searching” and so on. So, one will get his information from that list of search results that would otherwise be ignored under a strict interpretation of the query.
There are mainly two types of stemming: Plural and Porter stemming. Plural stemming tries to determine the singular form of a word, whereas porter stemming attempts to find the root, or stem, of a word and derive other possible variations. The database administrator adopts stemming algorithm before indexing the database.
To provide the most accurate results, some search engines support “stemming”, while others do not. While Netscape Search, for example, allows stemming, Google does not use “stemming” or support “wildcard” searches. Google searches for exactly the words that one will enter in the search box.
Search engines which allow stemming can leave stemming either on or off by default. However, there is a way to switch it to the other mode. Netscape Search leaves word stemming on by default. If one wants to turn off stemming of word in Netscape Search, one has to insert a single quotation mark (‘) at the end of the word. Then Netscape will only search for exact matches of that word.
Query Rewriting
Normally, search engines do not interpret search queries in a meaningful way. But AllTheWeb tries to rewrite an optimal search query, taking the search engine query language into consideration.
Let us consider the famous quotation from Shakespeare’s Hamlet, “to be or not to be”. This query contains only stop words, i.e. small everyday words that many search engines will skip when they index pages.
Google usually brings up no results to this query unless one compels it to take the stop words into consideration by entering a + sign in front of each and every word (“+to +be +or +not +to +be”), or by performing a phrase search. Now this query (to be or not to be) does not bring the information of Shakespeare’s famous drama Hamlet on the first few pages. The word “or” has been ignored in the query — for search results including one term or another, capitalized “OR” should be used between words. And as the following words are very common, they were not included in the search results: to be or not to be.
Further details are available here:
http://www.google.com/help/basics.html#and
It is the Fast search engine that consults a set of dictionaries when presented with a query like this, and eliminates unnecessary words. This particular search query will be interpreted as “to be or not to be hamlet”, and due to linguistic analysis the search engine will produce search listings relevant to this topic.
http://alltheweb.com/search?q=to+be+or+not+to+be+hamlet&c=web&l=any&cn=4&cs=utf-8
Filter Words
Quite often, the terms “Stop Words” and “Filter Words” have been used interchangeably. But they are not the same.
The common words that search engines remove from web pages before adding them to their databases are known as filter words. Some sample filter words are: a, the, is, an, of, for, do, to, or. In order to accelerate the search process and save disk space, search engines routinely filter these words out at the time of indexing the page content.
To facilitate the indexing of huge amount of web pages, saving disk space is important to all search engines. According to Google’s home page it has indexed over 1.6 billion web pages. Now let us consider how often the word “the” appears on an average web page. In this particular sub-topic, only the word “the” has appeared 16 times. Therefore all search engines (along with Google) can save a substantial amount of disk space by removing the filter words before indexing the sites.
As each search engine sets its own rules, there is no complete list of filter words. Most search engines inform a searcher on their search result’s page as to which words in the query are filter words.
Many search engines use link popularity as the main factor for ranking the web pages in the search result. Many still consider “Title” and “Meta” tags as their main on-the-page criteria that influences ranking. Thus web masters have to be very careful that their list of keywords does not match a search engine’s list of filter words. The filter words consume a large amount of space in the “Meta” tag, but produces no benefits.
Most search engines place more importance on key phrases that occur early in the “Meta” keyword list or at the top of the page content. However, if those phrases contain several filter words, the effect of later keywords will get diluted.
Again, many search engines recommend webmasters to limit “Meta” and “Title” tags to a certain number of characters. Otherwise, if filter words are overused, will soon run out of room for important words.