Preface
Nowerdays the Internet consists of bilions of Web pages all over the world. There
are no road signs that direct visitors to your site. So if you have not enough
money to start a huge advertizement campain, you are stuck with search engines.
Search engines are used to find specific information on the Internet. Search
engines are constantly crowling (looking) over the Internet and indexing milions
of pages per day. You can easily add your page to the seach engines work list.
So within a couple of weeks the search engine will crowl over your page.
But since the search engine is using a robot or spider, which is actualy a program
that looks at your page and tries to get information about your site. The change
that your site is indexed correctly is very slim. There are however ways to help
and direct a search spider so that the changes Internet users can find your site
in search engines will dramaticly increase.
There are two methods of directing a search engine on your site:
- PrefaceRobots.txt
- META tags
The robots.txt tell the search engine spider where to look for information.
The META tags help the spider to get the correct information about a specific
web page.
Robots.txt
The robots.txt file is used by search engine spiders to see what they may ormay not include in there search. The robots.txt file must always be located at
the root of the website. You cannot make a robots.txt for a specific part of
the website.
An example of a robots.txt would be my own at:
http://www.schaake.nu/robots.txt
The robots.txt consists of 2 commands. With the first command you can set a specific
user agent for which the directive will be set. So only that specific user agent
will look at the directive.
The second command will restrict access to a specific directory or file on the
website.
Let's explane this with a sample:
# Some stuff we don't want google to see
User-Agent: Googlebot
Disallow: /googlesecrets.html
Disallow: /cgi-bin
# All the other agents may also not index the cgi-bin
User-Agent: *
Disallow: /cgi-bin
With this example, all search engines will index the whole site except the /cgi-bin.
But only the googlebot will also not index the /googlesecrets.html page.
Now what if we want to index the complete site, so we don't have any secrets
at all.
# Allow complete access
User-Agent: *
Disallow:
Or we could disallow one agent to index our site completely.# Disallow the googlebot completely
User-Agent: googlebot
Disallow: /
Note that not all search engines will look at the robots.txt file at all. Most
of the big commercial search engines will look at your robots.txt file. But search
engines of spammers (who are looking for email addresses) will not be stopped
by the robots.txt file.
META tags
The META tags contain information about a specific web page. Most spiders will
look at the META tags and use this information instead of trying to collect information
about the page themselves.
The drawback of this is that when you have incorrect or outdated information
in your META tags, the spider will use this information instead of looking at
the page itself.
The following META tags can be used to help a spider.
<META NAME="description" CONTENT="Desciption of the webpage"/>
The description tag holds the title of the webpage. Keep this the same as the
%lt;TITLE> tag in the page header. Some search engines will still look at
the title tag instead of the description META tag!
<META NAME="keywords" CONTENT="keyword1 keyword2 keyword3"/>
To help a spider to collect keywords on your site, you can include the keywords
META tag. This tag contains some usefull keywords Internet users can use to find
your Web page. Keywords are seperated with a space.
<META NAME="robot" CONTENT="index,follow"/>
The robots.txt can forbit spiders to look at specific pages or complete directories.But sometimes you want some more control over the spider. The robot META tag
will give you all the control you need over a spider.
The first part of the tag will tell the spider if the current page may be indexed,
the second part will tell the spider if it may follow hyperlinks in the current
page. Possible options are:
- index,follow - Spiders may index the page and follow all links on the page.
- noindex,nofollow - Spiders may not index the page and may not follow any
links. - index,nofollow - The page may be indexed, but no links may be followed.
This is very usefull for page that link to forms. - noindex,follow - The page may not be indexed, but the spider may follow
all links. A good example would be a dynamic weblog.
The ALL option stands
for index,follow. And the NONE option stands for noindex,nofollow.
(eg. <META NAME="robot" CONTENT="all"/>)
<META NAME="refresh" CONTENT="3600"/>
The refresh meta tag will tell the spider to refresh the page every number ofseconds. This directive could be used for internal search engines, but I would
not see a reason why a public search engine would refresh indexed it's content
for your specific page. It will take weeks before a search engine will visit
your site again.
<META NAME="revisit-after" CONTENT="30"/>
This directive makes more sense. But I'm not sure if there are search enginesthat look at this directive. The above example tells the search engine to revisit
the site after 30 days. So if a search engine normally would plan a revisit after
14 days, it can wait another 16 days to revisit your site. This really keeps
the bandwith open.
<META NAME="generator" CONTENT="Microsoft Frontpage"/>
This META tag tell the spider which web design tool was used to generate or designthis Web page. A search engine could use this to build stastics on the usage
on design tools.
<META NAME="language" CONTENT="nl, en"/>
This META tag defines the language used on the Web page. Normally a spider willtry to detect the used language itself. But with this tag you can force a specific
language.
<META NAME="copyright" CONTENT="Copyright 2003 Christiaan Schaake."/>
<META NAME="author" CONTENT="Christiaan Schaake"/>
These 2 META tags tell the spider who wrote the page and the copyrights of this
page. A search engine could include this information in the search results.
Not all search engine will look at the META tags, so always use plain text for
importent parts of your site. Do not make a first welcome page that only includes
a big image or a shockwave animation. And make use of the title and alt tags!
Source