Jim Verard
October 15th, 2007, 03:13 AM
Greetings.
I am asking if there's a way to prevent search engines like Google, Yahoo and others from indexing our site and placing any robots from indexing and recording all our documents.
I just found out these links:
http://pageresource.com/zine/robotstxt.htm
There's also this extensive FAQ:
http://www.robotstxt.org
The first one explains a little how to do that job:
{QUOTE-> This is a useful file that keeps search engines from indexing pages you do not want spidered. Why would you not want a page indexed by a search engine? Perhaps you want to display a page that shows an example of spamming the search engines. This type of page might include an example of repeated keywords, hidden tags with keywords, and other things that could get a page or an entire site banned from a search engine.
The robots.txt file is a good way to prevent this page from getting indexed. However, not every site can use it. The only robots.txt file that the spiders will read is the one at the top html directory of your server. This means you can only use it if you run your own domain. The spiders will look for the file in a location similar to these below:
http://www.pageresource.com/robots.txt
http://www.javascriptcity.com/robots.txt
http://www.mysite.com/robots.txt
Any other location of the robots.txt file will not be read by a search engine spider, so the file locations below will not be worthwhile:
http://www.pageresource.com/html/robots.txt
http://members.someplace.com/you/robots.txt
http://someisp.net/~you/robots.txt
Now, if you have your own domain- you can see where to place the file. So let's take a look at exactly what needs to go into the robots.txt file to make the spider see what you want done.
If you want to exclude all the search engine spiders from your entire domain, you would write just the following into the robots.txt file:
User-agent: *
Disallow: / <-QUOTE}
I make a single robots.txt file which have the lines:
User-agent: *
Disallow: /
However, it's placed on a free host server. The address will be the following:
http://myusername.myfreehostserver.com/robots.txt
There's no way to prevent my website from being indexed from these search engines? And what about some IP Deny Manager, available sometimes on the Cpanel of your host service?
If this option is not available from the beginning, and we have one board on our website, how did you know what IPs can be placed on your admin panel (banned IPs) to prevent the same robots from Google, and others?
The FAQ also explained:
{QUOTE-> What if I can't make a /robots.txt file?
Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.
The basic idea is that if you include a tag like:
<META NAME="ROBOTS" CONTENT="NOINDEX">
in your HTML document, that document won't be indexed.
If you do:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
the links in that document will not be parsed by the robot. <-QUOTE}
My english is not so good, so I take that "parsed" means "verified".
Do I have to place that line on every single file created on my website? Like I said I have a board like vbulettin, what lines should be included on my files (most of them are not written on HTML, they are always PHP).
My last question should be: is there a way to place some Javascript code on all of our pages, to prevent our host service from using Google Analytics code?
I just sign up for a host and they are using this service on every single one of my pages. There's no option to remove this service manually here, which is not related to the statistics function also available on my Cpanel.
Any help will be appreciated. :)
I am asking if there's a way to prevent search engines like Google, Yahoo and others from indexing our site and placing any robots from indexing and recording all our documents.
I just found out these links:
http://pageresource.com/zine/robotstxt.htm
There's also this extensive FAQ:
http://www.robotstxt.org
The first one explains a little how to do that job:
{QUOTE-> This is a useful file that keeps search engines from indexing pages you do not want spidered. Why would you not want a page indexed by a search engine? Perhaps you want to display a page that shows an example of spamming the search engines. This type of page might include an example of repeated keywords, hidden tags with keywords, and other things that could get a page or an entire site banned from a search engine.
The robots.txt file is a good way to prevent this page from getting indexed. However, not every site can use it. The only robots.txt file that the spiders will read is the one at the top html directory of your server. This means you can only use it if you run your own domain. The spiders will look for the file in a location similar to these below:
http://www.pageresource.com/robots.txt
http://www.javascriptcity.com/robots.txt
http://www.mysite.com/robots.txt
Any other location of the robots.txt file will not be read by a search engine spider, so the file locations below will not be worthwhile:
http://www.pageresource.com/html/robots.txt
http://members.someplace.com/you/robots.txt
http://someisp.net/~you/robots.txt
Now, if you have your own domain- you can see where to place the file. So let's take a look at exactly what needs to go into the robots.txt file to make the spider see what you want done.
If you want to exclude all the search engine spiders from your entire domain, you would write just the following into the robots.txt file:
User-agent: *
Disallow: / <-QUOTE}
I make a single robots.txt file which have the lines:
User-agent: *
Disallow: /
However, it's placed on a free host server. The address will be the following:
http://myusername.myfreehostserver.com/robots.txt
There's no way to prevent my website from being indexed from these search engines? And what about some IP Deny Manager, available sometimes on the Cpanel of your host service?
If this option is not available from the beginning, and we have one board on our website, how did you know what IPs can be placed on your admin panel (banned IPs) to prevent the same robots from Google, and others?
The FAQ also explained:
{QUOTE-> What if I can't make a /robots.txt file?
Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.
The basic idea is that if you include a tag like:
<META NAME="ROBOTS" CONTENT="NOINDEX">
in your HTML document, that document won't be indexed.
If you do:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
the links in that document will not be parsed by the robot. <-QUOTE}
My english is not so good, so I take that "parsed" means "verified".
Do I have to place that line on every single file created on my website? Like I said I have a board like vbulettin, what lines should be included on my files (most of them are not written on HTML, they are always PHP).
My last question should be: is there a way to place some Javascript code on all of our pages, to prevent our host service from using Google Analytics code?
I just sign up for a host and they are using this service on every single one of my pages. There's no option to remove this service manually here, which is not related to the statistics function also available on my Cpanel.
Any help will be appreciated. :)