Looking for ways to remove your site and its pages from search engines?

Are you a webmaster who is looking for ways to remove the old and non existent content of your site from search engines? If yes, then this is the right place to see some effective ways of getting your site, pages, blog and images out of search engine’s index.

First, let’s look at the most common reasons because of which webmasters want to remove some of the pages or an entire site out of search engine’s index.

» Very old pages with content no longer valid or applicable.
» Pages with wrong information which was indexed in search engines before you could act on it.
» Pages with confidential information was indexed as it was linked from a public page or due to other possible causes.
» Migration from an old domain or URL to a new domain or URL.
» Sub domain or a folder under a parent domain is no longer in use or applicable.
» The site was under testing when it got indexed.

Different ways of removing your content from search engines

While the reason to remove the content could be any, ways are limited and depends on which search engine you want your content to be removed from. Let’s look at some ways which are search engine specific and then common methods which are applicable to all major search engines.

» Removing your content from Google

Google provides you with an online automatic URL removal tool using which webmasters can request removal of their indexed URLs, domains, sub domains, group of pages, sub folder etc. To be able to successfully submit a request for removal, your existing URL must return a 404 header message to tell Google that the content is no longer available (or) add a META tag in your pages to prevent Google from indexing it again (or) use robots.txt exclusion. As long as the META tag is embedded in the pages, site crawlers will not index your pages any further. To successfully return a 404 header for pages that needs to be removed and no longer available, the 404 error page must be set to return a 404 HTTP header when it is called for. Many websites do not return a 404 header when the error page is called and that’s why search engines fail to detect absence of pages from your site.

» Other common ways of removing content from search engines

[Option # 1]: Using robots.txt file to remove site, partial content from all or specific search engines.

We can make use of robots.txt exclusion to prevent the search engines from indexing the pages, folders, sub domains etc. Let’s see how it works.

[Note: In the below examples, replace Googlebot with MSNBot for MSN search engine, and here is a list of many known robots that you can use in the following examples instead of Googlebot]

» To remove your entire website from all search engines and prevent further indexing by all search engine crawlers, place the following lines in robots.txt file in the root folder of your site.

User-agent: *
Disallow: /

» To remove your site from Google only and prevent only Google bot from crawling your site in future, place the following lines in the robots.txt in web server’s root folder.

User-agent: Googlebot
Disallow: /

» To remove only the https version of indexed pages from search engines, place the following in robots.txt file in the folder which serves the secured page of your site.

User-agent: *
Disallow: /

» To remove all the pages under a particular directory [say scripts], place the following in the robots.txt file.

User-agent: *
Disallow: /scripts

» To remove all pages under a specific directory from just Google, use the following lines in robots.txt

User-agent: Googlebot
Disallow: /scripts

» To remove a specific file type from search engines, use the following [example gif files]

User-agent: *
Disallow: /*.gif$

» And to remove gifs only from Google, use the following.

User-agent: Googlebot
Disallow: /*.gif$

» To remove dynamically generated pages, use the following in robots.txt

User-agent: *
Disallow: /*?

» If it’s Google specific, use the following.

User-agent: Googlebot
Disallow: /*?

» To remove only images from Google’s image index, place the following in robots.txt file. I am not sure if replacing Googlebot-Image in below with a * will work with other search engines. You may try and let me know.

User-agent: Googlebot-Image
Disallow: /imagename.jpg

» To remove all images from a directory of your site, use the following.

User-agent: Googlebot-Image
Disallow: /images/*.gif$

[Option #2]: Using HTTP META tags in HTML pages to remove content from search engines.

» To prevent all search engine robots from indexing a page on your site, place the following meta tag into the <HEAD> section of your page.

<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>

» To prevent only Google robot from indexing a page, you’d use the following tag in your page.

<META NAME=”GOOGLEBOT” CONTENT=”NOINDEX, NOFOLLOW”>

» To allow all search engine robots to index the current page but NOT index any links from this page, place the following META tag on your page.

<META NAME=”ROBOTS” CONTENT=”NOFOLLOW”>

» To remove the page snippet that is displayed under the page title of search results, use the following:

<META NAME=”GOOGLEBOT” CONTENT=”NOSNIPPET”>

Note: This will also remove the cached version of your pages from the search index.

» To remove a blog from search engine’s index, place the following in the <head> tag of your site.

<META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”>


Facebook
Twitter
Delicious
Stumble
Technorati
Subscribe to feed

Categories

RSS feed