• Categories

  • Most Popular

  • Recent Posts

    Blogs I Read

    Pages

    Feeds

    Google Ignores Robots.txt

    September 26th, 2006 byPhilip Nicosia

    Logging into my Google Sitemaps account I notice the following:

    http://www.philipnicosia.com/gallery/index.php URL restricted by robots.txt Sep 8, 2006

    But looking at a site command Google has indexed and cached this page despite being restricted by robots.txt.

    This is Google’s cache of http://www.philipnicosia.com/gallery/index.php as retrieved on 18 Sep 2006 06:18:24 GMT.
    Google’s cache is the snapshot that we took of the page as we crawled the web.

    So I check with their robots.txt checker to see if there is a problem and it indeed says the page is allowed despite the robots.txt saying:

    User-agent: *
    Disallow: /gallery/

    By adding the following to my robots.txt

    User-agent: Googlebot
    Disallow: /gallery/

    Google now recognizes that the directory is blocked.

    So somewhere between the 8th September and the 18th September Google has decided that it is not like any other search engines and any page you have on your site is fair game unless you specifically tell Googlebot not to go there.

    1 Comment Add your own

    • 1. Philip Nicosia  |  October 1st, 2006 at 6:58 pm

      Someone has very kindly explained the errors of my ways. Below is what they said.

      “I see you have a section for * and a section for Googlebot. When you have it set up like that, the Googlebot will only look at the specific section, not at the generic one.”

      So it seems that Google ignores the section meant for all robots in preference to its own section. I never knew that. I assumed it would read both.

    Leave a Comment

    Required

    Required, hidden

    Optional