Referencing your sitemap - Before we turn to submission (i.e., actively notifying the search engines of your sitemap), I would like to briefly explore passive notification, which I call sitemap referencing. SiteMaps.org (to which all the major engines now subscribe) sets a standard for referencing that utilizes the very same robots.txt file I explained to you above (page 57). When a spider visits your site and reads your robots.txt file, you can now tell it where to find your sitemap. For example (where your sitemap file is called sitemap.xml and is located in the root of your website): User-agent: * Sitemap: http://www.yourdomain.com/sitemap.xml Disallow: /cgi-bin/ Disallow: /assets/images/
The example robots.txt file tells the crawler how to find your sitemap and not to crawl either your cgi-bin directory (containing PERL scripts not intended for the human reader) or your images directory (to save bandwidth). For more information on the robots.txt standard, you can refer to the authoritative website www.robotstxt.org.