Tuesday, February 19, 2008

Sitemap

What are Sitemaps?
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.
Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.
Sitemap 0.90 is offered under the terms of the Attribution-ShareAlike Creative Commons License and has wide adoption, including support from Google, Yahoo!, and Microsoft
A Sitemap does not affect the actual ranking of your pages. However, if it helps get more of your site crawled (by notifying us of URLs we didn't previously didn't know about, and/or by helping us prioritize the URLs on your site), that can lead to increased presence and visibility of your site in our index


Sitemap File Format
Most Popular sitemap format is XML and it is supported by most of all type of Search Engine even google too pushing people to use XML format Sitemap
The XML Sitemap must:


  1. Begin with an opening tag and end with a closing tag.
  2. Specify the namespace (protocol standard) within the tag.
  3. Include a entry for each URL, as a parent XML tag.
  4. Include a child entry for each parent tag.

All other tags are optional. Support for these optional tags may vary among search engines. Refer to each search engine's documentation for details.


Other Sitemap formats
The Sitemap protocol enables you to provide details about your pages to search engines, and we encourage its use since you can provide additional information about site pages beyond just the URLs. However, in addition to the XML protocol, we support RSS feeds and text files, which provide more limited information.
Syndication feed
You can provide an RSS (Real Simple Syndication) 2.0 or Atom 0.3 or 1.0 feed. Generally, you would use this format only if your site already has a syndication feed. Note that this method may not let search engines know about all the URLs in your site, since the feed may only provide information on recent URLs, although search engines can still use that information to find out about other pages on your site during their normal crawling processes by following links inside pages in the feed. Make sure that the feed is located in the highest-level directory you want search engines to crawl. Search engines extract the information from the feed as follows:

  1. <'link'>field - indicates the URL
  2. Mdified date field (the field for RSS feeds and the date for Atom feeds) - indicates when each URL was last modified. Use of the modified date field is optional.

Text file
You can provide a simple text file that contains one URL per line. The text file must follow these guidelines:

  1. The text file must have one URL per line. The URLs cannot contain embedded new lines.
  2. You must fully specify URLs, including the http.
  3. Each text file can contain a maximum of 50,000 URLs. If you site includes more than 50,000 URLs, you can separate the list into multiple text files and add each one separately.
  4. The text file must use UTF-8 encoding. You can specify this when you save the file (for instance, in Notepad, this is listed in the Encoding menu of the Save As dialog box).
  5. The text file should contain no information other than the list of URLs.
  6. The text file should contain no header or footer information.
  7. You can name the text file anything you wish.

You should upload the text file to the highest-level directory you want search engines to crawl and make sure that you don't list URLs in the text file that are located in a higher-level directory.

Sitemap Location
The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://example.com/catalog/sitemap.xml can include any URLs starting with http://example.com/catalog/ but can not include URLs starting with http://example.com/images/.
If you have the permission to change http://example.org/path/sitemap.xml, it is assumed that you also have permission to provide information for URLs with the prefix http://example.org/path/. Examples of URLs considered valid in http://example.com/catalog/sitemap.xml include:

http://example.com/catalog/show?item=23
http://example.com/catalog/show?item=233&user=3453
URLs not considered valid in http://example.com/catalog/sitemap.xml include:
http://example.com/image/show?item=23
http://example.com/image/show?item=233&user=3453
https://example.com/catalog/page1.php
Note that this means that all URLs listed in the Sitemap must use the same protocol (http, in this example) and reside on the same host as the Sitemap. For instance, if the Sitemap is located at http://www.example.com/sitemap.xml, it can't include URLs from http://subdomain.example.com.
URLs that are not considered valid are dropped from further consideration. It is strongly recommended that you place your Sitemap at the root directory of your web server. For example, if your web server is at example.com, then your Sitemap index file would be at http://example.com/sitemap.xml. In certain cases, you may need to produce different Sitemaps for different paths (e.g., if security permissions in your organization compartmentalize write access to different directories).
If you submit a Sitemap using a path with a port number, you must include that port number as part of the path in each URL listed in the Sitemap file. For instance, if your Sitemap is located at http://www.example.com:100/sitemap.xml, then each URL listed in the Sitemap must begin with http://www.example.com:100.


Validating Sitemap
The following XML schemas define the elements and attributes that can appear in your Sitemap file. You can download this schema from the links below:
For Sitemaps: http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd
For Sitemap index files: http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd
There are a number of tools available to help you validate the structure of your Sitemap based on this schema. You can find a list of XML-related tools at each of the following locations:

http://www.w3.org/XML/Schema#Tools
http://www.xml.com/pub/a/2000/12/13/schematools.html
Some other good sites for validating your sitemap are
http://www.xml-sitemaps.com/

Submitting Sitemap
Once you have created the Sitemap file and placed it on your webserver, you need to inform the search engines that support this protocol of its location. You can do this by:
· submitting it to them via the search engine's submission interface
· specifying the location in your site's robots.txt file
· sending an HTTP request
The search engines can then retrieve your Sitemap and make the URLs available to their crawlers.
Submitting your Sitemap via the search engine's submission interface
To submit your Sitemap directly to a search engine, which will enable you to receive status information and any processing errors, refer to each search engine's documentation.
Specifying the Sitemap location in your robots.txt file
You can specify the location of the Sitemap using a robots.txt file. To do this, simply add the following line:
Sitemap:
The should be the complete URL to the Sitemap, such as: http://www.example.com/sitemap.xml
This directive is independent of the user-agent line, so it doesn't matter where you place it in your file. If you have a Sitemap index file, you can include the location of just that file. You don't need to list each individual Sitemap listed in the index file.
Submitting your Sitemap via an HTTP request
To submit your Sitemap using an HTTP request (replace with the URL provided by the search engine), iIssue your request to the following URL:
/ping?sitemap=sitemap_url
For example, if your Sitemap is located at http://www.example.com/sitemap.gz, your URL will become:
/ping?sitemap=http://www.example.com/sitemap.gz
URL encode everything after the /ping?sitemap=:
/ping?sitemap=http://www.yoursite.com/sitemap.gz
You can issue the HTTP request using wget, curl, or another mechanism of your choosing. A successful request will return an HTTP 200 response code; if you receive a different response, you should resubmit your request. The HTTP 200 response code only indicates that the search engine has received your Sitemap, not that the Sitemap itself or the URLs contained in it were valid. An easy way to do this is to set up an automated job to generate and submit Sitemaps on a regular basis.
Note: If you are providing a Sitemap index file, you only need to issue one HTTP request that includes the location of the Sitemap index file; you do not need to issue individual requests for each Sitemap listed in the index.

Excluding Content
The Sitemaps protocol enables you to let search engines know what content you would like indexed. To tell search engines the content you don't want indexed, use a robots.txt file or robots meta tag

Read other Articles

Sitemap
Antivirus reviews 2008 (part 2)
Introduction to SEO Tutorials
How to optimize your site : SEO process
Basic link terminology
Taking the search engine point of view: why you wa...
What is search engine optimization (aka SEO)?
The best keyword research tools available
Antivirus reviews 2008 (part 1)
Strange Google Crawler
Creating Statspack job

No comments: