Google uses 302 redirects, meta refreshes & has a handful of 404s in their sitemaps?
Below, you'll find out about the 6 sitemaps in Google's robots.txt file. You'll find they actually have way more than 6, but lets dive in.
If you scroll to the bottom of http://www.google.com/robots.txt you'll find the above image.
Sitemap 1 Takeaway: Google Has Your Google Profile in a Sitemap (#2 is Way More Interesting) http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
As you might assume from the title of this sitemap, this contains all Google Profile pages. This is just their Profile pages and not Plus profiles.
Sitemap 2 Takeaway: Google Keeps 3 News Sites in Sitemaps, Some Longer Than Others http://www.google.com/hostednews/sitemap_index.xml
This contains news articles from the Associated Press, the Agence France-Presse & the European Press Photo Agency. Makes sense that Google would index the articles it uses in Google News. Here's the interesting part that I would love to hear feedback on:
1. Google keeps articles from the Associated Press in sitemaps for 20 days. After that they are omitted. They will still show up in search results, but Google does not keep them in a sitemap after 20 days. Google will keep less than 500 news articles in the sitemap and goes as low as 207 some days. This could just be that there aren't more than 207 stories some days OR it could be that Google keeps the top X amount in their sitemaps. Speculation, but it's noteworthy when looking at the other two agencies.
2. For Agence France-Presse, news stories are kept for 20 days too. The only difference is that Google will keep over 800 news articles in this sitemap. Could be obvious that ALL of Europe might have more news than all of the United States, but just another piece to keep in mind when looking at the third agency.
3. For the European Press Photo Agency, Google indexes about the same amount of URLs as the Associated Press, BUT it keeps them in the sitemap for 29 days. Why 9 more days than the other two?
Sitemap 3 Takeaway: Google Uses Meta Refreshes http://www.google.com/ventures/sitemap_ventures.xml
Not much to see here, just all of the pages from http://www.googleventures.com/portfolio.html. I missed it at first, but realized the URLs in the sitemap were subfolders like http://www.google.com/ventures/portfolio.html instead of http://www.googleventures.com/portfolio.html.
It must use a 301.
I was wrong.
All of the URLs in this sitemap use a 0 second meta refresh. Even though Google says in their Webmaster Tools Help to use a 301 and other sources say it is just not good to do (with good reason too), Google uses it on themselves.
Sitemap 4 Takeaway: Google Uses 302 Redirects and Has 404s in Sitemaps http://www.google.com/sitemaps_webmasters.xml
This sitemap contains one URL, https://www.google.co.uk/intl/en/adtoolkit/pdfs/pdf_sitemap.txt. Up until last month it was still working, but now it leads to a 404.
What was in that pdf_sitemap.txt file? URLs that also lead to 404s now, but they were all of the Google guides & case studies like the one that's still indexed below. Nobody is perfect, but it's still interesting to see Google needing to clean up or delete a few items.
We have the URL still, so lets see where it goes. http://www.google.co.uk/intl/en/adtoolkit/pdfs/insights/q109_car_insurance.pdf 302 redirects to http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.co.uk/en/uk/intl/en/adtoolkit/pdfs/insights/q109_car_insurance.pdf which brings back a 404 page.
Why would Google use a 302 redirect? Why would that 302 then lead to a 404?
Sitemap 5 Takeaway 1: Google Adheres to 50,000 URLs in a Sitemap Sitemap 5 Takeaway 2: Google has Over 1 Million Trend Searches in Sitemaps http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml
Sitemap 5 contains 24 sitemaps that contain Google Trends searches. In total, Google has 1,167,125 Google Trends URLs in each sitemap. The trends are all websites & do not contain any keyword trends.
Why would Google want to keep over a million trends of websites, eg: http://trends.google.com/websites?q=mountwashington.ca , in their sitemaps?
Some breakdown of sitemap URL totals:
Sitemap 1: 50,004 Sitemap 2: 50,004 Sitemap 3: 50,009 Sitemap 4: 50,001 Sitemap 23: 50,005
Odd thing is that none of these are cached and they are not in a value order. All of these sitemaps were also last modified 7/24/2009, so I'm not seeing the relevance to still be in a sitemap. The sites also range from Mitsubishi to high schools to porn, so there's no type of website or one with any special cctld. Google trends launched in May of 2006, so possibly this was just the first million plus sites that were searched for? That's the best I have for the moment and other guesses are very welcome.
Sitemap 6 Takeaway: It Contains Google Dictionary Pages. In Korean. http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml
Google Dictionary shut down in early August of 2011. I'm guessing this is an oversight they need to remove from their sitemaps.
http://www.google.co.kr/dictionary?hl=ko&sl=en&tl=ko&q=censorship
Google doesn't just have sitemaps for their dictionary terms (and not their entire dictionary either, odd) but it's for page on google.co.kr. Why Google Korea? Yes, search results are different in Korea (really good read if you have the time at SEW) but they still aren't being used in those results.
So there's the recap for the sitemaps that appear in Google's robots.txt file. If your inner geek wants to roam around in sitemaps even further, check out the sitemaps for gstatic.com. Lastly, looking at Google's robots.txt started all this. What about youtube's?
Follow Adam on Twitter.