5 Scrapers for the price of 1 blog post! Here's a new collection of 64 new ImportXML formulas and 5 spreadsheets to jump start your scrape-fest.
I wanted to update my original ImportXML Cookbook post so I recently asked twitter for requests (DJ Chris Le.. get it?! HAhaha ah h ... ok, nevermind.) The majority of requests were for social. No surprise, right? Well, here they all are! Let's get cooking!
Preparations
1. You'll need a new Google Docs Spreadsheet
2. In cell A1, type in the URL you want to scrape.
3. In any other cell copy-and-paste the recipe. Press enter.
Noob tip: Use the example sheets as your starting point. Pro tip: Use these recipes on any existing Google Doc! No code. Made with 100% pure formulas.
Quora: Followers of a Topic
Try the Quora starter spreadsheet
First up, by request (@dan_shure): Quora! The best way to get this URL is simply to find the topic on Quora, clicking on to followers and then cutting and pasting that URL into A1.
Example URL for cell A1: http://www.quora.com/Search-Engine-Optimization-SEO/followers
User's name =ImportXML(A1, "//a[@class='user']")
URL to user's profile =ImportXML(A1, "//a[@class='user']/@href")
Number of answers =ImportXML(A1, "//a[contains(@href, 'answers')]")
(Pro tip): To get around Google Doc's 50 ImportXML limit convert cells into values. First, select a bunch of cells, then hit copy, and then "paste as values."
This will permanently record the result into the cell and free up some ImportXMLs for you. I usually leave one row untouched so that I can use it later to get more data before converting those into data.
Quora: User Profiles
Example URL for cell A1 http://www.quora.com/Ian-Lurie
Photo =ImportXML(A1, "//meta[@property='og:image']/@content")
Name =ImportXML(A1, "//meta[@property='og:title']/@content")
Twitter profile =ImportXML(A1, "//a[@class='sn_icon' and contains(@href, 'twitter.com')]/@href")
Facebook profile =ImportXML(A1, "//a[@class='sn_icon' and contains(@href, 'facebook.com')]/@href")
Self description =ImportXML(A1, "//meta[@property='og:description']/@content")
Canonical URL on Quora =ImportXML(A1, "//link[@rel='canonical']/@href")
RSS feed on Quora =ImportXML(A1, "//link[@rel='alternate']/@href")
Topics =ImportXML(A1, "//a[@class='topic_name']//span[@class='name_text']/span")
Number of followers =ImportXML(A1, "//a[@class='link_label' and contains(@href, 'followers')]/span")
Number of topics =ImportXML(A1, "//a[@class='link_label' and contains(@href, 'topics')]/span")
Number of questions =ImportXML(A1, "//a[@class='link_label' and contains(@href, 'questions')]/span")
Number of answers =ImportXML(A1, "//a[@class='link_label' and contains(@href, 'answers')]/span")
Number of edits =ImportXML(A1, "//a[@class='link_label' and contains(@href, 'log')]/span")
Quora: User's Followers
Example URL for cell A1 http://www.quora.com/Ian-Lurie/followers
User's name =ImportXML(A1, "//a[@class='user']")
URL to user's profile =ImportXML(A1, "//a[@class='user']/@href")
Number of answers =ImportXML(A1, "//a[contains(@href, 'answers')]")
Reddit scraper starter spreadsheet
The URL for Reddit is best typed in manually instead of copying what you see in your browser. Replace "importxml" in the URL with your own search term.
Example URL for cell A1 http://www.reddit.com/search?q=importxml&sort=new
Title =ImportXML(A1, "//p[@class='title']/a")
URL =ImportXML(A1, "//p[@class='title']/a/@href")
Submitted on (as UTC) =ImportXML(A1, "//p[@class='title']/../p[@class='tagline']/time/@title")
Author URL =ImportXML(A1, "//p[@class='title']/../p[@class='tagline']/a[@href][1]/@href")
Posted to =ImportXML(A1, "//p[@class='title']/../p[@class='tagline']/a[@href][2]/@href")
Topsy
Topsy scraper starter spreadsheet
Example URL for cell A1 http://topsy.com/s?q=importxml
Title =ImportXML(A1, "//h3[@class='title']")
URL =ImportXML(A1, "//h3[@class='title']/a/@href")
Tweeted by name =ImportXML(A1, "//div[@class='author-bar']/a")
Tweet =ImportXML(A1, "//span[contains(@class, 'twitter-post-text')]")
Original tweet URL =ImportXML(A1, "//div[@class='actions']/a[contains(@class, 'date-link')]/@href[1]")
Trackback count =ImportXML(A1, "//div[@class='actions']/a[contains(@class, 'trackback-link')]")
Topsy: Twitter Trackback
Stats on Influential people who RT'ed you spreadsheet
This is where the fun really is. Combine lots of these ImportXML's to make some really great spreadsheets. This was fun. First, find the page you want to get trackbacks on, copy that URL into A1 and get going!
Example URL for cell A1 http://topsy.com/www.seerinteractive.com/blog/importxml-cookbook
Posts count =ImportXML(A1, "//div[@class='retweet-button-box']/span[@class='count']")
Tags list =ImportXML(A1, "//ul[@class='tags-list']/li[@class='tag']")
Related (influential): twitter name =ImportXML(A1, "//span[@class='author-bar']/*[contains(text(), 'Influential')]/../a")
Related (influential): tweet =ImportXML(A1, "//span[@class='author-bar']/*[contains(text(), 'Influential')]/../../span[@class='twitter-post-text']")
Related (influential): tweet date =ImportXML(A1, "//span[@class='author-bar']/*[contains(text(), 'Influential')]/../../../div[@class='actions']/a[contains(@class, 'date-link')]")
Related (influential): tweet URL =ImportXML(A1, "//span[@class='author-bar']/*[contains(text(), 'Influential')]/../../../div[@class='actions']/a[contains(@class, 'date-link')]/@href")
Related posts: twitter name =ImportXML(A1, "//div[@class='twitter-post-big']//a[contains(@class, 'author-link')]")
Related posts: tweet =ImportXML(A1, "//div[@class='twitter-post-big']//span[contains(@class, 'twitter-post-text')]")
Related posts: tweet date =ImportXML(A1, "//div[@class='twitter-post-big']//div[contains(@class, 'actions')]/a[contains(@class, 'date-link')]")
Related posts: tweet URL =ImportXML(A1, "//div[@class='twitter-post-big']//div[contains(@class, 'actions')]/a[contains(@class, 'date-link')]/@href")
(Pro tip): Interested in stalking people for links? (Links to slide deck by @wilreynolds)
Combine these ones with ImportXMLs from Twitter to devastating effect. Use Google Docs automate your hit list. You can be creepy and efficient!
LinkedIn: User's profile
Here are some updates to the original LinkedIn ImportXML formulas from the first ImportXML Cookbook.
Example URL for cell A1 http://www.linkedin.com/in/wilreynolds
Current =ImportXML(A1, "//dd[@class='summary-current']//ul[@class='current']/*")
Number of connections =ImportXML(A1, "//dd[@class='overview-connections']//strong")
Profile summary =ImportXML(A1, "//p[@class=' description summary']")
Again are some more updates and additions to the original formulas from the first ImportXML Cookbook.
Example URL for cell A1 http://www.twitter.com/djchrisle
Bio =ImportXML(A1, "//span[@class='bio']")
Location =ImportXML(A1, "//span[@class='adr']")
Website =ImportXML(A1, "//a[@class='url' and @rel='me nofollow']/@href")
Following =ImportXML(A1, "//span[@id='following_count']")
Following =ImportXML(A1, "//span[@id='follower_count']")
Lists count =ImportXML(A1, "//span[@id='lists_count']")
Number of Tweets =ImportXML(A1, "//span[@id='update_count']")
Randomly selected followed names =ImportXML(A1, "//div[@id='following_list']/span/a/@title")
Randomly selected followed URLs (add http://twitter.com) =ImportXML(A1, "//div[@id='following_list']/span/a/@href")
RSS Feed of tweets =ImportXML(A1, "//div[@id='rssfeed']/*[contains(@href, 'statuses')]/@href")
RSS Feed of favorites =ImportXML(A1, "//div[@id='rssfeed']/*[contains(@href, 'favorites')]/@href")
Tweets (recent) =ImportXML(A1, "//span[@class='status-body']//span[@class='status-content']")
Tweets (timestamp) =ImportXML(A1, "//span[@class='status-body']//span[@class='published timestamp']/@data")
Users recently mentioned in tweets =ImportXML(A1, "//span[@class='entry-content']/a[@class='tweet-url username']")
Google Insights
Google Insights scraper starter spreadsheet (say that 3 times fast)
And the last song for tonight, this one goes out to @dsottimano and @dohertyjf! Google Insights!
Example URL for cell A1 http://www.google.com/insights/search/overviewReport?q=seo%20company&content=1
Headline Trends =ImportXML(A1, "//td[@class='trends-headline-headline']/a")
Headline Trends URLs =ImportXML(A1, "//td[@class='trends-headline-headline']/a/@href")
Rising search terms =ImportXML(A1, "//div[@id='overviewRisingSearchesDiv']/table/tr[2]//table[@class='trends-table-data']/tr/td[2]/a[1]")
Rising search percent =ImportXML(A1, "//div[@id='overviewRisingSearchesDiv']/table/tr[2]//table[@class='trends-table-data']/tr/td[3]")
Top search terms =ImportXML(A1, "//div[@id='overviewTopSearchesDiv']/table/tr[2]//table[@class='trends-table-data']/tr/td[2]/a[1]")
Trivia: The trick to finding this one is that the content is loaded when you specify the query string "content=1." By turning on "Log XMLHttpRequests" in Chrome's Inspect Elements window, you can watch JavaScript make AJAX calls and then start figuring out the right URLs to make calls to.
OMG! Epic! Here's more Google Docs insanity from SEER:
Visualize Your Backlinks with Google Fusion Tables (May, 2012) Google Search Scraper (February, 2012) SEOmoz Data for Google Docs (February, 2012) (Not Provided) Report in Google Docs (November, 2011) ImportXML Cookbook 1 (September, 2011)
Want more?
(Shameless plug) Talk to me on Twitter @djchrisle. I take requests and birthday shout-outs :-D