Scraping data with the browser

Chrome's developer tools make it easy to scrape data from web pages. I'll demonstrate this by grabbing a list of ISO country codes and country names from Wikipedia.

Before we begin, some general tips with working with the console (OSX):

Finding the element to scrape

First off, we want to find the element that contains our data. Using the elements panel and right click -> 'Inspect element', we highlight the element containing our data.

Building a pipeline

Next we move to the console. Chrome places the element you inspected most recently in the $0 variable, the next most recent on $1 and so on. It also aliases querySelectorAll as $$(css [,startNode]), so we can use these together to check out the rows in our table:

> $$("tr",$0)[0].innerHTML
"<td><a href="/wiki/Afghanistan" title="Afghanistan">Afghanistan</a></td>
<td><a href="/wiki/ISO_3166-1_alpha-2#AF" title="ISO 3166-1 alpha-2"><tt>AF</tt></a></td>
<td><tt>AFG</tt></td>
<td><tt>004</tt></td>
<td><a href="/wiki/ISO_3166-2:AF" title="ISO 3166-2:AF">ISO 3166-2:AF</a></td>"

Great! Looks like some useful data in that HTML. I normally slap [0] at the end of the pipeline to get this insight into what's being produced.

To grab the data from the HTML in another one-liner we can use one of the Array additions - map. Unforunately $$ returns a NodeList which is array-like but not an array. To work around this we grab [].map and .call it on the node list.

> [].map.call($$("tr",$0),mapper)[0]

We'll need to define mapper() - a function that takes each element in turn and returns the data we want as a structured object. In this case we want the contents of the 1st and 2nd nodes, so:

> function mapper(el) {
    return {
      country: $("td",el)[0].innerText,
      code: $("td",el)[1].innerText
    }
  }
> [].map.call($$("tr",$0),mapper)[0]
Object {country: "Afghanistan", code: "AF"}

Excellent - just what we want. Now we have an array full of tasty data, but how do we get it out?

Chrome saves the day again with copy(), giving us the data-clipboard bridge we've always wanted. Since we want useful data we'll JSON.stringify the data first, and our completed pipeline looks like:

> var data = [].map.call($$("tr",$0),mapper);
> copy(JSON.stringify(data))

What does our final data look like?

[{"country":"Afghanistan","code":"AF"},{"country":"Ă…land Islands","code":"AX"},/* ... */,{"country":"Zimbabwe","code":"ZW"}]

Here's the whole pipeline built up bit by bit

> var nodes = $$("tr",$0);
> function mapper(el) {
    return {
      country: $("tr",el)[0].innerText,
      code: $("tr",el)[1].innerText
    }
  }
> var data = [].map.call(nodes,mapper)
> copy(JSON.stringify(data))

Happy scraping!

Enjoy this? Subscribe to my RSS feed or follow me on Twitter.