GeoCoding, R, and The Rolling Stones – Part 1

Posted: March 20, 2013 in GeoCoding XML processing
Tags: , ,

In this article I discuss a general approach for Geocoding a location from within R, processing XML reports, and using R packages to create interactive maps. There are various ways to accomplish this, though using Google’s GeoCoding service is a good place to start. We’ll also talk a bit about the XML package that is a very useful tool for parsing reports returned from Google. XML is a powerful markup language that has wide support in many Internet databases so it is helpful. This post sets us up for Part II wherein we’ll use our knowledge to create a map of the tour dates on the Rolling Stones 1975 Tour of the Americas. Also, when I use the word “GeoCoding” this basically implies the process of taking a geographic location and turning it into a latitude / longitude pair.

What does Google Offer ?

Check out the main Geocoding page, which presents implementation details of the API as well as use policies and limitations.

https://developers.google.com/maps/documentation/geocoding/

As an example let’s find out what the latitude and longitude are for “Atlanta, GA”. Actually we can get more specific than this by specifying an address or a zip code but let’s keep it simple at first. The Google API, (Application Programming Interface), is very forgiving and will work with almost any valid geographic information that we provide. We could enter just a zip code, a complete address, or an international location and Google will happily process it. Back to our example, according to the Google specs we would create a URL that looks like:

http://maps.googleapis.com/maps/api/geocode/xml?address=Atlanta,GA&sensor=false

Notice that I have placed Atlanta, GA into the URL after the “xml?address” tag. According to Google specs we must also append the string “sensor=false”. This tells Google that the query is not coming from a device but a real person (us) or perhaps a program. If you paste the above URL into a web browser you will get something back that looks very much like the following.

<GeocodeResponse>
   <status>OK</status>
   <result>
      <type>locality</type>
      <type>political</type>
      <formatted_address>Atlanta, GA, USA</formatted_address>
      ..
      ..
      <geometry>
         <location>
            <lat>33.7489954</lat>
            <lng>-84.3879824</lng>
         </location>
      ..
      ..
      </geometry>
    </result>
</GeocodeResponse>

This is what is known as an XML document (eXtensible Markup Language). It might look scary at first but in reality this report contains a lot of helpful information. Note that there are “tags” that describe the content as it is being returned. That is, what we get back is a document that basically describes itself. This is a very useful attribute of XML. We can pick out only that information we want while ignoring the rest.

In this case all we care about are the ‘lat’ and `lng’ tags. We can scan through the document and see that the latitude and longitude for Atlanta, GA is 33.7489954, -84.3879824. It is this self-documenting feature of XML that makes it ideal for processing by programming languages such as R, (of course), Perl, Python, Java, and others.

But would you want to visually accomplish this for 10, 100, or 1,000 locations ? I wouldn’t even want to do it for 2 ! So this is when we put the power of R to work. Let’s start up R and install some packages that will help us “talk” to the Google API.

install.packages("XML",dependencies=T)
library(XML)

install.packages("RCurl",dependencies=T)
library(RCurl)

Creating a URL string and Passing it to Google

We have a destination of “Atlanta, GA” that we wish to GeoCode into a latitude longitude pair. We’ll need to build a URL for eventual passing to the Google API. We can easily do this using the paste function.

google.url = "http://maps.googleapis.com/maps/api/geocode/xml?address="
query.url = paste(google.url, "Atlanta,GA","&amp&sensor=false", sep="")

query.url
[1] "http://maps.googleapis.com/maps/api/geocode/xml?address=Atlanta,GA&sensor=false"

Okay great. We are now ready to send this over to Google. This is easy using the getURL function that is part of the RCurl package.

txt = getURL(query.url)
xml.report = xmlTreeParse(txt,useInternalNodes=TRUE)

What we have here is a variable called “xml.report” that now contains the XML report that we saw in the previous section. If you don’t believe me then check out its contents:

xml.report

<?xml version="1.0" encoding="UTF-8"?>
<GeocodeResponse>
  <status>OK</status>
  <result>
    <type>locality</type>
    <type>political</type>
    <formatted_address>Atlanta, GA, USA</formatted_address>
    ..
    <geometry>
      <location>
        <lat>33.7489954</lat>
        <lng>-84.3879824</lng>
      </location>
      ..
      ..
      </GeocodeResponse>

Parsing the Returned XML Result

Don’t get too worried about how the xmlTreeParse function works at this point. Just use it as a “black box” that one implements to get the XML report. The next step uses the function getNodeSet to locate the latitude and longitude strings in the report. This is something that we did visually in the previous section though to do it from within R we need to use an “XPath expression” to search though the document for the lat/lon.

place = getNodeSet(xml.report,"//GeocodeResponse/result[1]/geometry/location[1]/*")

XPath is the XML Path Language, which is a query language for selecting “nodes” from an XML document. A full discussion of it is beyond the scope of the document. At this point just think of it as a way to search an XML document to match specific lines in the document. If you look at the string we pass to getNodeSet you might wonder what it means.

//GeocodeResponse/result[1]/geometry/location[1]/*

This is a string that we use to match specific tags in the XML report. We can “read” it as follows. Using xml.report we match the “GeocodeReponse” row, then the first “result” row, then the “geometry” row, and then the first “location” row. This will give the section of the report that relates to the latitude and longitude. To verify manually, you can visually scan through the document.

<GeocodeResponse>
  <result>
        <geometry>
             <location>
                <lat>33.7489954</lat>
                <lng>-84.3879824</lng>

Notice that in the xml.report the lines are indented, which suggests that some rows are contained in sections, which suggests a hierarchy in the document. So when you “match” the GeocodeResponse line, this is a “node” in the document that contains other nodes. So in our example the “result” node is contained within the “GeocodeResult” node. The “geometry” node is contained within the “result” node. The “location” is contained in the “result” node.

We can then extract the values contained in the “location” node. Note that in real world applications you would first use a tool such as XMLFinder to develop the correct XPath expression after which you would implement it in R as we have done here. The Firefox browser also has a plugin that allows one to parse XML reports. Finally, we use an R statement to turn the variable place (from up above) into numeric values.

lat.lon = as.numeric(sapply(place,xmlValue))
lat.lon
[1]  33.7 -84.4

Write a Function to Do This Work !

So you might think that was a lot of work. Well maybe it was though consider that we can now parse any returned XML document returned by the Google Geocoding service and find the lat / lon pair. Not only that, we can parse the returned XML file and extract any information contained therein. Since we wouldn’t want to do this manually for every GeoCoding query we might have, let’s write a function in R to do this for us. Here is one way to do it:

myGeo <- function(address="Atlanta,GA") {

# Make sure we have the required libraries to do the work

  stopifnot(require(RCurl))
  stopifnot(require(XML))
  
# Remove any spaces in the address field and replace with "+"

   address = gsub(" ","\\+",address)
   
# Create the request URL according to Google specs

  google.url = "http://maps.googleapis.com/maps/api/geocode/xml?address="
  my.url = paste(google.url,address,"&sensor=false",sep="")

  Sys.sleep(0.5) # Just so we don't beat up the Google servers with rapid fire requests
  
# Send the request to Google and turn it into an internal XML document
  
  txt = getURL(my.url)
  xml.report = xmlTreeParse(txt, useInternalNodes=TRUE)

# Pull out the lat/lon pair and return it  

  place = getNodeSet(xml.report,  "//GeocodeResponse/result[1]/geometry/location[1]/*")
  lat.lon = as.numeric(sapply(place,xmlValue))
  names(lat.lon) = c("lat","lon")
  
  return(lat.lon)
}

myGeo()
 lat   lon 
 33.7 -84.4 

myGeo("Palo Alto,CA")
   lat    lon 
  37.4 -122.1 
  
myGeo("Paris,FR")
  lat   lon 
48.86  2.35 

Now. We’ve actually got more capability than we think we do. Remember that the Google API will accept almost any reasonable geographic information. So check out the following. With just a few lines of code we have a pretty powerful way to GeoCode generally specified addresses without having to do lots of preparation and string processing within our R function.

myGeo("1600 Pennsylvania Avenue, Washington, DC")
  lat   lon 
 38.9 -77.0 

myGeo("Champs-de-Mars, Paris 75007")  # Address of the Eiffel Tower
lat  lon 
48.9  2.3 

Using the myGeo function in a Real Example

Let’s see how we might use this in a real situation. Here is a data frame called geog, which contains columns named city and state names.

geog
            city state
1       Glendale    CA
2      DesMoines    IA
3    Albuquerque    NM
4           Waco    TX
5       Honolulu    HI
6  Indianaopolis    IN
7     Pittsburgh    PA
8     Clearwater    FL
9      Sunnyvale    CA
10    Bridgeport    CT

Now. Look how easy it is to process all of these rows:

t(sapply(paste(geog$city,geog$state),myGeo))

                  lat    lon
Glendale CA      34.1 -118.3
DesMoines IA     41.6  -93.6
Albuquerque NM   35.1 -106.6
Waco TX          31.5  -97.1
Honolulu HI      21.3 -157.9
Indianaopolis IN 39.8  -86.2
Pittsburgh PA    40.4  -80.0
Clearwater FL    28.0  -82.8
Sunnyvale CA     37.4 -122.0
Bridgeport CT    41.2  -73.2

Our function should also work with a vector of zip codes.

myzipcodes
 [1] "90039" "50301" "87101" "76701" "96801" "46201" "15122" "33755" "94085" "06601"
 
t(sapply(myzipcodes,myGeo))

       lat    lon
90039 34.1 -118.3
50301 41.6  -93.6
87101 35.1 -106.6
76701 31.6  -97.1
96801 21.3 -157.9
46201 39.8  -86.1
15122 40.4  -79.9
33755 28.0  -82.8
94085 37.4 -122.0
06601 41.2  -73.2

Okay please check out Part 2 of this post to see how we process the tour data from the
1975 Rolling Stones “Tour of the Americas”.

Advertisements
Comments
  1. Steve says:

    Reblogged this on Rolling Your Rs.

  2. eliavseliav says:

    two things 1. there is a geocode command in the ggmap package might make thing easier 2. can you get by the google api restriction of 2,500 requests per 24 hour?

    • Steve says:

      Thanks for reading and taking the time to comment. Yes as time goes by there are packages which will embed such capability although what I’m proposing here is a general approach to parsing data from any service that offers information in XML such as Wunderground Weather, stock services, consumer rating, crime reports, restaurant health reports, etc. Thus, the approach outlined in the post can easily be extended to other types of data and functions not just GeoCoding. Relative to your question about working around the Google limit – there are paid solutions for unlimited geocoding approaches. Another free approach is to download GeoName’s “allcounties” file and import it into MySQL and Geocode that way using RMySQL. This is what I’ve done for another project. See this link for more information. http://forum.geonames.org/gforum/posts/list/80.page Hope this helps

  3. jazz says:

    Hi, thank you very much for your helpful tutoring here. I would like to ask a basic question. When I run your code, I receive an error message. The message says, “The ‘sensor’ parameter specified in the request must be set to either ‘true’ or ‘false’.” It seems to me that R may not like sensor=false. Do you have any ideas of what is going on here? Thank you very much for your help.

    • Steve says:

      Sorry about that. I had a typo in the text. I’ve fixed it now. The “sensor=false” phrase must be prefixed with a “&”. If you copy and paste the statements now they will work for you. Thanks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s