Vectors are at the heart of R and represent a true convenience. Moreover, vectors are essential for good performance especially when your are working with lots of data. We’ll explore these concepts in this posting. As a motivational example let’s generate a sequence of data from -3 to 3. We’ll also use each point as input into the Normal function. Our immediate goal is to draw the classic bell curve that everyone knows. So here we start out with the x values:


x = seq(-3, 3, by=0.05) # The "by" argument let's us specify a "step" or "interval" value

x
  [1] -3.00 -2.95 -2.90 -2.85 -2.80 -2.75 -2.70 -2.65 -2.60 -2.55 -2.50 -2.45
 [13] -2.40 -2.35 -2.30 -2.25 -2.20 -2.15 -2.10 -2.05 -2.00 -1.95 -1.90 -1.85
 [25] -1.80 -1.75 -1.70 -1.65 -1.60 -1.55 -1.50 -1.45 -1.40 -1.35 -1.30 -1.25
 [37] -1.20 -1.15 -1.10 -1.05 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65
 [49] -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05
 [61]  0.00  0.05  0.10  0.15  0.20  0.25  0.30  0.35  0.40  0.45  0.50  0.55
 [73]  0.60  0.65  0.70  0.75  0.80  0.85  0.90  0.95  1.00  1.05  1.10  1.15
 [85]  1.20  1.25  1.30  1.35  1.40  1.45  1.50  1.55  1.60  1.65  1.70  1.75
 [97]  1.80  1.85  1.90  1.95  2.00  2.05  2.10  2.15  2.20  2.25  2.30  2.35
[109]  2.40  2.45  2.50  2.55  2.60  2.65  2.70  2.75  2.80  2.85  2.90  2.95
[121]  3.00

# Next we generate the corresponding y values

y = dnorm(x)

# Now we plot the data 

par(mfrow=c(1,2))   # Set the plot region to be one row and two columns

plot(x, y, xlim=c(-3,3), ylim=c(0,0.4), main="Normal Curve", col="blue") # Plot 1

plot(x,y,xlim=c(-3,3),ylim=c(0,0.4),main="Normal Curve",col="blue",type="l") # Plot 2

normal_curve1

Note that we could have generated the X values by using the rnorm function and then sorting the results. If we call the rnorm function without specifying a mean and a standard deviation we get back values from a Normal distribution of mean 0 and standard deviation 1. So we could have done the following. Note that this approach results in less “coverage” in the tails but we still get a comparable result.

xvals = rnorm(120)
par(mfrow=c(1,1))
plot(sort(xvals), y= dnorm(sort(xvals)), main="Normal Curve",col="blue")

Next let’s pull out all the x values that are less than -2 or greater than 2. We are in effect looking a the “tails” of the distribution here. Using the bracket notation with vectors is a powerful way to interrogate the data and extract only that which you need. Look how easy this is. In other languages you would have to write a for-loop that uses if statements.

tmp = x[(x < -2) | (x > 2)]

par(mfrow=c(1,2))

plot(tmp,dnorm(tmp),xlim=c(-3,3),ylim=c(0,0.4),
         main="Normal Curve - The Tails",col="blue") # Plot 1

As mentioned previously, newcomers to R often want to write a loop structure to pull out this data. You could certainly do this but it would be overlooking the power of vectors. Still, let’s see what would be involved to extract the information above using the for-loop approach.

hold = vector()
for (ii in 1:length(x)) {
    if ( (x[ii] < -2) | (x[ii] > 2) ) {
       hold[ii] = x[ii]
    }
}

Okay well this will work but it doesn’t generalize very well, (if at all), since we basically have to customize this loop based on what comparisons we want to do. It might be better to put this into a function although it is far better, (as we will soon see), to use the bracket notation available with vectors. Now let’s also pull out the x values that complement the above set. That is let’s get all the x values that are greater than -2 and less than 2. We’ll plot these along side the previous plot.

tmp2 = x[ (x > -2) & (x < 2) ]

plot(tmp2,dnorm(tmp2),xlim=c(-3,3),ylim=c(0,0.4),
     main="Normal Curve - Without Tails",col="blue") # Plot 2

normal_curve2

So let’s reconsider the example where we used the for-loop. Aside from being more complex and less general it also performs much worse than the bracket notation – especially as the size of the vector grows. Let’s regenerate our x values from -3 to 3 but this time we will pick a very small step size which will result in a vector with many more values. So we are still generating values from -3 to 3 but many more than before:

x = seq(-3,3,by=0.0001)

length(x)     # We have 60,001 elements. 
[1] 60001

par(mfrow=c(1,1))
plot(x,dnorm(x))   # Convince yourself that it still plots the Normal curve.

Now we will evaluate each x value by plugging it in to the dnorm function except we will use the for-loop approach. To make it easier to time we will put it into a function:

my.func <- function(x) {
   hold = vector()
   for (ii in 1:length(x)) {
     if ( (x[ii] < -2) | (x[ii] > 2) ) {
        hold[ii] = x[ii]
     }
   }
   return(hold)
}

# So for each x let's get the corresponding y value using this function. Let's time it.

system.time( my.func(x) )   # system.time function does what it's name suggests
   user  system elapsed 
  2.120   0.287   2.406 
 
 # So is this good / bad ? Let's compare it to the bracket notation
 
system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.003   0.001   0.003 

Wow. So the bracket notation is much faster. For vectors of smaller length the difference in time isn’t so large. But as the size of x grows so does the time it takes to process it. To drive this point home let’s look at a larger x vector and see how it performs as compared to the bracket notation.

x = seq(-3,3,by=0.00005)

length(x)
[1] 120001

system.time( my.func(x) )
   user  system elapsed 
  6.363   2.640   8.998 

system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.006   0.000   0.006 
  
# Wow - the bracket notation really does provide much better performance. 

Let’s do some benchmarking and plot the results so we can get a visual representation of the performance of the two approaches. Check out the rbenchhmark package. You will have to install it since it is an add-on package. Use RStudio or the Windows GUI to do this or you can use the install.packages command. This package contains the function called benchmark which allows you to time the execution of any R function including those that you have written.

It replicates the function a specified number of times. You can also compare the performance of more than one function at a time. Let’s start out simple. Here we will time how long it takes to square the x vector. As mentioned this will be replicated some number times with the default being 100. We will get back a variety of information.

x = seq(-3,3,by=0.0001)

benchmark( square = x^2)
    test replications elapsed relative user.self sys.self user.child sys.child
1 square          100   0.018        1     0.015    0.002          0         0
 
# We can compare how long it takes to square it vs cubing it. We can also specify the
# information to be returned

benchmark( square = x^2, cube = x^3, replications=5, columns=c("test","elapsed"))
    test elapsed
2   cube   0.006
1 square   0.001

First we’ll generate four different versions of the x vector corresponding to different step sizes.

hold = lapply(c(0.01, 0.001, 0.0001, 0.00005), function(x) {
   return(seq(-3,3,by=x))
})

length(hold)  # Hold is a list with 4 elements each of which contains a vector
[1] 4

sapply(hold,length)
[1]    601   6001  60001 120001

Okay, second we’ll use this new hold vector as input into the benchmarking function. To make this efficient we’ll use the lapply command. We’ll also create a function to make the timing process more efficient.

timing.func <- function(x) {
  benchmark( replications=5, func = my.func(x), 
             bracket = x[ (x <2) | (x > 2)], columns=c("test","elapsed") ) 
}

( timings = lapply(hold, timing.func)  )
[[1]]
     test elapsed
2 bracket   0.001
1    func   0.008

[[2]]
     test elapsed
2 bracket   0.002
1    func   0.168

[[3]]
     test elapsed
2 bracket   0.018
1    func  10.757

[[4]]
     test elapsed
2 bracket   0.036
1    func  60.141

# Let's pull out the timings for each approach (the function approach or bracket approach)

( mtimings = sapply(timings, function(x) x$elapsed) )

      [,1]  [,2]   [,3]   [,4]
[1,] 0.001 0.002  0.018  0.036
[2,] 0.008 0.168 10.757 60.141

# Let's plot the timings

plot(1:4,mtimings[2,],pch=18,col="red",type="l",xaxt="n",
         main="Timings of bracket notations vs for loop",ylab="Timings in Seconds")

axis(1,1:4,labels=as.character(sapply(hold,length)),xlab=c("vector size in elements"))

points(1:4,mtimings[1,],type="l",col="blue")

legend(1,40,c("for-loop","bracket"),col=c("red","blue"),lty=c(1,1))

bracketfor

Well this shows that the bracket notation works well even as the x vector gets large. The for-loop, however, starts to perform very poorly and will soon be impractical for larger vectors.

Writing loops in C++, Java, C, and FORTRAN is pretty fast which is why people coming to R from those languages tend to want to write loops to process data. Users of MATLAB, however, are used to the idea of vectors, (although C++ supports vectors), and make the transition to R quite nicely. Also keep in mind that matrices, (which are closely related to vectors), can also be processed using for-loops but, as with vectors, it is not recommended. There are plenty of functions that are “vectorized” and customized for use with vectors and matrices. This is why we can use commands like apply to help process data in R.

Advertisements

Welcome to Part 2 of the GeoCoding, R, and the Rolling Stones blog. Let’s apply some of the things we learned in Part 1 to a practical real world example.

Mapping the Stones – A Real Example

The Rolling Stones have toured for many years. You can go to Wikipedia and see information on the various tours. Here we focus only on the dates and concerts for the 1975 “Tour of the Americas”. I’ve scraped off the information from the Wikipedia page and put it into a data frame. The idea here is that we will GeoCode each city and obtain a latitude and longitude and then use it to create an interactive map of the tour using the Google Charting Tools.

If you want your own copy of this data frame then do the following:

url = "http://steviep42.bitbucket.org/data/stones75.csv"
stones75 = read.csv(url)

Here are the first 10 rows of the data frame. The format is really simple:

head(stones75,10)
           Date        City         State                Venue
1   1 June 1975 Baton Rouge     Louisiana  LSU Assembly Center
2   3 June 1975 San Antonio         Texas    Convention Center
3   4 June 1975 San Antonio         Texas    Convention Center
4   6 June 1975 Kansas City      Missouri    Arrowhead Stadium
5   8 June 1975   Milwaukee     Wisconsin       County Stadium
6   9 June 1975  Saint Paul     Minnesota         Civic Center
7  11 June 1975      Boston Massachusetts        Boston Garden
8  14 June 1975   Cleveland          Ohio    Municipal Stadium
9  15 June 1975     Buffalo      New York  Memorial Auditorium
10 17 June 1975     Toronto       Ontario   Maple Leaf Gardens

Okay let’s process the cities. Like before we’ll use the sapply command to get back the data after which we’ll use cbind to attach the results to the data frame. We might get some warnings about row names when we do this but don’t worry about it. After all “you can’t always get what you want”.

hold = data.frame(t(sapply(paste(stones75$City,stones75$State,sep=","),myGeo)))
stones75 = cbind(stones75,hold)
 
head(stones75,10)

           Date        City         State                Venue  lat   lon
1   1 June 1975 Baton Rouge     Louisiana  LSU Assembly Center 30.5 -91.1
2   3 June 1975 San Antonio         Texas    Convention Center 29.4 -98.5
3   4 June 1975 San Antonio         Texas    Convention Center 29.4 -98.5
4   6 June 1975 Kansas City      Missouri    Arrowhead Stadium 39.1 -94.6
5   8 June 1975   Milwaukee     Wisconsin       County Stadium 43.0 -87.9
6   9 June 1975  Saint Paul     Minnesota         Civic Center 45.0 -93.1
7  11 June 1975      Boston Massachusetts        Boston Garden 42.4 -71.1
8  14 June 1975   Cleveland          Ohio    Municipal Stadium 41.5 -81.7
9  15 June 1975     Buffalo      New York  Memorial Auditorium 42.9 -78.9
10 17 June 1975     Toronto       Ontario   Maple Leaf Gardens 43.7 -79.4

Great ! So now we have the lat and lon for each city. As you might notice in the data frame the Stones played several nights in the same city so we should probably keep track of this.


stones75[9:18,]

           Date          City        State                 Venue
9  15 June 1975       Buffalo     New York   Memorial Auditorium
10 17 June 1975       Toronto      Ontario    Maple Leaf Gardens
11 22 June 1975 New York City     New York Madison Square Garden
12 23 June 1975 New York City     New York Madison Square Garden
13 24 June 1975 New York City     New York Madison Square Garden
14 25 June 1975 New York City     New York Madison Square Garden
15 26 June 1975 New York City     New York Madison Square Garden
16 27 June 1975 New York City     New York Madison Square Garden
17 29 June 1975  Philadelphia Pennsylvania          The Spectrum
18  1 July 1975         Largo     Maryland        Capital Center

As you can see above, they made a six night stand at the famous Madison Square Garden arena in New York City. Our programming should check for duplicate city names before we bug Google to get information that we already have. But that is left as an assignment for you.

Creating a Map of the Tour Using googleVis

Anyway let’s now build a map of the tour dates. For this example we will use a package called “googleVis”. You might not know that Google has a number of mapping services for which R APIs exist. Look at the table at the end of this section, which lists existing packages for interfacing programmatically with the various Google mapping and chart services. You can find these packages on CRAN. In our case we’ll need to install googleVis. After that we can create a map.

install.packages("googleVis",dependencies=TRUE)
library(googleVis)

The cool thing about the googleVis package is that we get back a map in a web browser that has scroll bars and zoom tools. Additionally we can use information from the data frame to annotate the chart we plan to create. So, for example, for each tour stop that the band made we can put in meta info like the name of the venue they played as well as the date.

We have to do this in a way that accommodates the requirements of googleVis. This means we have to read through the googleVis manual pages and play around with the examples. However, hopefully I’m presenting a pretty good example here so you don’t have to immerse yourself in the manual (at least not yet).

The first thing we need to do is to create a single column for the Latitude and Longitude because goolgeVis wants this. This is easy to do. Let’s take the existing stones75 data frame and change it:

head(stones75)

         Date        City     State                Venue  lat   lon
1 1 June 1975 Baton Rouge Louisiana  LSU Assembly Center 30.5 -91.1
2 3 June 1975 San Antonio     Texas    Convention Center 29.4 -98.5
3 4 June 1975 San Antonio     Texas    Convention Center 29.4 -98.5
4 6 June 1975 Kansas City  Missouri    Arrowhead Stadium 39.1 -94.6
5 8 June 1975   Milwaukee Wisconsin       County Stadium 43.0 -87.9
6 9 June 1975  Saint Paul Minnesota         Civic Center 45.0 -93.1

stones75$LatLon = paste(round(stones75$lat,1),round(stones75$lon,1),sep=":")
stones75 = stones75[,-5:-6]  # Remove the old lat and lon columns

head(stones75)
         Date        City     State                Venue     LatLon
1 1 June 1975 Baton Rouge Louisiana  LSU Assembly Center 30.5:-91.1
2 3 June 1975 San Antonio     Texas    Convention Center 29.4:-98.5
3 4 June 1975 San Antonio     Texas    Convention Center 29.4:-98.5
4 6 June 1975 Kansas City  Missouri    Arrowhead Stadium 39.1:-94.6
5 8 June 1975   Milwaukee Wisconsin       County Stadium   43:-87.9
6 9 June 1975  Saint Paul Minnesota         Civic Center   45:-93.1

Next up we can create a column in our data frame that contains all the information we want to use to annotate each concert date. This can include HTML tags to better format the output. As an example the statement below creates a new column in the data frame called “Tip”, that has the following info: the Stop number on the tour, the Venue where it was held, and the Date of the concert. Once we have a map we can click on the “pin” for each location and see the annotation info.

stones75$Tip = paste(rownames(stones75),stones75$Venue,stones75$Date,"<BR>",sep=" ")

# Now we can create a chart !  

# Click on the Atlanta locator and you'll see that it was the 37th stop of the tour. 
# The show took place at The Omni on July 30th, 1975

stones.plot = gvisMap(stones75,"LatLon","Tip")
plot(stones.plot)

googlemap
Refining the Plot Annotations
Pretty cool huh ? We can also zoom in on different parts of the map. The gvisMap function has a number of options that would allow us to draw a line between the cities, select a different type of map, and adopt certain zoom levels by default. So what else could / should we do ?

Well we have a problem here in that the Stones played more than one show in several cities but we don’t take that into account when we are building the annotation data. What we might want to do is to process the data frame and, for those cities that had multiple shows, (e.g. New York), we can capture all the meta data in one go. We saw this before with the New York dates.

stones75[9:18,]

           Date          City        State                 Venue
9  15 June 1975       Buffalo     New York   Memorial Auditorium
10 17 June 1975       Toronto      Ontario    Maple Leaf Gardens
11 22 June 1975 New York City     New York Madison Square Garden
12 23 June 1975 New York City     New York Madison Square Garden
13 24 June 1975 New York City     New York Madison Square Garden
14 25 June 1975 New York City     New York Madison Square Garden
15 26 June 1975 New York City     New York Madison Square Garden
16 27 June 1975 New York City     New York Madison Square Garden
17 29 June 1975  Philadelphia Pennsylvania          The Spectrum
18  1 July 1975         Largo     Maryland        Capital Center

Currently our plot has only the last New York show information. But we want to have the info for all NYC shows. Here is one way to approach this problem. Note that there are probably more elegant ways to clean up the data but this will do the job for now.

test = stones75     # Create some temporary work variables
str=""
tmpdf = list()
ii = 1

repeat {                  # Loop through the copy of the stones75 data frame 

   hold = test[test$Venue == test[ii,4],"Tip"]
   
   # Do we have a multi-city stand ?
   
      if (length(hold) > 1) {   
        str = paste(hold,collapse="")
        test[ii,6] = str
        tmpdf[[ii]] = test[ii,]
        str=""
        
    # We "jump over" cities that we've already processed 
   
        ii = ii + length(hold)  
     
    # Here we process the "one night stands"        
    
   } else {                       
        tmpdf[[ii]] = test[ii,]
        ii = ii + 1
   }
if (ii > 42) break
}

tmpdf = tmpdf[!sapply(tmpdf,is.null)]    # Remove NULL list elements
stones = do.call(rbind,tmpdf)                # Bind the list back into a data frame

stones.plot = gvisMap(stones,"LatLon","Tip")
plot(stones.plot)

googmap2

Okay. Now depending on your background in R you might think that was a lot of work, (or maybe not). In either case this is fairly typical of what we have to do to clean up and/or consolidate data to get it into a format that is suitable for use with the package we are using. Don’t think that this type of effort is peculiar to googleVis because other packages would require a comparable level of processing also. Welcome to the real world of data manipulation.

Anyway let’s take a look at the new plot. At first cut it seems just like the old one but click on the New York locator and you will now see that all the info for all Madison Square Garden is present. Shows number 11 through 16 took place in NYC.

R packages for interfacing with Google

Here is a table that lists the other R packages that exist to interface with various Google services. Each one of these is worth investigation. Keep in mind that similar accommodations exist for other languages so if you prefer to do your coding in Perl or Python then you could work with the Google APIs also.

PACKAGE DESCRIPTION
googleVis Create web pages with interactive charts based on R data frames
plotGoogleMaps Plot HTML output with Google Maps API and your own data
RgoogleMaps Overlays on Google map tiles in R
animation A Gallery of Animations in Statistics and Utilities
gridSVG Export grid graphics as SVG
SVGAnnotation Tools for Post-Processing SVG Plots Created in R
RSVGTipsDevice An R SVG graphics device with dynamic tips and hyperlink
iWebPlots Interactive web-based plots

In this article I discuss a general approach for Geocoding a location from within R, processing XML reports, and using R packages to create interactive maps. There are various ways to accomplish this, though using Google’s GeoCoding service is a good place to start. We’ll also talk a bit about the XML package that is a very useful tool for parsing reports returned from Google. XML is a powerful markup language that has wide support in many Internet databases so it is helpful. This post sets us up for Part II wherein we’ll use our knowledge to create a map of the tour dates on the Rolling Stones 1975 Tour of the Americas. Also, when I use the word “GeoCoding” this basically implies the process of taking a geographic location and turning it into a latitude / longitude pair.

What does Google Offer ?

Check out the main Geocoding page, which presents implementation details of the API as well as use policies and limitations.

https://developers.google.com/maps/documentation/geocoding/

As an example let’s find out what the latitude and longitude are for “Atlanta, GA”. Actually we can get more specific than this by specifying an address or a zip code but let’s keep it simple at first. The Google API, (Application Programming Interface), is very forgiving and will work with almost any valid geographic information that we provide. We could enter just a zip code, a complete address, or an international location and Google will happily process it. Back to our example, according to the Google specs we would create a URL that looks like:

http://maps.googleapis.com/maps/api/geocode/xml?address=Atlanta,GA&sensor=false

Notice that I have placed Atlanta, GA into the URL after the “xml?address” tag. According to Google specs we must also append the string “sensor=false”. This tells Google that the query is not coming from a device but a real person (us) or perhaps a program. If you paste the above URL into a web browser you will get something back that looks very much like the following.

<GeocodeResponse>
   <status>OK</status>
   <result>
      <type>locality</type>
      <type>political</type>
      <formatted_address>Atlanta, GA, USA</formatted_address>
      ..
      ..
      <geometry>
         <location>
            <lat>33.7489954</lat>
            <lng>-84.3879824</lng>
         </location>
      ..
      ..
      </geometry>
    </result>
</GeocodeResponse>

This is what is known as an XML document (eXtensible Markup Language). It might look scary at first but in reality this report contains a lot of helpful information. Note that there are “tags” that describe the content as it is being returned. That is, what we get back is a document that basically describes itself. This is a very useful attribute of XML. We can pick out only that information we want while ignoring the rest.

In this case all we care about are the ‘lat’ and `lng’ tags. We can scan through the document and see that the latitude and longitude for Atlanta, GA is 33.7489954, -84.3879824. It is this self-documenting feature of XML that makes it ideal for processing by programming languages such as R, (of course), Perl, Python, Java, and others.

But would you want to visually accomplish this for 10, 100, or 1,000 locations ? I wouldn’t even want to do it for 2 ! So this is when we put the power of R to work. Let’s start up R and install some packages that will help us “talk” to the Google API.

install.packages("XML",dependencies=T)
library(XML)

install.packages("RCurl",dependencies=T)
library(RCurl)

Creating a URL string and Passing it to Google

We have a destination of “Atlanta, GA” that we wish to GeoCode into a latitude longitude pair. We’ll need to build a URL for eventual passing to the Google API. We can easily do this using the paste function.

google.url = "http://maps.googleapis.com/maps/api/geocode/xml?address="
query.url = paste(google.url, "Atlanta,GA","&amp&sensor=false", sep="")

query.url
[1] "http://maps.googleapis.com/maps/api/geocode/xml?address=Atlanta,GA&sensor=false"

Okay great. We are now ready to send this over to Google. This is easy using the getURL function that is part of the RCurl package.

txt = getURL(query.url)
xml.report = xmlTreeParse(txt,useInternalNodes=TRUE)

What we have here is a variable called “xml.report” that now contains the XML report that we saw in the previous section. If you don’t believe me then check out its contents:

xml.report

<?xml version="1.0" encoding="UTF-8"?>
<GeocodeResponse>
  <status>OK</status>
  <result>
    <type>locality</type>
    <type>political</type>
    <formatted_address>Atlanta, GA, USA</formatted_address>
    ..
    <geometry>
      <location>
        <lat>33.7489954</lat>
        <lng>-84.3879824</lng>
      </location>
      ..
      ..
      </GeocodeResponse>

Parsing the Returned XML Result

Don’t get too worried about how the xmlTreeParse function works at this point. Just use it as a “black box” that one implements to get the XML report. The next step uses the function getNodeSet to locate the latitude and longitude strings in the report. This is something that we did visually in the previous section though to do it from within R we need to use an “XPath expression” to search though the document for the lat/lon.

place = getNodeSet(xml.report,"//GeocodeResponse/result[1]/geometry/location[1]/*")

XPath is the XML Path Language, which is a query language for selecting “nodes” from an XML document. A full discussion of it is beyond the scope of the document. At this point just think of it as a way to search an XML document to match specific lines in the document. If you look at the string we pass to getNodeSet you might wonder what it means.

//GeocodeResponse/result[1]/geometry/location[1]/*

This is a string that we use to match specific tags in the XML report. We can “read” it as follows. Using xml.report we match the “GeocodeReponse” row, then the first “result” row, then the “geometry” row, and then the first “location” row. This will give the section of the report that relates to the latitude and longitude. To verify manually, you can visually scan through the document.

<GeocodeResponse>
  <result>
        <geometry>
             <location>
                <lat>33.7489954</lat>
                <lng>-84.3879824</lng>

Notice that in the xml.report the lines are indented, which suggests that some rows are contained in sections, which suggests a hierarchy in the document. So when you “match” the GeocodeResponse line, this is a “node” in the document that contains other nodes. So in our example the “result” node is contained within the “GeocodeResult” node. The “geometry” node is contained within the “result” node. The “location” is contained in the “result” node.

We can then extract the values contained in the “location” node. Note that in real world applications you would first use a tool such as XMLFinder to develop the correct XPath expression after which you would implement it in R as we have done here. The Firefox browser also has a plugin that allows one to parse XML reports. Finally, we use an R statement to turn the variable place (from up above) into numeric values.

lat.lon = as.numeric(sapply(place,xmlValue))
lat.lon
[1]  33.7 -84.4

Write a Function to Do This Work !

So you might think that was a lot of work. Well maybe it was though consider that we can now parse any returned XML document returned by the Google Geocoding service and find the lat / lon pair. Not only that, we can parse the returned XML file and extract any information contained therein. Since we wouldn’t want to do this manually for every GeoCoding query we might have, let’s write a function in R to do this for us. Here is one way to do it:

myGeo <- function(address="Atlanta,GA") {

# Make sure we have the required libraries to do the work

  stopifnot(require(RCurl))
  stopifnot(require(XML))
  
# Remove any spaces in the address field and replace with "+"

   address = gsub(" ","\\+",address)
   
# Create the request URL according to Google specs

  google.url = "http://maps.googleapis.com/maps/api/geocode/xml?address="
  my.url = paste(google.url,address,"&sensor=false",sep="")

  Sys.sleep(0.5) # Just so we don't beat up the Google servers with rapid fire requests
  
# Send the request to Google and turn it into an internal XML document
  
  txt = getURL(my.url)
  xml.report = xmlTreeParse(txt, useInternalNodes=TRUE)

# Pull out the lat/lon pair and return it  

  place = getNodeSet(xml.report,  "//GeocodeResponse/result[1]/geometry/location[1]/*")
  lat.lon = as.numeric(sapply(place,xmlValue))
  names(lat.lon) = c("lat","lon")
  
  return(lat.lon)
}

myGeo()
 lat   lon 
 33.7 -84.4 

myGeo("Palo Alto,CA")
   lat    lon 
  37.4 -122.1 
  
myGeo("Paris,FR")
  lat   lon 
48.86  2.35 

Now. We’ve actually got more capability than we think we do. Remember that the Google API will accept almost any reasonable geographic information. So check out the following. With just a few lines of code we have a pretty powerful way to GeoCode generally specified addresses without having to do lots of preparation and string processing within our R function.

myGeo("1600 Pennsylvania Avenue, Washington, DC")
  lat   lon 
 38.9 -77.0 

myGeo("Champs-de-Mars, Paris 75007")  # Address of the Eiffel Tower
lat  lon 
48.9  2.3 

Using the myGeo function in a Real Example

Let’s see how we might use this in a real situation. Here is a data frame called geog, which contains columns named city and state names.

geog
            city state
1       Glendale    CA
2      DesMoines    IA
3    Albuquerque    NM
4           Waco    TX
5       Honolulu    HI
6  Indianaopolis    IN
7     Pittsburgh    PA
8     Clearwater    FL
9      Sunnyvale    CA
10    Bridgeport    CT

Now. Look how easy it is to process all of these rows:

t(sapply(paste(geog$city,geog$state),myGeo))

                  lat    lon
Glendale CA      34.1 -118.3
DesMoines IA     41.6  -93.6
Albuquerque NM   35.1 -106.6
Waco TX          31.5  -97.1
Honolulu HI      21.3 -157.9
Indianaopolis IN 39.8  -86.2
Pittsburgh PA    40.4  -80.0
Clearwater FL    28.0  -82.8
Sunnyvale CA     37.4 -122.0
Bridgeport CT    41.2  -73.2

Our function should also work with a vector of zip codes.

myzipcodes
 [1] "90039" "50301" "87101" "76701" "96801" "46201" "15122" "33755" "94085" "06601"
 
t(sapply(myzipcodes,myGeo))

       lat    lon
90039 34.1 -118.3
50301 41.6  -93.6
87101 35.1 -106.6
76701 31.6  -97.1
96801 21.3 -157.9
46201 39.8  -86.1
15122 40.4  -79.9
33755 28.0  -82.8
94085 37.4 -122.0
06601 41.2  -73.2

Okay please check out Part 2 of this post to see how we process the tour data from the
1975 Rolling Stones “Tour of the Americas”.

Hello. Welcome to my debut post ! Check the About link to see what this Blog intends to accomplish. In this article I discuss a general approach for dealing with the problem of splitting a data frame based on a grouping variable and then doing some more operations per group. A secondary goal is to provide some exposure to the “apply” family of functions to demonstrate how they can help. For purposes of example R has a built-in data frame called “ChickWeight” that contains 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. To get more details simply type:

data(ChickWeight)
?ChickWeight

In general when looking at data frames we should be looking for the continuous variables that we want to summarize in terms of some factor or grouping variable. Grouping variables can usually be easily identified because they typically take on a fixed number of values. (Note that R has commands that let us “cut” continuous variables into discrete intervals but that is for another blog post). So let’s use the sapply function to determine what the grouping variables might be:

sapply(ChickWeight, function(x) length(unique(x)))
weight   Time  Chick   Diet 
   212     12     50      4 

If you don’t know about sapply then don’t freak out. It takes the function we define in the second argument and “applies” it to the columns of data frame, (ChickWeight), whose name we provided as the first argument. The “x” in the function is a placeholder argument for each of the columns of the data frame. The unique function returns only the unique values that the column assumes. The length function then “counts” the number of unique values.

This can be confusing I know. However, the important thing is that we can easily observe that Diet and Time look like potential grouping variables since they take on, respectively, 12 and 4 unique values. If we just wanted to know the mean weight of the chickens for each Diet type then we could use the tapply command to get that answer.

tapply(ChickWeight$weight,ChickWeight$Diet,mean)
  1   2   3   4 
103 123 143 135 

Note that we could also write our own function in place of “mean” in case we wanted to so something more in depth in terms of summary. But maybe we don’t yet know what it is we want to do with the data. So having it in a grouped format might help us better understand the data. Let’s use the split command, which can give us access to the individual groups.

my.splits = split(ChickWeight, ChickWeight$Diet)
length(my.splits)
[1] 4

names(my.splits)
[1] "1" "2" "3" "4"

This operation creates a list where each element of my.splits corresponds to the individual Diet values (1,2,3, or 4). We can now start thinking about how to further investigate each Diet type. To convince yourself that the split command actually worked, lets take a peek at the first element. All records in the first element relate to Diet #1.

head(my.splits[[1]])
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

Just for fun let’s use the lapply command to subset into each element of my.splits to obtain all the chicks from each group that weigh less that 40 grams. This approach doesn’t take much typing and will conveniently return a list, which we store in a variable called my.results.lapply

my.results.lapply = lapply(my.splits, subset, weight <= 40)

In the example above we pass additional arguments to the subset command by adding the arguments after the function call. This is the most “R like” way of doing things. Alternatively, we could have defined an anonymous function to do this.

my.results.lapply = lapply(my.splits, function(x) subset(x, weight <= 40) )

Note that we can also define our function in advance. This isn’t substantially different from the example above but it might improve the readability of the code. This is a personal choice and either approach will yield the same result.

my.func <- function(x) {
    subset(x, weight <= 40)
}
my.results.lapply = lapply(my.splits, my.func)

In any case check out what is happening here. lapply creates a “hidden” subscript that passes to the function on the right the value of each element of my.splits. This can be confusing to newcomers but is actually a convenience since it accomplishes a for-loop structure “under the hood”. To make it more obvious we could have also done the following, which is kind of like a for-loop approach where you work with the length of the my.splits list using subscripts.

my.results.lapply = lapply(1:length(my.splits), function(x) subset(my.splits[[x]], weight <= 40))

If we are done with our sub-setting and interrogation then we can repackage the results list back into a data frame by using this construct:

my.df = do.call(rbind, my.results.lapply)
my.df
    weight Time Chick Diet
13      40    0     2    1
26      39    2     3    1
195     39    0    18    1
196     35    2    18    1
221     40    0    21    2
269     40    0    25    2
293     39    0    27    2
305     39    0    28    2
317     39    0    29    2
365     39    0    33    3
401     39    0    36    3
519     40    0    46    4
543     39    0    48    4
555     40    0    49    4

So now we have a single data frame with only the data that we want. If you are coming from another programming language then you might be tempted to write a for-loop to do this. You could though you have to do a little more work to keep up with the results. You have to create your own blank list to stash results:

my.results.for = list() 
for (ii in 1:length(my.splits)) {
     my.results.for[[ii]] = subset(my.splits[[ii]], weight <= 40)
}

names(my.results.for) = names(my.splits)

all.equal(my.results.lapply, my.results.for) # Should be equal to my.results.lapply

So which approach do you use ? It depends. The for-loop approach is a more traditional angle. However, the lapply takes much less typing, and can sometimes perform better, but you have to define your function to work with it and understand that apply is “silently” passing each element to your function. Once people become accustomed to using lapply they usually stick with it.

Now not to blow your mind but we could knock out the problem in one go:

lapply(split(ChickWeight,ChickWeight$Diet), subset, weight <= 40)

# OR

(do.call(rbind, lapply(split(ChickWeight,ChickWeight$Diet), subset, weight <= 40)))

      weight Time Chick Diet
1.13      40    0     2    1
1.26      39    2     3    1
1.195     39    0    18    1
1.196     35    2    18    1
2.221     40    0    21    2
2.269     40    0    25    2
2.293     39    0    27    2
2.305     39    0    28    2
2.317     39    0    29    2
3.365     39    0    33    3
3.401     39    0    36    3
4.519     40    0    46    4
4.543     39    0    48    4
4.555     40    0    49    4

Lastly, it occurred to me after writing much of this post that, in this example, which is admittedly somewhat contrived, there is another way to do this. This realization is fairly common in R especially when you view code written by others. This is the blessing, (and curse), of R. There are many ways to do the same thing. In fact there are frequently several functions, which very much appear to the same thing (or could be made to). Anyway, relative to our problem we could first sort the ChickWeight data frame by Diet and then do a subset. The sort will arrange the data by group, which means that the resulting subset operation will preserve that order.

sorted.chickens = ChickWeight[order(ChickWeight$Diet),]
(sorted.chickens = subset(sorted.chickens, weight <= 40))

    weight Time Chick Diet
13      40    0     2    1
26      39    2     3    1
195     39    0    18    1
196     35    2    18    1
221     40    0    21    2
269     40    0    25    2
293     39    0    27    2
305     39    0    28    2
317     39    0    29    2
365     39    0    33    3
401     39    0    36    3
519     40    0    46    4
543     39    0    48    4
555     40    0    49    4