lapply101
Next up in our review of the family of apply commands we’ll look at the lapply function, which can be used to loop over the elements of a list (or a vector). This is a true convenience although for those with experience in other programming languages it can seem unnecessary since you are accustomed to writing your own loops. Rest assured you can take that approach in R but once you get an understanding of lists and lapply you will appreciate what it can do for you. This leads me to what I feel is an important observation. I find that most misunderstandings of the lapply command result primarily from a limited or incomplete knowledge of the list structure in R. As long as you know what lists are and how to effectively manipulate them then lapply becomes much easier to implement. So we’ll begin with a brief yet important tutorial on lists and how elements can be referenced.

The list structure represents R’s default data structure for containing heterogeneous information. Recall that vectors and matrices can accommodate only one data type at a time. That is you cannot have a vector or matrix with mixed data types. You can try to put numeric and character data into the same vector but R will convert it all to a single type. Don’t believe me ? Try it. R doesn’t complain but it will make everything a character.

somevec <- c(1,4,5,"4","5")

somevec
[1] "1" "4" "5" "4" "5"

So from a purely practical point of view some data structure must exist in R to accommodate mixed data. That is what the list structure is for. Where do lists show up in R ? All over the place it turns out. Many of the interesting statistical functions in R, as well as the many add on packages available on CRAN, return information in the form of lists.


# Let's do some regression using the mtcars data frame

mylm <- lm(mpg~wt, data = mtcars)

# What type of structure do we get back ? A list with 12 sub elements

str(mylm,max.level=0)
List of 12
 - attr(*, "class")= chr "lm"

Now isn’t that interesting ? As you might know, judging the quality of the regression process can be quite involved. R knows this so it returns lots of information encapsulated within a list to help you assess the model. The 12 elements include data types such as vectors, factors, lists, objects, and data frames. So if you are writing your own function and need to return diverse types of data then the list structure is for you ! But that’s all a bit too complex for the moment so let’s return to some basics. To motivate things I’ll present some variables here that look like they relate to characteristics of a family. We have a surname, the number of children in the family, their respective ages, and whether or not the children have had measles.

surname <- "Jones"
numofchild <- 2
ages <- c(5,7)
measles <- c("Y","N")

We could work with this information as individual variables though it might be useful to assemble it all into a single structure. So I will create a “named” list to contain this information. It is called a “named” list because each element receives a name as the list is being created. Named lists are easier for humans to manipulate and interrogate because we can refer to elements by name using the “$” notation which I will introduce momentarily.

family1 <- list(name="Jones",numofchild=2,ages=c(5,7),measles=c("Y","N"))

str(family1)
List of 4
 $ name      : chr "Jones"
 $ numofchild: num 2
 $ ages      : num [1:2] 5 7
 $ measles   : chr [1:2] "Y" "N"

Ah so we see that the elements “name” and “measles” are character vectors, element “numofchild” is a single valued vector, and the “ages” element is a multi valued vector. This proves that we can host data of differing types within a single data structure. Now how do we address these elements and retrieve their values ? Can we use numeric indexing as with a vector ? Can we use the names we created for each element ? The answer in both cases is “yes”. If we have a list with named elements then we can use the “$” notation.

family1$ages
[1] 5 7

family1$measles
[1] "Y" "N"

# We can also reference by numeric position which is more useful if you are 
# writing your own loop structures but it is less intuitive 

family1[2]
$numofchild
[1] 2

family1[[2]]
[1] 2

Hmm. What’s up with the double bracket vs the single bracket ? Well the way I think about it is that if you use the single bracket, (as you would if this were a vector), you get back the name of the element as well as it’s value. While this is useful it is usually more interesting to get the actual value(s) of the element which, (if you don’t use the element name), requires use of the double brackets. Think of the double brackets as being more specific than the single brackets. Now even if you use the $ notation you can still address individual values of a given list element. So here I’ll start with pulling out the age of the first child only.

family1$ages[1]
[1] 5

# We could pull out both ages using this approach

family1$ages[1:2]
[1] 5 7

# But this is the same as this:

family1$ages
[1] 5 7

# Which is the same as this:

family1[[3]]
[1] 5 7

The way I would summarize the above information is that if you have a named list then you can use the “$” notation for the most part though if you want to address specific values within a multivalued element then you will also have to use the bracket notation in addition to the “$” notation. If you have an unnamed list then you must use the bracket notation exclusively since there are no names available. Unnamed lists result when no effort is made to name the elements such as in the following example. I can always apply names later if I wish.

family1 <- list("Jones",2,c(5,7),c("Y","N"))

# So when we print the list results we see only brackets - no names.

family1
[[1]]
[1] "Jones"

[[2]]
[1] 2

[[3]]
[1] 5 7

[[4]]
[1] "Y" "N"

# The cool thing is that we can make names for our elements even after we have
# create the list

names(family1) <- c("name","numofchild","ages","measles")

Introducing lapply

Admittedly this family1 list is a little basic but the above examples prove that there is flexibility in how you can address elements and values. So let’s present an example of the lapply function. I’ll use it to apply the built in “typeof” function to each element in the list.

lapply(family1, typeof)
$name
[1] "character"

$numofchild
[1] "double"

$ages
[1] "double"

$measles
[1] "character"

# Using lapply will return a list

str(lapply(family1, typeof))
List of 4
 $ name      : chr "character"
 $ numofchild: chr "double"
 $ ages      : chr "double"
 $ measles   : chr "character"

So each element of family1 is “silently” passed to the typeof function and the results for all elements are returned in a list. In fact the “l” in lapply comes from the fact that it will return to you a list. To make sure you understand what is happening here let me introduce a small variation to the example.


lapply(family1, typeof)

# is the same as

lapply(family1, function(x) typeof(x))

The second version does exactly the same thing as the first but illustrates two important facts: 1) I can pass an “anonymous” functions to lapply. (If you read my blog on apply I discussed the importance and use of “anonymous” functions). 2) The variable “x” referenced in the anonymous function definition represents a placeholder for each element of the family1 list. If you need more help understanding this then look at this example where I use a function that simply prints back each element of the list and it’s associated value.

lapply(family1,function(x) x)
$name
[1] "Jones"

$numofchild
[1] 2

$ages
[1] 5 7

$measles
[1] "Y" "N"

# But this is the same as simply typing the name of the list at the prompt

family1

Now you could have also defined a function in advance instead of using an anonymous function. That is you are not obligated to use anonymous functions. Some people find them to error prone, confusing, and less readable. If you do then simply define your function in advance. It won’t change the result or impact performance at least in this case.

simplefunc <- function(x) {
   mytype <- typeof(x)
   return(mytype)
}

lapply(family1, simplefunc)
$name
[1] "character"

$numofchild
[1] "double"

$ages
[1] "double"

$measles
[1] "character"

Alright let’s get a little more advanced here. I’ll write a function that returns the mean value of each element. But I know what you are thinking. Not all of our list elements are numeric so to avoid an error I’ll have to insert some basic logic to test if the element is numeric. If it is numeric then I take the mean of the element value. If not then I ignore the element. I could implement this example two ways: 1) create a named function in advance which I then use in a call to lapply. 2) create an anonymous function when I call lapply. Here is the first scenario:


# Create a function in advance.

myfunc <- function(x) {
  if (is.numeric(x)) {
    mean(x)
  }
} 

# For numeric elements we get a meaningful result. For other elements we
# don't get back anything

lapply(family1, myfunc)
$name
NULL

$numofchild
[1] 2

$ages
[1] 6

$measles
NULL

What about approach #2 ? This is where I would define the function as I make the call to lapply. This works just as well but might be less readable to a newcomer.

lapply(family1, function(x) {if (is.numeric(x)) mean(x)})
$name
NULL

$numofchild
[1] 2

$ages
[1] 6

$measles
NULL

A list of lists !

Okay, let’s create some more lists that correspond to different families. If we want we can even create a master list whose elements are individual family lists. So in effect we are creating a list of lists ! In this case our master list has named elements so we can easily address the contents of the sub elements.

family2 <- list(name="Espinoza",numofchild=4,ages=c(5,7,9,11),measles=c("Y","N","Y","Y"))
family3 <- list(name="Ginsberg",numofchild=3,ages=c(9,13,18),measles=c("Y","N","Y"))
family4 <- list(name="Souza",numofchild=5,ages=c(3,5,7,9,11),measles=c("N","Y","Y","Y","N"))

allfams <- list(f1=family1,f2=family2,f3=family3,f4=family4)

str(allfams,max.level=1)
List of 4
 $ f1:List of 4
 $ f2:List of 4
 $ f3:List of 4
 $ f4:List of 4

allfams$f3$ages   # Get the ages of Family 3
[1]  9 13 18

# Same as

allfams[[3]]$ages
[1]  9 13 18

Okay so now what if we wanted to get the mean ages of each family’s children ? How could we do this using lapply ? It’s easy.

lapply(allfams, function(x) mean(x$ages))
$f1
[1] 6

$f2
[1] 8

$f3
[1] 13.33333

$f4
[1] 7

It might be a better idea to get the averages for all children. How might we do that ? It takes a little bit more work but not much. First, recognize that what we are getting back are all numeric values so we don’t really need a list to store that information. What I mean is that the only reason we use a list in the first place is to “host” data of differing types but here our result is all numeric so let’s convert it to a vector.

unlist(lapply(allfams, function(x) mean(x$ages)))
      f1       f2       f3       f4 
 6.00000  8.00000 13.33333  7.00000 

# So check out the following. It gives us exactly what we want.

mean(unlist(lapply(allfams, function(x) mean(x$ages))))
[1] 8.583333

# An "expanded" version of this might have looked like:

mymeanages <- function(x) {
  return(mean(x$ages))
}

hold <- lapply(allfams, mymeanages)  # Get back a list with mean ages for each family

hold2 <- unlist(hold)    # Turn the result into a vector since everything is a numeric value

mean(hold2)   # Get the mean of all ages

Let’s ask another question that we could use lapply and a companion function to answer. Which families have 2 or 3 children ? Well since we only have 4 families in allfams we could just look at the lists and answer this question via visual inspection. But this might get really hard to do if our allfams list had 10, 100, or 1,000 families. So here is one way we could do this.

hold <- lapply(allfams,function(x) {x$numofchild >= 2 & x$numofchild <= 3}) 
which(hold == T)
f1 f3 
 1  3 

# Or we could it all in one go

which(lapply(allfams,function(x) {x$numofchild >= 2 & x$numofchild <= 3}) == T)
f1 f3 
 1  3 

Using the split command

Okay, how might we use this knowledge in another example. Lists also show up in conjunction with the “split” command which, given a data frame and a factor, will split the data frame based on that factor into a list. This is best understood with an example. We’ll use the built in data frame called mtcars.

unique(mtcars$cyl)  # Cylinder takes on three distinct values
[1] 6 4 8

# We could split the data frame based on cylinder group.

mydfs <- split(mtcars,mtcars$cyl)

str(mydfs,max.level=1)
List of 3
 $ 4:'data.frame':	11 obs. of  11 variables:
 $ 6:'data.frame':	7 obs. of  11 variables:
 $ 8:'data.frame':	14 obs. of  11 variables:

So what we get back is a list called mydfs whose elements are data frames whose elements represent observations corresponding to cars with a certain number of cylinders – 4,6, or 8. This is a quick and efficient way to split up a data frame. It is worth pointing out that lots of people don’t take advantage of the split function usually because they aren’t aware of it. If you don’t use the split function then you have to do it by hand using an approach like the following. While this will work it doesn’t scale very well especially if you have a factor with many “levels”.


fourcyl  <- mtcars[mtcars$cyl==4,]  
sixcyl   <- mtcars[mtcars$cyl==6,]
eightcyl <- mtcars[mtcars$cyl==8,]

But let’s get back to split function and our list of data frames. We have a list called mydfs whose elements are data frames with observations corresponding to cars of 4,6, and 8 cylinders. We can use our knowledge of lists to look around some:

names(mydfs)
[1] "4" "6" "8"

mydfs$"4"
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

How could we rapidly determine the mean MPG for each cylinder group using lapply ?

mydfs <- split(mtcars,mtcars$cyl)     
lapply(mydfs,function(x) mean(x$mpg))
$`4`
[1] 26.66364

$`6`
[1] 19.74286

$`8`
[1] 15.1

# Okay cool but we could bundle this up in one statement

lapply(split(mtcars,mtcars$cyl),function(x) mean(x$mpg))
$`4`
[1] 26.66364

$`6`
[1] 19.74286

$`8`
[1] 15.1

# Or more economically (though potentially confusing to a newcomer) 

unlist(lapply(split(mtcars,mtcars$cyl),function(x) mean(x$mpg)))

      4        6        8 
26.66364 19.74286 15.10000 

# Which is identical to the tapply function. 

tapply(mtcars$mpg,mtcars$cyl,mean)
       4        6        8 
26.66364 19.74286 15.10000 

I slipped that last one in on you to make a point that there are always multiple ways to solve problems using R. Some say that this flexibility is a great strength of R whereas others say it is a great source of confusion since newcomers don’t know which approach is best. When I was new to R I simply used whatever worked until I needed a faster or more flexible approach. My advice to you is don’t worry about which way is “right” because this will slow you down. Find an approach that solves your problems and change that approach when it becomes necessary. Okay that will wrap it up for the lapply intro. As always there are many other examples I could present but hopefully this blog will help in your mastery of lists and looping over them.

apply The goal of this blog entry is to introduce basic and essential information about the apply function. Even established R users get confused when considering this family of functions especially when observing how many of the them there are: apply, tapply, lapply, sapply, rapply, eapply, mapply. When I was new to R I was rarely satisfied with the all-too-basic explanations of these commands so I thought I would create a series on some of these functions to address the more common questions that newcomers might have. There is an add on package called plyr that attempts to present a unified philosophy and approach to the process of “applying” functions to various R data structures though, for now, I’ll be sticking to the native R commands since they show up a lot in example code you are likely to encounter.

I’ll start with the apply command which expects a matrix as input. Depending on what function you specify when using the apply command, you will get back either a vector or a matrix. Let’s get started with a motivating example, which I hope will convince you on how useful the apply approach can be. Here we create a matrix of 16 elements from a normal distribution of mean 10. We use the set.seed command to enable reproducibility. Imagine that each column represents numerical measurements and the rows represent individual samples. Scientific data, such as microarray and metabolomic information, can often follow this pattern. For those of you concerned with authenticity please don’t worry – I’ll introduce some “real” data later on in the posting.

set.seed(1)  # Makes the call to rnorm generate the same numbers every time
( mymat <- matrix(round(rnorm(16,10),2),4,4) )

      [,1]  [,2]  [,3]  [,4]
[1,]  9.37 10.33 10.58  9.38
[2,] 10.18  9.18  9.69  7.79
[3,]  9.16 10.49 11.51 11.12
[4,] 11.60 10.74 10.39  9.96

Doing it the hard way

Let’s say that we want to take the mean of each column. Another way to say this is that we want to apply the mean function to each column in the matrix. This is the way to start thinking about things like this in R. But first, how might we do this if we don’t know anything about the apply command ?

mean(mymat[,1])
[1] 10.0775

mean(mymat[,2])
[1] 10.185

mean(mymat[,3])
[1] 10.5425

mean(mymat[,4])
[1] 9.5625

# Or we could stash the means into a vector if we need to use that information elsewhere

( mymatcolmean <- c(mean(mymat[,1]),mean(mymat[,2]),mean(mymat[,3]),mean(mymat[,4])) )

[1] 10.0775 10.1850 10.5425  9.5625

See what I did there ? I took the mean of each column by invoking the mean function four times and substituting columns 1-4 on each invocation. This is a bit crude but entirely effective and we do get the correct answer. In the second example I created a vector called mymatcolmean to hold the information. The two approaches are equivalent though neither scales very well. Imagine if there were 10, 100, or 1,000 columns. Sure you could do cut and paste but it is still a tedious process. Surely there must be a better way. Well, if you have experience with programming languages like C, FORTRAN, Java, or Perl then you probably know how to use a for-loop structure to solve this problem. R supports for-loops too but it also offers some ways to avoid using them.

retvec <- vector()
for (ii in 1:ncol(mymat)) {
  retvec[ii] = mean(mymat[,ii])
}
retvec
[1] 10.0775 10.1850 10.5425  9.5625

# We could even put this into a function for later use in case we need it.

myloop <- function(somemat) {
 retvec <- vector()
 length(retvec) <- ncol(somemat)
 for (ii in 1:ncol(somemat)) {
   retvec[ii] <- mean(somemat[,ii])
 }
 return(retvec)
}

myloop(mymat)
[1] 10.0775 10.1850 10.5425  9.5625

# This will now work for any matrix but of course it is specific to the columns of the matrix.

set.seed(1)

newmat <- matrix(rnorm(100),10,10)

myloop(newmat)
 [1]  0.1322028  0.2488450 -0.1336732  0.1207302  0.1341367  0.1434569  0.4512100 -0.2477361
 [9]  0.1273603  0.1123413

So this would work but what if we wanted to take the mean of the rows instead of the columns ? We could do that but we then have to rewrite our function. And, while we are at it, in order to make our function more general what about providing an argument that indicates whether we want to operate on the rows or the columns ? I’ll use “1″ to indicate rows and “2″ to indicate columns.

myloop <- function(somemat,rc=2) {
  retvec <- vector()
  length(retvec) <- ncol(somemat)
  
  # Check to see if we are taking the mean of the columns or rows
  # 1 indicates rows, 2 indicates columns
  
  if (rc == 2) {
    for (ii in 1:ncol(somemat)) {
      retvec[ii] <- mean(somemat[,ii])
    }
  } else {
    for (ii in 1:nrow(somemat)) {
      retvec[ii] <- mean(somemat[ii,])
    }
  }
  return(retvec)
}

# Okay let's make sure it works. 

myloop(mymat,2)   # This is correct
[1] 10.0775 10.1850 10.5425  9.5625

myloop(mymat,1)   # This is correct
[1]  9.9150  9.2100 10.5700 10.6725

mean(mymat[1,])  # A spot check
[1] 9.915

Doing it the easy way

Well okay that’s good but what if we then wanted our myloop function to accommodate a function other than mean ? Like quantile, sum, fivenum, or even a function of our own design ? We could add another argument to our function to let the user specify the name of a function to be applied to the rows or columns. But before we do more work please consider that R already has something that will do the job for us – the apply command.

apply(mymat, 2, mean)
[1] 10.0775 10.1850 10.5425  9.5625

apply(mymat, 1, mean)
[1]  9.9150  9.2100 10.5700 10.6725

See how much easier that is than writing our own looping function ? It has been my observation that those well versed in traditional programming languages have a bigger problem getting used to the apply function than newcomers simply because experienced programmers are more accustomed to writing their own summary functions. They just dive in and start coding. But R short circuits this approach by providing the apply family of commands. Note also that we can substitute in any function we want to.


apply(mymat,2,class)  # What class do the columns belong to ?
[1] "numeric" "numeric" "numeric" "numeric"

apply(mymat,2,sum)    # Get the sum of all the columns
[1] 40.31 40.74 42.17 38.25

apply(mymat,1,range)  # Get the range of all rows
      [,1]  [,2]  [,3]  [,4]
[1,]  9.37  7.79  9.16  9.96
[2,] 10.58 10.18 11.51 11.60

apply(mymat,2,fivenum) # Get the fivenum summary for each column
       [,1]   [,2]   [,3]   [,4]
[1,]  9.160  9.180  9.690  7.790
[2,]  9.265  9.755 10.040  8.585
[3,]  9.775 10.410 10.485  9.670
[4,] 10.890 10.615 11.045 10.540
[5,] 11.600 10.740 11.510 11.120

Some activities on matrices are so common that R actually has dedicated functions for them that are very efficient and fast:

rowMeans(mymat)     # Equivalent to apply(mymat,1,mean)
[1]  9.9150  9.2100 10.5700 10.6725

colMeans(mymat)     # Equivalent to apply(mymat,2,mean)
[1] 10.0775 10.1850 10.5425  9.5625

rowSums(mymat)      # Equivalent to apply(mymat,1,sum)
[1] 39.66 36.84 42.28 42.69

colSums(mymat)      # Equivalent to apply(mymat,2,sum)
[1] 40.31 40.74 42.17 38.25

Passing additional arguments

Many explanations I’ve seen for the apply family of functions omit any discussion on how to provide arguments to the function you pass to apply. I’m not sure why as this is an extremely important consideration. To show you what I mean let’s go back to the apply example that uses the mean function.

apply(mymat, 2, mean)
[1] 10.0775 10.1850 10.5425  9.5625

This is easy to understand though what happens if we wish to pass arguments to the mean function such as “trim” or “na.rm” ? For example maybe we wish to first trim off some portion of the data before we apply the mean function. If we did this using the mean function directly it would look like this:

mean(mymat[,1],trim=0.5)
[1] 9.775

# You might then be tempted to do something like this when using the apply command

apply(mymat, 2, mean(trim=0.5))
Error in mean.default(trim = 0.5) : 
  argument "x" is missing, with no default

# Or maybe this ?

apply(mymat, 2, mean(x,trim=0.5))
Error in mean(x, trim = 0.5) : object 'x' not found

To get some idea on how to do this correctly look at the help information for apply. It’s not exactly jumping out at you but the answer is there. Do you see it ?

     apply(X, MARGIN, FUN, ...)
     
Arguments:

       X: an array, including a matrix.

  MARGIN: a vector giving the subscripts which the function will be
          applied over.  E.g., for a matrix ‘1’ indicates rows, ‘2’
          indicates columns, ‘c(1, 2)’ indicates rows and columns.
          Where ‘X’ has named dimnames, it can be a character vector
          selecting dimension names.

     FUN: the function to be applied: see ‘Details’.  In the case of
          functions like ‘+’, ‘%*%’, etc., the function name must be
          backquoted or quoted.

     ...: optional arguments to ‘FUN’.

Oh so the ellipsis imply that we can simply append arguments after the reference to the function. Of course it is reasonable to assume that the arguments you pass must make sense to the function. That is you cannot pass an argument to a function that isn’t valid !

apply(mymat,2,mean,trim=0.5)
[1]  9.775 10.410 10.485  9.670

# We can pass additional arguments

apply(mymat,2,mean,trim=0.5, na.rm=T)
[1]  9.775 10.410 10.485  9.670

Using our own functions with apply

We can also use a function that we have written and pass it to the apply command. As an example let’s say that we want to express each element in a column as a proportion of the sum of the column it occupies. Here is how we would do this without using apply. As in the opening example this approach doesn’t scale very well although we do get the correct answers.

mymat[,1]/sum(mymat[,1])
[1] 0.2324485 0.2525428 0.2272389 0.2877698

mymat[,2]/sum(mymat[,2])
[1] 0.2535592 0.2253314 0.2574865 0.2636230

mymat[,3]/sum(mymat[,3])
[1] 0.2508893 0.2297842 0.2729429 0.2463837

mymat[,4]/sum(mymat[,4])
[1] 0.2452288 0.2036601 0.2907190 0.2603922

But using our new found knowledge of the apply command we can make this easier and more general. First, let’s write a function that given a vector, (which is what each column or row of a matrix is), will return a vector where each element is a proportion of the sum of the input vector.

myfunc <- function(x) {
  return(x/sum(x))
}

# Check it out to make sure it works

myfunc(mymat[,1])
[1] 0.2324485 0.2525428 0.2272389 0.2877698

all.equal(myfunc(mymat[,1]), mymat[,1]/sum(mymat[,1]))
[1] TRUE

So now we can pass this to the apply command directly


apply(mymat, 2, myfunc)
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2324485 0.2535592 0.2508893 0.2452288
[2,] 0.2525428 0.2253314 0.2297842 0.2036601
[3,] 0.2272389 0.2574865 0.2729429 0.2907190
[4,] 0.2877698 0.2636230 0.2463837 0.2603922

# If this is correct then the sum of each column should be 1.

colSums(apply(mymat,2,myfunc))
[1] 1 1 1 1

# which is equivalent to the following nested apply call

apply( apply(mymat,2,myfunc), 2, sum)
[1] 1 1 1 1

The second example above, the one that has a call to apply within a call to apply, is known as a “composite call”, which can be confusing to newcomers. However, in the world of R, which aggressively implements functional programming concepts, the use of composite function calls is quite common and encouraged. But, there is no requirement to do things this way. Also, I should point out that R has a dedicated function for computing proportions in a matrix. Check out the prop.table command.

Working anonymously

In reality we don’t have to predefine the function. We simply pass the function to apply in an “anonymous” fashion, which means that we don’t even give it a name. It lives only for the duration of the call to apply.

apply(mymat, 2, function(x) x/sum(x))
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2324485 0.2535592 0.2508893 0.2452288
[2,] 0.2525428 0.2253314 0.2297842 0.2036601
[3,] 0.2272389 0.2574865 0.2729429 0.2907190
[4,] 0.2877698 0.2636230 0.2463837 0.2603922

This can be confusing to newcomers and, if it is, then you can simply predefine the function using a name and then refer to it by name when using the apply command. Sometimes you see calls to apply that look like the following. This is done in an attempt to improve readability though that is subjective. In terms of results or efficiency there are no differences.

apply(mymat, 2, function(x) {
                      return(x/sum(x))
                })

Performance vs the for loop

If you’ve read any of my other postings you will know that I have an interest in performance. Is there an efficiency gain when using the apply approach vs. the for loop approach ? To test this let’s read in some “real” Microarray data, which comes from Su et al, 2002.

url <- "http://steviep42.bitbucket.org/data/su_et_al_2002.txt"
mydata <- read.csv(url,header=T)

# B = brain, BF = fetal brain, L = liver, LF = fetal liver

str(mydata)
'data.frame':	12626 obs. of  8 variables:
 $ B1 : num  9.37 11.57 7.79 6.56 7.26 ...
 $ B2 : num  9.5 11.51 6.69 7.45 7.39 ...
 $ FB1: num  7.89 10.56 7.91 5.93 8.75 ...
 $ FB2: num  7.96 10.34 7.57 7.53 8.19 ...
 $ FL1: num  7.7 10.26 8.25 7.84 9.45 ...
 $ FL2: num  8.1 10.13 8.53 8.16 8.19 ...
 $ L1 : num  9.22 10.23 8.39 11.02 9.19 ...
 $ L2 : num  8.9 10.49 8.47 11.09 9.42 ...

# Let's check the mean and standard deviation for each column

apply(mydata,2,function(x) c(mu=mean(x),sd=sd(x)))
         B1       B2      FB1      FB2      FL1      FL2       L1       L2
mu 7.615090 7.431602 7.879343 7.770545 7.730293 7.601395 7.415937 7.518259
sd 2.257218 2.452364 2.150741 2.254736 2.207076 2.355612 2.405326 2.365394

Okay well let’s do a t-test across all rows for the brain tissue types. This becomes really easy to do with the apply command.


mypvals <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } )

round(mypvals[1:5],4)
 100_g_at   1000_at   1001_at 1002_f_at 1003_s_at 
   0.0090    0.0494    0.5211    0.8001    0.1349 
 

But is this faster than writing our own loop to do the t-test ? We’ll first need to write our own code to do the t-test. It might look something like this:


myttest <- function(somemat) {
  retvec <- vector()
  length(retvec) <- nrow(somemat)
  for (ii in 1:nrow(somemat)) {
    retvec[ii] <- t.test(somemat[ii,c("B1","B2")], somemat[ii,c("FB1","FB2")])$p.value
  }
  names(retvec) <- row.names(somemat)
  return(retvec)
}

Next we’ll make sure that this function returns the same results as the apply version after which we will time each approach.

v1 <- myttest(mydata)

v2 <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } )

all.equal(v1,v2)
[1] TRUE

# Let's time the two different approaches

benchmark(forloop = myttest(mydata), apply=apply(mydata, 1, function(x) 
                         { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value }),
                           replications=5,columns=c("test","elapsed"))
     test elapsed
2   apply  19.616
1 forloop  61.800

In this case it appears that the apply approach is indeed quicker than writing our own function though in my experience, with more general cases, the difference in execution time isn’t as drastic. In the end I suggest you use whichever approach makes the most sense to you to get the job done. However, I think that as you become more confident with R that you will be attracted to the apply command since it basically replaces the for-loop. So whenever you find yourself tempted to write a for-loop, even though you are used to it in other languages, you should remind yourself that that the “R-like” way to do it is to use apply.

I’m frequently asked by newcomers to R to provide an easy to follow generic set of instructions on how to download data, transform it, aggregate it, make graphs, and write it all up for publication in a high impact journal – all by the end of the day ! While such a request is somewhat understandable coming from a student it’s somewhat awkward when coming from a seasoned research investigator, (i.e. someone who should know better). Obviously, no such set of instructions exists given the astonishing variety of available data stored in differing formats with varying degrees of adherence to a standard(s). Simply put, “real world” data wrangling and analysis can be a complex process independent of the language you choose to employ. Nonetheless, what I’ll attempt to do here is provide a “tour” involving some data I picked at random. I’ll be sticking to the problems of downloading, filtering, and graphing some of the data. And rather than dictate to you what is the “best way” to do something I’ll simply show you what is on offer and you can make your own decisions. R is amazingly flexible, which means that there are many different ways to do the same thing.

I have chosen information relating to San Francisco restaurant inspections which can be found at https://data.sfgov.org/ along with an impressive variety of data sets involving crime reports, parking availability, bus routes, campaign funding, and many other types of civic info. For reference we can map the zipcodes mentioned in the health reports using methods outlined in my blog post on Geocoding. This is basically the area we are considering.

San Francisco restaurants

Mmm. Let’s eat !

I’m hungry !

Without even knowing how the restaurant information is organized one can easily imagine a number of interesting questions such as, “What is the average health score for all restaurants ? What restaurants have the highest, (or lowest), scores ? What is the distribution of scores across zip codes ? Are there seasonal scoring trends ?”. While you can download the files from the above website I have saved a copy on my web site in case the original location changes at some point. The data is available in a zipped file that contains three CSV files that we can easily unpack into data frames.

url <- "http://steviep42.bitbucket.org/data/SFFoodProgram_Complete_Data.zip"
download.file(url,"SFFood.zip")
system("unzip -l SFFood.zip")

Archive:  SFFood.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
  4613344  04-23-14 09:51   violations_plus.csv
  1175067  04-23-14 09:51   businesses_plus.csv
  1754811  04-23-14 09:51   inspections_plus.csv
 --------                   -------
  7543222                   3 files

There are three files each of which has some key information relating to the restaurant inspection process. I should point out that a description file comes with the dataset, which you can read to get a better idea about the column names and formats. But we can also use R to help us understand what is going on. Let’s read these files into data frames:

businesses  <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F)
str(violations)

'data.frame':	55800 obs. of  5 variables:
 $ business_id    : int  10 10 10 10 10 10 17 17 17 17 ...
 $ date           : int  20140114 20140114 20140114 20120403 20110428 20121114 20130605 20130605 20130605 20140312 ...
 $ ViolationTypeID: int  103119 103145 103154 103154 103119 103154 103154 103144 103142 103142 ...
 $ risk_category  : Factor w/ 4 levels "High Risk","Low Risk",..: 3 2 2 2 3 2 2 2 2 2 ...
 $ description    : Factor w/ 70 levels "","Consumer advisory not provided for raw or undercooked foods",..: 22 17 62 62 22 62 62 58 61 61 ...

str(inspections)

'data.frame':	40935 obs. of  4 variables:
 $ business_id: int  10 10 10 10 10 10 10 10 17 17 ...
 $ Score      : int  NA 92 98 NA 98 100 NA 96 94 NA ...
 $ date       : int  20140124 20140114 20121114 20120920 20120403 20110928 20110601 20110428 20140312 20130711 ...
 $ type       : Factor w/ 14 levels "Administrative or Document Review",..: 11 13 13 11 13 13 11 13 13 11 ...

str(businesses)

'data.frame':	6109 obs. of  17 variables:
 $ business_id         : int  10 17 19 24 29 31 37 45 48 50 ...
 $ name                : Factor w/ 5562 levels " HOL N JAM LEMONADE STAND  #2",..: 5110 1901 3525 3562 1067 3509 793 1027 265 4697 ...
 $ address             : Factor w/ 5444 levels "","   ","  VARIOUS LOCATIONS  ",..: 119 1868 548 3892 713 2478 3506 2870 4814 1446 ...
 $ city                : Factor w/ 1 level "San Francisco": 1 1 1 1 1 1 1 1 1 1 ...
 $ state               : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
 $ postal_code         : Factor w/ 47 levels "","00000","04102",..: 12 29 16 12 10 35 24 17 27 10 ...
 $ latitude            : num  37.8 37.7 37.8 37.8 37.8 ...
 $ longitude           : num  -122 -122 -122 -122 -122 ...
 $ phone_number        : num  NA 1.42e+10 NA NA 1.42e+10 ...
 $ TaxCode             : Factor w/ 32 levels "AA","H07","H08",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ business_certificate: int  779059 78443 NA 352312 939325 346882 315736 340024 318022 NA ...
 $ application_date    : Factor w/ 2327 levels "","01/01/2005",..: 1 620 1 1 2137 1 1473 1836 1 1 ...
 $ owner_name          : Factor w/ 5294 levels "","\tCARLIN'S CORNER COFFEESHOP LLC",..: 4746 2704 61 3387 186 3335 4297 1809 5198 333 ...
 $ owner_address       : Factor w/ 5366 levels "","\t1 INDEPENDENCE POINTE #305",..: 2898 4397 484 3785 1479 2492 3413 414 1103 2206 ...
 $ owner_city          : Factor w/ 335 levels "","\tBRISBANE",..: 239 88 239 239 88 239 241 226 241 280 ...
 $ owner_state         : Factor w/ 51 levels "","ARIZONA","AZ",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ owner_zip           : Factor w/ 376 levels "","\t28273","\t29615",..: 165 139 170 165 139 199 183 178 188 177 ...

It looks like that all data frames have a column in common called “business_id”, which we can use as a “key” to link/merge these data frames though we can already answer some basic questions such as there were 40,935 inspections, which led to 55,800 violations over a time range of 04/25/2011 to 04/22/2014. I use the strptime function here to turn the character string dates into POSIX dates.

inspections$date <- strptime(inspections$date,format="%Y%m%d")
violations$date  <- strptime(violations$date,format="%Y%m%d")
range(inspections$date)
[1] "2011-04-25 EDT" "2014-04-22 EDT"

Okay let’s extract information just for the year 2013 and use that as our base data set. Note that I could merge the data frames into a single data frame and we will eventually look at how to do that but for now let’s see what we can get out of each data frame individually.

start_time <- strptime("20130101","%Y%m%d")
stop_time  <- strptime("20131231","%Y%m%d")
inspections <- subset(inspections, date > start_time & date < stop_time)
violations <- subset(violations, date > start_time & date < stop_time)

range(inspections$date)
[1] "2013-01-02 EST" "2013-12-30 EST"

nrow(inspections) 
13,110

nrow(violations)
19,091

Relative to the restaurant inspections process the way I understand it is that an inspector will visit a restaurant at some frequency or in response to a diner’s complaint. The inspector will evaluate the establishment, document any code violations at a severity level of low,medium,high, and ultimately assign a numeric score from 0 to 100 with the following possible meanings: 0-70: Poor, 71-85: Needs Improvement, 86-90: Adequate, 90-100: Good. If violations were noted then a followup inspection(s) will usually occur to insure that the violations are adequately addressed. Such visits usually do not involve a “re-scoring” just a verification that the establishment dealt with the previously observed violations. Evidently this could take multiple visits.

We might then expect that in the inspections file there are multiple rows for a business starting with the original visit and any associated followups. In the case of the followups it looks like the Score is set to NA. Let’s see if that is true. According to the inspections file a “Routine-Unscheduled” inspection was performed on 06/05/2013 for business_id 17 that resulted in score of 94 ,which is good, although in consulting the violations file it seems that there were three “Low Risk” violations noted. A re inspection happened on 07/11/2013.


head(inspections[order(inspections$business_id),])

   business_id Score       date                  type
10          17    NA 2013-07-11 Reinspection/Followup
11          17    94 2013-06-05 Routine - Unscheduled
18          19    96 2013-09-04 Routine - Unscheduled
23          24   100 2013-11-18 Routine - Unscheduled
27          29    87 2013-10-01 Routine - Unscheduled
34          37    96 2013-08-08 Routine - Unscheduled


head(violations[order(violations$business_id),])

   business_id       date ViolationTypeID risk_category                                      description
7           17 2013-06-05          103154      Low Risk     Unclean or degraded floors walls or ceilings
8           17 2013-06-05          103144      Low Risk Unapproved or unmaintained equipment or utensils
9           17 2013-06-05          103142      Low Risk                 Unclean nonfood contact surfaces
18          19 2013-09-04          103133 Moderate Risk           Foods not protected from contamination
23          29 2013-10-01          103142      Low Risk                 Unclean nonfood contact surfaces
24          29 2013-10-01          103120 Moderate Risk          Moderate risk food holding temperature 

Grading

By referring to the inspections file we can answer one of our questions easily. What is the average health score across all restaurants ?

mean(inspections$Score,na.rm=T)
[1] 91.95652 

But let’s dig a little deeper. I’ll create a factor called rating that implements the scoring interpretation given above. This is easy to do using the cut command. We’ll then summarize the mean score for each category and follow that up with a boxplot of the scores per category. This should give us some initial ideas about the data.


inspections$rating <- cut(inspections$Score,breaks=c(0,70,85,89,100),right=T,
    labels=c("P","NI","A","G")) 

# P = Poor, NI = Needs Improvement, A = Adequate, G = Good

tapply(inspections$Score,inspections$rating,mean)
       P       NI        A        G 
64.34337 79.97907 87.55324 95.94264 

ylim <- c(min(inspections$Score,na.rm=T)-5,max(inspections$Score,na.rm=T)+5)

# Capture the boxplot output to get the number of observations per category

myb <- boxplot(Score~rating,data=inspections,main="Scores per Rating Category",
               col=rainbow(4),ylim=ylim)

leg.txt <- paste(levels(inspections$rating),myb$n,sep=" : ")

legend(3,70,leg.txt,title="Obs. per Category",cex=0.8,pch=19,col=rainbow(4))
Boxplots of score per category

Boxplots of score per category

Hmm, this information is a little shocking since no one wants to think that there are restaurants anywhere with a health score in the “Poor” category let alone 166 of them as we see here. Who does the lowest score of 2013 belong to ? Business id 74522. Note that it would be really easy to consult the “businesses” data frame at this point to determine the full name associated with this id though since this is a blog on R and not on restaurant reviews I’ll leave it to you to determine what business name that is so you can make your dining plans accordingly.

# What business id had the lowest score in 2013 ?

inspections[which(inspections$Score==min(inspections$Score,na.rm=T)),]

      business_id Score       date                  type rating
39498       74522    42 2013-10-03 Routine - Unscheduled      P

# Let's see what the violations were

violations[violations$business_id==74522,][,-3]

 business_id       date  risk_category                                        description
       74522 2013-10-03                Low Risk                   Unclean nonfood contact surfaces
       74522 2013-10-03               High Risk            Unclean hands or improper use of gloves
       74522 2013-10-03                Low Risk       Unclean or degraded floors walls or ceilings
       74522 2013-10-03                Low Risk   Improper storage of equipment utensils or linens
       74522 2013-10-03                Low Risk                              Improper food storage
       74522 2013-10-03           Moderate Risk                          Improper thawing methods 
       74522 2013-10-03           Moderate Risk Inadequate and inaccessible handwashing facilities
       74522 2013-10-03               High Risk                       High risk vermin infestation
       74522 2013-10-03               High Risk        Unclean or unsanitary food contact surfaces
       74522 2013-10-03               High Risk                High risk food holding temperature 
       74522 2013-10-03               High Risk                   Contaminated or adulterated food
       74522 2013-10-03               High Risk              Improper cooking time or temperatures

So let’s just go ahead and find the 25 restaurants that have the highest number of “high risk” violations ? How would we find those ?

# Pull out the high risk violations

violations_high <- violations[violations$risk_category == "High Risk",]

# Split the data frame by business_id

my_high_viols <- split(violations_high,violations_high$business_id)

# Count all violations for a given business id and then sort all
# business ids from highest to lowest

violation_high_count <- sort(sapply(my_high_viols,nrow),T)[1:25]

33446  2427   286  3480 70068  6695 16806 67330 70281 73926 74522   853  1000  1343  1358  1458 
   11     9     7     7     7     6     6     6     6     6     6     5     5     5     5     5 
 2189  2302  2401  3217  3948  4590 18800 21524 36201 
    5     5     5     5     5     5     5     5     5 

inspections[inspections$business_id==33446,]
      business_id Score       date                  type rating 
22145       33446    73 2013-12-11 Routine - Unscheduled     NI      

Notice that while business_id 74522 has the worst overall score, 43, it didn’t actually receive the highest number of “high risk” violations in 2013. That honor goes to business_id 33446 who received 11 violations and a score of 73 on 12/11/2013. But let’s go back to a general view here. I want to slice up the inspections into quarters of the year, (four groups of three months), to see if perhaps scores follow some seasonal pattern at least from a visual point of view. To do this we need to use some R date functions.

inspections$quarters <- quarters(inspections$date)
inspections$quarters <- factor(inspections$quarters,levels=c("Q1","Q2","Q3","Q4"),ordered=TRUE)

# Now let's build the barchart out of a table of quarters vs ratings

library(lattice)
barchart(table(inspections$quarters,inspections$rating),auto.key=list(columns=4),
         main="Violation Category Counts by Quarter")

Barchart of quarters vs. ratings

Barchart of quarters vs. ratings

Well it doesn’t look like there are drastic differences between the scoring categories on a quarterly basis though there appears to be slightly less scores in the “NI” category in Quarter 1 than all other quarters. We could drop down to a month level of granularity to check things out from month to month. All we would need to that is to use the “months” data function above instead of the “quarters” and then apply level labels accordingly. I’ll leave that to you as an exercise.

Grades by zipcode

Well what’s next ? Let’s take a look at the distribution of scores across zip codes which we can access from the inspections file. We’ll also do some merging of the data frames to make our process a bit easier. First, how many unique zip codes do we have ?

unique(businesses$postal_code)
 [1] "94104"     "94124"     "94109"     "94103"     "94133"     "94118"     "94110"     "94122"     "94115"    
[10] "94131"     "94111"     "94117"     "94107"     "94108"     "94102"     "94132"     "94105"     "94134"    
[19] "94116"     "94121"     "94112"     "94127"     "94123"     "94114"     "94513"     "94545"     "94066"    
[28] "941033148" ""          "94158"     "95105"     "94140"     "94013"     "94130"     "CA"        "92672"    
[37] "94120"     "94143"     "94609"     "94101"     "00000"     "CA  94122" "CA  94523" "94188"     "94014"    
[46] "04102"     "94129"

Most of them look legitimate whereas others don’t appear to be legal and are perhaps the result of data entry errors. For example while “CA 94122″ does contain a legal zip code the expected format is the 5 numbers. Also, the “00000″ might be used to signal a missing value but we don’t know. For simplicity we’ll simply strip out values that are NOT five digits. We also have some oddities like “04102″, (Portland), and “92672″, (San Clemente), although if we look close at the businesses data frame for these records we see street names of, respectively, 366 GOLDEN GATE AVE” and “1530 Market Street” which are San Francisco addresses. Look’s like somebody messed up when entering data. So in a “real” analysis we would need to deal with these issues more definitively. But we’ll pass on these for now and merge the inspections data frame with a new version of the inspections data frame that has the normalized zip codes. Finally, we’ll plot the barchart of the average score for each zip code.


myb <- businesses[nchar(businesses$postal_code)==5 & businesses$postal_code != "00000",]

unique(myb$postal_code)

[1] "94104" "94124" "94109" "94103" "94133" "94118" "94110" "94122" "94115" "94131" "94111" "94117" "94107"
[14] "94108" "94102" "94132" "94105" "94134" "94116" "94121" "94112" "94127" "94123" "94114" "94513" "94545"
[27] "94066" "94158" "95105" "94140" "94013" "94130" "92672" "94120" "94143" "94609" "94101" "94188" "94014"
[40] "04102" "94129"

# Let's merge the two data frames myb and inspections using business_id as a key.

mym = merge(myb[,c(1:2,6)],inspections,by="business_id")

head(mym)

business_id                               name postal_code Score       date
1          17               GEORGE'S COFFEE SHOP       94124    94 2013-06-05
2          17               GEORGE'S COFFEE SHOP       94124    NA 2013-07-11
3          19              NRGIZE LIFESTYLE CAFE       94109    96 2013-09-04
4          24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY       94104   100 2013-11-18
5          29                      CHICO'S PIZZA       94103    87 2013-10-01
6          37                        CAFE BISTRO       94118    NA 2013-01-30
                   type rating quarters
1 Routine - Unscheduled      G       Q2
2 Reinspection/Followup   <NA>       Q3
3 Routine - Unscheduled      G       Q3
4 Routine - Unscheduled      G       Q4
5 Routine - Unscheduled      A       Q4
6 Reinspection/Followup   <NA>       Q1

# Let's create an aggregation

hold <- aggregate(Score~postal_code,data=mym,mean)

hvec <- hold$Score

names(hvec) <- hold$postal_code

myCols <- colorRampPalette(c("blue", "red"))( length(hvec) )
mp <- barplot(rev(sort(hvec)),axisnames=F,col=myCols,main="Average Score per Zipcode")
axis(1,at=mp,labels=names(hvec),las=2,cex.axis=0.8)
Avg Score per Zip Code

Avg Score per Zip Code

Let’s see if the lowest scored 25 restaurants are in the same zip codes as the highest scored restaurants. From the following it looks like there is significant overlap so its not as if the there is a big difference between the zipcodes.

hi <- unique(mym[order(mym$Score),][1:100,]$postal_code)

hi
 [1] "94133" "94103" "94102" "94111" "94108" "94116" "94127" "94110" "94122"
[10] "94121" "94114" "94123" "94107" "94132" "94118" "94109"

lo <- unique(mym[order(-mym$Score),][1:100,]$postal_code)
lo
 [1] "94104" "94103" "94111" "94122" "94117" "94108" "94115" "94132" "94105"
[10] "94110" "94123" "94118" "94107" "94112" "94133" "94131" "94121" "94102"
[19] "94109"

sum(lo %in% hi)
[1] 13

What’s for Dessert ?

What next ? This is really just the beginning really. One approach that I didn’t take was to use the very cool sqldf package that would allow us to treat the dataframes as tables withing a relational database. However, this assumes some familiarity with SQL which, at least in my experience, newcomers to R don’t usually possess although it is worth it to learn if you plan on a career in data mining. As a teaser here is how we could have done some things using sqldf. Check this out and see you next time.

library(sqldf)
businesses  <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F)

# How many restaurants are there ?

sqldf("select count(*) from businesses")
count(*)
1     6109

# What is the average score for all restaurants ?

sqldf("select avg(Score) from inspections")
  avg(Score)
1   91.95652

# What are the average scores per zip code ?
# Note that since this is the raw data we have some "bad" zipcodes

sqldf("select avg(Score),businesses.postal_code from inspections join businesses using(business_id) group by postal_code")

   avg(Score) postal_code
1    91.85986            
2    90.00000       00000
3    96.00000       04102
4    96.50000       92672
5    98.33333       94013
6    85.66667       94014
7   100.00000       94066
8    95.75000       94101
9    91.73356       94102
10   91.34171       94103
11   98.00000   941033148
12   95.65000       94104
13   94.11030       94105
14   94.72059       94107
15   90.39931       94108
16   90.48051       94109
17   91.54456       94110

Originally posted on Rolling Your Rs:

In this article I discuss a general approach for Geocoding a location from within R, processing XML reports, and using R packages to create interactive maps. There are various ways to accomplish this, though using Google’s GeoCoding service is a good place to start. We’ll also talk a bit about the XML package that is a very useful tool for parsing reports returned from Google. XML is a powerful markup language that has wide support in many Internet databases so it is helpful. Lastly, we’ll use our knowledge to create a map of the tour dates on the Rolling Stones 1975 Tour of the Americas. Also, when I use the word “GeoCoding” this basically implies the process of taking a geographic location and turning it into a latitude / longitude pair.

What does Google Offer ?

Check out the main Geocoding page, which presents implementation details of the API as…

View original 1,433 more words

If you are a newcomer to R then you are probably quite busy learning the semantics of the language as you experiment with the apply family of commands or come up to speed on the grouping and conditioning capabilities offered by lattice graphics. And, along the way, you might have heard that R has the ability to “link in” code written in other languages such as C, C++, Fortran, and Java. This is true but until you are presented with a compelling use case you will most likely ignore such capability since you’ve already got plenty to do. But that’s where this blog can help. I’ll talk about how to integrate Fortran code with R and, in a later post, discuss doing the same with C and C++.

DISCLAIMER: To accomplish this work requires the presence of a working Fortran compiler such as the GNU suite of compilers (g77, gfortran, gcc, etc). Relative to operating systems I use Linux and OSX. I rarely use Windows though do know that the GNU suite is available for that OS so you should be able to link Fortran code into R there. However, I’ve never tried it. Additional Disclaimer: The timings I present in this post are based on an Apple MacBook @ OS level 10.9.2 with a 2.5 Ghz i5 processor and 4GB of RAM. Your timings may vary.

Why Bother ?

Good question. Most people wind up wanting to access Fortran from R for a few reasons such as they have some really fast and efficient Fortran code that they want to exploit within R. Or maybe they have written some code in R that winds up being incredibly slow so they write a much faster version in Fortran and then want to call it from R. Perhaps they need to access subroutines from external Fortran libraries. Lastly, it might simply be because your boss or faculty advisor is making you do it ! Whatever your reason(s) we’ll break the process of linking in Fortran code down into three general steps: 1) prepare the subroutine for compilation and generate a “shared object”, 2) load the shared object into an active R session, and 3) provide R variables to the shared object via the “.Fortran” call, which is part of the “Foreign Function Interface” within R.

Foreign                  package:base                  R Documentation

Foreign Function Interface

Description:

     Functions to make calls to compiled code that has been loaded into
     R.

Usage:

            .C(.NAME, ..., NAOK = FALSE, DUP = TRUE, PACKAGE, ENCODING)
      .Fortran(.NAME, ..., NAOK = FALSE, DUP = TRUE, PACKAGE, ENCODING)

Prepare the Fortran Subroutine

Next, I present a very simple Fortran 77 subroutine that computes the factorial of a number “n” and stashes the result into a variable called “answer”. And speaking of subroutines it is important to know that to use the .Fortran interface one must make reference to Fortran subroutines only – not Fortran functions or full on programs. So if you have some code that you want to bring in then you will need to embed that code within a subroutine definition. We also need to pay attention to the variable types as declared within the subroutine so we can match those types accordingly when calling the subroutine from R. This winds up being one of the more critical steps in the process.

        subroutine facto(n,answer)
c
c simple subroutine to compute factorial
c
        integer n, answer, i

        answer = 1
        do 100 i = 2,n
           answer = answer * i
  100   continue 
        
        end

You should make sure, of course, that this code does compile correctly. Our goal is to generate a shared object file ( a “.so” file) that can be linked in to R. Note also that our routine doesn’t print/write things to output. It simply uses its input variables and ultimately sets an output variable. Make your code lean.

$ ls
facto.f

$ gfortran -c facto.f

$ ls
facto.f	facto.o

$ gfortran -shared -o facto.so facto.o
$ ls
facto.f		facto.o		facto.so

So it looks like we are good to go here. However, instead of doing this compilation ourselves we could have allowed R to help us. In fact it is the preferred way to do this since this will insure the compilation is done “under the supervision” of the R tools. Let’s remove the .o and .so files and start over.

$ rm *.*o

$ ls
facto.f

$ R CMD SHLIB facto.f
<you will see various compilation output messages>

$ ls
facto.f		facto.o		facto.so

Load it

So now what ? Let’s fire up R. We’ll use the dyn.load command which is part of the foreign interface capability. It’s purpose is to load/unload shared objects which are also known as DLLs, (dynamically loadable libraries).

system("ls")
facto.f		facto.o		facto.so
dyn.load("facto.so")

Okay not much happened there. What’s going on ? Well all we did was simply load the shared object. We have yet to use it. To do that we rely upon the “.Fortran” function. Keep in mind that the subroutine “facto” has two arguments both of which are integers. We’ll supply a value of 5 for “n” and we’ll pass a single integer as a value for the “answer” variable though that will be overwritten once the subroutine computes the “answer”.

system("ls")
facto.f		facto.o		facto.so

dyn.load("facto.so")

.Fortran("facto",n=as.integer(5),answer=as.integer(1))

$n
[1] 5

$answer
[1] 120

# Or more directly

.Fortran("facto",n=as.integer(5),answer=as.integer(1))$answer
[1] 120

If you are wondering what types are supported or shared between R and Fortran here is a list from the .Fortran man page. It is worth some time to peruse the man page as it provides some caveats and deeper explanations on how to interact with Fortran.

       R          Fortran          
       integer    integer          
       numeric    double precision 
       - or -     real             
       complex    double complex   
       logical    integer          

Wrap it Up!

Okay that was cool but this is sort of an awkward way to call the Fortran subroutine. We should probably write our own wrapper function in R to do the loading of the shared object and the referencing to the .Fortran function. Take a look at this approach which winds up being very “R like”.

myfacto <- function(num) {
  dyn.load("facto.so")
  retvals <- .Fortran("facto",n = as.integer(num), answer = as.integer(1))
  return(retvals$answer)
}
         
myfacto(5)
[1] 120

sapply(1:10,myfacto)
 [1]       1       2       6      24     120     720    5040   40320  362880 3628800

So with the wrapper approach we can use the Fortran subroutine just as we would any other R function since the call to .Fortran is “buried” in the wrapper function. We could make this a bit more robust by putting in some logic to see if the shared object is already loaded.

myfacto <- function(num) {
  if (!is.loaded('facto')) {
     dyn.load("facto.so")
  }
  retvals <- .Fortran("facto",n = as.integer(num), answer = as.integer(1))
  return(retvals$answer)
}
         
myfacto(5)
[1] 120

It’s all Convoluted

Well that was okay but let’s look at a more involved example. Let’s consider the idea of doing discrete convolution between two vectors. (Note: This is discussed in the “Writing R Extensions” manual). Why did I pick such an example ? Well first it’s commonly referenced in R literature and , second, it is a good motivating case for using an external language to speed up the processing. The algorithm itself isn’t hard to code up either in R or Fortran. However, the performance in R isn’t so good once the vectors get larger. Check it out:

conr <- function(x, y) {
    lx <- length(x)
    ly <- length(y)
    cxy <- numeric(lx + ly - 1)
    for(i in 1:lx) {
        xi <- x[i]
        for(j in 1:ly) {
            ij <- i+j-1
            cxy[ij] <- cxy[ij] + xi * y[j]
        }
    }
    return(cxy)
}

# Let's check the timings for vectors of different sizes

v1 = rnorm(100); v2 = rnorm(100)

system.time(conr(v1,v2))
   user  system elapsed 
  0.034   0.000   0.035 

v1 = rnorm(2000); v2 = rnorm(2000)

system.time(conr(v1,v2))
   user  system elapsed 
 13.195   0.020  13.215 

v1 = rnorm(4000); v2 = rnorm(4000)

system.time(conr(v1,v2))
   user  system elapsed 
 57.757   0.130  58.008 

The timings grow significantly longer as the sizes of the vectors grow. So passing vectors of size 10,000 could take a very long time. While this blog isn’t specifically on performance let’s do a little bit more coding to get an idea about how poorly performing the convolution written in R is. We’ll use this for later comparison with the performance numbers resulting from the Fortran subroutine. Let’s write a wrapper function to the conr function. This will call conr with a variable x that represents the size of the vectors we wish to convolute. If you don’t understand exactly what is going on here don’t worry – just think of at as more exposure to the apply family of commands.

timewrapconr <- function(x) {
    times <- system.time(conr(rnorm(x),rnorm(x)))[3]
    return(c(size=x,times))
}

# time the convolution for vectors of size 100,1000,2000, and 4000

(convtimes <- sapply(c(100,1000,2000,4000),timewrapconr))
           [,1]     [,2]     [,3]    [,4]
size    100.000 1000.000 2000.000 4000.00
elapsed   0.033    3.552   13.811   62.15

# Let's plot this
library(lattice)

xyplot(convtimes[2,]~convtimes[1,],type=c("p","l","g"),
       xlab = "vector size", ylab = "elapsed time in seconds",
       main = "Execution times for Convolution in R", pch = 19)
Figure 1. Plot of execution times

Figure 1. Plot of execution times

How do we address this problem ? Well there are opportunities for improvement within the R code by using vectorization techniques. A good start would be to somehow avoid the second for loop and there are ways to do that. In fact there is a way to avoid both loops altogether and maybe we’ll explore such an approach in another post. But for now let’s see if writing the code in Fortran and then linking it in could help improve things. So here is the rewritten convolution algorithm, which we will save into a file called convolvef77.f

      subroutine convolvef77 (x, lx, y, ly, xy)
c
c A basic implementation of convolution algorithm for two vectors
c I use zero-based arrays here.
c
      integer lx, ly, i, j
      double precision x(0:lx-1), y(0:ly-1), xy(0:lx+ly-2)
      do 20 i = 0, (lx-1) 
         do 15 j = 0, (ly-1) 
            xy(i+j) = xy(i+j) + x(i) * y(j) 
  15     continue  
  20  continue 
      end

# Save the above to a file called convolvef77.f
# Now compile it to a shared library

$ R CMD SHLIB convolvef77.f 

Next we’ll write a function in R to call the convolvef77 function. Start up R.

convolvef77 <- function(x,y) {
  dyn.load('convolvef77.so')
  lx = length(x)
  ly = length(y)
  retdata <- .Fortran("convolvef77",
                       x = as.double(x),
                       lx = as.integer(lx), 
                       y = as.double(y), 
                       ly = as.integer(ly), 
                       xy = double(lx+ly-1))$xy
  return(retdata)
}

# Now let's throw some large vectors at it. Look at how much better the times are

v1 = rnorm(4000); v2 = rnorm(4000)

system.time(convolvef77(v1,v2))
   user  system elapsed 
  0.012   0.000   0.012 

v1 = rnorm(8000); v2 = rnorm(8000)

system.time(convolvef77(v1,v2))
   user  system elapsed 
  0.047   0.001   0.083 

So the speed looks really good. So now let’s repeat the timing exercise we applied to the convolutions done in R.

timewrapconf77 <- function(x) {
    times <- system.time(convolvef77(rnorm(x),rnorm(x)))[3]
    return(c(size=x,times))
}

(convtimes <- sapply(c(100,1000,2000,4000),timewrapconf77))
           [,1]  [,2]  [,3]    [,4]
size    100.000 1e+03 2e+03 4.0e+03
elapsed   0.045 2e-03 4e-03 1.3e-02

# Wow. This is FAST !!!!! Let's throw some bigger vectors at it.

(convtimes <- sapply(c(100,1000,2000,4000,10000,20000,50000),timewrap))
         [,1]  [,2]  [,3]    [,4]    [,5]     [,6]      [,7]
size    1e+02 1e+03 2e+03 4.0e+03 1.0e+04 2.00e+04 50000.000
elapsed 1e-03 2e-03 4e-03 1.2e-02 7.2e-02 3.22e-01     2.074

# Plot the times

xyplot(convtimes[2,]~convtimes[1,],type=c("p","l","g"),
       xlab = "vector size", ylab = "elapsed time in seconds",
       main = "Execution times for Convolution in Fortran", pch = 19)
Execution times using Fortran

Execution times using Fortran

So using the Fortran subroutine took 2.0 seconds to convolute vectors of size 50,000 whereas using native R code to convolute a vector of size 1,000 took 3.5 seconds (these timings might vary depending on your architecture and OS). To get a better visual comparison let’s repeat the timings for both approaches, R and Fortran, and plot the results on the same graph so you can get some sense of proportion between the execution times. This isn’t hard to do. We’ll just rerun our timing functions:

(convtimesr <- sapply(c(100,1000,2000,4000,10000,20000),timewrapconr))

           [,1]     [,2]     [,3]     [,4]      [,5]      [,6]
size    100.000 1000.000 2000.000 4000.000 10000.000 20000.000
elapsed   0.034    3.374   14.118   64.894   355.409  1504.517
  
(convtimesf77 <- sapply(c(100,1000,2000,4000,10000,20000),timewrapconf77))
           [,1]    [,2]  [,3]    [,4]    [,5]     [,6]
size    100.000 1.0e+03 2e+03 4.0e+03 1.0e+04 2.00e+04
elapsed   0.071 2.3e-02 4e-03 1.2e-02 6.9e-02 2.99e-01

# Now plot them on the same graph

plot(convtimesr[1,],convtimesr[2,],xlab="Vector size",
     ylab="Elapsed time in seconds", 
     main="Convolution in R vs Fortran",
     type="l",col="red",lwd=2.5)

points(convtimesf77[1,],convtimesf77[2,],type="l",col="blue",lwd=2.5)

legend("topleft",c("R","Fortran"),col=c("red","blue"),
        lty=c(1,1),lwd=c(2.5,2.5))
grid()
Execution times for R and Fortran

Execution times for R and Fortran

Okay, I think you get the point here. Using the Fortran code definitely helped speed things up. However, speed might not be the only reason you choose to link in Fortran code. For example I know of people who have written the bulk of their thesis analysis work using Fortran and now seek to leverage that effort within R. Sure, they could recode their stuff into R but that would probably result in lower performance results. Any time you have a significant body of work in one language you would like to avoid having to recode it in another. Lastly, there are other ways to bring in Fortran that I haven’t discussed here. The “inline” package allows one to compile fortran code inline within a given R program, which might be more appealing to some. Hope this has been helpful.

Conditioning and grouping are two important concepts in graphing that allow us to rapidly refine our understanding of data under consideration. Conditioning, in particular, allows us to view relationships across “panels” with common scales. Each panel contains a plot whose data is “conditional” upon records drawn from the category that supports that particular panel (an example will help make this clear). Grouping allows us to view interactions within a given panel by making distinctions using different plot characters, size, or color. These ideas have been discussed in Cleveland’s book “Visualizing Data”. The Lattice, (aka “Trellis”), graphics package attempts to implement the concepts and values indicated in Cleveland’s book. These ideas can help us find interesting patterns in our data.

However, it is important to point out that all graphics packages available within R, (Base, Lattice, ggplot2, Grid), have their respective strengths and good work can be accomplished using any of them although for the newcomer Base graphics is a reasonable starting point. There is an abundance of information on Base graphics both in printed literature and via Google results so ample support is available. As a newcomer to Base graphics it doesn’t take long before you are creating graphs like that in Figure 1, which is an approximation of a conditioned plot. Strictly speaking it does not adhere to the principles of conditioned plots but its useful to see how close we can get. We have a three “panel” plot of MPG vs Weight from the mtcars data frame which is built-in to R. Each panel contains a scatter plot between MPG vs wt for each cylinder group, (4, 6, or 8). The paneling allows us to conveniently compare the MPG vs Wt relationship across groups.

par(mfrow=c(1,3))

cyls <- split(mtcars, mtcars$cyl)
for (ii in 1:length(cyls)) {
     tmpdf <- cyls[[ii]]
     sname <- names(cyls)[ii]
     plot(tmpdf$wt, tmpdf$mpg, 
          main = paste("MPG vs Wt",sname,"Cyl"),
          ylim = c(0,40), xlab = "Wt / 1,000",
          ylab = "MPG", pch=19, col="blue")
     grid()
}
Figure 1

Figure 1

We see that the higher weights correspond to records with lower a MPG figure. This idea is confirmed by the separate panel plots wherein we see the lower MPG for increasing cylinder count, which is is quite likely correlated with the weight of the automobile. (It is in fact with a correlation coefficient of 0.78). In any case the data in mtcars isn’t terribly difficult to understand, (though that’s not really the point here – we are simply trying to come to terms with how to plot data). We split up the data frame and looped through the resulting list and plotted MPG vs Wt for each subgroup. It is not a bad approach and is one that lends itself well to programming. On the other hand let’s check out how this kind of thing might be done using lattice graphics, which was designed in large part to make conditioned plots really easy. Let’s put it to the test.

library(lattice)
xyplot(mpg~wt | factor(cyl), data=mtcars, pch=19,
                main="MPG vs Wt", xlab="Wt/1,000",  ylab="MPG",layout=c(3,1),type=c("p","g"))
Figure 2

Figure 2

What’s Your Condition ?

This isn’t bad for one line (two if you include the library) statement. Let’s break down what is happening in Figure 2. The “mpg~wt” part of the call to xyplot should be familiar as a formula where mpg represents the y-axis and wt represents the x-axis. The formula interface is commonly used in other R procedures such as lm as well as aggregation functions so if this is new to you then it is important to learn about it.

The part of the call to xyplot that newcomers might not understand is the vertical bar character. The presence of this character indicates that what follows is a “conditioning” variable which is typically a category/factor. It also let’s us know that the resulting plot will contain a number of panels the number of which will correspond to the number of unique values assumed by the conditioning variable(s). In the case of mtcars$cyl this will be three since the unique values assumed by cylinder are 4,6,8. However, we really don’t have to know the number of values in advance. We can just plot it and view the defaults after which we can go in and add the “layout” argument to refine the visual presentation.

unique(mtcars$cyl)
[1] 6 4 8

But it gets better in that we can use more than one conditioning character at a time. We can look at a combination of factors in effect looking at the MPG vs Wt relationship as “cells” of a table. Sound complicated ? Not really. Check this out. We are asking xyplot to present MPG vs wt for each combination of Transmission and Cylinder type. This results in a 2 x 3 grid structure wherein the panels contain a scatterplot.

xyplot(mpg~wt | factor(cyl) + factor(am,labels=c("A","M")),
                data=mtcars, main="MPG vs Wt", 
                xlab="Wt/1,000", ylab="MPG",pch=19,type=c("p","g"))
Figure 3

Figure 3

Notice how lattice takes care of the “paneling” for us and supplies labels based on the given factor combinations. So we have a row for each Transmission type, (automatic or manual) each of which contains three columns/panels corresponding to a cylinder group of 4,6, or 8. This plot makes it trivial to determine what combination gets the best gas mileage assuming that is what we are looking for. The 4 Cylinder cars with Manual transmissions get the best gas mileage. Note that we could also further influence the layout and look of the graph and make adjustments to the “strip” labels in each panel but since this is an introductory article we’ll focus on general capabilities for now.

There is Safety in Groups !

Are there other ways that we enhance the graph to reveal more meaning ? In fact there is. In particular we could employ the idea of “grouping”. However, before apply grouping to the above example, (Figure 3), let’s look at a stand-a-lone case to make sure you understand the concept. Suppose we want to look at the MPG vs wt graph within a single panel though we wish to see points colored according to the Transmission group they occupy. This is easy to accomplish with lattice by using the “groups” argument.

xyplot(mpg~wt,data=mtcars,groups=factor(am,labels=c("A","M")),
       pch=20,auto.key=list(columns=2),type=c("p","g"))
Figure 4

Figure 4

So that was pretty easy wasn’t it ? Instead of viewing the interaction across panels we can see it within a single panel by using the different colors to partition the data into groups. Now let’s say that we wanted the points to reflect the cylinder group they occupy except instead of using color we’ll use point size to indicate the relationship. Here we use a vector from 3 down to 1 in conjunction with the cex argument to reflect the size of the points coming from, respectively, cylinder groups 4, 6, and 8. Points coming from cylinder group 4 are going to be larger than points coming from groups 6 and 8. The MPG is better with 4 cylinder autos and we want that reflected in terms of larger size.

Figure 5

Figure 5

Now if we turn our attention back to the situation wherein we used two factors to condition the MPG vs Wt plots we should be able to combine grouping with conditioning to see if there is anything lurking in the data that we might have missed previously. So what I’ll do here is plot MPG vs Wt as conditioned by Cylinders while grouping by Transmission.

xyplot(mpg ~ wt | factor(cyl), data=mtcars, pch=16,layout=c(3,1),
                  groups = factor(am,labels=c("Auto","Manual")),
                  type=c("p","g"),auto.key = TRUE)
Figure 6

Figure 6

Do we see anything interesting here ? Perhaps. If you have a 4 cylinder car and a manual transmission then you get the best gas mileage. This isn’t anything new. However, if we look at the cars in the 8 cylinder panel we see that cars with a manual transmission, (there are only two of them), don’t seem to perform as well relative to MPG as they did in the 4 cylinder group. In the 6 cylinder panel we see that auto and manual transmissions are closer in MPG than in the 4 cylinder group. So we might form the hypothesis that having a manual transmission is related to significantly better gas mileage only if you have a 4 cylinder car. Or maybe we just need to find more 8 and 6 cylinder cars with manual transmissions to see if the pattern holds. It is important to note that these cars were surveyed in 1974 so our findings would probably change were we to use updated information. Well this is all cool stuff but for now we’ll stick with the graphics details instead of further analysis.

Out of curiosity what would this example look like in Base graphics ? Here is one possibility. I’ll let you make the decision if this is easier than using lattice and if the result is as aesthetically pleasing. Base graphics are amazingly flexible but require more of a programming approach in this case than does lattice although lattice panel functions, (something I will explore in another Blog post), can also require a “programmer’s mentality”.

par(mfrow=c(1,3)) 
mysplits = split(mtcars,mtcars$cyl)
maxmpg = max(mtcars$mpg)
for (ii in 1:length(mysplits)) {
      tmpdf  <- mysplits[[ii]]
      auto <- tmpdf[tmpdf$am == 0,]
      man <- tmpdf[tmpdf$am == 1,]
      plot(tmpdf$wt, tmpdf$mpg,type="n",
           main=paste(names(mysplits[ii])," Cylinders"),
           ylim=c(0,maxmpg), xlab="wt",ylab="MPG")
      points(auto$wt,auto$mpg,col="blue",pch=19)
      points(man$wt,man$mpg,col="pink",pch=19)
      grid()
      legend("topright", inset=0.05, c("manual","auto"),
             pch = 19, col=c("pink","blue"))
}
Figure 7

Figure 7

Well I haven’t been completely forthright here. Base graphics does in fact have a function that is dedicated to conditioned plotting although it doesn’t generalize to other plot types. It does, however, handle X/Y plots quite well. I include it here because it does conform to the principles of conditioned plotting but if you are going to embrace the formal approach for conditioning then you might as well use lattice since it does generalize across various plot types.

coplot(mpg~wt|factor(cyl)+factor(am),data=mtcars)
Figure 8: coplot

Figure 8: coplot

Other Lattice Functions

Okay, you might be wondering if the arguments to xyplot will “carry over” to other lattice plotting functions. The answer is “yes” although they must make sense for the given chart type. For example with histograms or boxplots we are looking at the plot of a single continuous quantity but we can still “condition” that continuous quantity using factors from the dataset. In this next example we look at another internal dataset to R called “ChickWeight”. We’ll present a histogram of the weights of 50 chickens, measured over time, as conditioned by the type of diet they were provided. There are 4 diet types hence we will see 4 panels. All of the considerations important to lattice graphics are intact – common axes, optimal use of space, shared margins, and panels to indicate conditions.

histogram(~weight|factor(Diet),data=ChickWeight,
           main="Weight by Diet Type",xlab="Weight in grams")
Figure 9

Figure 9: lattice histogram

So checkout this example. We create a boxplot of chicken weights conditioned on the passage of time since birth which is measured in days (0,2,…,21).

bwplot(~weight|factor(Time),data=ChickWeight,col="blue",
        main="Weight by Days Since Birth",xlab="Weight in grams")
lattice boxplot

Figure 10: lattice boxplot

Conclusions

So we’ve really only scratched the surface here of what can be done with conditioning and grouping although I feel that this is a reasonable introduction. As I’ve mentioned previously, Base graphics has lots of flexibility and if you become fluent with the high and low level functions offered as part of that environment you can accomplish a lot. Admittedly, if you do not have a programming background then writing looping structures to process data can be a bit intimidating but its worth it to learn. The examples I presented with lattice are to show how convenient it is to do conditioning and grouping but then again if you want to do more customization then that will involve learning a bit more about lattice. So at some point you find yourself in the same position that you would be with Base graphics. It all depends on how detailed and custom you wish your plots to be. The ggplot2 package, which I will discuss in another Blog post, can also handle these activities quite nicely though the jargon is a bit different.

Independently of what graphics package you choose to employ, it is important to realize that flexibility does not always imply ease of use. Like any software package there is absolutely no substitute for experience and getting your hands dirty. R is no exception.

Vectors are at the heart of R and represent a true convenience. Moreover, vectors are essential for good performance especially when your are working with lots of data. We’ll explore these concepts in this posting. As a motivational example let’s generate a sequence of data from -3 to 3. We’ll also use each point as input into the Normal function. Our immediate goal is to draw the classic bell curve that everyone knows. So here we start out with the x values:


x = seq(-3, 3, by=0.05) # The "by" argument let's us specify a "step" or "interval" value

x
  [1] -3.00 -2.95 -2.90 -2.85 -2.80 -2.75 -2.70 -2.65 -2.60 -2.55 -2.50 -2.45
 [13] -2.40 -2.35 -2.30 -2.25 -2.20 -2.15 -2.10 -2.05 -2.00 -1.95 -1.90 -1.85
 [25] -1.80 -1.75 -1.70 -1.65 -1.60 -1.55 -1.50 -1.45 -1.40 -1.35 -1.30 -1.25
 [37] -1.20 -1.15 -1.10 -1.05 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65
 [49] -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05
 [61]  0.00  0.05  0.10  0.15  0.20  0.25  0.30  0.35  0.40  0.45  0.50  0.55
 [73]  0.60  0.65  0.70  0.75  0.80  0.85  0.90  0.95  1.00  1.05  1.10  1.15
 [85]  1.20  1.25  1.30  1.35  1.40  1.45  1.50  1.55  1.60  1.65  1.70  1.75
 [97]  1.80  1.85  1.90  1.95  2.00  2.05  2.10  2.15  2.20  2.25  2.30  2.35
[109]  2.40  2.45  2.50  2.55  2.60  2.65  2.70  2.75  2.80  2.85  2.90  2.95
[121]  3.00

# Next we generate the corresponding y values

y = dnorm(x)

# Now we plot the data 

par(mfrow=c(1,2))   # Set the plot region to be one row and two columns

plot(x, y, xlim=c(-3,3), ylim=c(0,0.4), main="Normal Curve", col="blue") # Plot 1

plot(x,y,xlim=c(-3,3),ylim=c(0,0.4),main="Normal Curve",col="blue",type="l") # Plot 2

normal_curve1

Note that we could have generated the X values by using the rnorm function and then sorting the results. If we call the rnorm function without specifying a mean and a standard deviation we get back values from a Normal distribution of mean 0 and standard deviation 1. So we could have done the following. Note that this approach results in less “coverage” in the tails but we still get a comparable result.

xvals = rnorm(120)
par(mfrow=c(1,1))
plot(sort(xvals), y= dnorm(sort(xvals)), main="Normal Curve",col="blue")

Next let’s pull out all the x values that are less than -2 or greater than 2. We are in effect looking a the “tails” of the distribution here. Using the bracket notation with vectors is a powerful way to interrogate the data and extract only that which you need. Look how easy this is. In other languages you would have to write a for-loop that uses if statements.

tmp = x[(x < -2) | (x > 2)]

par(mfrow=c(1,2))

plot(tmp,dnorm(tmp),xlim=c(-3,3),ylim=c(0,0.4),
         main="Normal Curve - The Tails",col="blue") # Plot 1

As mentioned previously, newcomers to R often want to write a loop structure to pull out this data. You could certainly do this but it would be overlooking the power of vectors. Still, let’s see what would be involved to extract the information above using the for-loop approach.

hold = vector()
for (ii in 1:length(x)) {
    if ( (x[ii] < -2) | (x[ii] > 2) ) {
       hold[ii] = x[ii]
    }
}

Okay well this will work but it doesn’t generalize very well, (if at all), since we basically have to customize this loop based on what comparisons we want to do. It might be better to put this into a function although it is far better, (as we will soon see), to use the bracket notation available with vectors. Now let’s also pull out the x values that complement the above set. That is let’s get all the x values that are greater than -2 and less than 2. We’ll plot these along side the previous plot.

tmp2 = x[ (x > -2) & (x < 2) ]

plot(tmp2,dnorm(tmp2),xlim=c(-3,3),ylim=c(0,0.4),
     main="Normal Curve - Without Tails",col="blue") # Plot 2

normal_curve2

So let’s reconsider the example where we used the for-loop. Aside from being more complex and less general it also performs much worse than the bracket notation – especially as the size of the vector grows. Let’s regenerate our x values from -3 to 3 but this time we will pick a very small step size which will result in a vector with many more values. So we are still generating values from -3 to 3 but many more than before:

x = seq(-3,3,by=0.0001)

length(x)     # We have 60,001 elements. 
[1] 60001

par(mfrow=c(1,1))
plot(x,dnorm(x))   # Convince yourself that it still plots the Normal curve.

Now we will evaluate each x value by plugging it in to the dnorm function except we will use the for-loop approach. To make it easier to time we will put it into a function:

my.func <- function(x) {
   hold = vector()
   for (ii in 1:length(x)) {
     if ( (x[ii] < -2) | (x[ii] > 2) ) {
        hold[ii] = x[ii]
     }
   }
   return(hold)
}

# So for each x let's get the corresponding y value using this function. Let's time it.

system.time( my.func(x) )   # system.time function does what it's name suggests
   user  system elapsed 
  2.120   0.287   2.406 
 
 # So is this good / bad ? Let's compare it to the bracket notation
 
system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.003   0.001   0.003 

Wow. So the bracket notation is much faster. For vectors of smaller length the difference in time isn’t so large. But as the size of x grows so does the time it takes to process it. To drive this point home let’s look at a larger x vector and see how it performs as compared to the bracket notation.

x = seq(-3,3,by=0.00005)

length(x)
[1] 120001

system.time( my.func(x) )
   user  system elapsed 
  6.363   2.640   8.998 

system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.006   0.000   0.006 
  
# Wow - the bracket notation really does provide much better performance. 

Let’s do some benchmarking and plot the results so we can get a visual representation of the performance of the two approaches. Check out the rbenchhmark package. You will have to install it since it is an add-on package. Use RStudio or the Windows GUI to do this or you can use the install.packages command. This package contains the function called benchmark which allows you to time the execution of any R function including those that you have written.

It replicates the function a specified number of times. You can also compare the performance of more than one function at a time. Let’s start out simple. Here we will time how long it takes to square the x vector. As mentioned this will be replicated some number times with the default being 100. We will get back a variety of information.

x = seq(-3,3,by=0.0001)

benchmark( square = x^2)
    test replications elapsed relative user.self sys.self user.child sys.child
1 square          100   0.018        1     0.015    0.002          0         0
 
# We can compare how long it takes to square it vs cubing it. We can also specify the
# information to be returned

benchmark( square = x^2, cube = x^3, replications=5, columns=c("test","elapsed"))
    test elapsed
2   cube   0.006
1 square   0.001

First we’ll generate four different versions of the x vector corresponding to different step sizes.

hold = lapply(c(0.01, 0.001, 0.0001, 0.00005), function(x) {
   return(seq(-3,3,by=x))
})

length(hold)  # Hold is a list with 4 elements each of which contains a vector
[1] 4

sapply(hold,length)
[1]    601   6001  60001 120001

Okay, second we’ll use this new hold vector as input into the benchmarking function. To make this efficient we’ll use the lapply command. We’ll also create a function to make the timing process more efficient.

timing.func <- function(x) {
  benchmark( replications=5, func = my.func(x), 
             bracket = x[ (x <2) | (x > 2)], columns=c("test","elapsed") ) 
}

( timings = lapply(hold, timing.func)  )
[[1]]
     test elapsed
2 bracket   0.001
1    func   0.008

[[2]]
     test elapsed
2 bracket   0.002
1    func   0.168

[[3]]
     test elapsed
2 bracket   0.018
1    func  10.757

[[4]]
     test elapsed
2 bracket   0.036
1    func  60.141

# Let's pull out the timings for each approach (the function approach or bracket approach)

( mtimings = sapply(timings, function(x) x$elapsed) )

      [,1]  [,2]   [,3]   [,4]
[1,] 0.001 0.002  0.018  0.036
[2,] 0.008 0.168 10.757 60.141

# Let's plot the timings

plot(1:4,mtimings[2,],pch=18,col="red",type="l",xaxt="n",
         main="Timings of bracket notations vs for loop",ylab="Timings in Seconds")

axis(1,1:4,labels=as.character(sapply(hold,length)),xlab=c("vector size in elements"))

points(1:4,mtimings[1,],type="l",col="blue")

legend(1,40,c("for-loop","bracket"),col=c("red","blue"),lty=c(1,1))

bracketfor

Well this shows that the bracket notation works well even as the x vector gets large. The for-loop, however, starts to perform very poorly and will soon be impractical for larger vectors.

Writing loops in C++, Java, C, and FORTRAN is pretty fast which is why people coming to R from those languages tend to want to write loops to process data. Users of MATLAB, however, are used to the idea of vectors, (although C++ supports vectors), and make the transition to R quite nicely. Also keep in mind that matrices, (which are closely related to vectors), can also be processed using for-loops but, as with vectors, it is not recommended. There are plenty of functions that are “vectorized” and customized for use with vectors and matrices. This is why we can use commands like apply to help process data in R.

Welcome to Part 2 of the GeoCoding, R, and the Rolling Stones blog. Let’s apply some of the things we learned in Part 1 to a practical real world example.

Mapping the Stones – A Real Example

The Rolling Stones have toured for many years. You can go to Wikipedia and see information on the various tours. Here we focus only on the dates and concerts for the 1975 “Tour of the Americas”. I’ve scraped off the information from the Wikipedia page and put it into a data frame. The idea here is that we will GeoCode each city and obtain a latitude and longitude and then use it to create an interactive map of the tour using the Google Charting Tools.

If you want your own copy of this data frame then do the following:

url = "http://steviep42.bitbucket.org/data/stones75.csv"
stones75 = read.csv(url)

Here are the first 10 rows of the data frame. The format is really simple:

head(stones75,10)
           Date        City         State                Venue
1   1 June 1975 Baton Rouge     Louisiana  LSU Assembly Center
2   3 June 1975 San Antonio         Texas    Convention Center
3   4 June 1975 San Antonio         Texas    Convention Center
4   6 June 1975 Kansas City      Missouri    Arrowhead Stadium
5   8 June 1975   Milwaukee     Wisconsin       County Stadium
6   9 June 1975  Saint Paul     Minnesota         Civic Center
7  11 June 1975      Boston Massachusetts        Boston Garden
8  14 June 1975   Cleveland          Ohio    Municipal Stadium
9  15 June 1975     Buffalo      New York  Memorial Auditorium
10 17 June 1975     Toronto       Ontario   Maple Leaf Gardens

Okay let’s process the cities. Like before we’ll use the sapply command to get back the data after which we’ll use cbind to attach the results to the data frame. We might get some warnings about row names when we do this but don’t worry about it. After all “you can’t always get what you want”.

hold = data.frame(t(sapply(paste(stones75$City,stones75$State,sep=","),myGeo)))
stones75 = cbind(stones75,hold)
 
head(stones75,10)

           Date        City         State                Venue  lat   lon
1   1 June 1975 Baton Rouge     Louisiana  LSU Assembly Center 30.5 -91.1
2   3 June 1975 San Antonio         Texas    Convention Center 29.4 -98.5
3   4 June 1975 San Antonio         Texas    Convention Center 29.4 -98.5
4   6 June 1975 Kansas City      Missouri    Arrowhead Stadium 39.1 -94.6
5   8 June 1975   Milwaukee     Wisconsin       County Stadium 43.0 -87.9
6   9 June 1975  Saint Paul     Minnesota         Civic Center 45.0 -93.1
7  11 June 1975      Boston Massachusetts        Boston Garden 42.4 -71.1
8  14 June 1975   Cleveland          Ohio    Municipal Stadium 41.5 -81.7
9  15 June 1975     Buffalo      New York  Memorial Auditorium 42.9 -78.9
10 17 June 1975     Toronto       Ontario   Maple Leaf Gardens 43.7 -79.4

Great ! So now we have the lat and lon for each city. As you might notice in the data frame the Stones played several nights in the same city so we should probably keep track of this.


stones75[9:18,]

           Date          City        State                 Venue
9  15 June 1975       Buffalo     New York   Memorial Auditorium
10 17 June 1975       Toronto      Ontario    Maple Leaf Gardens
11 22 June 1975 New York City     New York Madison Square Garden
12 23 June 1975 New York City     New York Madison Square Garden
13 24 June 1975 New York City     New York Madison Square Garden
14 25 June 1975 New York City     New York Madison Square Garden
15 26 June 1975 New York City     New York Madison Square Garden
16 27 June 1975 New York City     New York Madison Square Garden
17 29 June 1975  Philadelphia Pennsylvania          The Spectrum
18  1 July 1975         Largo     Maryland        Capital Center

As you can see above, they made a six night stand at the famous Madison Square Garden arena in New York City. Our programming should check for duplicate city names before we bug Google to get information that we already have. But that is left as an assignment for you.

Creating a Map of the Tour Using googleVis

Anyway let’s now build a map of the tour dates. For this example we will use a package called “googleVis”. You might not know that Google has a number of mapping services for which R APIs exist. Look at the table at the end of this section, which lists existing packages for interfacing programmatically with the various Google mapping and chart services. You can find these packages on CRAN. In our case we’ll need to install googleVis. After that we can create a map.

install.packages("googleVis",dependencies=TRUE)
library(googleVis)

The cool thing about the googleVis package is that we get back a map in a web browser that has scroll bars and zoom tools. Additionally we can use information from the data frame to annotate the chart we plan to create. So, for example, for each tour stop that the band made we can put in meta info like the name of the venue they played as well as the date.

We have to do this in a way that accommodates the requirements of googleVis. This means we have to read through the googleVis manual pages and play around with the examples. However, hopefully I’m presenting a pretty good example here so you don’t have to immerse yourself in the manual (at least not yet).

The first thing we need to do is to create a single column for the Latitude and Longitude because goolgeVis wants this. This is easy to do. Let’s take the existing stones75 data frame and change it:

head(stones75)

         Date        City     State                Venue  lat   lon
1 1 June 1975 Baton Rouge Louisiana  LSU Assembly Center 30.5 -91.1
2 3 June 1975 San Antonio     Texas    Convention Center 29.4 -98.5
3 4 June 1975 San Antonio     Texas    Convention Center 29.4 -98.5
4 6 June 1975 Kansas City  Missouri    Arrowhead Stadium 39.1 -94.6
5 8 June 1975   Milwaukee Wisconsin       County Stadium 43.0 -87.9
6 9 June 1975  Saint Paul Minnesota         Civic Center 45.0 -93.1

stones75$LatLon = paste(round(stones75$lat,1),round(stones75$lon,1),sep=":")
stones75 = stones75[,-5:-6]  # Remove the old lat and lon columns

head(stones75)
         Date        City     State                Venue     LatLon
1 1 June 1975 Baton Rouge Louisiana  LSU Assembly Center 30.5:-91.1
2 3 June 1975 San Antonio     Texas    Convention Center 29.4:-98.5
3 4 June 1975 San Antonio     Texas    Convention Center 29.4:-98.5
4 6 June 1975 Kansas City  Missouri    Arrowhead Stadium 39.1:-94.6
5 8 June 1975   Milwaukee Wisconsin       County Stadium   43:-87.9
6 9 June 1975  Saint Paul Minnesota         Civic Center   45:-93.1

Next up we can create a column in our data frame that contains all the information we want to use to annotate each concert date. This can include HTML tags to better format the output. As an example the statement below creates a new column in the data frame called “Tip”, that has the following info: the Stop number on the tour, the Venue where it was held, and the Date of the concert. Once we have a map we can click on the “pin” for each location and see the annotation info.

stones75$Tip = paste(rownames(stones75),stones75$Venue,stones75$Date,"<BR>",sep=" ")

# Now we can create a chart !  

# Click on the Atlanta locator and you'll see that it was the 37th stop of the tour. 
# The show took place at The Omni on July 30th, 1975

stones.plot = gvisMap(stones75,"LatLon","Tip")
plot(stones.plot)

googlemap
Refining the Plot Annotations
Pretty cool huh ? We can also zoom in on different parts of the map. The gvisMap function has a number of options that would allow us to draw a line between the cities, select a different type of map, and adopt certain zoom levels by default. So what else could / should we do ?

Well we have a problem here in that the Stones played more than one show in several cities but we don’t take that into account when we are building the annotation data. What we might want to do is to process the data frame and, for those cities that had multiple shows, (e.g. New York), we can capture all the meta data in one go. We saw this before with the New York dates.

stones75[9:18,]

           Date          City        State                 Venue
9  15 June 1975       Buffalo     New York   Memorial Auditorium
10 17 June 1975       Toronto      Ontario    Maple Leaf Gardens
11 22 June 1975 New York City     New York Madison Square Garden
12 23 June 1975 New York City     New York Madison Square Garden
13 24 June 1975 New York City     New York Madison Square Garden
14 25 June 1975 New York City     New York Madison Square Garden
15 26 June 1975 New York City     New York Madison Square Garden
16 27 June 1975 New York City     New York Madison Square Garden
17 29 June 1975  Philadelphia Pennsylvania          The Spectrum
18  1 July 1975         Largo     Maryland        Capital Center

Currently our plot has only the last New York show information. But we want to have the info for all NYC shows. Here is one way to approach this problem. Note that there are probably more elegant ways to clean up the data but this will do the job for now.

test = stones75     # Create some temporary work variables
str=""
tmpdf = list()
ii = 1

repeat {                  # Loop through the copy of the stones75 data frame 

   hold = test[test$Venue == test[ii,4],"Tip"]
   
   # Do we have a multi-city stand ?
   
      if (length(hold) > 1) {   
        str = paste(hold,collapse="")
        test[ii,6] = str
        tmpdf[[ii]] = test[ii,]
        str=""
        
    # We "jump over" cities that we've already processed 
   
        ii = ii + length(hold)  
     
    # Here we process the "one night stands"        
    
   } else {                       
        tmpdf[[ii]] = test[ii,]
        ii = ii + 1
   }
if (ii > 42) break
}

tmpdf = tmpdf[!sapply(tmpdf,is.null)]    # Remove NULL list elements
stones = do.call(rbind,tmpdf)                # Bind the list back into a data frame

stones.plot = gvisMap(stones,"LatLon","Tip")
plot(stones.plot)

googmap2

Okay. Now depending on your background in R you might think that was a lot of work, (or maybe not). In either case this is fairly typical of what we have to do to clean up and/or consolidate data to get it into a format that is suitable for use with the package we are using. Don’t think that this type of effort is peculiar to googleVis because other packages would require a comparable level of processing also. Welcome to the real world of data manipulation.

Anyway let’s take a look at the new plot. At first cut it seems just like the old one but click on the New York locator and you will now see that all the info for all Madison Square Garden is present. Shows number 11 through 16 took place in NYC.

R packages for interfacing with Google

Here is a table that lists the other R packages that exist to interface with various Google services. Each one of these is worth investigation. Keep in mind that similar accommodations exist for other languages so if you prefer to do your coding in Perl or Python then you could work with the Google APIs also.

PACKAGE DESCRIPTION
googleVis Create web pages with interactive charts based on R data frames
plotGoogleMaps Plot HTML output with Google Maps API and your own data
RgoogleMaps Overlays on Google map tiles in R
animation A Gallery of Animations in Statistics and Utilities
gridSVG Export grid graphics as SVG
SVGAnnotation Tools for Post-Processing SVG Plots Created in R
RSVGTipsDevice An R SVG graphics device with dynamic tips and hyperlink
iWebPlots Interactive web-based plots

In this article I discuss a general approach for Geocoding a location from within R, processing XML reports, and using R packages to create interactive maps. There are various ways to accomplish this, though using Google’s GeoCoding service is a good place to start. We’ll also talk a bit about the XML package that is a very useful tool for parsing reports returned from Google. XML is a powerful markup language that has wide support in many Internet databases so it is helpful. This post sets us up for Part II wherein we’ll use our knowledge to create a map of the tour dates on the Rolling Stones 1975 Tour of the Americas. Also, when I use the word “GeoCoding” this basically implies the process of taking a geographic location and turning it into a latitude / longitude pair.

What does Google Offer ?

Check out the main Geocoding page, which presents implementation details of the API as well as use policies and limitations.


https://developers.google.com/maps/documentation/geocoding/

As an example let’s find out what the latitude and longitude are for “Atlanta, GA”. Actually we can get more specific than this by specifying an address or a zip code but let’s keep it simple at first. The Google API, (Application Programming Interface), is very forgiving and will work with almost any valid geographic information that we provide. We could enter just a zip code, a complete address, or an international location and Google will happily process it. Back to our example, according to the Google specs we would create a URL that looks like:


http://maps.googleapis.com/maps/api/geocode/xml?address=Atlanta,GA&sensor=false

Notice that I have placed Atlanta, GA into the URL after the “xml?address” tag. According to Google specs we must also append the string “sensor=false”. This tells Google that the query is not coming from a device but a real person (us) or perhaps a program. If you paste the above URL into a web browser you will get something back that looks very much like the following.

<GeocodeResponse>
   <status>OK</status>
   <result>
      <type>locality</type>
      <type>political</type>
      <formatted_address>Atlanta, GA, USA</formatted_address>
      ..
      ..
      <geometry>
         <location>
            <lat>33.7489954</lat>
            <lng>-84.3879824</lng>
         </location>
      ..
      ..
      </geometry>
    </result>
</GeocodeResponse>

This is what is known as an XML document (eXtensible Markup Language). It might look scary at first but in reality this report contains a lot of helpful information. Note that there are “tags” that describe the content as it is being returned. That is, what we get back is a document that basically describes itself. This is a very useful attribute of XML. We can pick out only that information we want while ignoring the rest.

In this case all we care about are the ‘lat’ and `lng’ tags. We can scan through the document and see that the latitude and longitude for Atlanta, GA is 33.7489954, -84.3879824. It is this self-documenting feature of XML that makes it ideal for processing by programming languages such as R, (of course), Perl, Python, Java, and others.

But would you want to visually accomplish this for 10, 100, or 1,000 locations ? I wouldn’t even want to do it for 2 ! So this is when we put the power of R to work. Let’s start up R and install some packages that will help us “talk” to the Google API.

install.packages("XML",dependencies=T)
library(XML)

install.packages("RCurl",dependencies=T)
library(RCurl)

Creating a URL string and Passing it to Google

We have a destination of “Atlanta, GA” that we wish to GeoCode into a latitude longitude pair. We’ll need to build a URL for eventual passing to the Google API. We can easily do this using the paste function.

google.url = "http://maps.googleapis.com/maps/api/geocode/xml?address="
query.url = paste(google.url, "Atlanta,GA","&amp&sensor=false", sep="")

query.url
[1] "http://maps.googleapis.com/maps/api/geocode/xml?address=Atlanta,GA&sensor=false"

Okay great. We are now ready to send this over to Google. This is easy using the getURL function that is part of the RCurl package.

txt = getURL(query.url)
xml.report = xmlTreeParse(txt,useInternalNodes=TRUE)

What we have here is a variable called “xml.report” that now contains the XML report that we saw in the previous section. If you don’t believe me then check out its contents:

xml.report

<?xml version="1.0" encoding="UTF-8"?>
<GeocodeResponse>
  <status>OK</status>
  <result>
    <type>locality</type>
    <type>political</type>
    <formatted_address>Atlanta, GA, USA</formatted_address>
    ..
    <geometry>
      <location>
        <lat>33.7489954</lat>
        <lng>-84.3879824</lng>
      </location>
      ..
      ..
      </GeocodeResponse>

Parsing the Returned XML Result

Don’t get too worried about how the xmlTreeParse function works at this point. Just use it as a “black box” that one implements to get the XML report. The next step uses the function getNodeSet to locate the latitude and longitude strings in the report. This is something that we did visually in the previous section though to do it from within R we need to use an “XPath expression” to search though the document for the lat/lon.

place = getNodeSet(xml.report,"//GeocodeResponse/result[1]/geometry/location[1]/*")

XPath is the XML Path Language, which is a query language for selecting “nodes” from an XML document. A full discussion of it is beyond the scope of the document. At this point just think of it as a way to search an XML document to match specific lines in the document. If you look at the string we pass to getNodeSet you might wonder what it means.

//GeocodeResponse/result[1]/geometry/location[1]/*

This is a string that we use to match specific tags in the XML report. We can “read” it as follows. Using xml.report we match the “GeocodeReponse” row, then the first “result” row, then the “geometry” row, and then the first “location” row. This will give the section of the report that relates to the latitude and longitude. To verify manually, you can visually scan through the document.

<GeocodeResponse>
  <result>
        <geometry>
             <location>
                <lat>33.7489954</lat>
                <lng>-84.3879824</lng>

Notice that in the xml.report the lines are indented, which suggests that some rows are contained in sections, which suggests a hierarchy in the document. So when you “match” the GeocodeResponse line, this is a “node” in the document that contains other nodes. So in our example the “result” node is contained within the “GeocodeResult” node. The “geometry” node is contained within the “result” node. The “location” is contained in the “result” node.

We can then extract the values contained in the “location” node. Note that in real world applications you would first use a tool such as XMLFinder to develop the correct XPath expression after which you would implement it in R as we have done here. The Firefox browser also has a plugin that allows one to parse XML reports. Finally, we use an R statement to turn the variable place (from up above) into numeric values.

lat.lon = as.numeric(sapply(place,xmlValue))
lat.lon
[1]  33.7 -84.4

Write a Function to Do This Work !

So you might think that was a lot of work. Well maybe it was though consider that we can now parse any returned XML document returned by the Google Geocoding service and find the lat / lon pair. Not only that, we can parse the returned XML file and extract any information contained therein. Since we wouldn’t want to do this manually for every GeoCoding query we might have, let’s write a function in R to do this for us. Here is one way to do it:

myGeo <- function(address="Atlanta,GA") {

# Make sure we have the required libraries to do the work

  stopifnot(require(RCurl))
  stopifnot(require(XML))
  
# Remove any spaces in the address field and replace with "+"

   address = gsub(" ","\\+",address)
   
# Create the request URL according to Google specs

  google.url = "http://maps.googleapis.com/maps/api/geocode/xml?address="
  my.url = paste(google.url,address,"&sensor=false",sep="")

  Sys.sleep(0.5) # Just so we don't beat up the Google servers with rapid fire requests
  
# Send the request to Google and turn it into an internal XML document
  
  txt = getURL(my.url)
  xml.report = xmlTreeParse(txt, useInternalNodes=TRUE)

# Pull out the lat/lon pair and return it  

  place = getNodeSet(xml.report,  "//GeocodeResponse/result[1]/geometry/location[1]/*")
  lat.lon = as.numeric(sapply(place,xmlValue))
  names(lat.lon) = c("lat","lon")
  
  return(lat.lon)
}

myGeo()
 lat   lon 
 33.7 -84.4 

myGeo("Palo Alto,CA")
   lat    lon 
  37.4 -122.1 
  
myGeo("Paris,FR")
  lat   lon 
48.86  2.35 

Now. We’ve actually got more capability than we think we do. Remember that the Google API will accept almost any reasonable geographic information. So check out the following. With just a few lines of code we have a pretty powerful way to GeoCode generally specified addresses without having to do lots of preparation and string processing within our R function.

myGeo("1600 Pennsylvania Avenue, Washington, DC")
  lat   lon 
 38.9 -77.0 

myGeo("Champs-de-Mars, Paris 75007")  # Address of the Eiffel Tower
lat  lon 
48.9  2.3 

Using the myGeo function in a Real Example

Let’s see how we might use this in a real situation. Here is a data frame called geog, which contains columns named city and state names.

geog
            city state
1       Glendale    CA
2      DesMoines    IA
3    Albuquerque    NM
4           Waco    TX
5       Honolulu    HI
6  Indianaopolis    IN
7     Pittsburgh    PA
8     Clearwater    FL
9      Sunnyvale    CA
10    Bridgeport    CT

Now. Look how easy it is to process all of these rows:

t(sapply(paste(geog$city,geog$state),myGeo))

                  lat    lon
Glendale CA      34.1 -118.3
DesMoines IA     41.6  -93.6
Albuquerque NM   35.1 -106.6
Waco TX          31.5  -97.1
Honolulu HI      21.3 -157.9
Indianaopolis IN 39.8  -86.2
Pittsburgh PA    40.4  -80.0
Clearwater FL    28.0  -82.8
Sunnyvale CA     37.4 -122.0
Bridgeport CT    41.2  -73.2

Our function should also work with a vector of zip codes.

myzipcodes
 [1] "90039" "50301" "87101" "76701" "96801" "46201" "15122" "33755" "94085" "06601"
 
t(sapply(myzipcodes,myGeo))

       lat    lon
90039 34.1 -118.3
50301 41.6  -93.6
87101 35.1 -106.6
76701 31.6  -97.1
96801 21.3 -157.9
46201 39.8  -86.1
15122 40.4  -79.9
33755 28.0  -82.8
94085 37.4 -122.0
06601 41.2  -73.2

Okay please check out Part 2 of this post to see how we process the tour data from the
1975 Rolling Stones “Tour of the Americas”.

Hello. Welcome to my debut post ! Check the About link to see what this Blog intends to accomplish. In this article I discuss a general approach for dealing with the problem of splitting a data frame based on a grouping variable and then doing some more operations per group. A secondary goal is to provide some exposure to the “apply” family of functions to demonstrate how they can help. For purposes of example R has a built-in data frame called “ChickWeight” that contains 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. To get more details simply type:

data(ChickWeight)
?ChickWeight

In general when looking at data frames we should be looking for the continuous variables that we want to summarize in terms of some factor or grouping variable. Grouping variables can usually be easily identified because they typically take on a fixed number of values. (Note that R has commands that let us “cut” continuous variables into discrete intervals but that is for another blog post). So let’s use the sapply function to determine what the grouping variables might be:

sapply(ChickWeight, function(x) length(unique(x)))
weight   Time  Chick   Diet 
   212     12     50      4 

If you don’t know about sapply then don’t freak out. It takes the function we define in the second argument and “applies” it to the columns of data frame, (ChickWeight), whose name we provided as the first argument. The “x” in the function is a placeholder argument for each of the columns of the data frame. The unique function returns only the unique values that the column assumes. The length function then “counts” the number of unique values.

This can be confusing I know. However, the important thing is that we can easily observe that Diet and Time look like potential grouping variables since they take on, respectively, 12 and 4 unique values. If we just wanted to know the mean weight of the chickens for each Diet type then we could use the tapply command to get that answer.

tapply(ChickWeight$weight,ChickWeight$Diet,mean)
  1   2   3   4 
103 123 143 135 

Note that we could also write our own function in place of “mean” in case we wanted to so something more in depth in terms of summary. But maybe we don’t yet know what it is we want to do with the data. So having it in a grouped format might help us better understand the data. Let’s use the split command, which can give us access to the individual groups.

my.splits = split(ChickWeight, ChickWeight$Diet)
length(my.splits)
[1] 4

names(my.splits)
[1] "1" "2" "3" "4"

This operation creates a list where each element of my.splits corresponds to the individual Diet values (1,2,3, or 4). We can now start thinking about how to further investigate each Diet type. To convince yourself that the split command actually worked, lets take a peek at the first element. All records in the first element relate to Diet #1.

head(my.splits[[1]])
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

Just for fun let’s use the lapply command to subset into each element of my.splits to obtain all the chicks from each group that weigh less that 40 grams. This approach doesn’t take much typing and will conveniently return a list, which we store in a variable called my.results.lapply

my.results.lapply = lapply(my.splits, subset, weight <= 40)

In the example above we pass additional arguments to the subset command by adding the arguments after the function call. This is the most “R like” way of doing things. Alternatively, we could have defined an anonymous function to do this.

my.results.lapply = lapply(my.splits, function(x) subset(x, weight <= 40) )

Note that we can also define our function in advance. This isn’t substantially different from the example above but it might improve the readability of the code. This is a personal choice and either approach will yield the same result.

my.func <- function(x) {
    subset(x, weight <= 40)
}
my.results.lapply = lapply(my.splits, my.func)

In any case check out what is happening here. lapply creates a “hidden” subscript that passes to the function on the right the value of each element of my.splits. This can be confusing to newcomers but is actually a convenience since it accomplishes a for-loop structure “under the hood”. To make it more obvious we could have also done the following, which is kind of like a for-loop approach where you work with the length of the my.splits list using subscripts.

my.results.lapply = lapply(1:length(my.splits), function(x) subset(my.splits[[x]], weight <= 40))

If we are done with our sub-setting and interrogation then we can repackage the results list back into a data frame by using this construct:

my.df = do.call(rbind, my.results.lapply)
my.df
    weight Time Chick Diet
13      40    0     2    1
26      39    2     3    1
195     39    0    18    1
196     35    2    18    1
221     40    0    21    2
269     40    0    25    2
293     39    0    27    2
305     39    0    28    2
317     39    0    29    2
365     39    0    33    3
401     39    0    36    3
519     40    0    46    4
543     39    0    48    4
555     40    0    49    4

So now we have a single data frame with only the data that we want. If you are coming from another programming language then you might be tempted to write a for-loop to do this. You could though you have to do a little more work to keep up with the results. You have to create your own blank list to stash results:

my.results.for = list() 
for (ii in 1:length(my.splits)) {
     my.results.for[[ii]] = subset(my.splits[[ii]], weight <= 40)
}

names(my.results.for) = names(my.splits)

all.equal(my.results.lapply, my.results.for) # Should be equal to my.results.lapply

So which approach do you use ? It depends. The for-loop approach is a more traditional angle. However, the lapply takes much less typing, and can sometimes perform better, but you have to define your function to work with it and understand that apply is “silently” passing each element to your function. Once people become accustomed to using lapply they usually stick with it.

Now not to blow your mind but we could knock out the problem in one go:

lapply(split(ChickWeight,ChickWeight$Diet), subset, weight <= 40)

# OR

(do.call(rbind, lapply(split(ChickWeight,ChickWeight$Diet), subset, weight <= 40)))

      weight Time Chick Diet
1.13      40    0     2    1
1.26      39    2     3    1
1.195     39    0    18    1
1.196     35    2    18    1
2.221     40    0    21    2
2.269     40    0    25    2
2.293     39    0    27    2
2.305     39    0    28    2
2.317     39    0    29    2
3.365     39    0    33    3
3.401     39    0    36    3
4.519     40    0    46    4
4.543     39    0    48    4
4.555     40    0    49    4

Lastly, it occurred to me after writing much of this post that, in this example, which is admittedly somewhat contrived, there is another way to do this. This realization is fairly common in R especially when you view code written by others. This is the blessing, (and curse), of R. There are many ways to do the same thing. In fact there are frequently several functions, which very much appear to the same thing (or could be made to). Anyway, relative to our problem we could first sort the ChickWeight data frame by Diet and then do a subset. The sort will arrange the data by group, which means that the resulting subset operation will preserve that order.

sorted.chickens = ChickWeight[order(ChickWeight$Diet),]
(sorted.chickens = subset(sorted.chickens, weight <= 40))

    weight Time Chick Diet
13      40    0     2    1
26      39    2     3    1
195     39    0    18    1
196     35    2    18    1
221     40    0    21    2
269     40    0    25    2
293     39    0    27    2
305     39    0    28    2
317     39    0    29    2
365     39    0    33    3
401     39    0    36    3
519     40    0    46    4
543     39    0    48    4
555     40    0    49    4