Next up in our review of the family of apply commands we’ll look at the lapply function, which can be used to loop over the elements of a list (or a vector). This is a true convenience although for those with experience in other programming languages it can seem unnecessary since you are accustomed to writing your own loops. Rest assured you can take that approach in R but once you get an understanding of lists and lapply you will appreciate what it can do for you. This leads me to what I feel is an important observation. I find that most misunderstandings of the lapply command result primarily from a limited or incomplete knowledge of the list structure in R. As long as you know what lists are and how to effectively manipulate them then lapply becomes much easier to implement. So we’ll begin with a brief yet important tutorial on lists and how elements can be referenced.

The list structure represents R’s default data structure for containing heterogeneous information. Recall that vectors and matrices can accommodate only one data type at a time. That is you cannot have a vector or matrix with mixed data types. You can try to put numeric and character data into the same vector but R will convert it all to a single type. Don’t believe me ? Try it. R doesn’t complain but it will make everything a character.

somevec <- c(1,4,5,"4","5") somevec [1] "1" "4" "5" "4" "5"

So from a purely practical point of view some data structure must exist in R to accommodate mixed data. That is what the list structure is for. Where do lists show up in R ? All over the place it turns out. Many of the interesting statistical functions in R, as well as the many add on packages available on CRAN, return information in the form of lists.

# Let's do some regression using the mtcars data frame mylm <- lm(mpg~wt, data = mtcars) # What type of structure do we get back ? A list with 12 sub elements str(mylm,max.level=0) List of 12 - attr(*, "class")= chr "lm"

Now isn’t that interesting ? As you might know, judging the quality of the regression process can be quite involved. R knows this so it returns lots of information encapsulated within a list to help you assess the model. The 12 elements include data types such as vectors, factors, lists, objects, and data frames. So if you are writing your own function and need to return diverse types of data then the list structure is for you ! But that’s all a bit too complex for the moment so let’s return to some basics. To motivate things I’ll present some variables here that look like they relate to characteristics of a family. We have a surname, the number of children in the family, their respective ages, and whether or not the children have had measles.

surname <- "Jones" numofchild <- 2 ages <- c(5,7) measles <- c("Y","N")

We could work with this information as individual variables though it might be useful to assemble it all into a single structure. So I will create a “named” list to contain this information. It is called a “named” list because each element receives a name as the list is being created. Named lists are easier for humans to manipulate and interrogate because we can refer to elements by name using the “$” notation which I will introduce momentarily.

family1 <- list(name="Jones",numofchild=2,ages=c(5,7),measles=c("Y","N")) str(family1) List of 4 $ name : chr "Jones" $ numofchild: num 2 $ ages : num [1:2] 5 7 $ measles : chr [1:2] "Y" "N"

Ah so we see that the elements “name” and “measles” are character vectors, element “numofchild” is a single valued vector, and the “ages” element is a multi valued vector. This proves that we can host data of differing types within a single data structure. Now how do we address these elements and retrieve their values ? Can we use numeric indexing as with a vector ? Can we use the names we created for each element ? The answer in both cases is “yes”. If we have a list with named elements then we can use the “$” notation.

family1$ages [1] 5 7 family1$measles [1] "Y" "N" # We can also reference by numeric position which is more useful if you are # writing your own loop structures but it is less intuitive family1[2] $numofchild [1] 2 family1[[2]] [1] 2

Hmm. What’s up with the double bracket vs the single bracket ? Well the way I think about it is that if you use the single bracket, (as you would if this were a vector), you get back the name of the element as well as it’s value. While this is useful it is usually more interesting to get the actual value(s) of the element which, (if you don’t use the element name), requires use of the double brackets. Think of the double brackets as being more specific than the single brackets. Now even if you use the $ notation you can still address individual values of a given list element. So here I’ll start with pulling out the age of the first child only.

family1$ages[1] [1] 5 # We could pull out both ages using this approach family1$ages[1:2] [1] 5 7 # But this is the same as this: family1$ages [1] 5 7 # Which is the same as this: family1[[3]] [1] 5 7

The way I would summarize the above information is that if you have a named list then you can use the “$” notation for the most part though if you want to address specific values within a multivalued element then you will also have to use the bracket notation in addition to the “$” notation. If you have an unnamed list then you must use the bracket notation exclusively since there are no names available. Unnamed lists result when no effort is made to name the elements such as in the following example. I can always apply names later if I wish.

family1 <- list("Jones",2,c(5,7),c("Y","N")) # So when we print the list results we see only brackets - no names. family1 [[1]] [1] "Jones" [[2]] [1] 2 [[3]] [1] 5 7 [[4]] [1] "Y" "N" # The cool thing is that we can make names for our elements even after we have # create the list names(family1) <- c("name","numofchild","ages","measles")

### Introducing lapply

Admittedly this family1 list is a little basic but the above examples prove that there is flexibility in how you can address elements and values. So let’s present an example of the lapply function. I’ll use it to apply the built in “**typeof**” function to each element in the list.

lapply(family1, typeof) $name [1] "character" $numofchild [1] "double" $ages [1] "double" $measles [1] "character" # Using lapply will return a list str(lapply(family1, typeof)) List of 4 $ name : chr "character" $ numofchild: chr "double" $ ages : chr "double" $ measles : chr "character"

So each element of family1 is “silently” passed to the typeof function and the results for all elements are returned in a list. In fact the “l” in lapply comes from the fact that it will return to you a list. To make sure you understand what is happening here let me introduce a small variation to the example.

lapply(family1, typeof) # is the same as lapply(family1, function(x) typeof(x))

The second version does exactly the same thing as the first but illustrates two important facts: 1) I can pass an “anonymous” functions to lapply. (If you read my blog on apply I discussed the importance and use of “anonymous” functions). 2) The variable “x” referenced in the anonymous function definition represents a placeholder for each element of the family1 list. If you need more help understanding this then look at this example where I use a function that simply prints back each element of the list and it’s associated value.

lapply(family1,function(x) x) $name [1] "Jones" $numofchild [1] 2 $ages [1] 5 7 $measles [1] "Y" "N" # But this is the same as simply typing the name of the list at the prompt family1

Now you could have also defined a function in advance instead of using an anonymous function. That is you are not obligated to use anonymous functions. Some people find them to error prone, confusing, and less readable. If you do then simply define your function in advance. It won’t change the result or impact performance at least in this case.

simplefunc <- function(x) { mytype <- typeof(x) return(mytype) } lapply(family1, simplefunc) $name [1] "character" $numofchild [1] "double" $ages [1] "double" $measles [1] "character"

Alright let’s get a little more advanced here. I’ll write a function that returns the mean value of each element. But I know what you are thinking. Not all of our list elements are numeric so to avoid an error I’ll have to insert some basic logic to test if the element is numeric. If it is numeric then I take the mean of the element value. If not then I ignore the element. I could implement this example two ways: 1) create a named function in advance which I then use in a call to lapply. 2) create an anonymous function when I call lapply. Here is the first scenario:

# Create a function in advance. myfunc <- function(x) { if (is.numeric(x)) { mean(x) } } # For numeric elements we get a meaningful result. For other elements we # don't get back anything lapply(family1, myfunc) $name NULL $numofchild [1] 2 $ages [1] 6 $measles NULL

What about approach #2 ? This is where I would define the function as I make the call to lapply. This works just as well but might be less readable to a newcomer.

lapply(family1, function(x) {if (is.numeric(x)) mean(x)}) $name NULL $numofchild [1] 2 $ages [1] 6 $measles NULL

### A list of lists !

Okay, let’s create some more lists that correspond to different families. If we want we can even create a master list whose elements are individual family lists. So in effect we are creating a list of lists ! In this case our master list has named elements so we can easily address the contents of the sub elements.

family2 <- list(name="Espinoza",numofchild=4,ages=c(5,7,9,11),measles=c("Y","N","Y","Y")) family3 <- list(name="Ginsberg",numofchild=3,ages=c(9,13,18),measles=c("Y","N","Y")) family4 <- list(name="Souza",numofchild=5,ages=c(3,5,7,9,11),measles=c("N","Y","Y","Y","N")) allfams <- list(f1=family1,f2=family2,f3=family3,f4=family4) str(allfams,max.level=1) List of 4 $ f1:List of 4 $ f2:List of 4 $ f3:List of 4 $ f4:List of 4 allfams$f3$ages # Get the ages of Family 3 [1] 9 13 18 # Same as allfams[[3]]$ages [1] 9 13 18

Okay so now what if we wanted to get the mean ages of each family’s children ? How could we do this using lapply ? It’s easy.

lapply(allfams, function(x) mean(x$ages)) $f1 [1] 6 $f2 [1] 8 $f3 [1] 13.33333 $f4 [1] 7

It might be a better idea to get the averages for all children. How might we do that ? It takes a little bit more work but not much. First, recognize that what we are getting back are all numeric values so we don’t really need a list to store that information. What I mean is that the only reason we use a list in the first place is to “host” data of differing types but here our result is all numeric so let’s convert it to a vector.

unlist(lapply(allfams, function(x) mean(x$ages))) f1 f2 f3 f4 6.00000 8.00000 13.33333 7.00000 # So check out the following. It gives us exactly what we want. mean(unlist(lapply(allfams, function(x) mean(x$ages)))) [1] 8.583333 # An "expanded" version of this might have looked like: mymeanages <- function(x) { return(mean(x$ages)) } hold <- lapply(allfams, mymeanages) # Get back a list with mean ages for each family hold2 <- unlist(hold) # Turn the result into a vector since everything is a numeric value mean(hold2) # Get the mean of all ages

Let’s ask another question that we could use lapply and a companion function to answer. Which families have 2 or 3 children ? Well since we only have 4 families in allfams we could just look at the lists and answer this question via visual inspection. But this might get really hard to do if our allfams list had 10, 100, or 1,000 families. So here is one way we could do this.

hold <- lapply(allfams,function(x) {x$numofchild >= 2 & x$numofchild <= 3}) which(hold == T) f1 f3 1 3 # Or we could it all in one go which(lapply(allfams,function(x) {x$numofchild >= 2 & x$numofchild <= 3}) == T) f1 f3 1 3

### Using the split command

Okay, how might we use this knowledge in another example. Lists also show up in conjunction with the “split” command which, given a data frame and a factor, will split the data frame based on that factor into a list. This is best understood with an example. We’ll use the built in data frame called mtcars.

unique(mtcars$cyl) # Cylinder takes on three distinct values [1] 6 4 8 # We could split the data frame based on cylinder group. mydfs <- split(mtcars,mtcars$cyl) str(mydfs,max.level=1) List of 3 $ 4:'data.frame': 11 obs. of 11 variables: $ 6:'data.frame': 7 obs. of 11 variables: $ 8:'data.frame': 14 obs. of 11 variables:

So what we get back is a list called mydfs whose elements are data frames whose elements represent observations corresponding to cars with a certain number of cylinders – 4,6, or 8. This is a quick and efficient way to split up a data frame. It is worth pointing out that lots of people don’t take advantage of the split function usually because they aren’t aware of it. If you don’t use the split function then you have to do it by hand using an approach like the following. While this will work it doesn’t scale very well especially if you have a factor with many “levels”.

fourcyl <- mtcars[mtcars$cyl==4,] sixcyl <- mtcars[mtcars$cyl==6,] eightcyl <- mtcars[mtcars$cyl==8,]

But let’s get back to split function and our list of data frames. We have a list called mydfs whose elements are data frames with observations corresponding to cars of 4,6, and 8 cylinders. We can use our knowledge of lists to look around some:

names(mydfs) [1] "4" "6" "8" mydfs$"4" mpg cyl disp hp drat wt qsec vs am gear carb Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

How could we rapidly determine the mean MPG for each cylinder group using lapply ?

mydfs <- split(mtcars,mtcars$cyl) lapply(mydfs,function(x) mean(x$mpg)) $`4` [1] 26.66364 $`6` [1] 19.74286 $`8` [1] 15.1 # Okay cool but we could bundle this up in one statement lapply(split(mtcars,mtcars$cyl),function(x) mean(x$mpg)) $`4` [1] 26.66364 $`6` [1] 19.74286 $`8` [1] 15.1 # Or more economically (though potentially confusing to a newcomer) unlist(lapply(split(mtcars,mtcars$cyl),function(x) mean(x$mpg))) 4 6 8 26.66364 19.74286 15.10000 # Which is identical to the tapply function. tapply(mtcars$mpg,mtcars$cyl,mean) 4 6 8 26.66364 19.74286 15.10000

I slipped that last one in on you to make a point that there are always multiple ways to solve problems using R. Some say that this flexibility is a great strength of R whereas others say it is a great source of confusion since newcomers don’t know which approach is best. When I was new to R I simply used whatever worked until I needed a faster or more flexible approach. My advice to you is don’t worry about which way is “right” because this will slow you down. Find an approach that solves your problems and change that approach when it becomes necessary. Okay that will wrap it up for the lapply intro. As always there are many other examples I could present but hopefully this blog will help in your mastery of lists and looping over them.

I’ve often heard that `split` + `lapply` is a good candidate for `by`. Continuing with your example, `with(mtcars, by(mpg, cyl, mean))`. Wrapped in `c()`, you would get the same result as `tapply`.

Of course, `by` is pretty much a wrapper for `tapply` anyway 🙂

one comment

# So check out the following. It gives us exactly what we want.

mean(unlist(lapply(allfams, function(x) mean(x$ages))))

[1] 8.583333

the value calculated is the mean of the means, this is not strictly the same as the mean of all the ages for all the children in all the families, which is

> mean(c(c(5,7,9,11),c(9,13,18),c(3,5,7,9,11)))

[1] 8.916667

this does not subtract from an excellent post, but I think people should know what is being calculated as they may intend t get the latter (for example say it was an epidemiology study and you wan to know the mean age for all the children)

I came here to mention this as well. Is there an efficient way to compute the actual mean of all ages using lapply, or would we have to resort to different functions for that computation?

Oh you are referring to computing the grand mean when the sample sizes are unequal. A more bullet proof approach, off the top of my head, would be as follows. (There are more elegant ways to do it but this satisfies the requirement). Row 1 represents the total number of ages in each family and row 2 is the number of elements (the number of children for the family).

(tmp <- sapply(allfams,function(x) c(sum(x$ages),length(x$ages))))

f1 f2 f3 f4

[1,] 12 32 40 35

[2,] 2 4 3 5

sum(tmp[1,])/sum(tmp[2,])

[1] 8.5

Sorry for the late response. Please see response below. Also your computation above is computing only for families 2,3, and 4. You left out family1.

mean(c(5,7,5,7,9,11,9,13,18,3,5,7,9,11))

[1] 8.5

[…] The lapply command 101 October 20, 2014 […]