The goal of this blog entry is to introduce basic and essential information about the apply function. Even established R users get confused when considering this family of functions especially when observing how many of the them there are: **apply, tapply, lapply, sapply, rapply, eapply, mapply**. When I was new to R I was rarely satisfied with the all-too-basic explanations of these commands so I thought I would create a series on some of these functions to address the more common questions that newcomers might have. There is an add on package called **plyr** that attempts to present a unified philosophy and approach to the process of “applying” functions to various R data structures though, for now, I’ll be sticking to the native R commands since they show up a lot in example code you are likely to encounter.

I’ll start with the **apply** command which expects a matrix as input. Depending on what function you specify when using the apply command, you will get back either a vector or a matrix. Let’s get started with a motivating example, which I hope will convince you on how useful the apply approach can be. Here we create a matrix of 16 elements from a normal distribution of mean 10. We use the set.seed command to enable reproducibility. Imagine that each column represents numerical measurements and the rows represent individual samples. Scientific data, such as microarray and metabolomic information, can often follow this pattern. For those of you concerned with authenticity please don’t worry – I’ll introduce some “real” data later on in the posting.

set.seed(1) # Makes the call to rnorm generate the same numbers every time ( mymat <- matrix(round(rnorm(16,10),2),4,4) ) [,1] [,2] [,3] [,4] [1,] 9.37 10.33 10.58 9.38 [2,] 10.18 9.18 9.69 7.79 [3,] 9.16 10.49 11.51 11.12 [4,] 11.60 10.74 10.39 9.96

### Doing it the hard way

Let’s say that we want to take the mean of each column. Another way to say this is that we want to apply the mean function to each column in the matrix. This is the way to start thinking about things like this in R. But first, how might we do this if we don’t know anything about the apply command ?

mean(mymat[,1]) [1] 10.0775 mean(mymat[,2]) [1] 10.185 mean(mymat[,3]) [1] 10.5425 mean(mymat[,4]) [1] 9.5625 # Or we could stash the means into a vector if we need to use that information elsewhere ( mymatcolmean <- c(mean(mymat[,1]),mean(mymat[,2]),mean(mymat[,3]),mean(mymat[,4])) ) [1] 10.0775 10.1850 10.5425 9.5625

See what I did there ? I took the mean of each column by invoking the mean function four times and substituting columns 1-4 on each invocation. This is a bit crude but entirely effective and we do get the correct answer. In the second example I created a vector called mymatcolmean to hold the information. The two approaches are equivalent though neither scales very well. Imagine if there were 10, 100, or 1,000 columns. Sure you could do cut and paste but it is still a tedious process. Surely there must be a better way. Well, if you have experience with programming languages like C, FORTRAN, Java, or Perl then you probably know how to use a for-loop structure to solve this problem. R supports for-loops too but it also offers some ways to avoid using them.

retvec <- vector() for (ii in 1:ncol(mymat)) { retvec[ii] = mean(mymat[,ii]) } retvec [1] 10.0775 10.1850 10.5425 9.5625 # We could even put this into a function for later use in case we need it. myloop <- function(somemat) { retvec <- vector() length(retvec) <- ncol(somemat) for (ii in 1:ncol(somemat)) { retvec[ii] <- mean(somemat[,ii]) } return(retvec) } myloop(mymat) [1] 10.0775 10.1850 10.5425 9.5625 # This will now work for any matrix but of course it is specific to the columns of the matrix. set.seed(1) newmat <- matrix(rnorm(100),10,10) myloop(newmat) [1] 0.1322028 0.2488450 -0.1336732 0.1207302 0.1341367 0.1434569 0.4512100 -0.2477361 [9] 0.1273603 0.1123413

So this would work but what if we wanted to take the mean of the rows instead of the columns ? We could do that but we then have to rewrite our function. And, while we are at it, in order to make our function more general what about providing an argument that indicates whether we want to operate on the rows or the columns ? I’ll use “1″ to indicate rows and “2″ to indicate columns.

myloop <- function(somemat,rc=2) { retvec <- vector() length(retvec) <- ncol(somemat) # Check to see if we are taking the mean of the columns or rows # 1 indicates rows, 2 indicates columns if (rc == 2) { for (ii in 1:ncol(somemat)) { retvec[ii] <- mean(somemat[,ii]) } } else { for (ii in 1:nrow(somemat)) { retvec[ii] <- mean(somemat[ii,]) } } return(retvec) } # Okay let's make sure it works. myloop(mymat,2) # This is correct [1] 10.0775 10.1850 10.5425 9.5625 myloop(mymat,1) # This is correct [1] 9.9150 9.2100 10.5700 10.6725 mean(mymat[1,]) # A spot check [1] 9.915

### Doing it the easy way

Well okay that’s good but what if we then wanted our myloop function to accommodate a function other than **mean** ? Like **quantile, sum, fivenum**, or even a function of our own design ? We could add another argument to our function to let the user specify the name of a function to be applied to the rows or columns. But before we do more work please consider that R already has something that will do the job for us – the apply command.

apply(mymat, 2, mean) [1] 10.0775 10.1850 10.5425 9.5625 apply(mymat, 1, mean) [1] 9.9150 9.2100 10.5700 10.6725

See how much easier that is than writing our own looping function ? It has been my observation that those well versed in traditional programming languages have a bigger problem getting used to the apply function than newcomers simply because experienced programmers are more accustomed to writing their own summary functions. They just dive in and start coding. But R short circuits this approach by providing the apply family of commands. Note also that we can substitute in any function we want to.

apply(mymat,2,class) # What class do the columns belong to ? [1] "numeric" "numeric" "numeric" "numeric" apply(mymat,2,sum) # Get the sum of all the columns [1] 40.31 40.74 42.17 38.25 apply(mymat,1,range) # Get the range of all rows [,1] [,2] [,3] [,4] [1,] 9.37 7.79 9.16 9.96 [2,] 10.58 10.18 11.51 11.60 apply(mymat,2,fivenum) # Get the fivenum summary for each column [,1] [,2] [,3] [,4] [1,] 9.160 9.180 9.690 7.790 [2,] 9.265 9.755 10.040 8.585 [3,] 9.775 10.410 10.485 9.670 [4,] 10.890 10.615 11.045 10.540 [5,] 11.600 10.740 11.510 11.120

Some activities on matrices are so common that R actually has dedicated functions for them that are very efficient and fast:

rowMeans(mymat) # Equivalent to apply(mymat,1,mean) [1] 9.9150 9.2100 10.5700 10.6725 colMeans(mymat) # Equivalent to apply(mymat,2,mean) [1] 10.0775 10.1850 10.5425 9.5625 rowSums(mymat) # Equivalent to apply(mymat,1,sum) [1] 39.66 36.84 42.28 42.69 colSums(mymat) # Equivalent to apply(mymat,2,sum) [1] 40.31 40.74 42.17 38.25

### Passing additional arguments

Many explanations I’ve seen for the apply family of functions omit any discussion on how to provide arguments to the function you pass to apply. I’m not sure why as this is an extremely important consideration. To show you what I mean let’s go back to the apply example that uses the mean function.

apply(mymat, 2, mean) [1] 10.0775 10.1850 10.5425 9.5625

This is easy to understand though what happens if we wish to pass arguments to the mean function such as “trim” or “na.rm” ? For example maybe we wish to first trim off some portion of the data before we apply the mean function. If we did this using the mean function directly it would look like this:

mean(mymat[,1],trim=0.5) [1] 9.775 # You might then be tempted to do something like this when using the apply command apply(mymat, 2, mean(trim=0.5)) Error in mean.default(trim = 0.5) : argument "x" is missing, with no default # Or maybe this ? apply(mymat, 2, mean(x,trim=0.5)) Error in mean(x, trim = 0.5) : object 'x' not found

To get some idea on how to do this correctly look at the help information for apply. It’s not exactly jumping out at you but the answer is there. Do you see it ?

apply(X, MARGIN, FUN, ...) Arguments: X: an array, including a matrix. MARGIN: a vector giving the subscripts which the function will be applied over. E.g., for a matrix ‘1’ indicates rows, ‘2’ indicates columns, ‘c(1, 2)’ indicates rows and columns. Where ‘X’ has named dimnames, it can be a character vector selecting dimension names. FUN: the function to be applied: see ‘Details’. In the case of functions like ‘+’, ‘%*%’, etc., the function name must be backquoted or quoted. ...: optional arguments to ‘FUN’.

Oh so the ellipsis imply that we can simply append arguments after the reference to the function. Of course it is reasonable to assume that the arguments you pass must make sense to the function. That is you cannot pass an argument to a function that isn’t valid !

apply(mymat,2,mean,trim=0.5) [1] 9.775 10.410 10.485 9.670 # We can pass additional arguments apply(mymat,2,mean,trim=0.5, na.rm=T) [1] 9.775 10.410 10.485 9.670

### Using our own functions with apply

We can also use a function that we have written and pass it to the apply command. As an example let’s say that we want to express each element in a column as a proportion of the sum of the column it occupies. Here is how we would do this without using apply. As in the opening example this approach doesn’t scale very well although we do get the correct answers.

mymat[,1]/sum(mymat[,1]) [1] 0.2324485 0.2525428 0.2272389 0.2877698 mymat[,2]/sum(mymat[,2]) [1] 0.2535592 0.2253314 0.2574865 0.2636230 mymat[,3]/sum(mymat[,3]) [1] 0.2508893 0.2297842 0.2729429 0.2463837 mymat[,4]/sum(mymat[,4]) [1] 0.2452288 0.2036601 0.2907190 0.2603922

But using our new found knowledge of the apply command we can make this easier and more general. First, let’s write a function that given a vector, (which is what each column or row of a matrix is), will return a vector where each element is a proportion of the sum of the input vector.

myfunc <- function(x) { return(x/sum(x)) } # Check it out to make sure it works myfunc(mymat[,1]) [1] 0.2324485 0.2525428 0.2272389 0.2877698 all.equal(myfunc(mymat[,1]), mymat[,1]/sum(mymat[,1])) [1] TRUE

So now we can pass this to the apply command directly

apply(mymat, 2, myfunc) [,1] [,2] [,3] [,4] [1,] 0.2324485 0.2535592 0.2508893 0.2452288 [2,] 0.2525428 0.2253314 0.2297842 0.2036601 [3,] 0.2272389 0.2574865 0.2729429 0.2907190 [4,] 0.2877698 0.2636230 0.2463837 0.2603922 # If this is correct then the sum of each column should be 1. colSums(apply(mymat,2,myfunc)) [1] 1 1 1 1 # which is equivalent to the following nested apply call apply( apply(mymat,2,myfunc), 2, sum) [1] 1 1 1 1

The second example above, the one that has a call to apply within a call to apply, is known as a “composite call”, which can be confusing to newcomers. However, in the world of R, which aggressively implements functional programming concepts, the use of composite function calls is quite common and encouraged. But, there is no requirement to do things this way. Also, I should point out that R has a dedicated function for computing proportions in a matrix. Check out the **prop.table** command.

### Working anonymously

In reality we don’t have to predefine the function. We simply pass the function to apply in an “anonymous” fashion, which means that we don’t even give it a name. It lives only for the duration of the call to apply.

apply(mymat, 2, function(x) x/sum(x)) [,1] [,2] [,3] [,4] [1,] 0.2324485 0.2535592 0.2508893 0.2452288 [2,] 0.2525428 0.2253314 0.2297842 0.2036601 [3,] 0.2272389 0.2574865 0.2729429 0.2907190 [4,] 0.2877698 0.2636230 0.2463837 0.2603922

This can be confusing to newcomers and, if it is, then you can simply predefine the function using a name and then refer to it by name when using the apply command. Sometimes you see calls to apply that look like the following. This is done in an attempt to improve readability though that is subjective. In terms of results or efficiency there are no differences.

apply(mymat, 2, function(x) { return(x/sum(x)) })

### Performance vs the for loop

If you’ve read any of my other postings you will know that I have an interest in performance. Is there an efficiency gain when using the apply approach vs. the for loop approach ? To test this let’s read in some “real” Microarray data, which comes from Su et al, 2002.

url <- "http://steviep42.bitbucket.org/data/su_et_al_2002.txt" mydata <- read.csv(url,header=T) # B = brain, BF = fetal brain, L = liver, LF = fetal liver str(mydata) 'data.frame': 12626 obs. of 8 variables: $ B1 : num 9.37 11.57 7.79 6.56 7.26 ... $ B2 : num 9.5 11.51 6.69 7.45 7.39 ... $ FB1: num 7.89 10.56 7.91 5.93 8.75 ... $ FB2: num 7.96 10.34 7.57 7.53 8.19 ... $ FL1: num 7.7 10.26 8.25 7.84 9.45 ... $ FL2: num 8.1 10.13 8.53 8.16 8.19 ... $ L1 : num 9.22 10.23 8.39 11.02 9.19 ... $ L2 : num 8.9 10.49 8.47 11.09 9.42 ... # Let's check the mean and standard deviation for each column apply(mydata,2,function(x) c(mu=mean(x),sd=sd(x))) B1 B2 FB1 FB2 FL1 FL2 L1 L2 mu 7.615090 7.431602 7.879343 7.770545 7.730293 7.601395 7.415937 7.518259 sd 2.257218 2.452364 2.150741 2.254736 2.207076 2.355612 2.405326 2.365394

Okay well let’s do a t-test across all rows for the brain tissue types. This becomes really easy to do with the apply command.

mypvals <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } ) round(mypvals[1:5],4) 100_g_at 1000_at 1001_at 1002_f_at 1003_s_at 0.0090 0.0494 0.5211 0.8001 0.1349

But is this faster than writing our own loop to do the t-test ? We’ll first need to write our own code to do the t-test. It might look something like this:

myttest <- function(somemat) { retvec <- vector() length(retvec) <- nrow(somemat) for (ii in 1:nrow(somemat)) { retvec[ii] <- t.test(somemat[ii,c("B1","B2")], somemat[ii,c("FB1","FB2")])$p.value } names(retvec) <- row.names(somemat) return(retvec) }

Next we’ll make sure that this function returns the same results as the apply version after which we will time each approach.

v1 <- myttest(mydata) v2 <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } ) all.equal(v1,v2) [1] TRUE # Let's time the two different approaches benchmark(forloop = myttest(mydata), apply=apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value }), replications=5,columns=c("test","elapsed")) test elapsed 2 apply 19.616 1 forloop 61.800

In this case it appears that the apply approach is indeed quicker than writing our own function though in my experience, with more general cases, the difference in execution time isn’t as drastic. In the end I suggest you use whichever approach makes the most sense to you to get the job done. However, I think that as you become more confident with R that you will be attracted to the apply command since it basically replaces the for-loop. So whenever you find yourself tempted to write a for-loop, even though you are used to it in other languages, you should remind yourself that that the “R-like” way to do it is to use apply.