Archive for the ‘R programming apply lapply tapply’ Category

Data Frames

R provides a helpful data structure called the “data frame” that gives the user an intuitive way to organize, view, and access data.  Many of the functions that you would use to read in external files (e.g. read.csv) or connect to databases (RMySQL), will return a data frame structure by default. While there are other important data structures, such as the vector, list and matrix, the data frame winds up being at the heart of many operations not the least of which is aggregation. Before we get into that let me offer a very brief review of data frame concepts:

  • A data frame is a set of rows and columns.
  • Each column is of the same length and data type
  • Every row is of the same length but can be of differing data types
  • A data frame has characteristics of both a matrix and a list
  • Bracket notation is the customary method of indexing into a data frame

 

Subsetting Data The Old School Way

Here are some examples of getting specific subsets of information from the built in data frame mtcars. Note that the bracket notation has two dimensions here – one for row and one for column. The comma within any given bracket notation expression separates the two dimensions.

# select rows 1 and 2

mtcars[1:2,]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

# select rows 1 and 2 and columns 3 and 5

mtcars[1:2,c(3,5)]
              disp drat
Mazda RX4      160  3.9
Mazda RX4 Wag  160  3.9

# Find the rows where the MPG column is greater than 30 

mtcars[mtcars$mpg > 30,]
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

# select columns 10 and 11 for all rows where MPG > 30 and cylinder is
# equal to 4 and then extract columns 10 and 11

mtcars[mtcars$mpg > 30 & mtcars$cyl == 4, 10:11]
               gear carb
Fiat 128          4    1
Honda Civic       4    2
Toyota Corolla    4    1
Lotus Europa      5    2

So the bracket notation winds up being a good way by which to index and search within a data frame although to do aggregation requires us to use other functions such as tapply, aggregate, and table. This isn’t necessarily a bad thing just that you have to learn which function is the most appropriate for the task at hand.

# Get the mean MPG by Transmission

tapply(mtcars$mpg, mtcars$am, mean)
0    1
17.1 24.4 

# Get the mean MPG for Transmission grouped by Cylinder

aggregate(mpg~am+cyl,data=mtcars,mean)
   am cyl  mpg
1  0   4 22.9
2  1   4 28.1
3  0   6 19.1
4  1   6 20.6
5  0   8 15.1
6  1   8 15.4

# Cross tabulation based on Transmission and Cylinder

table(transmission=mtcars$am, cylinder=mtcars$cyl)
              cylinder
transmission  4  6  8
0  3  4 12
1  8  3  2

 

Enter the data.table package

Okay this is nice though wouldn’t it be good to have a way to do aggregation within the bracket notation ? In fact there is. There are a couple of packages that could help us  to simplify aggregation though we will start with the data.table package for now. In addition to being able to do aggregation within the brackets there are some other reasons why it is useful:

  • It works well with very large data files
  • Can behave just like a data frame
  • Offers fast subset, grouping,  update,  and joins
  • Makes it easy to turn an existing data frame into a data table

Since it is an external package you will first need to install it. After that just load it up using the library statement

library(data.table)

dt <- data.table(mtcars)

class(dt)
[1] "data.table" "data.frame"

dt[,mean(mpg)]   # You can't do this with a normal data frame
[1] 20.09062

mtcars[,mean(mpg)]  # Such a thing will not work with regular data frames
Error in mean(mpg) : object 'mpg' not found

So notice that we can actually find the mean MPG directly within the bracket notation. We don’t have to go outside of the brackets to do this. So what about reproducing the previous tapply example:

tapply(mtcars$mpg,mtcars$am,mean)
 0    1
17.1 24.4 

# Here is how we would do this with the data table "dt"

dt[,mean(mpg),by=am]
   am   V1
1:  1 24.4
2:  0 17.1

# We could even extend this to group by am and cyl

dt[,mean(mpg),by=.(am,cyl)]
    am cyl  V1
1:  1   6 20.6
2:  1   4 28.1
3:  0   6 19.1
4:  0   8 15.0
5:  0   4 22.9
6:  1   8 15.4

# If we want to more clearly label the computed average

dt[,.(avg=mean(mpg)),by=.(am,cyl)]
    am cyl  avg
1:  1   6 20.6
2:  1   4 28.1
3:  0   6 19.1
4:  0   8 15.0
5:  0   4 22.9
6:  1   8 15.4

 

Similarities to SQL

It doesn’t require many examples to prove that we don’t have to use the aggregate or tapply functions to do any of the work once we have created a data table.  Unlike default data frames the bracket notation for a data table object has three dimensions which correspond to what one might see in an SQL statement. Don’t worry – you do not have to be an SQL expert to use data.table. In reality you don’t have to know it at all although if you do then using data.table becomes much easier.

screenshot_749

So in terms of SQL we would say something like select “j” (columns or an operation on some columns) where those columns in  a row(s) “i” satisfy some specified condition on the rows. And if the “by” index is supplied it indicates how to group the result. Well this might sound a little involved so something more practical might be in order to illustrate what I’m talking about here.

screenshot_748

So let’s revisit the previous examples and see how it relates to the SQL model – This is helpful in understanding the paradigm associated with data table objects:

dt[,mean(mpg),by=am]
   am   V1
1:  1 24.4
2:  0 17.1

# The above is analogous to an SQL statement like

select am,avg(mpg) from mtcars group by am

# The following example

dt[,.(avg=mean(mpg)),by=.(am,cyl)]
    am cyl avg
1:  1   6 20.6
2:  1   4 28.1
3:  0   6 19.1
4:  0   8 15.0
5:  0   4 22.9
6:  1   8 15.4

# is analogous to an SQL statement like:

select am,avg(mpg) as avg from mtcars group by am,cyl

# The following example

dt[mpg > 20,.(avg=mean(mpg)),by=.(am,cyl)]
    am cyl  avg
1:  1   6 21.0
2:  1   4 28.1
3:  0   6 21.4
4:  0   4 22.9

# would be analogous to the following SQL statement

select am,avg(mpg) as avg from mtcars where mpg > 20 group by am,cyl


As previously mentioned one does not need to know SQL to use data.table. However, if you do it can help you understand some of the motivations behind the package.

 

Tabulation

Here are some more examples that illustrate how we can count and tabulate things. Within a data table the special variable .N represents the count of rows. If there is a group by index then it presents the number of rows within that grouping variable.

dt[, .N] # How many rows
[1] 32

dt[, .N, by=cyl]  # How many cars in each cylinder group
   cyl  N
1:   6  7
2:   4 11
3:   8 14

# For rows where the wt is > 1.5 tons count the number of cars by
# transmission type.

dt[wt > 1.5, .(count=.N), by=am] 
   am count
1:  1    13
2:  0    19

We can also do sorting quite easily and very fast.


# Present the 5 cars with the best MPG

head(dt[order(-mpg)],5) 
 mpg cyl disp  hp drat   wt qsec vs am gear carb
1: 33.9   4 71.1  65 4.22 1.83 19.9  1  1    4    1
2: 32.4   4 78.7  66 4.08 2.20 19.5  1  1    4    1
3: 30.4   4 75.7  52 4.93 1.61 18.5  1  1    4    2
4: 30.4   4 95.1 113 3.77 1.51 16.9  1  1    5    2
5: 27.3   4 79.0  66 4.08 1.94 18.9  1  1    4    1

# Since data table inherits from a data frame we could have also done

dt[order(-mpg)][1:5]
mpg cyl disp  hp drat   wt qsec vs am gear carb
1: 33.9   4 71.1  65 4.22 1.83 19.9  1  1    4    1
2: 32.4   4 78.7  66 4.08 2.20 19.5  1  1    4    1
3: 30.4   4 75.7  52 4.93 1.61 18.5  1  1    4    2
4: 30.4   4 95.1 113 3.77 1.51 16.9  1  1    5    2
5: 27.3   4 79.0  66 4.08 1.94 18.9  1  1    4    1

# We could sort on multiple keys. Here we find the cars with the best
# gas mileage and then sort those on increasing weight

dt[order(-mpg,wt)][1:5]
   mpg cyl disp  hp drat   wt qsec vs am gear carb
1: 33.9   4 71.1  65 4.22 1.83 19.9  1  1    4    1
2: 32.4   4 78.7  66 4.08 2.20 19.5  1  1    4    1
3: 30.4   4 95.1 113 3.77 1.51 16.9  1  1    5    2
4: 30.4   4 75.7  52 4.93 1.61 18.5  1  1    4    2
5: 27.3   4 79.0  66 4.08 1.94 18.9  1  1    4    1

 

Chicago Crime Statistics

Let’s look at a more realistic example. I have a file that relates to Chicago crime data that you can download if you wish (that is if you want to work this example). It is about 81 megabytes so it isn’t terribly large.

url <- "https://raw.githubusercontent.com/steviep42/youtube/master/YOUTUBE.DIR/chi_crimes.zip"
download.file(url,"chi_crimes.zip")
system("unzip chi_crimes.zip")
system.time(df.crimes <- read.csv("chi_crimes.csv", header=TRUE,sep=","))

So you might try reading it in with the typical import functions in R such as read.csv which many people use as a default when reading in CSV files. I’m reading this file in on a five year old Mac Book with about 8 GB of RAM. Your speed may vary. We’ll first read in the file using read.csv and then use the fread function supplied by data.table to see if there is any appreciable difference

system.time(df.crimes <- read.csv("chi_crimes.csv", header=TRUE,sep=","))

user system elapsed
30.251 0.283 30.569

nrow(df.crimes)
[1] 334141

# Now let's try fread

system.time(dt.crimes <- fread("chi_crimes.csv", header=TRUE,sep=","))

user system elapsed
1.045 0.037 1.362

attributes(dt.crimes)$class # dt.crimes is also a data.frame
[1] "data.table" "data.frame"

nrow(df.crimes)
[1] 334141

dt.crimes[,.N]
[1] 334141

That was a fairly significant difference. If the file were much larger we would see an even larger time difference which for me is a good thing since I routinely read in large files. Consequently fread has become a default for me even if I don’t wind up using the aggregation capability of the data.table package. Also note that the fread function returns a data.table object by default. Now let’s investigate some crime using some of the things we learned earlier. This data table has information on every call to the Chicago police in the year 2012. So we’ll want to see what factors there are in the data frame/table so we can do some summaries across groups.

 names(dt.crimes)
 [1] "Case Number"          "ID"                   "Date"
[4] "Block"                "IUCR"                 "Primary Type"
[7] "Description"          "Location Description" "Arrest"
[10] "Domestic"             "Beat"                 "District"
[13] "Ward"                 "FBI Code"             "X Coordinate"
[16] "Community Area"       "Y Coordinate"         "Year"
[19] "Latitude"             "Updated On"           "Longitude"
[22] "Location"  

# Let's see how many unique values there are for each column. Looks 
# like 30 FBI codes so maybe we could see the number of calls per FBI 
# code. What about District ? There are 25 of those.

sapply(dt.crimes,function(x) length(unique(x)))

        Case Number                   ID                 Date 
              334114               334139               121484 
               Block                 IUCR         Primary Type 
               28383                  358                   30 
         Description Location Description               Arrest 
                 296                  120                    2 
            Domestic                 Beat             District 
                   2                  302                   25 
                Ward             FBI Code         X Coordinate 
                  51                   30                60704 
      Community Area         Y Coordinate                 Year 
                  79                89895                    1 
            Latitude           Updated On            Longitude 
              180396                 1311               180393 
            Location 
              178534 


Now – I just used the sapply function to tell me how many unique values each column assumes. This is so we can identify potential summary factors. This is a common activity and I used a common R-like approach although for the newcomer it looks a little verbose (Welcome to R my friend) and it turns out that we can accomplish the same thing using data.table itself. But first let’s do some exploration and then we’ll come back to this. So how many calls per District were there ? Let’s then order this result such that we see the Districts with the highest number of calls first and in descending order.

dt.crimes[,.N,by=District][order(-N)]
1:        8 22386
2:       11 21798
3:        7 20150
4:        4 19789
5:       25 19658
6:        6 19232
7:        3 17649
8:        9 16656
9:       19 15608
10:        5 15258
11:       10 15016
12:       15 14385
13:       18 14178
14:        2 13448
15:       14 12537
16:        1 12107
17:       16 10753
18:       22 10745
19:       17  9673
20:       24  9498
21:       12  8774
22:       13  7084
23:       20  5674
24:       NA  2079
25:       31     6
District     N

Next let’s randomly sample 500 rows and then find the mean  number of calls to the cops
as grouped by FBI.Code (whatever that corresponds to) check https://www2.fbi.gov/ucr/nibrs/manuals/v1all.pdf to see them all.

dt.crimes[sample(1:.N,500), .(mean=mean(.N)), by="FBI Code"]
  FBI Code mean
1:       14   60
2:       19    3
3:       24    6
4:       26   47
5:       06  109
6:      08B   83
7:       07   27
8:      08A   22
9:       05   34
10:       18   44
11:      04B   10
12:       03   19
13:       11   15
14:      04A    7
15:       09    1
16:       15    6
17:       16    3
18:       10    1
19:       17    1
20:       02    2

# Here we count the number of calls for each day of the year and order from the highest
# to the lowest. We see then that the day with the most calls to the police was 
# New Years Day. Whereas Christmas Day had the fewest calls

dt.crimes[,.N,by=substr(Date,1,10)][order(-N)]
         substr    N
  1: 01/01/2012 1297
  2: 06/01/2012 1163
  3: 07/01/2012 1158
  4: 09/01/2012 1124
  5: 07/15/2012 1122
 ---                
362: 12/27/2012  679
363: 01/20/2012  677
364: 11/22/2012  653
365: 12/26/2012  619
366: 12/25/2012  520

# Here is another example that would emulate the HAVING clause in SQL

dt.crimes[,.N,by=substr(Date,1,10)][N > 1122][order(-N)]
       substr    N
1: 01/01/2012 1297
2: 06/01/2012 1163
3: 07/01/2012 1158
4: 09/01/2012 1124

 

Other Things

Keep in mind that data.table isn’t just for aggregation. You can do anything with it that you can do with a normal data frame. This includes creating new columns, modify existing ones, and create your own functions to do aggregation, and many other activities.


# Get the next to the last row from the data table

dt[.N-1]
mpg cyl disp  hp drat   wt qsec vs am gear carb
1:  15   8  301 335 3.54 3.57 14.6  0  1    5    8

dt[cyl %in% c(4,6)]
mpg cyl  disp  hp drat   wt qsec vs am gear carb
1: 21.0   6 160.0 110 3.90 2.62 16.5  0  1    4    4
2: 21.0   6 160.0 110 3.90 2.88 17.0  0  1    4    4
3: 22.8   4 108.0  93 3.85 2.32 18.6  1  1    4    1
4: 21.4   6 258.0 110 3.08 3.21 19.4  1  0    3    1
5: 18.1   6 225.0 105 2.76 3.46 20.2  1  0    3    1
6: 24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
7: 22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
8: 19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
9: 17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
10: 32.4   4  78.7  66 4.08 2.20 19.5  1  1    4    1
11: 30.4   4  75.7  52 4.93 1.61 18.5  1  1    4    2
12: 33.9   4  71.1  65 4.22 1.83 19.9  1  1    4    1
13: 21.5   4 120.1  97 3.70 2.46 20.0  1  0    3    1
14: 27.3   4  79.0  66 4.08 1.94 18.9  1  1    4    1
15: 26.0   4 120.3  91 4.43 2.14 16.7  0  1    5    2
16: 30.4   4  95.1 113 3.77 1.51 16.9  1  1    5    2
17: 19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6
18: 21.4   4 121.0 109 4.11 2.78 18.6  1  1    4    2

# Summarize different variables at once

dt[,.(avg_mpg=mean(mpg),avg_wt=mean(wt))]
avg_mpg avg_wt
1:    20.1   3.22

dt[,cyl:=NULL]  # Removes the cyl column

head(dt)
mpg disp  hp drat   wt qsec vs am gear carb
1: 21.0  160 110 3.90 2.62 16.5  0  1    4    4
2: 21.0  160 110 3.90 2.88 17.0  0  1    4    4
3: 22.8  108  93 3.85 2.32 18.6  1  1    4    1
4: 21.4  258 110 3.08 3.21 19.4  1  0    3    1
5: 18.7  360 175 3.15 3.44 17.0  0  0    3    2
6: 18.1  225 105 2.76 3.46 20.2  1  0    3    1

 

Identifying Possible Factors in a Data Frame

Okay – I demonstrated above a way to identify the number of unique values contained in each column as a means to determine what the factors/categories might be. I used the following


sapply(dt.crimes,function(x) length(unique(x)))

However, the data.table package has a special variable called .SD and a function called uniqueN that can help us. Let’s explore what these do. First let me create a smaller data frame which makes it easier to see what is going on

set.seed(123)
mydt <- data.table(group=sample(LETTERS[1:3],6,T),gender=sample(c("M","F"),6,T),
                   measure=round(rnorm(6,10),2))

mydt
   group gender measure
1:     A      F   10.46
2:     C      F    8.73
3:     B      F    9.31
4:     C      M    9.55
5:     C      F   11.22
6:     A      M   10.36

# The following groups the data.table by the "gender" column

mydt[,.SD,by=gender]
   gender group measure
1:      F     A   10.46
2:      F     C    8.73
3:      F     B    9.31
4:      F     C   11.22
5:      M     C    9.55
6:      M     A   10.36

In this case the .SDvariable is a way to group the data by a key as if we were creating an index. (Note that the data.table package has a setkey function for this to formalize the creation of an index).
What about this example ?

mydt[,print(.SD),by=gender] 
   group measure
1:     A   10.46
2:     C    8.73
3:     B    9.31
4:     C   11.22
   group measure
1:     C    9.55
2:     A   10.36
Empty data.table (0 rows) of 1 col: gender

This is somewhat similar to the native split function in R that let’s one split a data frame on a given factor and store the results in a list. In this case, however, the “splits” aren’t really stored any where because we are simply just printing them. A more useful example might be:

mydt[,lapply(.SD,mean),by=gender,.SDcols="measure"] 
   gender measure
1:      F    9.93
2:      M    9.96 

Oh wow – so the .SD pulls out all the columns in the data frame except gender and then applies the mean function to the columns specified by the .SDcols variable. So this is another way of doing some aggregation (although perhaps a bit intimidating for those not familiar with the lapply function. (If that is the case I have a cure for you – go read this post). Well we might be getting a bit off track here because I wanted to show you how to replace the functionality of the sapply example from above. Here it is. The uniqueN function determines how many unique values there are in a column. The lapply function applies uniqueN to each column.


mydt[,lapply(.SD,uniqueN)]
   group gender measure
1:     3      2       6

 

C’est la fin ?

Before we finish this posting it is worth mentioning that the data table package also provides support for setting a key much in the same way one would create an index in a relational database to speed up queries. This is for situations wherein you might have a really large data table and expect to routinely interrogate it using the same column(s) as keys. In the next posting I will look at the dplyr package to show another way to handle large files and accomplish intuitive aggregation. Some R experts represents data.table as being as competitor of dplyr although one could mix the two. What I like about data.table is that it allows you to build sophisticated queries, summaries, and aggregations within the bracket notations. It has the added flexibility of allowing you to employ existing R functions or any that you decide to write.

lapply101
Next up in our review of the family of apply commands we’ll look at the lapply function, which can be used to loop over the elements of a list (or a vector). This is a true convenience although for those with experience in other programming languages it can seem unnecessary since you are accustomed to writing your own loops. Rest assured you can take that approach in R but once you get an understanding of lists and lapply you will appreciate what it can do for you. This leads me to what I feel is an important observation. I find that most misunderstandings of the lapply command result primarily from a limited or incomplete knowledge of the list structure in R. As long as you know what lists are and how to effectively manipulate them then lapply becomes much easier to implement. So we’ll begin with a brief yet important tutorial on lists and how elements can be referenced.

The list structure represents R’s default data structure for containing heterogeneous information. Recall that vectors and matrices can accommodate only one data type at a time. That is you cannot have a vector or matrix with mixed data types. You can try to put numeric and character data into the same vector but R will convert it all to a single type. Don’t believe me ? Try it. R doesn’t complain but it will make everything a character.

somevec <- c(1,4,5,"4","5")

somevec
[1] "1" "4" "5" "4" "5"

So from a purely practical point of view some data structure must exist in R to accommodate mixed data. That is what the list structure is for. Where do lists show up in R ? All over the place it turns out. Many of the interesting statistical functions in R, as well as the many add on packages available on CRAN, return information in the form of lists.


# Let's do some regression using the mtcars data frame

mylm <- lm(mpg~wt, data = mtcars)

# What type of structure do we get back ? A list with 12 sub elements

str(mylm,max.level=0)
List of 12
 - attr(*, "class")= chr "lm"

Now isn’t that interesting ? As you might know, judging the quality of the regression process can be quite involved. R knows this so it returns lots of information encapsulated within a list to help you assess the model. The 12 elements include data types such as vectors, factors, lists, objects, and data frames. So if you are writing your own function and need to return diverse types of data then the list structure is for you ! But that’s all a bit too complex for the moment so let’s return to some basics. To motivate things I’ll present some variables here that look like they relate to characteristics of a family. We have a surname, the number of children in the family, their respective ages, and whether or not the children have had measles.

surname <- "Jones"
numofchild <- 2
ages <- c(5,7)
measles <- c("Y","N")

We could work with this information as individual variables though it might be useful to assemble it all into a single structure. So I will create a “named” list to contain this information. It is called a “named” list because each element receives a name as the list is being created. Named lists are easier for humans to manipulate and interrogate because we can refer to elements by name using the “$” notation which I will introduce momentarily.

family1 <- list(name="Jones",numofchild=2,ages=c(5,7),measles=c("Y","N"))

str(family1)
List of 4
 $ name      : chr "Jones"
 $ numofchild: num 2
 $ ages      : num [1:2] 5 7
 $ measles   : chr [1:2] "Y" "N"

Ah so we see that the elements “name” and “measles” are character vectors, element “numofchild” is a single valued vector, and the “ages” element is a multi valued vector. This proves that we can host data of differing types within a single data structure. Now how do we address these elements and retrieve their values ? Can we use numeric indexing as with a vector ? Can we use the names we created for each element ? The answer in both cases is “yes”. If we have a list with named elements then we can use the “$” notation.

family1$ages
[1] 5 7

family1$measles
[1] "Y" "N"

# We can also reference by numeric position which is more useful if you are 
# writing your own loop structures but it is less intuitive 

family1[2]
$numofchild
[1] 2

family1[[2]]
[1] 2

Hmm. What’s up with the double bracket vs the single bracket ? Well the way I think about it is that if you use the single bracket, (as you would if this were a vector), you get back the name of the element as well as it’s value. While this is useful it is usually more interesting to get the actual value(s) of the element which, (if you don’t use the element name), requires use of the double brackets. Think of the double brackets as being more specific than the single brackets. Now even if you use the $ notation you can still address individual values of a given list element. So here I’ll start with pulling out the age of the first child only.

family1$ages[1]
[1] 5

# We could pull out both ages using this approach

family1$ages[1:2]
[1] 5 7

# But this is the same as this:

family1$ages
[1] 5 7

# Which is the same as this:

family1[[3]]
[1] 5 7

The way I would summarize the above information is that if you have a named list then you can use the “$” notation for the most part though if you want to address specific values within a multivalued element then you will also have to use the bracket notation in addition to the “$” notation. If you have an unnamed list then you must use the bracket notation exclusively since there are no names available. Unnamed lists result when no effort is made to name the elements such as in the following example. I can always apply names later if I wish.

family1 <- list("Jones",2,c(5,7),c("Y","N"))

# So when we print the list results we see only brackets - no names.

family1
[[1]]
[1] "Jones"

[[2]]
[1] 2

[[3]]
[1] 5 7

[[4]]
[1] "Y" "N"

# The cool thing is that we can make names for our elements even after we have
# create the list

names(family1) <- c("name","numofchild","ages","measles")

Introducing lapply

Admittedly this family1 list is a little basic but the above examples prove that there is flexibility in how you can address elements and values. So let’s present an example of the lapply function. I’ll use it to apply the built in “typeof” function to each element in the list.

lapply(family1, typeof)
$name
[1] "character"

$numofchild
[1] "double"

$ages
[1] "double"

$measles
[1] "character"

# Using lapply will return a list

str(lapply(family1, typeof))
List of 4
 $ name      : chr "character"
 $ numofchild: chr "double"
 $ ages      : chr "double"
 $ measles   : chr "character"

So each element of family1 is “silently” passed to the typeof function and the results for all elements are returned in a list. In fact the “l” in lapply comes from the fact that it will return to you a list. To make sure you understand what is happening here let me introduce a small variation to the example.


lapply(family1, typeof)

# is the same as

lapply(family1, function(x) typeof(x))

The second version does exactly the same thing as the first but illustrates two important facts: 1) I can pass an “anonymous” functions to lapply. (If you read my blog on apply I discussed the importance and use of “anonymous” functions). 2) The variable “x” referenced in the anonymous function definition represents a placeholder for each element of the family1 list. If you need more help understanding this then look at this example where I use a function that simply prints back each element of the list and it’s associated value.

lapply(family1,function(x) x)
$name
[1] "Jones"

$numofchild
[1] 2

$ages
[1] 5 7

$measles
[1] "Y" "N"

# But this is the same as simply typing the name of the list at the prompt

family1

Now you could have also defined a function in advance instead of using an anonymous function. That is you are not obligated to use anonymous functions. Some people find them to error prone, confusing, and less readable. If you do then simply define your function in advance. It won’t change the result or impact performance at least in this case.

simplefunc <- function(x) {
   mytype <- typeof(x)
   return(mytype)
}

lapply(family1, simplefunc)
$name
[1] "character"

$numofchild
[1] "double"

$ages
[1] "double"

$measles
[1] "character"

Alright let’s get a little more advanced here. I’ll write a function that returns the mean value of each element. But I know what you are thinking. Not all of our list elements are numeric so to avoid an error I’ll have to insert some basic logic to test if the element is numeric. If it is numeric then I take the mean of the element value. If not then I ignore the element. I could implement this example two ways: 1) create a named function in advance which I then use in a call to lapply. 2) create an anonymous function when I call lapply. Here is the first scenario:


# Create a function in advance.

myfunc <- function(x) {
  if (is.numeric(x)) {
    mean(x)
  }
} 

# For numeric elements we get a meaningful result. For other elements we
# don't get back anything

lapply(family1, myfunc)
$name
NULL

$numofchild
[1] 2

$ages
[1] 6

$measles
NULL

What about approach #2 ? This is where I would define the function as I make the call to lapply. This works just as well but might be less readable to a newcomer.

lapply(family1, function(x) {if (is.numeric(x)) mean(x)})
$name
NULL

$numofchild
[1] 2

$ages
[1] 6

$measles
NULL

A list of lists !

Okay, let’s create some more lists that correspond to different families. If we want we can even create a master list whose elements are individual family lists. So in effect we are creating a list of lists ! In this case our master list has named elements so we can easily address the contents of the sub elements.

family2 <- list(name="Espinoza",numofchild=4,ages=c(5,7,9,11),measles=c("Y","N","Y","Y"))
family3 <- list(name="Ginsberg",numofchild=3,ages=c(9,13,18),measles=c("Y","N","Y"))
family4 <- list(name="Souza",numofchild=5,ages=c(3,5,7,9,11),measles=c("N","Y","Y","Y","N"))

allfams <- list(f1=family1,f2=family2,f3=family3,f4=family4)

str(allfams,max.level=1)
List of 4
 $ f1:List of 4
 $ f2:List of 4
 $ f3:List of 4
 $ f4:List of 4

allfams$f3$ages   # Get the ages of Family 3
[1]  9 13 18

# Same as

allfams[[3]]$ages
[1]  9 13 18

Okay so now what if we wanted to get the mean ages of each family’s children ? How could we do this using lapply ? It’s easy.

lapply(allfams, function(x) mean(x$ages))
$f1
[1] 6

$f2
[1] 8

$f3
[1] 13.33333

$f4
[1] 7

It might be a better idea to get the averages for all children. How might we do that ? It takes a little bit more work but not much. First, recognize that what we are getting back are all numeric values so we don’t really need a list to store that information. What I mean is that the only reason we use a list in the first place is to “host” data of differing types but here our result is all numeric so let’s convert it to a vector.

unlist(lapply(allfams, function(x) mean(x$ages)))
      f1       f2       f3       f4 
 6.00000  8.00000 13.33333  7.00000 

# So check out the following. It gives us exactly what we want.

mean(unlist(lapply(allfams, function(x) mean(x$ages))))
[1] 8.583333

# An "expanded" version of this might have looked like:

mymeanages <- function(x) {
  return(mean(x$ages))
}

hold <- lapply(allfams, mymeanages)  # Get back a list with mean ages for each family

hold2 <- unlist(hold)    # Turn the result into a vector since everything is a numeric value

mean(hold2)   # Get the mean of all ages

Let’s ask another question that we could use lapply and a companion function to answer. Which families have 2 or 3 children ? Well since we only have 4 families in allfams we could just look at the lists and answer this question via visual inspection. But this might get really hard to do if our allfams list had 10, 100, or 1,000 families. So here is one way we could do this.

hold <- lapply(allfams,function(x) {x$numofchild >= 2 & x$numofchild <= 3}) 
which(hold == T)
f1 f3 
 1  3 

# Or we could it all in one go

which(lapply(allfams,function(x) {x$numofchild >= 2 & x$numofchild <= 3}) == T)
f1 f3 
 1  3 

Using the split command

Okay, how might we use this knowledge in another example. Lists also show up in conjunction with the “split” command which, given a data frame and a factor, will split the data frame based on that factor into a list. This is best understood with an example. We’ll use the built in data frame called mtcars.

unique(mtcars$cyl)  # Cylinder takes on three distinct values
[1] 6 4 8

# We could split the data frame based on cylinder group.

mydfs <- split(mtcars,mtcars$cyl)

str(mydfs,max.level=1)
List of 3
 $ 4:'data.frame':	11 obs. of  11 variables:
 $ 6:'data.frame':	7 obs. of  11 variables:
 $ 8:'data.frame':	14 obs. of  11 variables:

So what we get back is a list called mydfs whose elements are data frames whose elements represent observations corresponding to cars with a certain number of cylinders – 4,6, or 8. This is a quick and efficient way to split up a data frame. It is worth pointing out that lots of people don’t take advantage of the split function usually because they aren’t aware of it. If you don’t use the split function then you have to do it by hand using an approach like the following. While this will work it doesn’t scale very well especially if you have a factor with many “levels”.


fourcyl  <- mtcars[mtcars$cyl==4,]  
sixcyl   <- mtcars[mtcars$cyl==6,]
eightcyl <- mtcars[mtcars$cyl==8,]

But let’s get back to split function and our list of data frames. We have a list called mydfs whose elements are data frames with observations corresponding to cars of 4,6, and 8 cylinders. We can use our knowledge of lists to look around some:

names(mydfs)
[1] "4" "6" "8"

mydfs$"4"
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

How could we rapidly determine the mean MPG for each cylinder group using lapply ?

mydfs <- split(mtcars,mtcars$cyl)     
lapply(mydfs,function(x) mean(x$mpg))
$`4`
[1] 26.66364

$`6`
[1] 19.74286

$`8`
[1] 15.1

# Okay cool but we could bundle this up in one statement

lapply(split(mtcars,mtcars$cyl),function(x) mean(x$mpg))
$`4`
[1] 26.66364

$`6`
[1] 19.74286

$`8`
[1] 15.1

# Or more economically (though potentially confusing to a newcomer) 

unlist(lapply(split(mtcars,mtcars$cyl),function(x) mean(x$mpg)))

      4        6        8 
26.66364 19.74286 15.10000 

# Which is identical to the tapply function. 

tapply(mtcars$mpg,mtcars$cyl,mean)
       4        6        8 
26.66364 19.74286 15.10000 

I slipped that last one in on you to make a point that there are always multiple ways to solve problems using R. Some say that this flexibility is a great strength of R whereas others say it is a great source of confusion since newcomers don’t know which approach is best. When I was new to R I simply used whatever worked until I needed a faster or more flexible approach. My advice to you is don’t worry about which way is “right” because this will slow you down. Find an approach that solves your problems and change that approach when it becomes necessary. Okay that will wrap it up for the lapply intro. As always there are many other examples I could present but hopefully this blog will help in your mastery of lists and looping over them.

apply The goal of this blog entry is to introduce basic and essential information about the apply function. Even established R users get confused when considering this family of functions especially when observing how many of the them there are: apply, tapply, lapply, sapply, rapply, eapply, mapply. When I was new to R I was rarely satisfied with the all-too-basic explanations of these commands so I thought I would create a series on some of these functions to address the more common questions that newcomers might have. There is an add on package called plyr that attempts to present a unified philosophy and approach to the process of “applying” functions to various R data structures though, for now, I’ll be sticking to the native R commands since they show up a lot in example code you are likely to encounter.

I’ll start with the apply command which expects a matrix as input. Depending on what function you specify when using the apply command, you will get back either a vector or a matrix. Let’s get started with a motivating example, which I hope will convince you on how useful the apply approach can be. Here we create a matrix of 16 elements from a normal distribution of mean 10. We use the set.seed command to enable reproducibility. Imagine that each column represents numerical measurements and the rows represent individual samples. Scientific data, such as microarray and metabolomic information, can often follow this pattern. For those of you concerned with authenticity please don’t worry – I’ll introduce some “real” data later on in the posting.

set.seed(1)  # Makes the call to rnorm generate the same numbers every time
( mymat <- matrix(round(rnorm(16,10),2),4,4) )

      [,1]  [,2]  [,3]  [,4]
[1,]  9.37 10.33 10.58  9.38
[2,] 10.18  9.18  9.69  7.79
[3,]  9.16 10.49 11.51 11.12
[4,] 11.60 10.74 10.39  9.96

Doing it the hard way

Let’s say that we want to take the mean of each column. Another way to say this is that we want to apply the mean function to each column in the matrix. This is the way to start thinking about things like this in R. But first, how might we do this if we don’t know anything about the apply command ?

mean(mymat[,1])
[1] 10.0775

mean(mymat[,2])
[1] 10.185

mean(mymat[,3])
[1] 10.5425

mean(mymat[,4])
[1] 9.5625

# Or we could stash the means into a vector if we need to use that information elsewhere

( mymatcolmean <- c(mean(mymat[,1]),mean(mymat[,2]),mean(mymat[,3]),mean(mymat[,4])) )

[1] 10.0775 10.1850 10.5425  9.5625

See what I did there ? I took the mean of each column by invoking the mean function four times and substituting columns 1-4 on each invocation. This is a bit crude but entirely effective and we do get the correct answer. In the second example I created a vector called mymatcolmean to hold the information. The two approaches are equivalent though neither scales very well. Imagine if there were 10, 100, or 1,000 columns. Sure you could do cut and paste but it is still a tedious process. Surely there must be a better way. Well, if you have experience with programming languages like C, FORTRAN, Java, or Perl then you probably know how to use a for-loop structure to solve this problem. R supports for-loops too but it also offers some ways to avoid using them.

retvec <- vector()
for (ii in 1:ncol(mymat)) {
  retvec[ii] = mean(mymat[,ii])
}
retvec
[1] 10.0775 10.1850 10.5425  9.5625

# We could even put this into a function for later use in case we need it.

myloop <- function(somemat) {
 retvec <- vector()
 length(retvec) <- ncol(somemat)
 for (ii in 1:ncol(somemat)) {
   retvec[ii] <- mean(somemat[,ii])
 }
 return(retvec)
}

myloop(mymat)
[1] 10.0775 10.1850 10.5425  9.5625

# This will now work for any matrix but of course it is specific to the columns of the matrix.

set.seed(1)

newmat <- matrix(rnorm(100),10,10)

myloop(newmat)
 [1]  0.1322028  0.2488450 -0.1336732  0.1207302  0.1341367  0.1434569  0.4512100 -0.2477361
 [9]  0.1273603  0.1123413

So this would work but what if we wanted to take the mean of the rows instead of the columns ? We could do that but we then have to rewrite our function. And, while we are at it, in order to make our function more general what about providing an argument that indicates whether we want to operate on the rows or the columns ? I’ll use “1” to indicate rows and “2” to indicate columns.

myloop <- function(somemat,rc=2) {
  retvec <- vector()
  length(retvec) <- ncol(somemat)
  
  # Check to see if we are taking the mean of the columns or rows
  # 1 indicates rows, 2 indicates columns
  
  if (rc == 2) {
    for (ii in 1:ncol(somemat)) {
      retvec[ii] <- mean(somemat[,ii])
    }
  } else {
    for (ii in 1:nrow(somemat)) {
      retvec[ii] <- mean(somemat[ii,])
    }
  }
  return(retvec)
}

# Okay let's make sure it works. 

myloop(mymat,2)   # This is correct
[1] 10.0775 10.1850 10.5425  9.5625

myloop(mymat,1)   # This is correct
[1]  9.9150  9.2100 10.5700 10.6725

mean(mymat[1,])  # A spot check
[1] 9.915

Doing it the easy way

Well okay that’s good but what if we then wanted our myloop function to accommodate a function other than mean ? Like quantile, sum, fivenum, or even a function of our own design ? We could add another argument to our function to let the user specify the name of a function to be applied to the rows or columns. But before we do more work please consider that R already has something that will do the job for us – the apply command.

apply(mymat, 2, mean)
[1] 10.0775 10.1850 10.5425  9.5625

apply(mymat, 1, mean)
[1]  9.9150  9.2100 10.5700 10.6725

See how much easier that is than writing our own looping function ? It has been my observation that those well versed in traditional programming languages have a bigger problem getting used to the apply function than newcomers simply because experienced programmers are more accustomed to writing their own summary functions. They just dive in and start coding. But R short circuits this approach by providing the apply family of commands. Note also that we can substitute in any function we want to.


apply(mymat,2,class)  # What class do the columns belong to ?
[1] "numeric" "numeric" "numeric" "numeric"

apply(mymat,2,sum)    # Get the sum of all the columns
[1] 40.31 40.74 42.17 38.25

apply(mymat,1,range)  # Get the range of all rows
      [,1]  [,2]  [,3]  [,4]
[1,]  9.37  7.79  9.16  9.96
[2,] 10.58 10.18 11.51 11.60

apply(mymat,2,fivenum) # Get the fivenum summary for each column
       [,1]   [,2]   [,3]   [,4]
[1,]  9.160  9.180  9.690  7.790
[2,]  9.265  9.755 10.040  8.585
[3,]  9.775 10.410 10.485  9.670
[4,] 10.890 10.615 11.045 10.540
[5,] 11.600 10.740 11.510 11.120

Some activities on matrices are so common that R actually has dedicated functions for them that are very efficient and fast:

rowMeans(mymat)     # Equivalent to apply(mymat,1,mean)
[1]  9.9150  9.2100 10.5700 10.6725

colMeans(mymat)     # Equivalent to apply(mymat,2,mean)
[1] 10.0775 10.1850 10.5425  9.5625

rowSums(mymat)      # Equivalent to apply(mymat,1,sum)
[1] 39.66 36.84 42.28 42.69

colSums(mymat)      # Equivalent to apply(mymat,2,sum)
[1] 40.31 40.74 42.17 38.25

Passing additional arguments

Many explanations I’ve seen for the apply family of functions omit any discussion on how to provide arguments to the function you pass to apply. I’m not sure why as this is an extremely important consideration. To show you what I mean let’s go back to the apply example that uses the mean function.

apply(mymat, 2, mean)
[1] 10.0775 10.1850 10.5425  9.5625

This is easy to understand though what happens if we wish to pass arguments to the mean function such as “trim” or “na.rm” ? For example maybe we wish to first trim off some portion of the data before we apply the mean function. If we did this using the mean function directly it would look like this:

mean(mymat[,1],trim=0.5)
[1] 9.775

# You might then be tempted to do something like this when using the apply command

apply(mymat, 2, mean(trim=0.5))
Error in mean.default(trim = 0.5) : 
  argument "x" is missing, with no default

# Or maybe this ?

apply(mymat, 2, mean(x,trim=0.5))
Error in mean(x, trim = 0.5) : object 'x' not found

To get some idea on how to do this correctly look at the help information for apply. It’s not exactly jumping out at you but the answer is there. Do you see it ?

     apply(X, MARGIN, FUN, ...)
     
Arguments:

       X: an array, including a matrix.

  MARGIN: a vector giving the subscripts which the function will be
          applied over.  E.g., for a matrix ‘1’ indicates rows, ‘2’
          indicates columns, ‘c(1, 2)’ indicates rows and columns.
          Where ‘X’ has named dimnames, it can be a character vector
          selecting dimension names.

     FUN: the function to be applied: see ‘Details’.  In the case of
          functions like ‘+’, ‘%*%’, etc., the function name must be
          backquoted or quoted.

     ...: optional arguments to ‘FUN’.

Oh so the ellipsis imply that we can simply append arguments after the reference to the function. Of course it is reasonable to assume that the arguments you pass must make sense to the function. That is you cannot pass an argument to a function that isn’t valid !

apply(mymat,2,mean,trim=0.5)
[1]  9.775 10.410 10.485  9.670

# We can pass additional arguments

apply(mymat,2,mean,trim=0.5, na.rm=T)
[1]  9.775 10.410 10.485  9.670

Using our own functions with apply

We can also use a function that we have written and pass it to the apply command. As an example let’s say that we want to express each element in a column as a proportion of the sum of the column it occupies. Here is how we would do this without using apply. As in the opening example this approach doesn’t scale very well although we do get the correct answers.

mymat[,1]/sum(mymat[,1])
[1] 0.2324485 0.2525428 0.2272389 0.2877698

mymat[,2]/sum(mymat[,2])
[1] 0.2535592 0.2253314 0.2574865 0.2636230

mymat[,3]/sum(mymat[,3])
[1] 0.2508893 0.2297842 0.2729429 0.2463837

mymat[,4]/sum(mymat[,4])
[1] 0.2452288 0.2036601 0.2907190 0.2603922

But using our new found knowledge of the apply command we can make this easier and more general. First, let’s write a function that given a vector, (which is what each column or row of a matrix is), will return a vector where each element is a proportion of the sum of the input vector.

myfunc <- function(x) {
  return(x/sum(x))
}

# Check it out to make sure it works

myfunc(mymat[,1])
[1] 0.2324485 0.2525428 0.2272389 0.2877698

all.equal(myfunc(mymat[,1]), mymat[,1]/sum(mymat[,1]))
[1] TRUE

So now we can pass this to the apply command directly


apply(mymat, 2, myfunc)
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2324485 0.2535592 0.2508893 0.2452288
[2,] 0.2525428 0.2253314 0.2297842 0.2036601
[3,] 0.2272389 0.2574865 0.2729429 0.2907190
[4,] 0.2877698 0.2636230 0.2463837 0.2603922

# If this is correct then the sum of each column should be 1.

colSums(apply(mymat,2,myfunc))
[1] 1 1 1 1

# which is equivalent to the following nested apply call

apply( apply(mymat,2,myfunc), 2, sum)
[1] 1 1 1 1

The second example above, the one that has a call to apply within a call to apply, is known as a “composite call”, which can be confusing to newcomers. However, in the world of R, which aggressively implements functional programming concepts, the use of composite function calls is quite common and encouraged. But, there is no requirement to do things this way. Also, I should point out that R has a dedicated function for computing proportions in a matrix. Check out the prop.table command.

Working anonymously

In reality we don’t have to predefine the function. We simply pass the function to apply in an “anonymous” fashion, which means that we don’t even give it a name. It lives only for the duration of the call to apply.

apply(mymat, 2, function(x) x/sum(x))
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2324485 0.2535592 0.2508893 0.2452288
[2,] 0.2525428 0.2253314 0.2297842 0.2036601
[3,] 0.2272389 0.2574865 0.2729429 0.2907190
[4,] 0.2877698 0.2636230 0.2463837 0.2603922

This can be confusing to newcomers and, if it is, then you can simply predefine the function using a name and then refer to it by name when using the apply command. Sometimes you see calls to apply that look like the following. This is done in an attempt to improve readability though that is subjective. In terms of results or efficiency there are no differences.

apply(mymat, 2, function(x) {
                      return(x/sum(x))
                })

Performance vs the for loop

If you’ve read any of my other postings you will know that I have an interest in performance. Is there an efficiency gain when using the apply approach vs. the for loop approach ? To test this let’s read in some “real” Microarray data, which comes from Su et al, 2002.

url <- "http://steviep42.bitbucket.org/data/su_et_al_2002.txt"
mydata <- read.csv(url,header=T)

# B = brain, BF = fetal brain, L = liver, LF = fetal liver

str(mydata)
'data.frame':	12626 obs. of  8 variables:
 $ B1 : num  9.37 11.57 7.79 6.56 7.26 ...
 $ B2 : num  9.5 11.51 6.69 7.45 7.39 ...
 $ FB1: num  7.89 10.56 7.91 5.93 8.75 ...
 $ FB2: num  7.96 10.34 7.57 7.53 8.19 ...
 $ FL1: num  7.7 10.26 8.25 7.84 9.45 ...
 $ FL2: num  8.1 10.13 8.53 8.16 8.19 ...
 $ L1 : num  9.22 10.23 8.39 11.02 9.19 ...
 $ L2 : num  8.9 10.49 8.47 11.09 9.42 ...

# Let's check the mean and standard deviation for each column

apply(mydata,2,function(x) c(mu=mean(x),sd=sd(x)))
         B1       B2      FB1      FB2      FL1      FL2       L1       L2
mu 7.615090 7.431602 7.879343 7.770545 7.730293 7.601395 7.415937 7.518259
sd 2.257218 2.452364 2.150741 2.254736 2.207076 2.355612 2.405326 2.365394

Okay well let’s do a t-test across all rows for the brain tissue types. This becomes really easy to do with the apply command.


mypvals <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } )

round(mypvals[1:5],4)
 100_g_at   1000_at   1001_at 1002_f_at 1003_s_at 
   0.0090    0.0494    0.5211    0.8001    0.1349 
 

But is this faster than writing our own loop to do the t-test ? We’ll first need to write our own code to do the t-test. It might look something like this:


myttest <- function(somemat) {
  retvec <- vector()
  length(retvec) <- nrow(somemat)
  for (ii in 1:nrow(somemat)) {
    retvec[ii] <- t.test(somemat[ii,c("B1","B2")], somemat[ii,c("FB1","FB2")])$p.value
  }
  names(retvec) <- row.names(somemat)
  return(retvec)
}

Next we’ll make sure that this function returns the same results as the apply version after which we will time each approach.

v1 <- myttest(mydata)

v2 <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } )

all.equal(v1,v2)
[1] TRUE

# Let's time the two different approaches

benchmark(forloop = myttest(mydata), apply=apply(mydata, 1, function(x) 
                         { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value }),
                           replications=5,columns=c("test","elapsed"))
     test elapsed
2   apply  19.616
1 forloop  61.800

In this case it appears that the apply approach is indeed quicker than writing our own function though in my experience, with more general cases, the difference in execution time isn’t as drastic. In the end I suggest you use whichever approach makes the most sense to you to get the job done. However, I think that as you become more confident with R that you will be attracted to the apply command since it basically replaces the for-loop. So whenever you find yourself tempted to write a for-loop, even though you are used to it in other languages, you should remind yourself that that the “R-like” way to do it is to use apply.

Rolling Your Rs

In this article I discuss a general approach for Geocoding a location from within R, processing XML reports, and using R packages to create interactive maps. There are various ways to accomplish this, though using Google’s GeoCoding service is a good place to start. We’ll also talk a bit about the XML package that is a very useful tool for parsing reports returned from Google. XML is a powerful markup language that has wide support in many Internet databases so it is helpful. Lastly, we’ll use our knowledge to create a map of the tour dates on the Rolling Stones 1975 Tour of the Americas. Also, when I use the word “GeoCoding” this basically implies the process of taking a geographic location and turning it into a latitude / longitude pair.

What does Google Offer ?

Check out the main Geocoding page, which presents implementation details of the API as…

View original post 1,071 more words

If you are a newcomer to R then you are probably quite busy learning the semantics of the language as you experiment with the apply family of commands or come up to speed on the grouping and conditioning capabilities offered by lattice graphics. And, along the way, you might have heard that R has the ability to “link in” code written in other languages such as C, C++, Fortran, and Java. This is true but until you are presented with a compelling use case you will most likely ignore such capability since you’ve already got plenty to do. But that’s where this blog can help. I’ll talk about how to integrate Fortran code with R and, in a later post, discuss doing the same with C and C++.

DISCLAIMER: To accomplish this work requires the presence of a working Fortran compiler such as the GNU suite of compilers (g77, gfortran, gcc, etc). Relative to operating systems I use Linux and OSX. I rarely use Windows though do know that the GNU suite is available for that OS so you should be able to link Fortran code into R there. However, I’ve never tried it. Additional Disclaimer: The timings I present in this post are based on an Apple MacBook @ OS level 10.9.2 with a 2.5 Ghz i5 processor and 4GB of RAM. Your timings may vary.

Why Bother ?

Good question. Most people wind up wanting to access Fortran from R for a few reasons such as they have some really fast and efficient Fortran code that they want to exploit within R. Or maybe they have written some code in R that winds up being incredibly slow so they write a much faster version in Fortran and then want to call it from R. Perhaps they need to access subroutines from external Fortran libraries. Lastly, it might simply be because your boss or faculty advisor is making you do it ! Whatever your reason(s) we’ll break the process of linking in Fortran code down into three general steps: 1) prepare the subroutine for compilation and generate a “shared object”, 2) load the shared object into an active R session, and 3) provide R variables to the shared object via the “.Fortran” call, which is part of the “Foreign Function Interface” within R.

Foreign                  package:base                  R Documentation

Foreign Function Interface

Description:

     Functions to make calls to compiled code that has been loaded into
     R.

Usage:

            .C(.NAME, ..., NAOK = FALSE, DUP = TRUE, PACKAGE, ENCODING)
      .Fortran(.NAME, ..., NAOK = FALSE, DUP = TRUE, PACKAGE, ENCODING)

Prepare the Fortran Subroutine

Next, I present a very simple Fortran 77 subroutine that computes the factorial of a number “n” and stashes the result into a variable called “answer”. And speaking of subroutines it is important to know that to use the .Fortran interface one must make reference to Fortran subroutines only – not Fortran functions or full on programs. So if you have some code that you want to bring in then you will need to embed that code within a subroutine definition. We also need to pay attention to the variable types as declared within the subroutine so we can match those types accordingly when calling the subroutine from R. This winds up being one of the more critical steps in the process.

        subroutine facto(n,answer)
c
c simple subroutine to compute factorial
c
        integer n, answer, i

        answer = 1
        do 100 i = 2,n
           answer = answer * i
  100   continue 
        
        end

You should make sure, of course, that this code does compile correctly. Our goal is to generate a shared object file ( a “.so” file) that can be linked in to R. Note also that our routine doesn’t print/write things to output. It simply uses its input variables and ultimately sets an output variable. Make your code lean.

$ ls
facto.f

$ gfortran -c facto.f

$ ls
facto.f	facto.o

$ gfortran -shared -o facto.so facto.o
$ ls
facto.f		facto.o		facto.so

So it looks like we are good to go here. However, instead of doing this compilation ourselves we could have allowed R to help us. In fact it is the preferred way to do this since this will insure the compilation is done “under the supervision” of the R tools. Let’s remove the .o and .so files and start over.

$ rm *.*o

$ ls
facto.f

$ R CMD SHLIB facto.f
<you will see various compilation output messages>

$ ls
facto.f		facto.o		facto.so

Load it

So now what ? Let’s fire up R. We’ll use the dyn.load command which is part of the foreign interface capability. It’s purpose is to load/unload shared objects which are also known as DLLs, (dynamically loadable libraries).

system("ls")
facto.f		facto.o		facto.so
dyn.load("facto.so")

Okay not much happened there. What’s going on ? Well all we did was simply load the shared object. We have yet to use it. To do that we rely upon the “.Fortran” function. Keep in mind that the subroutine “facto” has two arguments both of which are integers. We’ll supply a value of 5 for “n” and we’ll pass a single integer as a value for the “answer” variable though that will be overwritten once the subroutine computes the “answer”.

system("ls")
facto.f		facto.o		facto.so

dyn.load("facto.so")

.Fortran("facto",n=as.integer(5),answer=as.integer(1))

$n
[1] 5

$answer
[1] 120

# Or more directly

.Fortran("facto",n=as.integer(5),answer=as.integer(1))$answer
[1] 120

If you are wondering what types are supported or shared between R and Fortran here is a list from the .Fortran man page. It is worth some time to peruse the man page as it provides some caveats and deeper explanations on how to interact with Fortran.

       R          Fortran          
       integer    integer          
       numeric    double precision 
       - or -     real             
       complex    double complex   
       logical    integer          

Wrap it Up!

Okay that was cool but this is sort of an awkward way to call the Fortran subroutine. We should probably write our own wrapper function in R to do the loading of the shared object and the referencing to the .Fortran function. Take a look at this approach which winds up being very “R like”.

myfacto <- function(num) {
  dyn.load("facto.so")
  retvals <- .Fortran("facto",n = as.integer(num), answer = as.integer(1))
  return(retvals$answer)
}
         
myfacto(5)
[1] 120

sapply(1:10,myfacto)
 [1]       1       2       6      24     120     720    5040   40320  362880 3628800

So with the wrapper approach we can use the Fortran subroutine just as we would any other R function since the call to .Fortran is “buried” in the wrapper function. We could make this a bit more robust by putting in some logic to see if the shared object is already loaded.

myfacto <- function(num) {
  if (!is.loaded('facto')) {
     dyn.load("facto.so")
  }
  retvals <- .Fortran("facto",n = as.integer(num), answer = as.integer(1))
  return(retvals$answer)
}
         
myfacto(5)
[1] 120

It’s all Convoluted

Well that was okay but let’s look at a more involved example. Let’s consider the idea of doing discrete convolution between two vectors. (Note: This is discussed in the “Writing R Extensions” manual). Why did I pick such an example ? Well first it’s commonly referenced in R literature and , second, it is a good motivating case for using an external language to speed up the processing. The algorithm itself isn’t hard to code up either in R or Fortran. However, the performance in R isn’t so good once the vectors get larger. Check it out:

conr <- function(x, y) {
    lx <- length(x)
    ly <- length(y)
    cxy <- numeric(lx + ly - 1)
    for(i in 1:lx) {
        xi <- x[i]
        for(j in 1:ly) {
            ij <- i+j-1
            cxy[ij] <- cxy[ij] + xi * y[j]
        }
    }
    return(cxy)
}

# Let's check the timings for vectors of different sizes

v1 = rnorm(100); v2 = rnorm(100)

system.time(conr(v1,v2))
   user  system elapsed 
  0.034   0.000   0.035 

v1 = rnorm(2000); v2 = rnorm(2000)

system.time(conr(v1,v2))
   user  system elapsed 
 13.195   0.020  13.215 

v1 = rnorm(4000); v2 = rnorm(4000)

system.time(conr(v1,v2))
   user  system elapsed 
 57.757   0.130  58.008 

The timings grow significantly longer as the sizes of the vectors grow. So passing vectors of size 10,000 could take a very long time. While this blog isn’t specifically on performance let’s do a little bit more coding to get an idea about how poorly performing the convolution written in R is. We’ll use this for later comparison with the performance numbers resulting from the Fortran subroutine. Let’s write a wrapper function to the conr function. This will call conr with a variable x that represents the size of the vectors we wish to convolute. If you don’t understand exactly what is going on here don’t worry – just think of at as more exposure to the apply family of commands.

timewrapconr <- function(x) {
    times <- system.time(conr(rnorm(x),rnorm(x)))[3]
    return(c(size=x,times))
}

# time the convolution for vectors of size 100,1000,2000, and 4000

(convtimes <- sapply(c(100,1000,2000,4000),timewrapconr))
           [,1]     [,2]     [,3]    [,4]
size    100.000 1000.000 2000.000 4000.00
elapsed   0.033    3.552   13.811   62.15

# Let's plot this
library(lattice)

xyplot(convtimes[2,]~convtimes[1,],type=c("p","l","g"),
       xlab = "vector size", ylab = "elapsed time in seconds",
       main = "Execution times for Convolution in R", pch = 19)
Figure 1. Plot of execution times

Figure 1. Plot of execution times

How do we address this problem ? Well there are opportunities for improvement within the R code by using vectorization techniques. A good start would be to somehow avoid the second for loop and there are ways to do that. In fact there is a way to avoid both loops altogether and maybe we’ll explore such an approach in another post. But for now let’s see if writing the code in Fortran and then linking it in could help improve things. So here is the rewritten convolution algorithm, which we will save into a file called convolvef77.f

      subroutine convolvef77 (x, lx, y, ly, xy)
c
c A basic implementation of convolution algorithm for two vectors
c I use zero-based arrays here.
c
      integer lx, ly, i, j
      double precision x(0:lx-1), y(0:ly-1), xy(0:lx+ly-2)
      do 20 i = 0, (lx-1) 
         do 15 j = 0, (ly-1) 
            xy(i+j) = xy(i+j) + x(i) * y(j) 
  15     continue  
  20  continue 
      end

# Save the above to a file called convolvef77.f
# Now compile it to a shared library

$ R CMD SHLIB convolvef77.f 

Next we’ll write a function in R to call the convolvef77 function. Start up R.

convolvef77 <- function(x,y) {
  dyn.load('convolvef77.so')
  lx = length(x)
  ly = length(y)
  retdata <- .Fortran("convolvef77",
                       x = as.double(x),
                       lx = as.integer(lx), 
                       y = as.double(y), 
                       ly = as.integer(ly), 
                       xy = double(lx+ly-1))$xy
  return(retdata)
}

# Now let's throw some large vectors at it. Look at how much better the times are

v1 = rnorm(4000); v2 = rnorm(4000)

system.time(convolvef77(v1,v2))
   user  system elapsed 
  0.012   0.000   0.012 

v1 = rnorm(8000); v2 = rnorm(8000)

system.time(convolvef77(v1,v2))
   user  system elapsed 
  0.047   0.001   0.083 

So the speed looks really good. So now let’s repeat the timing exercise we applied to the convolutions done in R.

timewrapconf77 <- function(x) {
    times <- system.time(convolvef77(rnorm(x),rnorm(x)))[3]
    return(c(size=x,times))
}

(convtimes <- sapply(c(100,1000,2000,4000),timewrapconf77))
           [,1]  [,2]  [,3]    [,4]
size    100.000 1e+03 2e+03 4.0e+03
elapsed   0.045 2e-03 4e-03 1.3e-02

# Wow. This is FAST !!!!! Let's throw some bigger vectors at it.

(convtimes <- sapply(c(100,1000,2000,4000,10000,20000,50000),timewrap))
         [,1]  [,2]  [,3]    [,4]    [,5]     [,6]      [,7]
size    1e+02 1e+03 2e+03 4.0e+03 1.0e+04 2.00e+04 50000.000
elapsed 1e-03 2e-03 4e-03 1.2e-02 7.2e-02 3.22e-01     2.074

# Plot the times

xyplot(convtimes[2,]~convtimes[1,],type=c("p","l","g"),
       xlab = "vector size", ylab = "elapsed time in seconds",
       main = "Execution times for Convolution in Fortran", pch = 19)
Execution times using Fortran

Execution times using Fortran

So using the Fortran subroutine took 2.0 seconds to convolute vectors of size 50,000 whereas using native R code to convolute a vector of size 1,000 took 3.5 seconds (these timings might vary depending on your architecture and OS). To get a better visual comparison let’s repeat the timings for both approaches, R and Fortran, and plot the results on the same graph so you can get some sense of proportion between the execution times. This isn’t hard to do. We’ll just rerun our timing functions:

(convtimesr <- sapply(c(100,1000,2000,4000,10000,20000),timewrapconr))

           [,1]     [,2]     [,3]     [,4]      [,5]      [,6]
size    100.000 1000.000 2000.000 4000.000 10000.000 20000.000
elapsed   0.034    3.374   14.118   64.894   355.409  1504.517
  
(convtimesf77 <- sapply(c(100,1000,2000,4000,10000,20000),timewrapconf77))
           [,1]    [,2]  [,3]    [,4]    [,5]     [,6]
size    100.000 1.0e+03 2e+03 4.0e+03 1.0e+04 2.00e+04
elapsed   0.071 2.3e-02 4e-03 1.2e-02 6.9e-02 2.99e-01

# Now plot them on the same graph

plot(convtimesr[1,],convtimesr[2,],xlab="Vector size",
     ylab="Elapsed time in seconds", 
     main="Convolution in R vs Fortran",
     type="l",col="red",lwd=2.5)

points(convtimesf77[1,],convtimesf77[2,],type="l",col="blue",lwd=2.5)

legend("topleft",c("R","Fortran"),col=c("red","blue"),
        lty=c(1,1),lwd=c(2.5,2.5))
grid()
Execution times for R and Fortran

Execution times for R and Fortran

Okay, I think you get the point here. Using the Fortran code definitely helped speed things up. However, speed might not be the only reason you choose to link in Fortran code. For example I know of people who have written the bulk of their thesis analysis work using Fortran and now seek to leverage that effort within R. Sure, they could recode their stuff into R but that would probably result in lower performance results. Any time you have a significant body of work in one language you would like to avoid having to recode it in another. Lastly, there are other ways to bring in Fortran that I haven’t discussed here. The “inline” package allows one to compile fortran code inline within a given R program, which might be more appealing to some. Hope this has been helpful.

Vectors are at the heart of R and represent a true convenience. Moreover, vectors are essential for good performance especially when your are working with lots of data. We’ll explore these concepts in this posting. As a motivational example let’s generate a sequence of data from -3 to 3. We’ll also use each point as input into the Normal function. Our immediate goal is to draw the classic bell curve that everyone knows. So here we start out with the x values:


x = seq(-3, 3, by=0.05) # The "by" argument let's us specify a "step" or "interval" value

x
  [1] -3.00 -2.95 -2.90 -2.85 -2.80 -2.75 -2.70 -2.65 -2.60 -2.55 -2.50 -2.45
 [13] -2.40 -2.35 -2.30 -2.25 -2.20 -2.15 -2.10 -2.05 -2.00 -1.95 -1.90 -1.85
 [25] -1.80 -1.75 -1.70 -1.65 -1.60 -1.55 -1.50 -1.45 -1.40 -1.35 -1.30 -1.25
 [37] -1.20 -1.15 -1.10 -1.05 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65
 [49] -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05
 [61]  0.00  0.05  0.10  0.15  0.20  0.25  0.30  0.35  0.40  0.45  0.50  0.55
 [73]  0.60  0.65  0.70  0.75  0.80  0.85  0.90  0.95  1.00  1.05  1.10  1.15
 [85]  1.20  1.25  1.30  1.35  1.40  1.45  1.50  1.55  1.60  1.65  1.70  1.75
 [97]  1.80  1.85  1.90  1.95  2.00  2.05  2.10  2.15  2.20  2.25  2.30  2.35
[109]  2.40  2.45  2.50  2.55  2.60  2.65  2.70  2.75  2.80  2.85  2.90  2.95
[121]  3.00

# Next we generate the corresponding y values

y = dnorm(x)

# Now we plot the data 

par(mfrow=c(1,2))   # Set the plot region to be one row and two columns

plot(x, y, xlim=c(-3,3), ylim=c(0,0.4), main="Normal Curve", col="blue") # Plot 1

plot(x,y,xlim=c(-3,3),ylim=c(0,0.4),main="Normal Curve",col="blue",type="l") # Plot 2

normal_curve1

Note that we could have generated the X values by using the rnorm function and then sorting the results. If we call the rnorm function without specifying a mean and a standard deviation we get back values from a Normal distribution of mean 0 and standard deviation 1. So we could have done the following. Note that this approach results in less “coverage” in the tails but we still get a comparable result.

xvals = rnorm(120)
par(mfrow=c(1,1))
plot(sort(xvals), y= dnorm(sort(xvals)), main="Normal Curve",col="blue")

Next let’s pull out all the x values that are less than -2 or greater than 2. We are in effect looking a the “tails” of the distribution here. Using the bracket notation with vectors is a powerful way to interrogate the data and extract only that which you need. Look how easy this is. In other languages you would have to write a for-loop that uses if statements.

tmp = x[(x < -2) | (x > 2)]

par(mfrow=c(1,2))

plot(tmp,dnorm(tmp),xlim=c(-3,3),ylim=c(0,0.4),
         main="Normal Curve - The Tails",col="blue") # Plot 1

As mentioned previously, newcomers to R often want to write a loop structure to pull out this data. You could certainly do this but it would be overlooking the power of vectors. Still, let’s see what would be involved to extract the information above using the for-loop approach.

hold = vector()
for (ii in 1:length(x)) {
    if ( (x[ii] < -2) | (x[ii] > 2) ) {
       hold[ii] = x[ii]
    }
}

Okay well this will work but it doesn’t generalize very well, (if at all), since we basically have to customize this loop based on what comparisons we want to do. It might be better to put this into a function although it is far better, (as we will soon see), to use the bracket notation available with vectors. Now let’s also pull out the x values that complement the above set. That is let’s get all the x values that are greater than -2 and less than 2. We’ll plot these along side the previous plot.

tmp2 = x[ (x > -2) & (x < 2) ]

plot(tmp2,dnorm(tmp2),xlim=c(-3,3),ylim=c(0,0.4),
     main="Normal Curve - Without Tails",col="blue") # Plot 2

normal_curve2

So let’s reconsider the example where we used the for-loop. Aside from being more complex and less general it also performs much worse than the bracket notation – especially as the size of the vector grows. Let’s regenerate our x values from -3 to 3 but this time we will pick a very small step size which will result in a vector with many more values. So we are still generating values from -3 to 3 but many more than before:

x = seq(-3,3,by=0.0001)

length(x)     # We have 60,001 elements. 
[1] 60001

par(mfrow=c(1,1))
plot(x,dnorm(x))   # Convince yourself that it still plots the Normal curve.

Now we will evaluate each x value by plugging it in to the dnorm function except we will use the for-loop approach. To make it easier to time we will put it into a function:

my.func <- function(x) {
   hold = vector()
   for (ii in 1:length(x)) {
     if ( (x[ii] < -2) | (x[ii] > 2) ) {
        hold[ii] = x[ii]
     }
   }
   return(hold)
}

# So for each x let's get the corresponding y value using this function. Let's time it.

system.time( my.func(x) )   # system.time function does what it's name suggests
   user  system elapsed 
  2.120   0.287   2.406 
 
 # So is this good / bad ? Let's compare it to the bracket notation
 
system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.003   0.001   0.003 

Wow. So the bracket notation is much faster. For vectors of smaller length the difference in time isn’t so large. But as the size of x grows so does the time it takes to process it. To drive this point home let’s look at a larger x vector and see how it performs as compared to the bracket notation.

x = seq(-3,3,by=0.00005)

length(x)
[1] 120001

system.time( my.func(x) )
   user  system elapsed 
  6.363   2.640   8.998 

system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.006   0.000   0.006 
  
# Wow - the bracket notation really does provide much better performance. 

Let’s do some benchmarking and plot the results so we can get a visual representation of the performance of the two approaches. Check out the rbenchhmark package. You will have to install it since it is an add-on package. Use RStudio or the Windows GUI to do this or you can use the install.packages command. This package contains the function called benchmark which allows you to time the execution of any R function including those that you have written.

It replicates the function a specified number of times. You can also compare the performance of more than one function at a time. Let’s start out simple. Here we will time how long it takes to square the x vector. As mentioned this will be replicated some number times with the default being 100. We will get back a variety of information.

x = seq(-3,3,by=0.0001)

benchmark( square = x^2)
    test replications elapsed relative user.self sys.self user.child sys.child
1 square          100   0.018        1     0.015    0.002          0         0
 
# We can compare how long it takes to square it vs cubing it. We can also specify the
# information to be returned

benchmark( square = x^2, cube = x^3, replications=5, columns=c("test","elapsed"))
    test elapsed
2   cube   0.006
1 square   0.001

First we’ll generate four different versions of the x vector corresponding to different step sizes.

hold = lapply(c(0.01, 0.001, 0.0001, 0.00005), function(x) {
   return(seq(-3,3,by=x))
})

length(hold)  # Hold is a list with 4 elements each of which contains a vector
[1] 4

sapply(hold,length)
[1]    601   6001  60001 120001

Okay, second we’ll use this new hold vector as input into the benchmarking function. To make this efficient we’ll use the lapply command. We’ll also create a function to make the timing process more efficient.

timing.func <- function(x) {
  benchmark( replications=5, func = my.func(x), 
             bracket = x[ (x <2) | (x > 2)], columns=c("test","elapsed") ) 
}

( timings = lapply(hold, timing.func)  )
[[1]]
     test elapsed
2 bracket   0.001
1    func   0.008

[[2]]
     test elapsed
2 bracket   0.002
1    func   0.168

[[3]]
     test elapsed
2 bracket   0.018
1    func  10.757

[[4]]
     test elapsed
2 bracket   0.036
1    func  60.141

# Let's pull out the timings for each approach (the function approach or bracket approach)

( mtimings = sapply(timings, function(x) x$elapsed) )

      [,1]  [,2]   [,3]   [,4]
[1,] 0.001 0.002  0.018  0.036
[2,] 0.008 0.168 10.757 60.141

# Let's plot the timings

plot(1:4,mtimings[2,],pch=18,col="red",type="l",xaxt="n",
         main="Timings of bracket notations vs for loop",ylab="Timings in Seconds")

axis(1,1:4,labels=as.character(sapply(hold,length)),xlab=c("vector size in elements"))

points(1:4,mtimings[1,],type="l",col="blue")

legend(1,40,c("for-loop","bracket"),col=c("red","blue"),lty=c(1,1))

bracketfor

Well this shows that the bracket notation works well even as the x vector gets large. The for-loop, however, starts to perform very poorly and will soon be impractical for larger vectors.

Writing loops in C++, Java, C, and FORTRAN is pretty fast which is why people coming to R from those languages tend to want to write loops to process data. Users of MATLAB, however, are used to the idea of vectors, (although C++ supports vectors), and make the transition to R quite nicely. Also keep in mind that matrices, (which are closely related to vectors), can also be processed using for-loops but, as with vectors, it is not recommended. There are plenty of functions that are “vectorized” and customized for use with vectors and matrices. This is why we can use commands like apply to help process data in R.

Hello. Welcome to my debut post ! Check the About link to see what this Blog intends to accomplish. In this article I discuss a general approach for dealing with the problem of splitting a data frame based on a grouping variable and then doing some more operations per group. A secondary goal is to provide some exposure to the “apply” family of functions to demonstrate how they can help. For purposes of example R has a built-in data frame called “ChickWeight” that contains 578 rows and 4 columns from an experiment on the effect of diet on early growth of chicks. To get more details simply type:

data(ChickWeight)
?ChickWeight

In general when looking at data frames we should be looking for the continuous variables that we want to summarize in terms of some factor or grouping variable. Grouping variables can usually be easily identified because they typically take on a fixed number of values. (Note that R has commands that let us “cut” continuous variables into discrete intervals but that is for another blog post). So let’s use the sapply function to determine what the grouping variables might be:

sapply(ChickWeight, function(x) length(unique(x)))
weight   Time  Chick   Diet 
   212     12     50      4 

If you don’t know about sapply then don’t freak out. It takes the function we define in the second argument and “applies” it to the columns of data frame, (ChickWeight), whose name we provided as the first argument. The “x” in the function is a placeholder argument for each of the columns of the data frame. The unique function returns only the unique values that the column assumes. The length function then “counts” the number of unique values.

This can be confusing I know. However, the important thing is that we can easily observe that Diet and Time look like potential grouping variables since they take on, respectively, 12 and 4 unique values. If we just wanted to know the mean weight of the chickens for each Diet type then we could use the tapply command to get that answer.

tapply(ChickWeight$weight,ChickWeight$Diet,mean)
  1   2   3   4 
103 123 143 135 

Note that we could also write our own function in place of “mean” in case we wanted to so something more in depth in terms of summary. But maybe we don’t yet know what it is we want to do with the data. So having it in a grouped format might help us better understand the data. Let’s use the split command, which can give us access to the individual groups.

my.splits = split(ChickWeight, ChickWeight$Diet)
length(my.splits)
[1] 4

names(my.splits)
[1] "1" "2" "3" "4"

This operation creates a list where each element of my.splits corresponds to the individual Diet values (1,2,3, or 4). We can now start thinking about how to further investigate each Diet type. To convince yourself that the split command actually worked, lets take a peek at the first element. All records in the first element relate to Diet #1.

head(my.splits[[1]])
  weight Time Chick Diet
1     42    0     1    1
2     51    2     1    1
3     59    4     1    1
4     64    6     1    1
5     76    8     1    1
6     93   10     1    1

Just for fun let’s use the lapply command to subset into each element of my.splits to obtain all the chicks from each group that weigh less that 40 grams. This approach doesn’t take much typing and will conveniently return a list, which we store in a variable called my.results.lapply

my.results.lapply = lapply(my.splits, subset, weight <= 40)

In the example above we pass additional arguments to the subset command by adding the arguments after the function call. This is the most “R like” way of doing things. Alternatively, we could have defined an anonymous function to do this.

my.results.lapply = lapply(my.splits, function(x) subset(x, weight <= 40) )

Note that we can also define our function in advance. This isn’t substantially different from the example above but it might improve the readability of the code. This is a personal choice and either approach will yield the same result.

my.func <- function(x) {
    subset(x, weight <= 40)
}
my.results.lapply = lapply(my.splits, my.func)

In any case check out what is happening here. lapply creates a “hidden” subscript that passes to the function on the right the value of each element of my.splits. This can be confusing to newcomers but is actually a convenience since it accomplishes a for-loop structure “under the hood”. To make it more obvious we could have also done the following, which is kind of like a for-loop approach where you work with the length of the my.splits list using subscripts.

my.results.lapply = lapply(1:length(my.splits), function(x) subset(my.splits[[x]], weight <= 40))

If we are done with our sub-setting and interrogation then we can repackage the results list back into a data frame by using this construct:

my.df = do.call(rbind, my.results.lapply)
my.df
    weight Time Chick Diet
13      40    0     2    1
26      39    2     3    1
195     39    0    18    1
196     35    2    18    1
221     40    0    21    2
269     40    0    25    2
293     39    0    27    2
305     39    0    28    2
317     39    0    29    2
365     39    0    33    3
401     39    0    36    3
519     40    0    46    4
543     39    0    48    4
555     40    0    49    4

So now we have a single data frame with only the data that we want. If you are coming from another programming language then you might be tempted to write a for-loop to do this. You could though you have to do a little more work to keep up with the results. You have to create your own blank list to stash results:

my.results.for = list() 
for (ii in 1:length(my.splits)) {
     my.results.for[[ii]] = subset(my.splits[[ii]], weight <= 40)
}

names(my.results.for) = names(my.splits)

all.equal(my.results.lapply, my.results.for) # Should be equal to my.results.lapply

So which approach do you use ? It depends. The for-loop approach is a more traditional angle. However, the lapply takes much less typing, and can sometimes perform better, but you have to define your function to work with it and understand that apply is “silently” passing each element to your function. Once people become accustomed to using lapply they usually stick with it.

Now not to blow your mind but we could knock out the problem in one go:

lapply(split(ChickWeight,ChickWeight$Diet), subset, weight <= 40)

# OR

(do.call(rbind, lapply(split(ChickWeight,ChickWeight$Diet), subset, weight <= 40)))

      weight Time Chick Diet
1.13      40    0     2    1
1.26      39    2     3    1
1.195     39    0    18    1
1.196     35    2    18    1
2.221     40    0    21    2
2.269     40    0    25    2
2.293     39    0    27    2
2.305     39    0    28    2
2.317     39    0    29    2
3.365     39    0    33    3
3.401     39    0    36    3
4.519     40    0    46    4
4.543     39    0    48    4
4.555     40    0    49    4

Lastly, it occurred to me after writing much of this post that, in this example, which is admittedly somewhat contrived, there is another way to do this. This realization is fairly common in R especially when you view code written by others. This is the blessing, (and curse), of R. There are many ways to do the same thing. In fact there are frequently several functions, which very much appear to the same thing (or could be made to). Anyway, relative to our problem we could first sort the ChickWeight data frame by Diet and then do a subset. The sort will arrange the data by group, which means that the resulting subset operation will preserve that order.

sorted.chickens = ChickWeight[order(ChickWeight$Diet),]
(sorted.chickens = subset(sorted.chickens, weight <= 40))

    weight Time Chick Diet
13      40    0     2    1
26      39    2     3    1
195     39    0    18    1
196     35    2    18    1
221     40    0    21    2
269     40    0    25    2
293     39    0    27    2
305     39    0    28    2
317     39    0    29    2
365     39    0    33    3
401     39    0    36    3
519     40    0    46    4
543     39    0    48    4
555     40    0    49    4