Archive for the ‘Performance’ Category

Data Frames

R provides a helpful data structure called the “data frame” that gives the user an intuitive way to organize, view, and access data.  Many of the functions that you would use to read in external files (e.g. read.csv) or connect to databases (RMySQL), will return a data frame structure by default. While there are other important data structures, such as the vector, list and matrix, the data frame winds up being at the heart of many operations not the least of which is aggregation. Before we get into that let me offer a very brief review of data frame concepts:

  • A data frame is a set of rows and columns.
  • Each column is of the same length and data type
  • Every row is of the same length but can be of differing data types
  • A data frame has characteristics of both a matrix and a list
  • Bracket notation is the customary method of indexing into a data frame

 

Subsetting Data The Old School Way

Here are some examples of getting specific subsets of information from the built in data frame mtcars. Note that the bracket notation has two dimensions here – one for row and one for column. The comma within any given bracket notation expression separates the two dimensions.

# select rows 1 and 2

mtcars[1:2,]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

# select rows 1 and 2 and columns 3 and 5

mtcars[1:2,c(3,5)]
              disp drat
Mazda RX4      160  3.9
Mazda RX4 Wag  160  3.9

# Find the rows where the MPG column is greater than 30 

mtcars[mtcars$mpg > 30,]
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

# select columns 10 and 11 for all rows where MPG > 30 and cylinder is
# equal to 4 and then extract columns 10 and 11

mtcars[mtcars$mpg > 30 & mtcars$cyl == 4, 10:11]
               gear carb
Fiat 128          4    1
Honda Civic       4    2
Toyota Corolla    4    1
Lotus Europa      5    2

So the bracket notation winds up being a good way by which to index and search within a data frame although to do aggregation requires us to use other functions such as tapply, aggregate, and table. This isn’t necessarily a bad thing just that you have to learn which function is the most appropriate for the task at hand.

# Get the mean MPG by Transmission

tapply(mtcars$mpg, mtcars$am, mean)
0    1
17.1 24.4 

# Get the mean MPG for Transmission grouped by Cylinder

aggregate(mpg~am+cyl,data=mtcars,mean)
   am cyl  mpg
1  0   4 22.9
2  1   4 28.1
3  0   6 19.1
4  1   6 20.6
5  0   8 15.1
6  1   8 15.4

# Cross tabulation based on Transmission and Cylinder

table(transmission=mtcars$am, cylinder=mtcars$cyl)
              cylinder
transmission  4  6  8
0  3  4 12
1  8  3  2

 

Enter the data.table package

Okay this is nice though wouldn’t it be good to have a way to do aggregation within the bracket notation ? In fact there is. There are a couple of packages that could help us  to simplify aggregation though we will start with the data.table package for now. In addition to being able to do aggregation within the brackets there are some other reasons why it is useful:

  • It works well with very large data files
  • Can behave just like a data frame
  • Offers fast subset, grouping,  update,  and joins
  • Makes it easy to turn an existing data frame into a data table

Since it is an external package you will first need to install it. After that just load it up using the library statement

library(data.table)

dt <- data.table(mtcars)

class(dt)
[1] "data.table" "data.frame"

dt[,mean(mpg)]   # You can't do this with a normal data frame
[1] 20.09062

mtcars[,mean(mpg)]  # Such a thing will not work with regular data frames
Error in mean(mpg) : object 'mpg' not found

So notice that we can actually find the mean MPG directly within the bracket notation. We don’t have to go outside of the brackets to do this. So what about reproducing the previous tapply example:

tapply(mtcars$mpg,mtcars$am,mean)
 0    1
17.1 24.4 

# Here is how we would do this with the data table "dt"

dt[,mean(mpg),by=am]
   am   V1
1:  1 24.4
2:  0 17.1

# We could even extend this to group by am and cyl

dt[,mean(mpg),by=.(am,cyl)]
    am cyl  V1
1:  1   6 20.6
2:  1   4 28.1
3:  0   6 19.1
4:  0   8 15.0
5:  0   4 22.9
6:  1   8 15.4

# If we want to more clearly label the computed average

dt[,.(avg=mean(mpg)),by=.(am,cyl)]
    am cyl  avg
1:  1   6 20.6
2:  1   4 28.1
3:  0   6 19.1
4:  0   8 15.0
5:  0   4 22.9
6:  1   8 15.4

 

Similarities to SQL

It doesn’t require many examples to prove that we don’t have to use the aggregate or tapply functions to do any of the work once we have created a data table.  Unlike default data frames the bracket notation for a data table object has three dimensions which correspond to what one might see in an SQL statement. Don’t worry – you do not have to be an SQL expert to use data.table. In reality you don’t have to know it at all although if you do then using data.table becomes much easier.

screenshot_749

So in terms of SQL we would say something like select “j” (columns or an operation on some columns) where those columns in  a row(s) “i” satisfy some specified condition on the rows. And if the “by” index is supplied it indicates how to group the result. Well this might sound a little involved so something more practical might be in order to illustrate what I’m talking about here.

screenshot_748

So let’s revisit the previous examples and see how it relates to the SQL model – This is helpful in understanding the paradigm associated with data table objects:

dt[,mean(mpg),by=am]
   am   V1
1:  1 24.4
2:  0 17.1

# The above is analogous to an SQL statement like

select am,avg(mpg) from mtcars group by am

# The following example

dt[,.(avg=mean(mpg)),by=.(am,cyl)]
    am cyl avg
1:  1   6 20.6
2:  1   4 28.1
3:  0   6 19.1
4:  0   8 15.0
5:  0   4 22.9
6:  1   8 15.4

# is analogous to an SQL statement like:

select am,avg(mpg) as avg from mtcars group by am,cyl

# The following example

dt[mpg > 20,.(avg=mean(mpg)),by=.(am,cyl)]
    am cyl  avg
1:  1   6 21.0
2:  1   4 28.1
3:  0   6 21.4
4:  0   4 22.9

# would be analogous to the following SQL statement

select am,avg(mpg) as avg from mtcars where mpg > 20 group by am,cyl


As previously mentioned one does not need to know SQL to use data.table. However, if you do it can help you understand some of the motivations behind the package.

 

Tabulation

Here are some more examples that illustrate how we can count and tabulate things. Within a data table the special variable .N represents the count of rows. If there is a group by index then it presents the number of rows within that grouping variable.

dt[, .N] # How many rows
[1] 32

dt[, .N, by=cyl]  # How many cars in each cylinder group
   cyl  N
1:   6  7
2:   4 11
3:   8 14

# For rows where the wt is > 1.5 tons count the number of cars by
# transmission type.

dt[wt > 1.5, .(count=.N), by=am] 
   am count
1:  1    13
2:  0    19

We can also do sorting quite easily and very fast.


# Present the 5 cars with the best MPG

head(dt[order(-mpg)],5) 
 mpg cyl disp  hp drat   wt qsec vs am gear carb
1: 33.9   4 71.1  65 4.22 1.83 19.9  1  1    4    1
2: 32.4   4 78.7  66 4.08 2.20 19.5  1  1    4    1
3: 30.4   4 75.7  52 4.93 1.61 18.5  1  1    4    2
4: 30.4   4 95.1 113 3.77 1.51 16.9  1  1    5    2
5: 27.3   4 79.0  66 4.08 1.94 18.9  1  1    4    1

# Since data table inherits from a data frame we could have also done

dt[order(-mpg)][1:5]
mpg cyl disp  hp drat   wt qsec vs am gear carb
1: 33.9   4 71.1  65 4.22 1.83 19.9  1  1    4    1
2: 32.4   4 78.7  66 4.08 2.20 19.5  1  1    4    1
3: 30.4   4 75.7  52 4.93 1.61 18.5  1  1    4    2
4: 30.4   4 95.1 113 3.77 1.51 16.9  1  1    5    2
5: 27.3   4 79.0  66 4.08 1.94 18.9  1  1    4    1

# We could sort on multiple keys. Here we find the cars with the best
# gas mileage and then sort those on increasing weight

dt[order(-mpg,wt)][1:5]
   mpg cyl disp  hp drat   wt qsec vs am gear carb
1: 33.9   4 71.1  65 4.22 1.83 19.9  1  1    4    1
2: 32.4   4 78.7  66 4.08 2.20 19.5  1  1    4    1
3: 30.4   4 95.1 113 3.77 1.51 16.9  1  1    5    2
4: 30.4   4 75.7  52 4.93 1.61 18.5  1  1    4    2
5: 27.3   4 79.0  66 4.08 1.94 18.9  1  1    4    1

 

Chicago Crime Statistics

Let’s look at a more realistic example. I have a file that relates to Chicago crime data that you can download if you wish (that is if you want to work this example). It is about 81 megabytes so it isn’t terribly large.

url <- "https://raw.githubusercontent.com/steviep42/youtube/master/YOUTUBE.DIR/chi_crimes.zip"
download.file(url,"chi_crimes.zip")
system("unzip chi_crimes.zip")
system.time(df.crimes <- read.csv("chi_crimes.csv", header=TRUE,sep=","))

So you might try reading it in with the typical import functions in R such as read.csv which many people use as a default when reading in CSV files. I’m reading this file in on a five year old Mac Book with about 8 GB of RAM. Your speed may vary. We’ll first read in the file using read.csv and then use the fread function supplied by data.table to see if there is any appreciable difference

system.time(df.crimes <- read.csv("chi_crimes.csv", header=TRUE,sep=","))

user system elapsed
30.251 0.283 30.569

nrow(df.crimes)
[1] 334141

# Now let's try fread

system.time(dt.crimes <- fread("chi_crimes.csv", header=TRUE,sep=","))

user system elapsed
1.045 0.037 1.362

attributes(dt.crimes)$class # dt.crimes is also a data.frame
[1] "data.table" "data.frame"

nrow(df.crimes)
[1] 334141

dt.crimes[,.N]
[1] 334141

That was a fairly significant difference. If the file were much larger we would see an even larger time difference which for me is a good thing since I routinely read in large files. Consequently fread has become a default for me even if I don’t wind up using the aggregation capability of the data.table package. Also note that the fread function returns a data.table object by default. Now let’s investigate some crime using some of the things we learned earlier. This data table has information on every call to the Chicago police in the year 2012. So we’ll want to see what factors there are in the data frame/table so we can do some summaries across groups.

 names(dt.crimes)
 [1] "Case Number"          "ID"                   "Date"
[4] "Block"                "IUCR"                 "Primary Type"
[7] "Description"          "Location Description" "Arrest"
[10] "Domestic"             "Beat"                 "District"
[13] "Ward"                 "FBI Code"             "X Coordinate"
[16] "Community Area"       "Y Coordinate"         "Year"
[19] "Latitude"             "Updated On"           "Longitude"
[22] "Location"  

# Let's see how many unique values there are for each column. Looks 
# like 30 FBI codes so maybe we could see the number of calls per FBI 
# code. What about District ? There are 25 of those.

sapply(dt.crimes,function(x) length(unique(x)))

        Case Number                   ID                 Date 
              334114               334139               121484 
               Block                 IUCR         Primary Type 
               28383                  358                   30 
         Description Location Description               Arrest 
                 296                  120                    2 
            Domestic                 Beat             District 
                   2                  302                   25 
                Ward             FBI Code         X Coordinate 
                  51                   30                60704 
      Community Area         Y Coordinate                 Year 
                  79                89895                    1 
            Latitude           Updated On            Longitude 
              180396                 1311               180393 
            Location 
              178534 


Now – I just used the sapply function to tell me how many unique values each column assumes. This is so we can identify potential summary factors. This is a common activity and I used a common R-like approach although for the newcomer it looks a little verbose (Welcome to R my friend) and it turns out that we can accomplish the same thing using data.table itself. But first let’s do some exploration and then we’ll come back to this. So how many calls per District were there ? Let’s then order this result such that we see the Districts with the highest number of calls first and in descending order.

dt.crimes[,.N,by=District][order(-N)]
1:        8 22386
2:       11 21798
3:        7 20150
4:        4 19789
5:       25 19658
6:        6 19232
7:        3 17649
8:        9 16656
9:       19 15608
10:        5 15258
11:       10 15016
12:       15 14385
13:       18 14178
14:        2 13448
15:       14 12537
16:        1 12107
17:       16 10753
18:       22 10745
19:       17  9673
20:       24  9498
21:       12  8774
22:       13  7084
23:       20  5674
24:       NA  2079
25:       31     6
District     N

Next let’s randomly sample 500 rows and then find the mean  number of calls to the cops
as grouped by FBI.Code (whatever that corresponds to) check https://www2.fbi.gov/ucr/nibrs/manuals/v1all.pdf to see them all.

dt.crimes[sample(1:.N,500), .(mean=mean(.N)), by="FBI Code"]
  FBI Code mean
1:       14   60
2:       19    3
3:       24    6
4:       26   47
5:       06  109
6:      08B   83
7:       07   27
8:      08A   22
9:       05   34
10:       18   44
11:      04B   10
12:       03   19
13:       11   15
14:      04A    7
15:       09    1
16:       15    6
17:       16    3
18:       10    1
19:       17    1
20:       02    2

# Here we count the number of calls for each day of the year and order from the highest
# to the lowest. We see then that the day with the most calls to the police was 
# New Years Day. Whereas Christmas Day had the fewest calls

dt.crimes[,.N,by=substr(Date,1,10)][order(-N)]
         substr    N
  1: 01/01/2012 1297
  2: 06/01/2012 1163
  3: 07/01/2012 1158
  4: 09/01/2012 1124
  5: 07/15/2012 1122
 ---                
362: 12/27/2012  679
363: 01/20/2012  677
364: 11/22/2012  653
365: 12/26/2012  619
366: 12/25/2012  520

# Here is another example that would emulate the HAVING clause in SQL

dt.crimes[,.N,by=substr(Date,1,10)][N > 1122][order(-N)]
       substr    N
1: 01/01/2012 1297
2: 06/01/2012 1163
3: 07/01/2012 1158
4: 09/01/2012 1124

 

Other Things

Keep in mind that data.table isn’t just for aggregation. You can do anything with it that you can do with a normal data frame. This includes creating new columns, modify existing ones, and create your own functions to do aggregation, and many other activities.


# Get the next to the last row from the data table

dt[.N-1]
mpg cyl disp  hp drat   wt qsec vs am gear carb
1:  15   8  301 335 3.54 3.57 14.6  0  1    5    8

dt[cyl %in% c(4,6)]
mpg cyl  disp  hp drat   wt qsec vs am gear carb
1: 21.0   6 160.0 110 3.90 2.62 16.5  0  1    4    4
2: 21.0   6 160.0 110 3.90 2.88 17.0  0  1    4    4
3: 22.8   4 108.0  93 3.85 2.32 18.6  1  1    4    1
4: 21.4   6 258.0 110 3.08 3.21 19.4  1  0    3    1
5: 18.1   6 225.0 105 2.76 3.46 20.2  1  0    3    1
6: 24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
7: 22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
8: 19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
9: 17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
10: 32.4   4  78.7  66 4.08 2.20 19.5  1  1    4    1
11: 30.4   4  75.7  52 4.93 1.61 18.5  1  1    4    2
12: 33.9   4  71.1  65 4.22 1.83 19.9  1  1    4    1
13: 21.5   4 120.1  97 3.70 2.46 20.0  1  0    3    1
14: 27.3   4  79.0  66 4.08 1.94 18.9  1  1    4    1
15: 26.0   4 120.3  91 4.43 2.14 16.7  0  1    5    2
16: 30.4   4  95.1 113 3.77 1.51 16.9  1  1    5    2
17: 19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6
18: 21.4   4 121.0 109 4.11 2.78 18.6  1  1    4    2

# Summarize different variables at once

dt[,.(avg_mpg=mean(mpg),avg_wt=mean(wt))]
avg_mpg avg_wt
1:    20.1   3.22

dt[,cyl:=NULL]  # Removes the cyl column

head(dt)
mpg disp  hp drat   wt qsec vs am gear carb
1: 21.0  160 110 3.90 2.62 16.5  0  1    4    4
2: 21.0  160 110 3.90 2.88 17.0  0  1    4    4
3: 22.8  108  93 3.85 2.32 18.6  1  1    4    1
4: 21.4  258 110 3.08 3.21 19.4  1  0    3    1
5: 18.7  360 175 3.15 3.44 17.0  0  0    3    2
6: 18.1  225 105 2.76 3.46 20.2  1  0    3    1

 

Identifying Possible Factors in a Data Frame

Okay – I demonstrated above a way to identify the number of unique values contained in each column as a means to determine what the factors/categories might be. I used the following


sapply(dt.crimes,function(x) length(unique(x)))

However, the data.table package has a special variable called .SD and a function called uniqueN that can help us. Let’s explore what these do. First let me create a smaller data frame which makes it easier to see what is going on

set.seed(123)
mydt <- data.table(group=sample(LETTERS[1:3],6,T),gender=sample(c("M","F"),6,T),
                   measure=round(rnorm(6,10),2))

mydt
   group gender measure
1:     A      F   10.46
2:     C      F    8.73
3:     B      F    9.31
4:     C      M    9.55
5:     C      F   11.22
6:     A      M   10.36

# The following groups the data.table by the "gender" column

mydt[,.SD,by=gender]
   gender group measure
1:      F     A   10.46
2:      F     C    8.73
3:      F     B    9.31
4:      F     C   11.22
5:      M     C    9.55
6:      M     A   10.36

In this case the .SDvariable is a way to group the data by a key as if we were creating an index. (Note that the data.table package has a setkey function for this to formalize the creation of an index).
What about this example ?

mydt[,print(.SD),by=gender] 
   group measure
1:     A   10.46
2:     C    8.73
3:     B    9.31
4:     C   11.22
   group measure
1:     C    9.55
2:     A   10.36
Empty data.table (0 rows) of 1 col: gender

This is somewhat similar to the native split function in R that let’s one split a data frame on a given factor and store the results in a list. In this case, however, the “splits” aren’t really stored any where because we are simply just printing them. A more useful example might be:

mydt[,lapply(.SD,mean),by=gender,.SDcols="measure"] 
   gender measure
1:      F    9.93
2:      M    9.96 

Oh wow – so the .SD pulls out all the columns in the data frame except gender and then applies the mean function to the columns specified by the .SDcols variable. So this is another way of doing some aggregation (although perhaps a bit intimidating for those not familiar with the lapply function. (If that is the case I have a cure for you – go read this post). Well we might be getting a bit off track here because I wanted to show you how to replace the functionality of the sapply example from above. Here it is. The uniqueN function determines how many unique values there are in a column. The lapply function applies uniqueN to each column.


mydt[,lapply(.SD,uniqueN)]
   group gender measure
1:     3      2       6

 

C’est la fin ?

Before we finish this posting it is worth mentioning that the data table package also provides support for setting a key much in the same way one would create an index in a relational database to speed up queries. This is for situations wherein you might have a really large data table and expect to routinely interrogate it using the same column(s) as keys. In the next posting I will look at the dplyr package to show another way to handle large files and accomplish intuitive aggregation. Some R experts represents data.table as being as competitor of dplyr although one could mix the two. What I like about data.table is that it allows you to build sophisticated queries, summaries, and aggregations within the bracket notations. It has the added flexibility of allowing you to employ existing R functions or any that you decide to write.

Advertisements

apply The goal of this blog entry is to introduce basic and essential information about the apply function. Even established R users get confused when considering this family of functions especially when observing how many of the them there are: apply, tapply, lapply, sapply, rapply, eapply, mapply. When I was new to R I was rarely satisfied with the all-too-basic explanations of these commands so I thought I would create a series on some of these functions to address the more common questions that newcomers might have. There is an add on package called plyr that attempts to present a unified philosophy and approach to the process of “applying” functions to various R data structures though, for now, I’ll be sticking to the native R commands since they show up a lot in example code you are likely to encounter.

I’ll start with the apply command which expects a matrix as input. Depending on what function you specify when using the apply command, you will get back either a vector or a matrix. Let’s get started with a motivating example, which I hope will convince you on how useful the apply approach can be. Here we create a matrix of 16 elements from a normal distribution of mean 10. We use the set.seed command to enable reproducibility. Imagine that each column represents numerical measurements and the rows represent individual samples. Scientific data, such as microarray and metabolomic information, can often follow this pattern. For those of you concerned with authenticity please don’t worry – I’ll introduce some “real” data later on in the posting.

set.seed(1)  # Makes the call to rnorm generate the same numbers every time
( mymat <- matrix(round(rnorm(16,10),2),4,4) )

      [,1]  [,2]  [,3]  [,4]
[1,]  9.37 10.33 10.58  9.38
[2,] 10.18  9.18  9.69  7.79
[3,]  9.16 10.49 11.51 11.12
[4,] 11.60 10.74 10.39  9.96

Doing it the hard way

Let’s say that we want to take the mean of each column. Another way to say this is that we want to apply the mean function to each column in the matrix. This is the way to start thinking about things like this in R. But first, how might we do this if we don’t know anything about the apply command ?

mean(mymat[,1])
[1] 10.0775

mean(mymat[,2])
[1] 10.185

mean(mymat[,3])
[1] 10.5425

mean(mymat[,4])
[1] 9.5625

# Or we could stash the means into a vector if we need to use that information elsewhere

( mymatcolmean <- c(mean(mymat[,1]),mean(mymat[,2]),mean(mymat[,3]),mean(mymat[,4])) )

[1] 10.0775 10.1850 10.5425  9.5625

See what I did there ? I took the mean of each column by invoking the mean function four times and substituting columns 1-4 on each invocation. This is a bit crude but entirely effective and we do get the correct answer. In the second example I created a vector called mymatcolmean to hold the information. The two approaches are equivalent though neither scales very well. Imagine if there were 10, 100, or 1,000 columns. Sure you could do cut and paste but it is still a tedious process. Surely there must be a better way. Well, if you have experience with programming languages like C, FORTRAN, Java, or Perl then you probably know how to use a for-loop structure to solve this problem. R supports for-loops too but it also offers some ways to avoid using them.

retvec <- vector()
for (ii in 1:ncol(mymat)) {
  retvec[ii] = mean(mymat[,ii])
}
retvec
[1] 10.0775 10.1850 10.5425  9.5625

# We could even put this into a function for later use in case we need it.

myloop <- function(somemat) {
 retvec <- vector()
 length(retvec) <- ncol(somemat)
 for (ii in 1:ncol(somemat)) {
   retvec[ii] <- mean(somemat[,ii])
 }
 return(retvec)
}

myloop(mymat)
[1] 10.0775 10.1850 10.5425  9.5625

# This will now work for any matrix but of course it is specific to the columns of the matrix.

set.seed(1)

newmat <- matrix(rnorm(100),10,10)

myloop(newmat)
 [1]  0.1322028  0.2488450 -0.1336732  0.1207302  0.1341367  0.1434569  0.4512100 -0.2477361
 [9]  0.1273603  0.1123413

So this would work but what if we wanted to take the mean of the rows instead of the columns ? We could do that but we then have to rewrite our function. And, while we are at it, in order to make our function more general what about providing an argument that indicates whether we want to operate on the rows or the columns ? I’ll use “1” to indicate rows and “2” to indicate columns.

myloop <- function(somemat,rc=2) {
  retvec <- vector()
  length(retvec) <- ncol(somemat)
  
  # Check to see if we are taking the mean of the columns or rows
  # 1 indicates rows, 2 indicates columns
  
  if (rc == 2) {
    for (ii in 1:ncol(somemat)) {
      retvec[ii] <- mean(somemat[,ii])
    }
  } else {
    for (ii in 1:nrow(somemat)) {
      retvec[ii] <- mean(somemat[ii,])
    }
  }
  return(retvec)
}

# Okay let's make sure it works. 

myloop(mymat,2)   # This is correct
[1] 10.0775 10.1850 10.5425  9.5625

myloop(mymat,1)   # This is correct
[1]  9.9150  9.2100 10.5700 10.6725

mean(mymat[1,])  # A spot check
[1] 9.915

Doing it the easy way

Well okay that’s good but what if we then wanted our myloop function to accommodate a function other than mean ? Like quantile, sum, fivenum, or even a function of our own design ? We could add another argument to our function to let the user specify the name of a function to be applied to the rows or columns. But before we do more work please consider that R already has something that will do the job for us – the apply command.

apply(mymat, 2, mean)
[1] 10.0775 10.1850 10.5425  9.5625

apply(mymat, 1, mean)
[1]  9.9150  9.2100 10.5700 10.6725

See how much easier that is than writing our own looping function ? It has been my observation that those well versed in traditional programming languages have a bigger problem getting used to the apply function than newcomers simply because experienced programmers are more accustomed to writing their own summary functions. They just dive in and start coding. But R short circuits this approach by providing the apply family of commands. Note also that we can substitute in any function we want to.


apply(mymat,2,class)  # What class do the columns belong to ?
[1] "numeric" "numeric" "numeric" "numeric"

apply(mymat,2,sum)    # Get the sum of all the columns
[1] 40.31 40.74 42.17 38.25

apply(mymat,1,range)  # Get the range of all rows
      [,1]  [,2]  [,3]  [,4]
[1,]  9.37  7.79  9.16  9.96
[2,] 10.58 10.18 11.51 11.60

apply(mymat,2,fivenum) # Get the fivenum summary for each column
       [,1]   [,2]   [,3]   [,4]
[1,]  9.160  9.180  9.690  7.790
[2,]  9.265  9.755 10.040  8.585
[3,]  9.775 10.410 10.485  9.670
[4,] 10.890 10.615 11.045 10.540
[5,] 11.600 10.740 11.510 11.120

Some activities on matrices are so common that R actually has dedicated functions for them that are very efficient and fast:

rowMeans(mymat)     # Equivalent to apply(mymat,1,mean)
[1]  9.9150  9.2100 10.5700 10.6725

colMeans(mymat)     # Equivalent to apply(mymat,2,mean)
[1] 10.0775 10.1850 10.5425  9.5625

rowSums(mymat)      # Equivalent to apply(mymat,1,sum)
[1] 39.66 36.84 42.28 42.69

colSums(mymat)      # Equivalent to apply(mymat,2,sum)
[1] 40.31 40.74 42.17 38.25

Passing additional arguments

Many explanations I’ve seen for the apply family of functions omit any discussion on how to provide arguments to the function you pass to apply. I’m not sure why as this is an extremely important consideration. To show you what I mean let’s go back to the apply example that uses the mean function.

apply(mymat, 2, mean)
[1] 10.0775 10.1850 10.5425  9.5625

This is easy to understand though what happens if we wish to pass arguments to the mean function such as “trim” or “na.rm” ? For example maybe we wish to first trim off some portion of the data before we apply the mean function. If we did this using the mean function directly it would look like this:

mean(mymat[,1],trim=0.5)
[1] 9.775

# You might then be tempted to do something like this when using the apply command

apply(mymat, 2, mean(trim=0.5))
Error in mean.default(trim = 0.5) : 
  argument "x" is missing, with no default

# Or maybe this ?

apply(mymat, 2, mean(x,trim=0.5))
Error in mean(x, trim = 0.5) : object 'x' not found

To get some idea on how to do this correctly look at the help information for apply. It’s not exactly jumping out at you but the answer is there. Do you see it ?

     apply(X, MARGIN, FUN, ...)
     
Arguments:

       X: an array, including a matrix.

  MARGIN: a vector giving the subscripts which the function will be
          applied over.  E.g., for a matrix ‘1’ indicates rows, ‘2’
          indicates columns, ‘c(1, 2)’ indicates rows and columns.
          Where ‘X’ has named dimnames, it can be a character vector
          selecting dimension names.

     FUN: the function to be applied: see ‘Details’.  In the case of
          functions like ‘+’, ‘%*%’, etc., the function name must be
          backquoted or quoted.

     ...: optional arguments to ‘FUN’.

Oh so the ellipsis imply that we can simply append arguments after the reference to the function. Of course it is reasonable to assume that the arguments you pass must make sense to the function. That is you cannot pass an argument to a function that isn’t valid !

apply(mymat,2,mean,trim=0.5)
[1]  9.775 10.410 10.485  9.670

# We can pass additional arguments

apply(mymat,2,mean,trim=0.5, na.rm=T)
[1]  9.775 10.410 10.485  9.670

Using our own functions with apply

We can also use a function that we have written and pass it to the apply command. As an example let’s say that we want to express each element in a column as a proportion of the sum of the column it occupies. Here is how we would do this without using apply. As in the opening example this approach doesn’t scale very well although we do get the correct answers.

mymat[,1]/sum(mymat[,1])
[1] 0.2324485 0.2525428 0.2272389 0.2877698

mymat[,2]/sum(mymat[,2])
[1] 0.2535592 0.2253314 0.2574865 0.2636230

mymat[,3]/sum(mymat[,3])
[1] 0.2508893 0.2297842 0.2729429 0.2463837

mymat[,4]/sum(mymat[,4])
[1] 0.2452288 0.2036601 0.2907190 0.2603922

But using our new found knowledge of the apply command we can make this easier and more general. First, let’s write a function that given a vector, (which is what each column or row of a matrix is), will return a vector where each element is a proportion of the sum of the input vector.

myfunc <- function(x) {
  return(x/sum(x))
}

# Check it out to make sure it works

myfunc(mymat[,1])
[1] 0.2324485 0.2525428 0.2272389 0.2877698

all.equal(myfunc(mymat[,1]), mymat[,1]/sum(mymat[,1]))
[1] TRUE

So now we can pass this to the apply command directly


apply(mymat, 2, myfunc)
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2324485 0.2535592 0.2508893 0.2452288
[2,] 0.2525428 0.2253314 0.2297842 0.2036601
[3,] 0.2272389 0.2574865 0.2729429 0.2907190
[4,] 0.2877698 0.2636230 0.2463837 0.2603922

# If this is correct then the sum of each column should be 1.

colSums(apply(mymat,2,myfunc))
[1] 1 1 1 1

# which is equivalent to the following nested apply call

apply( apply(mymat,2,myfunc), 2, sum)
[1] 1 1 1 1

The second example above, the one that has a call to apply within a call to apply, is known as a “composite call”, which can be confusing to newcomers. However, in the world of R, which aggressively implements functional programming concepts, the use of composite function calls is quite common and encouraged. But, there is no requirement to do things this way. Also, I should point out that R has a dedicated function for computing proportions in a matrix. Check out the prop.table command.

Working anonymously

In reality we don’t have to predefine the function. We simply pass the function to apply in an “anonymous” fashion, which means that we don’t even give it a name. It lives only for the duration of the call to apply.

apply(mymat, 2, function(x) x/sum(x))
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2324485 0.2535592 0.2508893 0.2452288
[2,] 0.2525428 0.2253314 0.2297842 0.2036601
[3,] 0.2272389 0.2574865 0.2729429 0.2907190
[4,] 0.2877698 0.2636230 0.2463837 0.2603922

This can be confusing to newcomers and, if it is, then you can simply predefine the function using a name and then refer to it by name when using the apply command. Sometimes you see calls to apply that look like the following. This is done in an attempt to improve readability though that is subjective. In terms of results or efficiency there are no differences.

apply(mymat, 2, function(x) {
                      return(x/sum(x))
                })

Performance vs the for loop

If you’ve read any of my other postings you will know that I have an interest in performance. Is there an efficiency gain when using the apply approach vs. the for loop approach ? To test this let’s read in some “real” Microarray data, which comes from Su et al, 2002.

url <- "http://steviep42.bitbucket.org/data/su_et_al_2002.txt"
mydata <- read.csv(url,header=T)

# B = brain, BF = fetal brain, L = liver, LF = fetal liver

str(mydata)
'data.frame':	12626 obs. of  8 variables:
 $ B1 : num  9.37 11.57 7.79 6.56 7.26 ...
 $ B2 : num  9.5 11.51 6.69 7.45 7.39 ...
 $ FB1: num  7.89 10.56 7.91 5.93 8.75 ...
 $ FB2: num  7.96 10.34 7.57 7.53 8.19 ...
 $ FL1: num  7.7 10.26 8.25 7.84 9.45 ...
 $ FL2: num  8.1 10.13 8.53 8.16 8.19 ...
 $ L1 : num  9.22 10.23 8.39 11.02 9.19 ...
 $ L2 : num  8.9 10.49 8.47 11.09 9.42 ...

# Let's check the mean and standard deviation for each column

apply(mydata,2,function(x) c(mu=mean(x),sd=sd(x)))
         B1       B2      FB1      FB2      FL1      FL2       L1       L2
mu 7.615090 7.431602 7.879343 7.770545 7.730293 7.601395 7.415937 7.518259
sd 2.257218 2.452364 2.150741 2.254736 2.207076 2.355612 2.405326 2.365394

Okay well let’s do a t-test across all rows for the brain tissue types. This becomes really easy to do with the apply command.


mypvals <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } )

round(mypvals[1:5],4)
 100_g_at   1000_at   1001_at 1002_f_at 1003_s_at 
   0.0090    0.0494    0.5211    0.8001    0.1349 
 

But is this faster than writing our own loop to do the t-test ? We’ll first need to write our own code to do the t-test. It might look something like this:


myttest <- function(somemat) {
  retvec <- vector()
  length(retvec) <- nrow(somemat)
  for (ii in 1:nrow(somemat)) {
    retvec[ii] <- t.test(somemat[ii,c("B1","B2")], somemat[ii,c("FB1","FB2")])$p.value
  }
  names(retvec) <- row.names(somemat)
  return(retvec)
}

Next we’ll make sure that this function returns the same results as the apply version after which we will time each approach.

v1 <- myttest(mydata)

v2 <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } )

all.equal(v1,v2)
[1] TRUE

# Let's time the two different approaches

benchmark(forloop = myttest(mydata), apply=apply(mydata, 1, function(x) 
                         { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value }),
                           replications=5,columns=c("test","elapsed"))
     test elapsed
2   apply  19.616
1 forloop  61.800

In this case it appears that the apply approach is indeed quicker than writing our own function though in my experience, with more general cases, the difference in execution time isn’t as drastic. In the end I suggest you use whichever approach makes the most sense to you to get the job done. However, I think that as you become more confident with R that you will be attracted to the apply command since it basically replaces the for-loop. So whenever you find yourself tempted to write a for-loop, even though you are used to it in other languages, you should remind yourself that that the “R-like” way to do it is to use apply.

If you are a newcomer to R then you are probably quite busy learning the semantics of the language as you experiment with the apply family of commands or come up to speed on the grouping and conditioning capabilities offered by lattice graphics. And, along the way, you might have heard that R has the ability to “link in” code written in other languages such as C, C++, Fortran, and Java. This is true but until you are presented with a compelling use case you will most likely ignore such capability since you’ve already got plenty to do. But that’s where this blog can help. I’ll talk about how to integrate Fortran code with R and, in a later post, discuss doing the same with C and C++.

DISCLAIMER: To accomplish this work requires the presence of a working Fortran compiler such as the GNU suite of compilers (g77, gfortran, gcc, etc). Relative to operating systems I use Linux and OSX. I rarely use Windows though do know that the GNU suite is available for that OS so you should be able to link Fortran code into R there. However, I’ve never tried it. Additional Disclaimer: The timings I present in this post are based on an Apple MacBook @ OS level 10.9.2 with a 2.5 Ghz i5 processor and 4GB of RAM. Your timings may vary.

Why Bother ?

Good question. Most people wind up wanting to access Fortran from R for a few reasons such as they have some really fast and efficient Fortran code that they want to exploit within R. Or maybe they have written some code in R that winds up being incredibly slow so they write a much faster version in Fortran and then want to call it from R. Perhaps they need to access subroutines from external Fortran libraries. Lastly, it might simply be because your boss or faculty advisor is making you do it ! Whatever your reason(s) we’ll break the process of linking in Fortran code down into three general steps: 1) prepare the subroutine for compilation and generate a “shared object”, 2) load the shared object into an active R session, and 3) provide R variables to the shared object via the “.Fortran” call, which is part of the “Foreign Function Interface” within R.

Foreign                  package:base                  R Documentation

Foreign Function Interface

Description:

     Functions to make calls to compiled code that has been loaded into
     R.

Usage:

            .C(.NAME, ..., NAOK = FALSE, DUP = TRUE, PACKAGE, ENCODING)
      .Fortran(.NAME, ..., NAOK = FALSE, DUP = TRUE, PACKAGE, ENCODING)

Prepare the Fortran Subroutine

Next, I present a very simple Fortran 77 subroutine that computes the factorial of a number “n” and stashes the result into a variable called “answer”. And speaking of subroutines it is important to know that to use the .Fortran interface one must make reference to Fortran subroutines only – not Fortran functions or full on programs. So if you have some code that you want to bring in then you will need to embed that code within a subroutine definition. We also need to pay attention to the variable types as declared within the subroutine so we can match those types accordingly when calling the subroutine from R. This winds up being one of the more critical steps in the process.

        subroutine facto(n,answer)
c
c simple subroutine to compute factorial
c
        integer n, answer, i

        answer = 1
        do 100 i = 2,n
           answer = answer * i
  100   continue 
        
        end

You should make sure, of course, that this code does compile correctly. Our goal is to generate a shared object file ( a “.so” file) that can be linked in to R. Note also that our routine doesn’t print/write things to output. It simply uses its input variables and ultimately sets an output variable. Make your code lean.

$ ls
facto.f

$ gfortran -c facto.f

$ ls
facto.f	facto.o

$ gfortran -shared -o facto.so facto.o
$ ls
facto.f		facto.o		facto.so

So it looks like we are good to go here. However, instead of doing this compilation ourselves we could have allowed R to help us. In fact it is the preferred way to do this since this will insure the compilation is done “under the supervision” of the R tools. Let’s remove the .o and .so files and start over.

$ rm *.*o

$ ls
facto.f

$ R CMD SHLIB facto.f
<you will see various compilation output messages>

$ ls
facto.f		facto.o		facto.so

Load it

So now what ? Let’s fire up R. We’ll use the dyn.load command which is part of the foreign interface capability. It’s purpose is to load/unload shared objects which are also known as DLLs, (dynamically loadable libraries).

system("ls")
facto.f		facto.o		facto.so
dyn.load("facto.so")

Okay not much happened there. What’s going on ? Well all we did was simply load the shared object. We have yet to use it. To do that we rely upon the “.Fortran” function. Keep in mind that the subroutine “facto” has two arguments both of which are integers. We’ll supply a value of 5 for “n” and we’ll pass a single integer as a value for the “answer” variable though that will be overwritten once the subroutine computes the “answer”.

system("ls")
facto.f		facto.o		facto.so

dyn.load("facto.so")

.Fortran("facto",n=as.integer(5),answer=as.integer(1))

$n
[1] 5

$answer
[1] 120

# Or more directly

.Fortran("facto",n=as.integer(5),answer=as.integer(1))$answer
[1] 120

If you are wondering what types are supported or shared between R and Fortran here is a list from the .Fortran man page. It is worth some time to peruse the man page as it provides some caveats and deeper explanations on how to interact with Fortran.

       R          Fortran          
       integer    integer          
       numeric    double precision 
       - or -     real             
       complex    double complex   
       logical    integer          

Wrap it Up!

Okay that was cool but this is sort of an awkward way to call the Fortran subroutine. We should probably write our own wrapper function in R to do the loading of the shared object and the referencing to the .Fortran function. Take a look at this approach which winds up being very “R like”.

myfacto <- function(num) {
  dyn.load("facto.so")
  retvals <- .Fortran("facto",n = as.integer(num), answer = as.integer(1))
  return(retvals$answer)
}
         
myfacto(5)
[1] 120

sapply(1:10,myfacto)
 [1]       1       2       6      24     120     720    5040   40320  362880 3628800

So with the wrapper approach we can use the Fortran subroutine just as we would any other R function since the call to .Fortran is “buried” in the wrapper function. We could make this a bit more robust by putting in some logic to see if the shared object is already loaded.

myfacto <- function(num) {
  if (!is.loaded('facto')) {
     dyn.load("facto.so")
  }
  retvals <- .Fortran("facto",n = as.integer(num), answer = as.integer(1))
  return(retvals$answer)
}
         
myfacto(5)
[1] 120

It’s all Convoluted

Well that was okay but let’s look at a more involved example. Let’s consider the idea of doing discrete convolution between two vectors. (Note: This is discussed in the “Writing R Extensions” manual). Why did I pick such an example ? Well first it’s commonly referenced in R literature and , second, it is a good motivating case for using an external language to speed up the processing. The algorithm itself isn’t hard to code up either in R or Fortran. However, the performance in R isn’t so good once the vectors get larger. Check it out:

conr <- function(x, y) {
    lx <- length(x)
    ly <- length(y)
    cxy <- numeric(lx + ly - 1)
    for(i in 1:lx) {
        xi <- x[i]
        for(j in 1:ly) {
            ij <- i+j-1
            cxy[ij] <- cxy[ij] + xi * y[j]
        }
    }
    return(cxy)
}

# Let's check the timings for vectors of different sizes

v1 = rnorm(100); v2 = rnorm(100)

system.time(conr(v1,v2))
   user  system elapsed 
  0.034   0.000   0.035 

v1 = rnorm(2000); v2 = rnorm(2000)

system.time(conr(v1,v2))
   user  system elapsed 
 13.195   0.020  13.215 

v1 = rnorm(4000); v2 = rnorm(4000)

system.time(conr(v1,v2))
   user  system elapsed 
 57.757   0.130  58.008 

The timings grow significantly longer as the sizes of the vectors grow. So passing vectors of size 10,000 could take a very long time. While this blog isn’t specifically on performance let’s do a little bit more coding to get an idea about how poorly performing the convolution written in R is. We’ll use this for later comparison with the performance numbers resulting from the Fortran subroutine. Let’s write a wrapper function to the conr function. This will call conr with a variable x that represents the size of the vectors we wish to convolute. If you don’t understand exactly what is going on here don’t worry – just think of at as more exposure to the apply family of commands.

timewrapconr <- function(x) {
    times <- system.time(conr(rnorm(x),rnorm(x)))[3]
    return(c(size=x,times))
}

# time the convolution for vectors of size 100,1000,2000, and 4000

(convtimes <- sapply(c(100,1000,2000,4000),timewrapconr))
           [,1]     [,2]     [,3]    [,4]
size    100.000 1000.000 2000.000 4000.00
elapsed   0.033    3.552   13.811   62.15

# Let's plot this
library(lattice)

xyplot(convtimes[2,]~convtimes[1,],type=c("p","l","g"),
       xlab = "vector size", ylab = "elapsed time in seconds",
       main = "Execution times for Convolution in R", pch = 19)
Figure 1. Plot of execution times

Figure 1. Plot of execution times

How do we address this problem ? Well there are opportunities for improvement within the R code by using vectorization techniques. A good start would be to somehow avoid the second for loop and there are ways to do that. In fact there is a way to avoid both loops altogether and maybe we’ll explore such an approach in another post. But for now let’s see if writing the code in Fortran and then linking it in could help improve things. So here is the rewritten convolution algorithm, which we will save into a file called convolvef77.f

      subroutine convolvef77 (x, lx, y, ly, xy)
c
c A basic implementation of convolution algorithm for two vectors
c I use zero-based arrays here.
c
      integer lx, ly, i, j
      double precision x(0:lx-1), y(0:ly-1), xy(0:lx+ly-2)
      do 20 i = 0, (lx-1) 
         do 15 j = 0, (ly-1) 
            xy(i+j) = xy(i+j) + x(i) * y(j) 
  15     continue  
  20  continue 
      end

# Save the above to a file called convolvef77.f
# Now compile it to a shared library

$ R CMD SHLIB convolvef77.f 

Next we’ll write a function in R to call the convolvef77 function. Start up R.

convolvef77 <- function(x,y) {
  dyn.load('convolvef77.so')
  lx = length(x)
  ly = length(y)
  retdata <- .Fortran("convolvef77",
                       x = as.double(x),
                       lx = as.integer(lx), 
                       y = as.double(y), 
                       ly = as.integer(ly), 
                       xy = double(lx+ly-1))$xy
  return(retdata)
}

# Now let's throw some large vectors at it. Look at how much better the times are

v1 = rnorm(4000); v2 = rnorm(4000)

system.time(convolvef77(v1,v2))
   user  system elapsed 
  0.012   0.000   0.012 

v1 = rnorm(8000); v2 = rnorm(8000)

system.time(convolvef77(v1,v2))
   user  system elapsed 
  0.047   0.001   0.083 

So the speed looks really good. So now let’s repeat the timing exercise we applied to the convolutions done in R.

timewrapconf77 <- function(x) {
    times <- system.time(convolvef77(rnorm(x),rnorm(x)))[3]
    return(c(size=x,times))
}

(convtimes <- sapply(c(100,1000,2000,4000),timewrapconf77))
           [,1]  [,2]  [,3]    [,4]
size    100.000 1e+03 2e+03 4.0e+03
elapsed   0.045 2e-03 4e-03 1.3e-02

# Wow. This is FAST !!!!! Let's throw some bigger vectors at it.

(convtimes <- sapply(c(100,1000,2000,4000,10000,20000,50000),timewrap))
         [,1]  [,2]  [,3]    [,4]    [,5]     [,6]      [,7]
size    1e+02 1e+03 2e+03 4.0e+03 1.0e+04 2.00e+04 50000.000
elapsed 1e-03 2e-03 4e-03 1.2e-02 7.2e-02 3.22e-01     2.074

# Plot the times

xyplot(convtimes[2,]~convtimes[1,],type=c("p","l","g"),
       xlab = "vector size", ylab = "elapsed time in seconds",
       main = "Execution times for Convolution in Fortran", pch = 19)
Execution times using Fortran

Execution times using Fortran

So using the Fortran subroutine took 2.0 seconds to convolute vectors of size 50,000 whereas using native R code to convolute a vector of size 1,000 took 3.5 seconds (these timings might vary depending on your architecture and OS). To get a better visual comparison let’s repeat the timings for both approaches, R and Fortran, and plot the results on the same graph so you can get some sense of proportion between the execution times. This isn’t hard to do. We’ll just rerun our timing functions:

(convtimesr <- sapply(c(100,1000,2000,4000,10000,20000),timewrapconr))

           [,1]     [,2]     [,3]     [,4]      [,5]      [,6]
size    100.000 1000.000 2000.000 4000.000 10000.000 20000.000
elapsed   0.034    3.374   14.118   64.894   355.409  1504.517
  
(convtimesf77 <- sapply(c(100,1000,2000,4000,10000,20000),timewrapconf77))
           [,1]    [,2]  [,3]    [,4]    [,5]     [,6]
size    100.000 1.0e+03 2e+03 4.0e+03 1.0e+04 2.00e+04
elapsed   0.071 2.3e-02 4e-03 1.2e-02 6.9e-02 2.99e-01

# Now plot them on the same graph

plot(convtimesr[1,],convtimesr[2,],xlab="Vector size",
     ylab="Elapsed time in seconds", 
     main="Convolution in R vs Fortran",
     type="l",col="red",lwd=2.5)

points(convtimesf77[1,],convtimesf77[2,],type="l",col="blue",lwd=2.5)

legend("topleft",c("R","Fortran"),col=c("red","blue"),
        lty=c(1,1),lwd=c(2.5,2.5))
grid()
Execution times for R and Fortran

Execution times for R and Fortran

Okay, I think you get the point here. Using the Fortran code definitely helped speed things up. However, speed might not be the only reason you choose to link in Fortran code. For example I know of people who have written the bulk of their thesis analysis work using Fortran and now seek to leverage that effort within R. Sure, they could recode their stuff into R but that would probably result in lower performance results. Any time you have a significant body of work in one language you would like to avoid having to recode it in another. Lastly, there are other ways to bring in Fortran that I haven’t discussed here. The “inline” package allows one to compile fortran code inline within a given R program, which might be more appealing to some. Hope this has been helpful.