Vectors, Looping, and Performance

Posted: September 7, 2013 in R programming apply lapply tapply
Tags: , , ,

Vectors are at the heart of R and represent a true convenience. Moreover, vectors are essential for good performance especially when your are working with lots of data. We’ll explore these concepts in this posting. As a motivational example let’s generate a sequence of data from -3 to 3. We’ll also use each point as input into the Normal function. Our immediate goal is to draw the classic bell curve that everyone knows. So here we start out with the x values:


x = seq(-3, 3, by=0.05) # The "by" argument let's us specify a "step" or "interval" value

x
  [1] -3.00 -2.95 -2.90 -2.85 -2.80 -2.75 -2.70 -2.65 -2.60 -2.55 -2.50 -2.45
 [13] -2.40 -2.35 -2.30 -2.25 -2.20 -2.15 -2.10 -2.05 -2.00 -1.95 -1.90 -1.85
 [25] -1.80 -1.75 -1.70 -1.65 -1.60 -1.55 -1.50 -1.45 -1.40 -1.35 -1.30 -1.25
 [37] -1.20 -1.15 -1.10 -1.05 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65
 [49] -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05
 [61]  0.00  0.05  0.10  0.15  0.20  0.25  0.30  0.35  0.40  0.45  0.50  0.55
 [73]  0.60  0.65  0.70  0.75  0.80  0.85  0.90  0.95  1.00  1.05  1.10  1.15
 [85]  1.20  1.25  1.30  1.35  1.40  1.45  1.50  1.55  1.60  1.65  1.70  1.75
 [97]  1.80  1.85  1.90  1.95  2.00  2.05  2.10  2.15  2.20  2.25  2.30  2.35
[109]  2.40  2.45  2.50  2.55  2.60  2.65  2.70  2.75  2.80  2.85  2.90  2.95
[121]  3.00

# Next we generate the corresponding y values

y = dnorm(x)

# Now we plot the data 

par(mfrow=c(1,2))   # Set the plot region to be one row and two columns

plot(x, y, xlim=c(-3,3), ylim=c(0,0.4), main="Normal Curve", col="blue") # Plot 1

plot(x,y,xlim=c(-3,3),ylim=c(0,0.4),main="Normal Curve",col="blue",type="l") # Plot 2

normal_curve1

Note that we could have generated the X values by using the rnorm function and then sorting the results. If we call the rnorm function without specifying a mean and a standard deviation we get back values from a Normal distribution of mean 0 and standard deviation 1. So we could have done the following. Note that this approach results in less “coverage” in the tails but we still get a comparable result.

xvals = rnorm(120)
par(mfrow=c(1,1))
plot(sort(xvals), y= dnorm(sort(xvals)), main="Normal Curve",col="blue")

Next let’s pull out all the x values that are less than -2 or greater than 2. We are in effect looking a the “tails” of the distribution here. Using the bracket notation with vectors is a powerful way to interrogate the data and extract only that which you need. Look how easy this is. In other languages you would have to write a for-loop that uses if statements.

tmp = x[(x < -2) | (x > 2)]

par(mfrow=c(1,2))

plot(tmp,dnorm(tmp),xlim=c(-3,3),ylim=c(0,0.4),
         main="Normal Curve - The Tails",col="blue") # Plot 1

As mentioned previously, newcomers to R often want to write a loop structure to pull out this data. You could certainly do this but it would be overlooking the power of vectors. Still, let’s see what would be involved to extract the information above using the for-loop approach.

hold = vector()
for (ii in 1:length(x)) {
    if ( (x[ii] < -2) | (x[ii] > 2) ) {
       hold[ii] = x[ii]
    }
}

Okay well this will work but it doesn’t generalize very well, (if at all), since we basically have to customize this loop based on what comparisons we want to do. It might be better to put this into a function although it is far better, (as we will soon see), to use the bracket notation available with vectors. Now let’s also pull out the x values that complement the above set. That is let’s get all the x values that are greater than -2 and less than 2. We’ll plot these along side the previous plot.

tmp2 = x[ (x > -2) & (x < 2) ]

plot(tmp2,dnorm(tmp2),xlim=c(-3,3),ylim=c(0,0.4),
     main="Normal Curve - Without Tails",col="blue") # Plot 2

normal_curve2

So let’s reconsider the example where we used the for-loop. Aside from being more complex and less general it also performs much worse than the bracket notation – especially as the size of the vector grows. Let’s regenerate our x values from -3 to 3 but this time we will pick a very small step size which will result in a vector with many more values. So we are still generating values from -3 to 3 but many more than before:

x = seq(-3,3,by=0.0001)

length(x)     # We have 60,001 elements. 
[1] 60001

par(mfrow=c(1,1))
plot(x,dnorm(x))   # Convince yourself that it still plots the Normal curve.

Now we will evaluate each x value by plugging it in to the dnorm function except we will use the for-loop approach. To make it easier to time we will put it into a function:

my.func <- function(x) {
   hold = vector()
   for (ii in 1:length(x)) {
     if ( (x[ii] < -2) | (x[ii] > 2) ) {
        hold[ii] = x[ii]
     }
   }
   return(hold)
}

# So for each x let's get the corresponding y value using this function. Let's time it.

system.time( my.func(x) )   # system.time function does what it's name suggests
   user  system elapsed 
  2.120   0.287   2.406 
 
 # So is this good / bad ? Let's compare it to the bracket notation
 
system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.003   0.001   0.003 

Wow. So the bracket notation is much faster. For vectors of smaller length the difference in time isn’t so large. But as the size of x grows so does the time it takes to process it. To drive this point home let’s look at a larger x vector and see how it performs as compared to the bracket notation.

x = seq(-3,3,by=0.00005)

length(x)
[1] 120001

system.time( my.func(x) )
   user  system elapsed 
  6.363   2.640   8.998 

system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.006   0.000   0.006 
  
# Wow - the bracket notation really does provide much better performance. 

Let’s do some benchmarking and plot the results so we can get a visual representation of the performance of the two approaches. Check out the rbenchhmark package. You will have to install it since it is an add-on package. Use RStudio or the Windows GUI to do this or you can use the install.packages command. This package contains the function called benchmark which allows you to time the execution of any R function including those that you have written.

It replicates the function a specified number of times. You can also compare the performance of more than one function at a time. Let’s start out simple. Here we will time how long it takes to square the x vector. As mentioned this will be replicated some number times with the default being 100. We will get back a variety of information.

x = seq(-3,3,by=0.0001)

benchmark( square = x^2)
    test replications elapsed relative user.self sys.self user.child sys.child
1 square          100   0.018        1     0.015    0.002          0         0
 
# We can compare how long it takes to square it vs cubing it. We can also specify the
# information to be returned

benchmark( square = x^2, cube = x^3, replications=5, columns=c("test","elapsed"))
    test elapsed
2   cube   0.006
1 square   0.001

First we’ll generate four different versions of the x vector corresponding to different step sizes.

hold = lapply(c(0.01, 0.001, 0.0001, 0.00005), function(x) {
   return(seq(-3,3,by=x))
})

length(hold)  # Hold is a list with 4 elements each of which contains a vector
[1] 4

sapply(hold,length)
[1]    601   6001  60001 120001

Okay, second we’ll use this new hold vector as input into the benchmarking function. To make this efficient we’ll use the lapply command. We’ll also create a function to make the timing process more efficient.

timing.func <- function(x) {
  benchmark( replications=5, func = my.func(x), 
             bracket = x[ (x <2) | (x > 2)], columns=c("test","elapsed") ) 
}

( timings = lapply(hold, timing.func)  )
[[1]]
     test elapsed
2 bracket   0.001
1    func   0.008

[[2]]
     test elapsed
2 bracket   0.002
1    func   0.168

[[3]]
     test elapsed
2 bracket   0.018
1    func  10.757

[[4]]
     test elapsed
2 bracket   0.036
1    func  60.141

# Let's pull out the timings for each approach (the function approach or bracket approach)

( mtimings = sapply(timings, function(x) x$elapsed) )

      [,1]  [,2]   [,3]   [,4]
[1,] 0.001 0.002  0.018  0.036
[2,] 0.008 0.168 10.757 60.141

# Let's plot the timings

plot(1:4,mtimings[2,],pch=18,col="red",type="l",xaxt="n",
         main="Timings of bracket notations vs for loop",ylab="Timings in Seconds")

axis(1,1:4,labels=as.character(sapply(hold,length)),xlab=c("vector size in elements"))

points(1:4,mtimings[1,],type="l",col="blue")

legend(1,40,c("for-loop","bracket"),col=c("red","blue"),lty=c(1,1))

bracketfor

Well this shows that the bracket notation works well even as the x vector gets large. The for-loop, however, starts to perform very poorly and will soon be impractical for larger vectors.

Writing loops in C++, Java, C, and FORTRAN is pretty fast which is why people coming to R from those languages tend to want to write loops to process data. Users of MATLAB, however, are used to the idea of vectors, (although C++ supports vectors), and make the transition to R quite nicely. Also keep in mind that matrices, (which are closely related to vectors), can also be processed using for-loops but, as with vectors, it is not recommended. There are plenty of functions that are “vectorized” and customized for use with vectors and matrices. This is why we can use commands like apply to help process data in R.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s