Posts Tagged ‘R Programming’

Conditioning and grouping are two important concepts in graphing that allow us to rapidly refine our understanding of data under consideration. Conditioning, in particular, allows us to view relationships across “panels” with common scales. Each panel contains a plot whose data is “conditional” upon records drawn from the category that supports that particular panel (an example will help make this clear). Grouping allows us to view interactions within a given panel by making distinctions using different plot characters, size, or color. These ideas have been discussed in Cleveland’s book “Visualizing Data”. The Lattice, (aka “Trellis”), graphics package attempts to implement the concepts and values indicated in Cleveland’s book. These ideas can help us find interesting patterns in our data.

However, it is important to point out that all graphics packages available within R, (Base, Lattice, ggplot2, Grid), have their respective strengths and good work can be accomplished using any of them although for the newcomer Base graphics is a reasonable starting point. There is an abundance of information on Base graphics both in printed literature and via Google results so ample support is available. As a newcomer to Base graphics it doesn’t take long before you are creating graphs like that in Figure 1, which is an approximation of a conditioned plot. Strictly speaking it does not adhere to the principles of conditioned plots but its useful to see how close we can get. We have a three “panel” plot of MPG vs Weight from the mtcars data frame which is built-in to R. Each panel contains a scatter plot between MPG vs wt for each cylinder group, (4, 6, or 8). The paneling allows us to conveniently compare the MPG vs Wt relationship across groups.

par(mfrow=c(1,3))

cyls <- split(mtcars, mtcars$cyl)
for (ii in 1:length(cyls)) {
     tmpdf <- cyls[[ii]]
     sname <- names(cyls)[ii]
     plot(tmpdf$wt, tmpdf$mpg, 
          main = paste("MPG vs Wt",sname,"Cyl"),
          ylim = c(0,40), xlab = "Wt / 1,000",
          ylab = "MPG", pch=19, col="blue")
     grid()
}
Figure 1

Figure 1

We see that the higher weights correspond to records with lower a MPG figure. This idea is confirmed by the separate panel plots wherein we see the lower MPG for increasing cylinder count, which is is quite likely correlated with the weight of the automobile. (It is in fact with a correlation coefficient of 0.78). In any case the data in mtcars isn’t terribly difficult to understand, (though that’s not really the point here – we are simply trying to come to terms with how to plot data). We split up the data frame and looped through the resulting list and plotted MPG vs Wt for each subgroup. It is not a bad approach and is one that lends itself well to programming. On the other hand let’s check out how this kind of thing might be done using lattice graphics, which was designed in large part to make conditioned plots really easy. Let’s put it to the test.

library(lattice)
xyplot(mpg~wt | factor(cyl), data=mtcars, pch=19,
                main="MPG vs Wt", xlab="Wt/1,000",  ylab="MPG",layout=c(3,1),type=c("p","g"))
Figure 2

Figure 2

What’s Your Condition ?

This isn’t bad for one line (two if you include the library) statement. Let’s break down what is happening in Figure 2. The “mpg~wt” part of the call to xyplot should be familiar as a formula where mpg represents the y-axis and wt represents the x-axis. The formula interface is commonly used in other R procedures such as lm as well as aggregation functions so if this is new to you then it is important to learn about it.

The part of the call to xyplot that newcomers might not understand is the vertical bar character. The presence of this character indicates that what follows is a “conditioning” variable which is typically a category/factor. It also let’s us know that the resulting plot will contain a number of panels the number of which will correspond to the number of unique values assumed by the conditioning variable(s). In the case of mtcars$cyl this will be three since the unique values assumed by cylinder are 4,6,8. However, we really don’t have to know the number of values in advance. We can just plot it and view the defaults after which we can go in and add the “layout” argument to refine the visual presentation.

unique(mtcars$cyl)
[1] 6 4 8

But it gets better in that we can use more than one conditioning character at a time. We can look at a combination of factors in effect looking at the MPG vs Wt relationship as “cells” of a table. Sound complicated ? Not really. Check this out. We are asking xyplot to present MPG vs wt for each combination of Transmission and Cylinder type. This results in a 2 x 3 grid structure wherein the panels contain a scatterplot.

xyplot(mpg~wt | factor(cyl) + factor(am,labels=c("A","M")),
                data=mtcars, main="MPG vs Wt", 
                xlab="Wt/1,000", ylab="MPG",pch=19,type=c("p","g"))
Figure 3

Figure 3

Notice how lattice takes care of the “paneling” for us and supplies labels based on the given factor combinations. So we have a row for each Transmission type, (automatic or manual) each of which contains three columns/panels corresponding to a cylinder group of 4,6, or 8. This plot makes it trivial to determine what combination gets the best gas mileage assuming that is what we are looking for. The 4 Cylinder cars with Manual transmissions get the best gas mileage. Note that we could also further influence the layout and look of the graph and make adjustments to the “strip” labels in each panel but since this is an introductory article we’ll focus on general capabilities for now.

There is Safety in Groups !

Are there other ways that we enhance the graph to reveal more meaning ? In fact there is. In particular we could employ the idea of “grouping”. However, before apply grouping to the above example, (Figure 3), let’s look at a stand-a-lone case to make sure you understand the concept. Suppose we want to look at the MPG vs wt graph within a single panel though we wish to see points colored according to the Transmission group they occupy. This is easy to accomplish with lattice by using the “groups” argument.

xyplot(mpg~wt,data=mtcars,groups=factor(am,labels=c("A","M")),
       pch=20,auto.key=list(columns=2),type=c("p","g"))
Figure 4

Figure 4

So that was pretty easy wasn’t it ? Instead of viewing the interaction across panels we can see it within a single panel by using the different colors to partition the data into groups. Now let’s say that we wanted the points to reflect the cylinder group they occupy except instead of using color we’ll use point size to indicate the relationship. Here we use a vector from 3 down to 1 in conjunction with the cex argument to reflect the size of the points coming from, respectively, cylinder groups 4, 6, and 8. Points coming from cylinder group 4 are going to be larger than points coming from groups 6 and 8. The MPG is better with 4 cylinder autos and we want that reflected in terms of larger size.

Figure 5

Figure 5

Now if we turn our attention back to the situation wherein we used two factors to condition the MPG vs Wt plots we should be able to combine grouping with conditioning to see if there is anything lurking in the data that we might have missed previously. So what I’ll do here is plot MPG vs Wt as conditioned by Cylinders while grouping by Transmission.

xyplot(mpg ~ wt | factor(cyl), data=mtcars, pch=16,layout=c(3,1),
                  groups = factor(am,labels=c("Auto","Manual")),
                  type=c("p","g"),auto.key = TRUE)
Figure 6

Figure 6

Do we see anything interesting here ? Perhaps. If you have a 4 cylinder car and a manual transmission then you get the best gas mileage. This isn’t anything new. However, if we look at the cars in the 8 cylinder panel we see that cars with a manual transmission, (there are only two of them), don’t seem to perform as well relative to MPG as they did in the 4 cylinder group. In the 6 cylinder panel we see that auto and manual transmissions are closer in MPG than in the 4 cylinder group. So we might form the hypothesis that having a manual transmission is related to significantly better gas mileage only if you have a 4 cylinder car. Or maybe we just need to find more 8 and 6 cylinder cars with manual transmissions to see if the pattern holds. It is important to note that these cars were surveyed in 1974 so our findings would probably change were we to use updated information. Well this is all cool stuff but for now we’ll stick with the graphics details instead of further analysis.

Out of curiosity what would this example look like in Base graphics ? Here is one possibility. I’ll let you make the decision if this is easier than using lattice and if the result is as aesthetically pleasing. Base graphics are amazingly flexible but require more of a programming approach in this case than does lattice although lattice panel functions, (something I will explore in another Blog post), can also require a “programmer’s mentality”.

par(mfrow=c(1,3)) 
mysplits = split(mtcars,mtcars$cyl)
maxmpg = max(mtcars$mpg)
for (ii in 1:length(mysplits)) {
      tmpdf  <- mysplits[[ii]]
      auto <- tmpdf[tmpdf$am == 0,]
      man <- tmpdf[tmpdf$am == 1,]
      plot(tmpdf$wt, tmpdf$mpg,type="n",
           main=paste(names(mysplits[ii])," Cylinders"),
           ylim=c(0,maxmpg), xlab="wt",ylab="MPG")
      points(auto$wt,auto$mpg,col="blue",pch=19)
      points(man$wt,man$mpg,col="pink",pch=19)
      grid()
      legend("topright", inset=0.05, c("manual","auto"),
             pch = 19, col=c("pink","blue"))
}
Figure 7

Figure 7

Well I haven’t been completely forthright here. Base graphics does in fact have a function that is dedicated to conditioned plotting although it doesn’t generalize to other plot types. It does, however, handle X/Y plots quite well. I include it here because it does conform to the principles of conditioned plotting but if you are going to embrace the formal approach for conditioning then you might as well use lattice since it does generalize across various plot types.

coplot(mpg~wt|factor(cyl)+factor(am),data=mtcars)
Figure 8: coplot

Figure 8: coplot

Other Lattice Functions

Okay, you might be wondering if the arguments to xyplot will “carry over” to other lattice plotting functions. The answer is “yes” although they must make sense for the given chart type. For example with histograms or boxplots we are looking at the plot of a single continuous quantity but we can still “condition” that continuous quantity using factors from the dataset. In this next example we look at another internal dataset to R called “ChickWeight”. We’ll present a histogram of the weights of 50 chickens, measured over time, as conditioned by the type of diet they were provided. There are 4 diet types hence we will see 4 panels. All of the considerations important to lattice graphics are intact – common axes, optimal use of space, shared margins, and panels to indicate conditions.

histogram(~weight|factor(Diet),data=ChickWeight,
           main="Weight by Diet Type",xlab="Weight in grams")
Figure 9

Figure 9: lattice histogram

So checkout this example. We create a boxplot of chicken weights conditioned on the passage of time since birth which is measured in days (0,2,…,21).

bwplot(~weight|factor(Time),data=ChickWeight,col="blue",
        main="Weight by Days Since Birth",xlab="Weight in grams")
lattice boxplot

Figure 10: lattice boxplot

Conclusions

So we’ve really only scratched the surface here of what can be done with conditioning and grouping although I feel that this is a reasonable introduction. As I’ve mentioned previously, Base graphics has lots of flexibility and if you become fluent with the high and low level functions offered as part of that environment you can accomplish a lot. Admittedly, if you do not have a programming background then writing looping structures to process data can be a bit intimidating but its worth it to learn. The examples I presented with lattice are to show how convenient it is to do conditioning and grouping but then again if you want to do more customization then that will involve learning a bit more about lattice. So at some point you find yourself in the same position that you would be with Base graphics. It all depends on how detailed and custom you wish your plots to be. The ggplot2 package, which I will discuss in another Blog post, can also handle these activities quite nicely though the jargon is a bit different.

Independently of what graphics package you choose to employ, it is important to realize that flexibility does not always imply ease of use. Like any software package there is absolutely no substitute for experience and getting your hands dirty. R is no exception.

Vectors are at the heart of R and represent a true convenience. Moreover, vectors are essential for good performance especially when your are working with lots of data. We’ll explore these concepts in this posting. As a motivational example let’s generate a sequence of data from -3 to 3. We’ll also use each point as input into the Normal function. Our immediate goal is to draw the classic bell curve that everyone knows. So here we start out with the x values:


x = seq(-3, 3, by=0.05) # The "by" argument let's us specify a "step" or "interval" value

x
  [1] -3.00 -2.95 -2.90 -2.85 -2.80 -2.75 -2.70 -2.65 -2.60 -2.55 -2.50 -2.45
 [13] -2.40 -2.35 -2.30 -2.25 -2.20 -2.15 -2.10 -2.05 -2.00 -1.95 -1.90 -1.85
 [25] -1.80 -1.75 -1.70 -1.65 -1.60 -1.55 -1.50 -1.45 -1.40 -1.35 -1.30 -1.25
 [37] -1.20 -1.15 -1.10 -1.05 -1.00 -0.95 -0.90 -0.85 -0.80 -0.75 -0.70 -0.65
 [49] -0.60 -0.55 -0.50 -0.45 -0.40 -0.35 -0.30 -0.25 -0.20 -0.15 -0.10 -0.05
 [61]  0.00  0.05  0.10  0.15  0.20  0.25  0.30  0.35  0.40  0.45  0.50  0.55
 [73]  0.60  0.65  0.70  0.75  0.80  0.85  0.90  0.95  1.00  1.05  1.10  1.15
 [85]  1.20  1.25  1.30  1.35  1.40  1.45  1.50  1.55  1.60  1.65  1.70  1.75
 [97]  1.80  1.85  1.90  1.95  2.00  2.05  2.10  2.15  2.20  2.25  2.30  2.35
[109]  2.40  2.45  2.50  2.55  2.60  2.65  2.70  2.75  2.80  2.85  2.90  2.95
[121]  3.00

# Next we generate the corresponding y values

y = dnorm(x)

# Now we plot the data 

par(mfrow=c(1,2))   # Set the plot region to be one row and two columns

plot(x, y, xlim=c(-3,3), ylim=c(0,0.4), main="Normal Curve", col="blue") # Plot 1

plot(x,y,xlim=c(-3,3),ylim=c(0,0.4),main="Normal Curve",col="blue",type="l") # Plot 2

normal_curve1

Note that we could have generated the X values by using the rnorm function and then sorting the results. If we call the rnorm function without specifying a mean and a standard deviation we get back values from a Normal distribution of mean 0 and standard deviation 1. So we could have done the following. Note that this approach results in less “coverage” in the tails but we still get a comparable result.

xvals = rnorm(120)
par(mfrow=c(1,1))
plot(sort(xvals), y= dnorm(sort(xvals)), main="Normal Curve",col="blue")

Next let’s pull out all the x values that are less than -2 or greater than 2. We are in effect looking a the “tails” of the distribution here. Using the bracket notation with vectors is a powerful way to interrogate the data and extract only that which you need. Look how easy this is. In other languages you would have to write a for-loop that uses if statements.

tmp = x[(x < -2) | (x > 2)]

par(mfrow=c(1,2))

plot(tmp,dnorm(tmp),xlim=c(-3,3),ylim=c(0,0.4),
         main="Normal Curve - The Tails",col="blue") # Plot 1

As mentioned previously, newcomers to R often want to write a loop structure to pull out this data. You could certainly do this but it would be overlooking the power of vectors. Still, let’s see what would be involved to extract the information above using the for-loop approach.

hold = vector()
for (ii in 1:length(x)) {
    if ( (x[ii] < -2) | (x[ii] > 2) ) {
       hold[ii] = x[ii]
    }
}

Okay well this will work but it doesn’t generalize very well, (if at all), since we basically have to customize this loop based on what comparisons we want to do. It might be better to put this into a function although it is far better, (as we will soon see), to use the bracket notation available with vectors. Now let’s also pull out the x values that complement the above set. That is let’s get all the x values that are greater than -2 and less than 2. We’ll plot these along side the previous plot.

tmp2 = x[ (x > -2) & (x < 2) ]

plot(tmp2,dnorm(tmp2),xlim=c(-3,3),ylim=c(0,0.4),
     main="Normal Curve - Without Tails",col="blue") # Plot 2

normal_curve2

So let’s reconsider the example where we used the for-loop. Aside from being more complex and less general it also performs much worse than the bracket notation – especially as the size of the vector grows. Let’s regenerate our x values from -3 to 3 but this time we will pick a very small step size which will result in a vector with many more values. So we are still generating values from -3 to 3 but many more than before:

x = seq(-3,3,by=0.0001)

length(x)     # We have 60,001 elements. 
[1] 60001

par(mfrow=c(1,1))
plot(x,dnorm(x))   # Convince yourself that it still plots the Normal curve.

Now we will evaluate each x value by plugging it in to the dnorm function except we will use the for-loop approach. To make it easier to time we will put it into a function:

my.func <- function(x) {
   hold = vector()
   for (ii in 1:length(x)) {
     if ( (x[ii] < -2) | (x[ii] > 2) ) {
        hold[ii] = x[ii]
     }
   }
   return(hold)
}

# So for each x let's get the corresponding y value using this function. Let's time it.

system.time( my.func(x) )   # system.time function does what it's name suggests
   user  system elapsed 
  2.120   0.287   2.406 
 
 # So is this good / bad ? Let's compare it to the bracket notation
 
system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.003   0.001   0.003 

Wow. So the bracket notation is much faster. For vectors of smaller length the difference in time isn’t so large. But as the size of x grows so does the time it takes to process it. To drive this point home let’s look at a larger x vector and see how it performs as compared to the bracket notation.

x = seq(-3,3,by=0.00005)

length(x)
[1] 120001

system.time( my.func(x) )
   user  system elapsed 
  6.363   2.640   8.998 

system.time( x[ (x < -2) | (x > 2) ])
   user  system elapsed 
  0.006   0.000   0.006 
  
# Wow - the bracket notation really does provide much better performance. 

Let’s do some benchmarking and plot the results so we can get a visual representation of the performance of the two approaches. Check out the rbenchhmark package. You will have to install it since it is an add-on package. Use RStudio or the Windows GUI to do this or you can use the install.packages command. This package contains the function called benchmark which allows you to time the execution of any R function including those that you have written.

It replicates the function a specified number of times. You can also compare the performance of more than one function at a time. Let’s start out simple. Here we will time how long it takes to square the x vector. As mentioned this will be replicated some number times with the default being 100. We will get back a variety of information.

x = seq(-3,3,by=0.0001)

benchmark( square = x^2)
    test replications elapsed relative user.self sys.self user.child sys.child
1 square          100   0.018        1     0.015    0.002          0         0
 
# We can compare how long it takes to square it vs cubing it. We can also specify the
# information to be returned

benchmark( square = x^2, cube = x^3, replications=5, columns=c("test","elapsed"))
    test elapsed
2   cube   0.006
1 square   0.001

First we’ll generate four different versions of the x vector corresponding to different step sizes.

hold = lapply(c(0.01, 0.001, 0.0001, 0.00005), function(x) {
   return(seq(-3,3,by=x))
})

length(hold)  # Hold is a list with 4 elements each of which contains a vector
[1] 4

sapply(hold,length)
[1]    601   6001  60001 120001

Okay, second we’ll use this new hold vector as input into the benchmarking function. To make this efficient we’ll use the lapply command. We’ll also create a function to make the timing process more efficient.

timing.func <- function(x) {
  benchmark( replications=5, func = my.func(x), 
             bracket = x[ (x <2) | (x > 2)], columns=c("test","elapsed") ) 
}

( timings = lapply(hold, timing.func)  )
[[1]]
     test elapsed
2 bracket   0.001
1    func   0.008

[[2]]
     test elapsed
2 bracket   0.002
1    func   0.168

[[3]]
     test elapsed
2 bracket   0.018
1    func  10.757

[[4]]
     test elapsed
2 bracket   0.036
1    func  60.141

# Let's pull out the timings for each approach (the function approach or bracket approach)

( mtimings = sapply(timings, function(x) x$elapsed) )

      [,1]  [,2]   [,3]   [,4]
[1,] 0.001 0.002  0.018  0.036
[2,] 0.008 0.168 10.757 60.141

# Let's plot the timings

plot(1:4,mtimings[2,],pch=18,col="red",type="l",xaxt="n",
         main="Timings of bracket notations vs for loop",ylab="Timings in Seconds")

axis(1,1:4,labels=as.character(sapply(hold,length)),xlab=c("vector size in elements"))

points(1:4,mtimings[1,],type="l",col="blue")

legend(1,40,c("for-loop","bracket"),col=c("red","blue"),lty=c(1,1))

bracketfor

Well this shows that the bracket notation works well even as the x vector gets large. The for-loop, however, starts to perform very poorly and will soon be impractical for larger vectors.

Writing loops in C++, Java, C, and FORTRAN is pretty fast which is why people coming to R from those languages tend to want to write loops to process data. Users of MATLAB, however, are used to the idea of vectors, (although C++ supports vectors), and make the transition to R quite nicely. Also keep in mind that matrices, (which are closely related to vectors), can also be processed using for-loops but, as with vectors, it is not recommended. There are plenty of functions that are “vectorized” and customized for use with vectors and matrices. This is why we can use commands like apply to help process data in R.