HexBin Chart of Zika Cases in Brazil

HexBin Chart of Zika Cases in Brazil

I really thought I was done with the Express dplyr series though on completion of the second part I received many messages requesting more examples of using dplyr with ggplot along with some other types of information such as the Zika virus data which can be downloaded from Github. These examples are not drastically different from previous examples although they allow me to show how each data set presents its own challenges and also how transformations to the data will be necessary to visualize  the information.  The Zika data is spread across a number of countries but we will focus uniquely on cases in Brazil. Download the data using a Git client or straight from the site itself. If you are not using Git then you are missing out. (See my Youtube playlist for a basic introduction). It’s a great way to manage your R code and larger projects. Also keep in mind that the RStudio IDE has built in Git capabilities so as you create software it is extremely easy to “push” changes to your reference code and equally as easy for users of your code to “pull” changes down.

Processing the Data

Brazil Zika DataOkay we have seven .CSV files each of which represents some Zika information on the given date. (Note that there might be more files that have been added since I wrote this post). The format of the data isn’t particularly complicated although it isn’t standard in the sense that each row in a file represents the same type of information.  In general you never really know what the data contains (or not) until you start working with it. There are a number of ways to approach this scenario. Ideally if there is a code book that explains the format then start there although it turns out that this data is simple enough to figure out. We could read in the data for a specific date, e.g. 2016-04-02.csv, but my inclination is to read in all the files and bind them together into a single data frame so I will have all that I need in one spot. This isn’t necessary but since the data itself is not very large then this is a reasonable approach.

library(dplyr)

# You will need to specify your own path to reflect your download location

path <- "/Users/fender/Downloads/zika/Brazil/Epidemiological_Bulletin/data"
setwd(path)

temp_files <- list.files(pattern=".csv")
temp_files
[1] "Epidemiological_Bulletin-2016-04-02.csv" "Epidemiological_Bulletin-2016-04-23.csv"
[3] "Epidemiological_Bulletin-2016-04-30.csv" "Epidemiological_Bulletin-2016-05-07.csv"
[5] "Epidemiological_Bulletin-2016-05-14.csv" "Epidemiological_Bulletin-2016-05-21.csv"
[7] "Epidemiological_Bulletin-2016-05-28.csv"

# Next we read all seven files the contents of each will go into a list
# element

myfiles <- lapply(temp_files,read.csv,stringsAsFactors=FALSE) 

str(myfiles,1)
List of 7
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:

The result of the list.files() function will be a character vector that contains the names of all the .CSV files in the current working folder. We then read all of these files in using the lapply() function. We could have used a for loop construct to do the same thing although the former approach is a more “R-like” way to do things. If you don’t yet understand lapply() or need a review of what it does then please see my posting which explains it in considerable detail. It is a very cool function in R that is used many places so you definitely want to get to get to know how it works. Each element of the resulting list contains a data frame that in turn contains the contents of one of the seven .CSV files. Let’s inspect the first two lines of the first two list elements to get an idea bout the Zika data.

# First let's get the column names for the data frames

lapply(myfiles,names)[1]
[[1]]
[1] "report_date"      "location"         "location_type"    "data_field"      
[5] "data_field_code"  "time_period"      "time_period_type" "value"           
[9] "unit"      

# Now look at the first two lines of the first two list elements
lapply(myfiles,head,2)[1:2]
[[1]]
  report_date        location location_type    data_field data_field_code time_period
1  2016-04-02           Norte        region zika_reported          BR0011          NA
2  2016-04-02 Brazil-Rondonia         state zika_reported          BR0011          NA
  time_period_type value  unit
1               NA  6295 cases
2               NA   618 cases

[[2]]
  report_date    location location_type    data_field data_field_code time_period
1  2016-04-23       Norte        region zika_reported          BR0011          NA
2  2016-04-23 Brazil-Acre         state zika_reported          BR0011          NA
  time_period_type value  unit
1               NA  8545 cases
2               NA   716 cases</pre>

So we could work with each file/data frame  individually although I want to create one large data structure. The do.call() function in R is used for a number of things but is frequently used to “bind” together individual list elements  to form a single data structure so it’s going to be perfect for this job.

# The following will take all the individual list elements each of which are
# individual data frames and bind them into a single data frame

brazil <- do.call(rbind,myfiles)

# Let's turn the report_date variable into a "real" date

brazil <- brazil %>% mutate(report_date = as.Date(report_date))

glimpse(brazil)
Observations: 231
Variables: 9
$ report_date      <date> 2016-04-02, 2016-04-02, 2016-04-02, 2016-04-02, 2016-04-02,...
$ location         <chr> "Norte", "Brazil-Rondonia", "Brazil-Acre", "Brazil-Amazonas"...
$ location_type    <chr> "region", "state", "state", "state", "state", "state", "stat...
$ data_field       <chr> "zika_reported", "zika_reported", "zika_reported", "zika_rep...
$ data_field_code  <chr> "BR0011", "BR0011", "BR0011", "BR0011", "BR0011", "BR0011", ...
$ time_period      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ time_period_type <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ value            <int> 6295, 618, 375, 1520, 44, 771, 74, 2893, 30286, 1202, 7, 156...
$ unit             <chr> "cases", "cases", "cases", "cases", "cases", "cases", "cases...
9               NA 30286 cases

If we use the glimpse() function to examine this data frame we see that we have some basic variables represented such as report_date, location, and value which appears to represent the number of cases for a given location. The location_type variable tells us whether the location is a state or region with the latter being a container of possibly many states. The way I understand the information be organized is that on a certain date a .CSV file was uploaded that contained specific locations within Brazil and the number of reported Zika cases at that time. For each subsequent date (file) the same thing happened although the new number of cases are a  “delta” or adjusted number to the previously reported number. We now have all of that in a single data frame. Let’s work with this data. We will remove columns 6 and 7 since all they have are missing values anyway as confirmed by the above glimpse() function results.

# Turn the data frame into a dplyr data table and get rid of
# columns 5 and 7

brazil <- brazil %>% select(-(6:7)) 

# Let's look at the first twenty rows. While we are at it we will
# omit columns 6 and 7 since the have all missing values anyway

brazil %>% slice (1:20) 

# A tibble: 20 x 7
   report_date                   location location_type    data_field data_field_code value  unit
        <date>                      <chr>         <chr>         <chr>           <chr> <int> <chr>
1   2016-04-02                      Norte        region zika_reported          BR0011  6295 cases
2   2016-04-02            Brazil-Rondonia         state zika_reported          BR0011   618 cases
3   2016-04-02                Brazil-Acre         state zika_reported          BR0011   375 cases
4   2016-04-02            Brazil-Amazonas         state zika_reported          BR0011  1520 cases
5   2016-04-02             Brazil-Roraima         state zika_reported          BR0011    44 cases
6   2016-04-02                Brazil-Para         state zika_reported          BR0011   771 cases
7   2016-04-02               Brazil-Amapa         state zika_reported          BR0011    74 cases
8   2016-04-02           Brazil-Tocantins         state zika_reported          BR0011  2893 cases
9   2016-04-02                   Nordeste        region zika_reported          BR0011 30286 cases
10  2016-04-02            Brazil-Maranhao         state zika_reported          BR0011  1202 cases
11  2016-04-02               Brazil-Piaui         state zika_reported          BR0011     7 cases
12  2016-04-02               Brazil-Ceara         state zika_reported          BR0011   156 cases
13  2016-04-02 Brazil-Rio_Grande_do_Norte         state zika_reported          BR0011   640 cases
14  2016-04-02             Brazil-Paraiba         state zika_reported          BR0011  1060 cases
15  2016-04-02          Brazil-Pernambuco         state zika_reported          BR0011   333 cases
16  2016-04-02             Brazil-Alagoas         state zika_reported          BR0011  1479 cases
17  2016-04-02             Brazil-Sergipe         state zika_reported          BR0011   348 cases
18  2016-04-02               Brazil-Bahia         state zika_reported          BR0011 25061 cases
19  2016-04-02                    Sudeste        region zika_reported          BR0011 35505 cases
20  2016-04-02        Brazil-Minas_Gerais         state zika_reported          BR0011  6693 cases 

Understanding the Data

This data frame, though simple, is a little confusing. For each region a cumulative number of cases is presented which represents the sum of all the constituent states occurring in some number of rows below this record. For example – the first row of the data frame summarizes reported cases for the “Norte” region on 04/02/16 and then rows 2-8 are the constituent states and their respective case number totals. So the sum of the cases of rows  2-8 is the same as the case numbers reported on row 1. And then there are other entries for the “Norte” region for different reporting dates later on in the data table.  If you look at row 19 you see that the same thing is happening for the “Sudeste” region. For someone wanting summary information this is fine but it makes the resulting data frame non-standard. As analysts you would probably want to have the region as a factor – another column in the data frame. Totals can easily be generated from this format so we lose nothing. But even with the current format we can still work with it. For example let’s pull out the totals for all regions and plot them.

# For each reporting_date we have a each of the 5 region case numbers
# So we have 7 dates/files so we have 35 rows in this table

brazil %>% filter(location_type=="region")
# A tibble: 35 x 7
   report_date     location location_type    data_field data_field_code value  unit
        <date>        <chr>         <chr>         <chr>           <chr> <int> <chr>
1   2016-04-02        Norte        region zika_reported          BR0011  6295 cases
2   2016-04-02     Nordeste        region zika_reported          BR0011 30286 cases
3   2016-04-02      Sudeste        region zika_reported          BR0011 35505 cases
4   2016-04-02          Sul        region zika_reported          BR0011  1797 cases
5   2016-04-02 Centro-Oeste        region zika_reported          BR0011 17504 cases
6   2016-04-23        Norte        region zika_reported          BR0011  8545 cases
7   2016-04-23     Nordeste        region zika_reported          BR0011 43000 cases
8   2016-04-23      Sudeste        region zika_reported          BR0011 46318 cases
9   2016-04-23          Sul        region zika_reported          BR0011  2197 cases
10  2016-04-23 Centro-Oeste        region zika_reported          BR0011 20101 cases
# ... with 25 more rows

# Let's plot it
brazil %>% filter(location_type=="region") %>% 
           ggplot(aes(x=report_date, y=value,
                        group=location, color=location)) + 
           geom_line() +  
           geom_point() +
           ggtitle("Brazil Zika Cases by Region")

 

Zika Line Chart

Looks like the Southeast and Northeast have the most reported cases by far and the highest increase in reported cases over the available date period. This doesn’t necessarily mean that the actual number of cases in those regions changed that much over the given date range.  It is possible that those cases were there all along and maybe the collecting and reporting mechanisms just caught up with the actual number of cases. (Note I have no idea if this is/was true or not). Let’s do a barchart of this information just for fun (and also to demonstrate an eccentricity of ggplot).

# Let's just pull out the regions into a table of its own

region <- brazil %>% filter(location_type=="region")

region %>% 
  ggplot(aes(x=location,y=value)) + geom_bar(stat="identity") +
  ylab("Number of Reported Cases") + xlab("Region") + 
  ggtitle("Reported Brazil Zika Cases 04/02/16 - 2016-05-28")

Zika Bar Chart

Dealing with Factors

You saw something like this in the previous posting on dplyr. It’s a bar plot and it’s a bit unsatisfying since the bars are not in order according to number of cases. This data is easy to understand so it’s not a big deal although I don’t like this. I want it to be in order of most cases to least cases. To accomplish this will require creating a factor out of of the location variable based on the total number of reported cases. First here is the solution:

region %>% 
  slice(1:length(unique(region$location))) %>% 
  arrange(desc(value)) %>%
  mutate(location=factor(location,levels=location,ordered=TRUE)) %>%
  ggplot(aes(x=location,y=value)) + geom_bar(stat="identity") +
  ylab("Number of Reported Cases") + xlab("Region") + 
  ggtitle("Reported Brazil Zika Cases 04/02/16 - 2016-05-28")
Sorted Zika Cases

Sorted Zika Cases

So how does this work ? Well check this out:


# This gets us the unique locations

region %>% 
    slice(1:length(unique(region$location)))
  report_date     location location_type    data_field data_field_code value  unit
1  2016-04-02        Norte        region zika_reported          BR0011  6295 cases
2  2016-04-02     Nordeste        region zika_reported          BR0011 30286 cases
3  2016-04-02      Sudeste        region zika_reported          BR0011 35505 cases
4  2016-04-02          Sul        region zika_reported          BR0011  1797 cases
5  2016-04-02 Centro-Oeste        region zika_reported          BR0011 17504 cases

# This arranges the unique locations by number of reported cases (descending)

region %>% 
     slice(1:length(unique(region$location))) %>% 
     arrange(desc(value))
  report_date     location location_type    data_field data_field_code value  unit
1  2016-04-02      Sudeste        region zika_reported          BR0011 35505 cases
2  2016-04-02     Nordeste        region zika_reported          BR0011 30286 cases
3  2016-04-02 Centro-Oeste        region zika_reported          BR0011 17504 cases
4  2016-04-02        Norte        region zika_reported          BR0011  6295 cases
5  2016-04-02          Sul        region zika_reported          BR0011  1797 cases</pre>

# So now we can create a factor out the locations variable using this order
# Notice how location is now an ordered factor

region %>% 
     slice(1:length(unique(region$location))) %>% 
     arrange(desc(value)) %>%
     mutate(location=factor(location,levels=location,ordered=TRUE)) %>% glimpse()
Observations: 5
Variables: 7
$ report_date     <date> 2016-04-02, 2016-04-02, 2016-04-02, 2016-04-02, 2016-04-02
$ location        <ord> Sudeste, Nordeste, Centro-Oeste, Norte, Sul
$ location_type   <chr> "region", "region", "region", "region", "region"
$ data_field      <chr> "zika_reported", "zika_reported", "zika_reported", "zika_re...
$ data_field_code <chr> "BR0011", "BR0011", "BR0011", "BR0011", "BR0011"
$ value           <int> 35505, 30286, 17504, 6295, 1797
$ unit            <chr> "cases", "cases", "cases", "cases", "cases"

Standardizing the Data Format

We could continue to work with the data this way – with summaries for regions and the country sitting along side entries for states. But this mixture “violates” the standards of a “tidy” data frame. What I propose is to remove the summary info for the regions and country. We will make a factor column that gives us the region for each state. This is great because 1) all rows represent the same types of information and 2) we can easily recompute the totals for the region. There are a number of ways to do this – I’m just going with a quick method here with a loop.


# Here we will pull out these total so we can later check out work
brazil_totals <- brazil %>% filter(location=="Brazil") 
region_totals <- brazil %>% filter(location_type=="region") %>%
  group_by(report_date,location) %>%  
  summarize(tot=sum(value)) 
# Here we standardize the data frame and remove all summary rows because after all we can easily
# (re)compute any summary totals that we want

regvec <- vector()  # A vector to hold the region for a given state
length(regvec) <- nrow(brazil)
for (ii in 1:nrow(brazil)) {
  if (brazil[ii,]$location_type != "region")  {
       regvec[ii] <- newlab
  } else {
       newlab <- brazil[ii,]$location
       regvec[ii] <- newlab
  }
}

# Bind the region vector to the brazil data frame

statedf <- cbind(brazil,regvec)

# Next we eliminate the summary rows for the country and regions

statedf <- statedf %>% filter(location != "Brazil") 
statedf <- statedf %>% filter(location_type != "region") 

# Let's generate the region totals from the newly transformed data to make
# sure it matches what we had before. 
statedf %>% group_by(report_date,regvec) %>% 
            summarize(tot=sum(value)) -> totals

# Do a spot check on the totals - compare this to the original data frame 
# totals. There are differences in the column names but other than it's the
# same

all.equal(totals,region_totals)
[1] "Cols in y but not x: location. Cols in x but not y: regvec. "

# Let's take a sampling from this new data frame. Note how there is now
# a factor column that shows us what region a given state location belongs to

statedf %>% select(report_date,location,value,regvec) %>% sample_n(10)
    report_date                   location value   regvec
22   2016-04-02      Brazil-Santa_Catarina    62      Sul
174  2016-05-28             Brazil-Paraiba  2865 Nordeste
173  2016-05-28 Brazil-Rio_Grande_do_Norte  2312 Nordeste
140  2016-05-21                Brazil-Para  1583    Norte
48   2016-04-23              Brazil-Parana  1847      Sul
163  2016-05-28            Brazil-Rondonia  1032    Norte
101  2016-05-07           Brazil-Sao_Paulo  3452  Sudeste
136  2016-05-21            Brazil-Rondonia   974    Norte
67   2016-04-30          Brazil-Pernambuco   450 Nordeste
92   2016-05-07 Brazil-Rio_Grande_do_Norte  1757 Nordeste
# The following will give the same line plot as above just to show you that we
# can easily regenerate the totals with no problem.

statedf %>% group_by(report_date,regvec) %>% summarize(cases=sum(value)) %>% 
   ggplot(aes(x=report_date,y=cases,group=regvec,color=regvec)) + 
   geom_line() + 
   geom_point() + 
   ggtitle("Brazil Zika Cases by Region")

Drawing a Cluster Map

Okay what next ? Well how about generating a Leaflet Cluster Map where we see the number of cases reflected in a given region. And if we zoom in we can then see the sub clusters that make up the larger cluster. This is a cool feature of Leaflet. But to do this we will need to Geo Code the locations given in the data frame. This is easy to do since the ggmap package has  geo coding capability built into it courtesy of the geocode() function which uses Google to do the work. See how easy this:

library(ggmap)
longlat <- geocode(unique(statedf$location)) %>% 
  mutate(loc=unique(statedf$location)) 

# Let's join this geocoded information to the statedf and save it into
# another data frame for mapping

statedf %>% filter(as.character(report_date)=="2016-05-28") %>% 
  group_by(location) %>% summarize(cases=sum(value)) %>% 
  inner_join(longlat,by=c("location"="loc")) %>% 
  mutate(LatLon=paste(lat,lon,sep=":")) -> formapping

head(formapping) 
# A tibble: 6 x 5
         location cases       lon       lat              LatLon
            <chr> <int>     <dbl>     <dbl>               <chr>
1     Brazil-Acre   843 -70.00000  -9.00000              -9:-70
2  Brazil-Alagoas  3577 -36.00000  -9.00000              -9:-36
3    Brazil-Amapa   189 -52.00000   1.00000               1:-52
4 Brazil-Amazonas  2176 -63.00000  -5.00000              -5:-63
5    Brazil-Bahia 45419 -38.51083 -12.97111 -12.97111:-38.51083
6    Brazil-Ceara  2144 -38.54306  -3.71722  -3.71722:-38.54306

Cluster Map of Brazil Zika Cases

Cluster Map of Brazil Zika Cases

Wait. This doesn’t look right because we know we have many more cases in many more regions. What happened ? Well the way the cluster mapping option works is that it looks at each row and the corresponding latitude and longitude and clusters together similar rows. The data we mapped has the states listed once and their location so no wonder we get this limited output. If you look at the cases column it tells us how many times each cases occurred in that region. We need to find a hack to generate a data frame to get this number of cases for each location so the Leaflet Clustermap will look reasonable. There is a way to do this – a number of ways in fact. Here were will generate the row numbers of the data frame and for each row number we’ll replicate that row case number of times.

num_of_times_to_repeat <- formapping$cases
long_formapping <- formapping[rep(seq_len(nrow(formapping)),
                                  num_of_times_to_repeat),]

head(long_formapping)
# A tibble: 6 x 5
     location cases   lon   lat LatLon
        <chr> <int> <dbl> <dbl>  <chr>
1 Brazil-Acre   843   -70    -9 -9:-70
2 Brazil-Acre   843   -70    -9 -9:-70
3 Brazil-Acre   843   -70    -9 -9:-70
4 Brazil-Acre   843   -70    -9 -9:-70
5 Brazil-Acre   843   -70    -9 -9:-70
6 Brazil-Acre   843   -70    -9 -9:-70

# ah so this is the format we need

leaflet(long_formapping) %>% addTiles() %>% 
  addMarkers(clusterOptions=markerClusterOptions()
Accurate Cluster Map

Accurate Cluster Map

So this still isn’t ideal because if we click on one of the clusters to see the constituent sub clusters the performance can be slow especially in regions where there are lots of cases (like the Southest and Northeast).  Why don’t we make a hexbin chart that might be more useful in representing the distribution of reported Zika cases in Brazil ? Hexbin charts are an extension of the heat map concept where squares are used to capture density or count in a region. The hexagonal shape is closer to a circle shape. What we will see is that the hex shapes on the map will be colored by count of Zika cases in that area. You will need to install the hexbin package prior to use. So it might be better to use a density geometry or a chloropleth map (see comments) though in creating the long_formapping data frame you will notice that it repeats the same latitude/longitude pair many times so in effect any resulting map is basically going to represent the counts of the same locales. We could jitter the lat/lon pairs some to get a better data set for density or chloropleth. For now we will stick with the hexbin since it overlays the map nicely.

library(hexbin)
mymap <- qmap(c(lon=-56.09667,lat=-15.59611),zoom=4)
mymap + 
  geom_hex(bins=bins,data = long_formapping, aes(x = lon, y = lat),alpha=.6,
           inherit.aes=FALSE) + 
  geom_point(data = long_formapping, aes(x = lon, y=lat),
             inherit.aes=FALSE,position = "jitter")+
  scale_fill_gradient(low = "blue", high = "red")
HexBin Chart of Zika Cases in Brazil

HexBin Chart of Zika Cases in Brazil

As usual there are many other ways to kick this data round but I hope I have given you a realistic introduction to working with data that isn’t custom made for summary. Most of the data you will encounter will involve cleaning and manipulation prior to getting the best format for visualization and analysis. Speaking of analysis it can also be the case that you will need to juggle data formats around a few times to accommodate your interests. So many times you will never have a single ideal data format that will serve all your interests. This is why it is essential for you to become experienced with tools like dplyr, the lapply command, and some of the R programming constructs to glue it all together for you. Thanks for reading Steve Pittard © 2016  See my website for information on consulting and R training services.

Express dplyr Part II

Posted: July 7, 2016 in Data Mining, dplyr

This is Part II of the “Express dplyr” posting. If you haven’t already you might want to review Part I of this topic before proceeding although if you have some knowledge of dplyr then by all means proceed. Don’t worry – there are no tests ! In this second part we will work with some “real” data to help solidify our understanding of the dplyr concepts we studied previously. Note that as I wrote this posting I’ve offered up examples I put together very fast which means that there are probably opportunities for improvement and optimization. I want to present examples that can be understood by newcomers to dplyr so I’ve constructed the examples with this in mind.

The Bay Area Bike Share program allows Bay area residents to use bicycles for commuting in the area. Here is a description of the service from the website:

Users of this service can purchase an annual membership online or get a 24-hour or 3-day membership from any station kiosk. 24-hour and 3-day members will receive a ride code from the station kiosk. Enter this code on the keypad next to any available bike. Annual members can bypass the kiosk and insert their membership key at any available dock. Wait for the green light, and pull the bike out while lifting the seat. After your trip, return the bike to any station in your service area. Push the bike firmly into an available dock, and wait for the green light to confirm your trip has ended. Enjoy unlimited trips up to 30 minutes each at no additional cost! Remember, trips longer than 30 minutes will incur overtime fees. 24-hour and 3-day members must request a new ride code for each trip by swiping their credit or debit card at the kiosk

I obtained data relating to this project from the Kaggle Project page. It represents anonymized Bike Data usage from 2013-2015. The data is about 660MB and you can download it though you will need a Kaggle account. The download contains an SQLite Database which is very cool though we’ll ignore that for now and work with the CSV files.

system("unzip -l sf-bay-area-bike-share.zip")
Archive:  sf-bay-area-bike-share.zip
  Length     Date   Time     Name
 --------    ----   ----     ----
2712824832  06-14-16 04:18   database.sqlite
     5647   06-14-16 04:18   station.csv
1989696383  06-14-16 04:18   status.csv
 80208848   06-14-16 04:18   trip.csv
   438063   06-14-16 04:18   weather.csv
 --------                    -------
4783173773                   5 files

# Let's pull out the CSV files of interest 

system("unzip sf-bay-area-bike-share.zip station.csv trip.csv weather.csv")
Archive:  sf-bay-area-bike-share.zip
  inflating: station.csv             
  inflating: trip.csv                
  inflating: weather.csv

library(dplyr)
library(readr)

# Read in the station data
stations <- read_csv("station.csv")  

# Read in the trip date - you might get some messages about missing zipcodes
# but don't worry if you do

trips    <- read_csv("trip.csv")

str(stations)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	70 obs. of  7 variables:
 $ id               : int  2 3 4 5 6 7 8 9 10 11 ...
 $ name             : chr  "San Jose Diridon Caltrain Station" "San Jose Civic Center" "Santa Clara at Almaden" "Adobe on Almaden" ...
 $ lat              : num  37.3 37.3 37.3 37.3 37.3 ...
 $ long             : num  -122 -122 -122 -122 -122 ...
 $ dock_count       : int  27 15 11 19 15 15 15 15 15 19 ...
 $ city             : chr  "San Jose" "San Jose" "San Jose" "San Jose" ...
 $ installation_date: chr  "8/6/2013" "8/5/2013" "8/6/2013" "8/5/2013" ...

str(trips)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	669959 obs. of  11 variables:
 $ id                : int  4576 4607 4130 4251 4299 4927 4500 4563 4760 4258 ...
 $ duration          : int  63 70 71 77 83 103 109 111 113 114 ...
 $ start_date        : chr  "8/29/2013 14:13" "8/29/2013 14:42" "8/29/2013 10:16" "8/29/2013 11:29" ...
 $ start_station_name: chr  "South Van Ness at Market" "San Jose City Hall" "Mountain View City Hall" "San Jose City Hall" ...
 $ start_station_id  : int  66 10 27 10 66 59 4 8 66 10 ...
 $ end_date          : chr  "8/29/2013 14:14" "8/29/2013 14:43" "8/29/2013 10:17" "8/29/2013 11:30" ...
 $ end_station_name  : chr  "South Van Ness at Market" "San Jose City Hall" "Mountain View City Hall" "San Jose City Hall" ...
 $ end_station_id    : int  66 10 27 10 67 59 5 8 66 11 ...
 $ bike_id           : int  520 661 48 26 319 527 679 687 553 107 ...
 $ subscription_type : chr  "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
 $ zip_code          : int  94127 95138 97214 95060 94103 94109 95112 95112 94103 95060 ...

With regard to the two files above a possible linking key in a join or merge of these two data frames is the id column from stations and perhaps the start_station_id and/or the end_station_id from trips. However, it might not be necessary to join the data just yet as there might be some questions that can be answered by referencing just a single table. In terms of starting to explore the data just begin thinking of some fundamental questions to get the creative juices flowing. Unless you have been given a specific assignment or set of questions there is no “right way” to do something like this. One of the biggest problems I see with students is that if you present them with an open ended opportunity for investigation they freeze up. They had much rather be given a question set and work against that but in the “real world” it isn’t really like that. Just dive in and you will be fine.

Asking Some Questions

How many bikes are there ?

Bikes have a unique identifier and are used any number of times by subscribers. The trips data table has a record of what bike was used in a given trip so we can select that column and then pipe into the distinct() function which will give us a list of all unique bike ids used. This results in a data table whose output we pipe into the nrow() function to count the number of unique bikes used.

trips %>% select(bike_id) %>% distinct()
Source: local data frame [700 x 1]

   bike_id
     (int)
1      520
2      661
3       48
4       26
5      319
6      527
7      679
8      687
9      553
10     107
..     ...

# But if we wanted a single result we could do this

trips %>% select(bike_id) %>% distinct() %>% nrow()
[1] 700

# So how many times was each bike used ? 

trips %>% group_by(bike_id) %>% summarize(times_used=n()) %>% arrange(desc(times_used))
Source: local data frame [700 x 2]

   bike_id times_used
     (int)      (int)
1      392       2061
2      489       1975
3      558       1955
4      267       1951
5      631       1948
6      518       1942
7      532       1933
8      592       1932
9      395       1927
10     368       1926
..     ...        ...

I should point out here that dplyr offers a function called count() that will do tallying/counting of data and also accomplish the group_by() and summarize() steps underneath the hood. So when determining the number of times each bike was used consider this:

trips %>% count(bike_id,sort=TRUE)
Source: local data frame [700 x 2]

   bike_id     n
     (int) (int)
1      392  2061
2      489  1975
3      558  1955
4      267  1951
5      631  1948
6      518  1942
7      532  1933
8      592  1932
9      395  1927
10     368  1926
..     ...   ...

Finding Missing Data ?

Since there were some apparent problems when reading in the trips data let’s see if we have a problem with complete cases. I suspect there are missing values in some of the zipe codes. In native R there are some functions like complete.cases() and is.na() and na.omit() that can help you figure out the “missingness” of your data. The following example also gives me an opportunity to explain how dplyr can intersect with native R commands. So we will see that the zip_code column has 17,685 missing values. Note that what I do here is pipe the trips data table into a non dplyr function.

So what columns in the trips data frame contain missing values and if so how many ?

# A pure native R solution
sapply(trips,function(x) sum(is.na(x)))
                id           duration         start_date 
                 0                  0                  0 
start_station_name   start_station_id           end_date 
                 0                  0                  0 
  end_station_name     end_station_id            bike_id 
                 0                  0                  0 
 subscription_type           zip_code 
                 0              17685 


# A dplyr equivalent

trips %>% sapply(function(x) sum(is.na(x)))
                id           duration         start_date start_station_name 
                 0                  0                  0                  0 
  start_station_id           end_date   end_station_name     end_station_id 
                 0                  0                  0                  0 
           bike_id  subscription_type           zip_code 
                 0                  0              17685 

  subscription_type zip_code
              (int)    (int)
1                 0    17685

So how many cities are covered by the service ? How many stations per city are there ?

stations %>% count(city)
Source: local data frame [5 x 2]

           city     n
          (chr) (int)
1 Mountain View     7
2     Palo Alto     5
3  Redwood City     7
4 San Francisco    35
5      San Jose    16

# We could also sort the result from highest count to lowest

 stations %>% count(city,sort=TRUE)
Source: local data frame [5 x 2]

           city     n
          (chr) (int)
1 San Francisco    35
2      San Jose    16
3 Mountain View     7
4  Redwood City     7
5     Palo Alto     5

Mapping

Here is something cool. We can pipe the locations of the stations to the leaflet package which provides an R interface to the popular Leaflet Javascript library. Once you install the package you can easily map all the stations. The following will produce an interactive map that from a high level will show circles containing numbers that indicate the number of stations in that area. Click on a circle to “drill down” to a more granular level to see where the stations are. If you click on the cluster circle over Palo Alto you will see it split into two clusters of 7 and 5 corresponding to the cities of Redwood City and Palo Alto respectively. The image below is just a screenshot of course but click it to enlarge the map.

library(leaflet)

m <- leaflet() %>% addTiles() %>% 
     addMarkers(lng=stations$long,lat=stations$lat,popup=stations$name,
                clusterOptions = markerClusterOptions())

m
LeafLet Cluster Map of Ride Share Stations

LeafLet Cluster Map of Ride Share Stations

Sorting Out The Dates

Note that there is a start_date and end_date for each trip. If we want to do any summary on this information it’s not a bad idea to parse them into an actual date recognized by R. The readr function comes with some date and time routines that we can use here although for more general use the lubridate package is a great package for manipulating dates and times. This can be confusing since native R provides date functions of its own. For now we will use the parse_datetime() function from the readr package.

trips %>% mutate(start_date=parse_datetime(start_date,
                                           format="%m/%d/%Y %H:%M"),
                 end_date=parse_datetime(end_date,
                                           format="%m/%d/%Y %H:%M")) -> trips

# So it looks like the data set starts on August 29, 2013 and continued to 
# August 31, 2015

range(trips$start_date)
[1] "2013-08-29 09:08:00 UTC" "2015-08-31 23:26:00 UTC"

trips
Source: local data frame [669,959 x 11]

      id duration          start_date       start_station_name start_station_id            end_date         end_station_name
   (int)    (int)              (time)                    (chr)            (int)              (time)                    (chr)
1   4576       63 2013-08-29 14:13:00 South Van Ness at Market               66 2013-08-29 14:14:00 South Van Ness at Market
2   4607       70 2013-08-29 14:42:00       San Jose City Hall               10 2013-08-29 14:43:00       San Jose City Hall
3   4130       71 2013-08-29 10:16:00  Mountain View City Hall               27 2013-08-29 10:17:00  Mountain View City Hall
4   4251       77 2013-08-29 11:29:00       San Jose City Hall               10 2013-08-29 11:30:00       San Jose City Hall
5   4299       83 2013-08-29 12:02:00 South Van Ness at Market               66 2013-08-29 12:04:00           Market at 10th
6   4927      103 2013-08-29 18:54:00      Golden Gate at Polk               59 2013-08-29 18:56:00      Golden Gate at Polk
7   4500      109 2013-08-29 13:25:00   Santa Clara at Almaden                4 2013-08-29 13:27:00         Adobe on Almaden
8   4563      111 2013-08-29 14:02:00      San Salvador at 1st                8 2013-08-29 14:04:00      San Salvador at 1st
9   4760      113 2013-08-29 17:01:00 South Van Ness at Market               66 2013-08-29 17:03:00 South Van Ness at Market
10  4258      114 2013-08-29 11:33:00       San Jose City Hall               10 2013-08-29 11:35:00              MLK Library
..   ...      ...                 ...                      ...              ...                 ...                      ...
Variables not shown: end_station_id (int), bike_id (int), subscription_type (chr), zip_code (int)

Now should we assume that all trips are started and completed on the same day ? My initial guess is that they are although that might not be true and/or there might have been an error in the data capture. Either way it’s probably wise to check. Looks like there were 2,099 trips that were started on one day and finished the next.

How many trips did not finish on the same day they began ?

trips %>% filter(substr(start_date,1,10) 
                        != substr(end_date,1,10)) %>% 
                        summarize(different_days=n())
 different_days
           (int)
1           2099

How many trips were there for each year ?

trips %>% count(substr(start_date,1,4))
Source: local data frame [3 x 2]

  substr(start_date, 1, 4)      n
                     (chr)  (int)
1                     2013 100563
2                     2014 326339
3                     2015 243057

Okay let’s extend this a little bit. We will look at the number of trips for each day of each year and use the googleVis package to create a heatmap to visualize this. We’ll need to modify the start_date to respect only the year and ignore the hour and minute portion. The way we did this above was by using the substr() function. However, the googleVis package expects an actual date as opposed to a character string (which is what gets returned by substr() so we’ll need to work a little bit harder (but not too much). So we use the as.Date() to truncate the full date string into just the Year, Month, and Day. We’ll then filter out just the trips that began and ended on the same day, then group by start date, and then count the number of trips for each day. Remember – we are NOT changing the underlying trips date frame at all – we just use the pipe operator to mutate, filter, group, and summarize the data in a single command chain. There is no need to store temporary or intermediate results into a data frame unless you want to.

library(googleVis)

trips %>% mutate(start_date=as.Date(start_date), 
                 end_date=as.Date(end_date)) %>%
                 filter(start_date == end_date) %>% 
                 count(start_date) -> tripdates

# Create a Gvisplot and then plot it

plot( 
  gvisCalendar(data=tripdates, datevar="start_date", numvar="n",
               options=list(
                 title="Calendar Heat Map of Open Bike Trips",
                 calendar="{cellSize:10,
                 yearLabel:{fontSize:20, color:'#444444'},
                 focusedCellColor:{stroke:'red'}}",
                 width=590, height=320),
               chartid="Calendar")
)

Trips per Day of Year

Trips per Day of Year

Trips by Day of the Week

One of the advantages of turning the start and end dates into actual date objects is because we can then use built in R date functions. For example how many trips were there for each day of the week ? We can leverage the strength of the weekdays() function to figure this out.

trips %>% mutate(weekdays = weekdays(start_date)) %>% 
         count(weekdays, sort=TRUE)
Source: local data frame [7 x 2]

   weekdays      n
      (chr)  (int)
1   Tuesday 122259
2 Wednesday 120201
3  Thursday 119089
4    Monday 115873
5    Friday 109361
6  Saturday  44785
7    Sunday  38391

# But let's plot this

trips %>% mutate(weekdays = weekdays(start_date)) %>% 
  count(weekdays, sort=TRUE) %>%
   ggplot(aes(x=weekdays,y=n)) + 
   geom_bar(stat="identity") + 
   ggtitle("Trips per Day of The Week") +
   ylab("Total Trips")

Total Trips per Day of Week

Joins

Up until now we have been working with a single table and summaries thereof. Next let’s use our knowledge of joins and merging to learn more about the data. For example, what are the most popular bike stations ? How would we answer this question ? Well we have a list of all the trips in the trips data frame which includes the beginning and ending station id for each trip. So, for each station, is there a way to find the number of trips starting or ending there ? If so then we can find the stations with the highest number of starts and stops which will then tell us the most popular stations.

# For each station we count the number of trips initiated from the 
# station or ended there. 

# Here we get the number of times a trip started at a station

trips %>% count(start_station_id) -> start

Source: local data frame [70 x 2]

   start_station_id     n
              (int) (int)
1                 2  9558
2                 3  1594
3                 4  3861
4                 5  1257
5                 6  2917
6                 7  2233
7                 8  1692
8                 9  1910
9                10  2393
10               11  2034
..              ...   ...

# Here we get the number of times a trip ended at a station

trips %>% count(end_station_id) -> end

Source: local data frame [70 x 2]

   end_station_id     n
            (int) (int)
1               2  9415
2               3  1786
3               4  3705
4               5  1169
5               6  3163
6               7  2498
7               8  1707
8               9  2200
9              10  1658
10             11  2508
..            ...   ...

# Let's join these two tables using the station_ids as keys. While we do
# the join we can add together the start and end counts for each 
# station. This tells us how "popular" a given station was

inner_join(start,end,by=c("start_station_id"="end_station_id")) %>% 
    mutate(total_trips=n.x+n.y) %>% 
    arrange(desc(total_trips))

Source: local data frame [70 x 4]

   start_station_id cnt.x cnt.y total_trips
              (int) (int) (int)       (int)
1                70 49092 63179      112271
2                69 33742 35117       68859
3                50 32934 33193       66127
4                60 27713 30796       58509
5                61 25837 28529       54366
6                77 24172 28033       52205
7                65 23724 26637       50361
8                74 24838 25025       49863
9                55 26089 23080       49169
10               76 20165 19915       40080
..              ...   ...   ...         ...

# Okay now let's join this with the stations data frame to get some
# more information such as the latitude and longitude

inner_join(start,end,by=c("start_station_id"="end_station_id")) %>% 
     mutate(total=n.x+n.y) %>% 
     arrange(desc(total)) %>%
     inner_join(stations,c("start_station_id"="id")) %>% 
     select(c(1,4:7,9)) 

Source: local data frame [70 x 6]

   start_station_id total                                          name      lat
              (int)       (int)                                         (chr)    (dbl)
1                70      112271      San Francisco Caltrain (Townsend at 4th) 37.77662
2                69       68859       San Francisco Caltrain 2 (330 Townsend) 37.77660
3                50       66127          Harry Bridges Plaza (Ferry Building) 37.79539
4                60       58509                        Embarcadero at Sansome 37.80477
5                61       54366                               2nd at Townsend 37.78053
6                77       52205                             Market at Sansome 37.78963
7                65       50361                               Townsend at 7th 37.77106
8                74       49863                             Steuart at Market 37.79414
9                55       49169 Temporary Transbay Terminal (Howard at Beale) 37.78976
10               76       40080                                 Market at 4th 37.78630
..              ...         ...                                           ...      ...
Variables not shown: long (dbl), city (chr)

More Mapping

Wouldn’t it be nice to plot this information on a map ? We could plot points representing the station locations and maybe even make the size of a given point/location to be in proportion to the number of trips starting and ending with that given station. This would make it really easy to see which stations get a lot of traffic. To do this we know we are going to need the latitude and longitude information from the stations data frame (but we already have that courtesy of the commands above). Also for purposes of this exercise let’s just focus only on San Francisco bike stations. That is we won’t worry about Palo Alto, Redwood City, Mountain View, or San Jose. Let’s start by reissuing the last command and doing some more selecting to get the right information.

inner_join(start,end,by=c("start_station_id"="end_station_id")) %>% 
     mutate(total=n.x+n.y) %>% 
     arrange(desc(total)) %>%
     inner_join(stations,c("start_station_id"="id")) %>% 
     select(c(1,4:7,9)) %>% filter(city=="San Francisco") -> sfstats

Source: local data frame [35 x 6]

   start_station_id total                                          name      lat      long          city
              (int)       (int)                                         (chr)    (dbl)     (dbl)         (chr)
1                70      112271      San Francisco Caltrain (Townsend at 4th) 37.77662 -122.3953 San Francisco
2                69       68859       San Francisco Caltrain 2 (330 Townsend) 37.77660 -122.3955 San Francisco
3                50       66127          Harry Bridges Plaza (Ferry Building) 37.79539 -122.3942 San Francisco
4                60       58509                        Embarcadero at Sansome 37.80477 -122.4032 San Francisco
5                61       54366                               2nd at Townsend 37.78053 -122.3903 San Francisco
6                77       52205                             Market at Sansome 37.78963 -122.4008 San Francisco
7                65       50361                               Townsend at 7th 37.77106 -122.4027 San Francisco
8                74       49863                             Steuart at Market 37.79414 -122.3944 San Francisco
9                55       49169 Temporary Transbay Terminal (Howard at Beale) 37.78976 -122.3946 San Francisco
10               76       40080                                 Market at 4th 37.78630 -122.4050 San Francisco
..              ...         ...                                           ...      ...       ...           ...

# Next I scale the total number of trips variable down to a reasonable number
# for use with the geom_point function. You could come up with your own 
# scaling function to do this. The idea is that when we plot each station
# on the map we can adjust the size of the respective point to reflect how
# busy that station is. There are other ways to do this of course

sz <- round((sfstats$total)^(1/3)-20)

qmap(c(lon=-122.4071,lat=37.78836),zoom=14) -> mymap

mymap + 
  geom_point(data=sfstats, aes(x=long,y=lat),
             color="red", alpha=0.4,size=sz) + 
             geom_text(data=sfstats,
                       aes(x=long,y=lat,label=start_station_id),size=4)


Plot of SF Stations by Popularity

Plot of SF Stations by Popularity

Two things to note here: 1) the numbers in the circles are the station ids of the associated station. We could have put up the actual station names but that would look messy. 2) There is some over plotting in the lower part of the map because there are two Cal Train related stations very close to each other both of which experience lots of use. So the ids are stacked on top of each other at the current map resolution. We could easily fix this by changing the resolution or plotting these points separately (left as an exercise for the reader).

       San Francisco Caltrain (Townsend at 4th)       49092
        San Francisco Caltrain 2 (330 Townsend)       33742

The “trick” with the sz vector is to find a way to scale the total_trips column in the sfstats data table in a way that let’s us use this figure to specify the size of the point. Here I took the cube root of the total count and subtracted an arbitrary number. I experimented with this but generally find that using root functions can be useful when scaling count data.

What Next ?

We’ve just scratched the surface here. What about identifying the most frequently used trip segments ? We could identify all trip pairs (starting station and ending station id) and count the number of times each occurs. We could then perhaps draw segments on the map to more easily visualize these popular segments. . In a different direction there is another CSV file that came with the data called weather.csv that has meteorological information for each day of the year. We could investigate the impact that weather has on bike usage. Lastly, the data set on Kaggle comes with an SQLite database which we could also use in a couple of ways. dplyr has the ability to connect to SQLite databases though we could also use the RSQLite connection package to execute SQL queries against the database. I’ll explore these options in a future posting. Thanks, Steve Pittard

Express Intro to dplyr

Posted: June 29, 2016 in Data Mining
Tags: ,

Working The Data Like a Boss !

I recently introduced the data.table package which provides a nice way to manage and aggregate large data sources using the standard bracket notation that is commonly employed when manipulating data frames in R. As data sources grow larger one must be prepared with a variety of approaches to efficiently handle this information. Using databases (both SQL and NoSQL) are a possibility wherein one queries for a subset of information although this assumes that the database is pre-existing or that you are prepared to create it yourself. The dplyr package offers ways to read in large files, interact with databases, and accomplish aggregation and summary. Some feel that dplyr is a competitor to the data.table package though I do not share that view. I think that each offers a well-conceived philosophy and approach and does a good job of delivering on their respective design goals. That there is overlap in their potential applications simply means to me that there is another way to do something. They are just great tools in a larger toolbox so I have no complaints. Let’s dig into dplyr to learn what it can do. Note that this post is part one of two. The second dplyr blog will apply the knowledge learned in this post.

Upcoming Class

Before we get too deep into this I wanted to indicate that I will be teaching a 3-day Intro to R BootCamp in the Atlanta, GA area of the US sometime in August or September. I say “sometime” because the logistics are still under development. If interested please feel free to email me and once I get everything lined up I will get back to you with the details. You can also visit my home page. Thanks for indulging my self-promotion. Steve – Now Back to the Action…

Verbs in Action !

dplyr is based on the idea that when working with data there are a number of common activities one will pursue: reading, filtering rows on some condition, selecting or excluding columns, arranging/sorting, grouping, summarize, merging/joining, and mutating/transforming columns. There are other activities but these describe the main categories. dplyr presents a number of commands or “verbs” that help you accomplish the work. Note that dplyr does not replace any existing commands – it simply gives you new commands:

Command Purpose
select() Select columns from a data frame
filter() Filter rows according to some condition(s)
arrange() Sort / Re-order rows in a data frame
mutate() Create new columns or transform existing ones
group_by() Group a data frame by some factor(s) usually in conjunction to summary
summarize() Summarize some values from the data frame or across groups
inner_join(x,y,by=”col”) return all rows from ‘x’ where there are matching values in ‘x’, and all columns from ‘x’ and ‘y’. If there are multiple matches between ‘x’ and ‘y’, all combination of the matches are returned.
left_join(x,y,by=”col”) return all rows from ‘x’, and all columns from ‘x’ and ‘y’. Rows in ‘x’ with no match in ‘y’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
right_join(x,y,by=”col”) return all rows from ‘y’, and all columns from ‘x’ and y. Rows in ‘y’ with no match in ‘x’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned
anti_join(x,y,by=”col”) return all rows from ‘x’ where there are not matching values in ‘y’, keeping just columns from ‘x’

 

readr

There is also an associated package called readr that is more efficient at ingesting CSV files than the base R functions such as read.csv. While it is not part of the actual dplyr package it does in fact produce a dplyr structure as it reads in files. readr provides the read_csv function to do the work. It is also pretty smart and can figure things out like if there is a header or not so you don’t have to provide a lot of additional arguments. Here is an example using a file that contains information on weather station measurements in the year 2013.

install.packages("readr")  # one time only 
library(readr)

url <- "http://steviep42.bitbucket.org/YOUTUBE.DIR/weather.csv"
download.file(url,"weather.csv")

system("head -5 weather.csv")  # Take a peak at the first 5 lines

"origin","year","month","day","hour","temp","dewp","humid","wind_dir","wind_speed","wind_gust","precip","pressure","visib"
"EWR",2013,1,1,0,37.04,21.92,53.97,230,10.35702,11.9186514756,0,1013.9,10
"EWR",2013,1,1,1,37.04,21.92,53.97,230,13.80936,15.8915353008,0,1013,10
"EWR",2013,1,1,2,37.94,21.92,52.09,230,12.65858,14.5672406924,0,1012.6,10
"EWR",2013,1,1,3,37.94,23,54.51,230,13.80936,15.8915353008,0,1012.7,10

weather <- read_csv("weather.csv")

weather
Source: local data frame [8,719 x 14]

   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
    (chr) (int) (int) (int) (int) (dbl) (dbl) (dbl)    (int)      (dbl)
1     EWR  2013     1     1     0 37.04 21.92 53.97      230   10.35702
2     EWR  2013     1     1     1 37.04 21.92 53.97      230   13.80936
3     EWR  2013     1     1     2 37.94 21.92 52.09      230   12.65858
4     EWR  2013     1     1     3 37.94 23.00 54.51      230   13.80936
5     EWR  2013     1     1     4 37.94 24.08 57.04      240   14.96014
6     EWR  2013     1     1     6 39.02 26.06 59.37      270   10.35702
7     EWR  2013     1     1     7 39.02 26.96 61.63      250    8.05546
8     EWR  2013     1     1     8 39.02 28.04 64.43      240   11.50780
9     EWR  2013     1     1     9 39.92 28.04 62.21      250   12.65858
10    EWR  2013     1     1    10 39.02 28.04 64.43      260   12.65858
..    ...   ...   ...   ...   ...   ...   ...   ...      ...        ...
Variables not shown: wind_gust (dbl), precip (dbl), pressure (dbl), visib (dbl)

tbl_df

It is important to note that dplyr works transparently with existing R data frames though ideally one should explicitly create or transform an existing data frame to a dplyr structure to get the full benefit of the package. Let’s use the dplyr tbl_df command to wrap an existing data frame. We’ll convert the infamous mtcars data frame into a dplyr table since it is a small data frame that is easy to understand. The main advantage in using a ‘tbl_df’ over a regular data frame is the printing: tbl objects only print a few rows and all the columns that fit on one screen, describing the rest of it as text.

dp_mtcars <- tbl_df(mtcars)

# dp_mtcars is a data frame as well as a dplyr object

class(dp_mtcars)
[1] "tbl_df"     "tbl"        "data.frame"

In the example below (as with the readr example above) notice how only a subset of the data gets printed by default. This is actually very nice especially if you have ever accidentally typed the name of a really, really large native data frame. R will dutifully try to print a large portion of the data even if it locks up your R session. So wrapping the data frame in a dplyr table will prevent this. Also notice how you get a summary of the number of rows and columns as well as the type of each column.

dp_mtcars
Source: local data frame [32 x 11]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
..

Now we could start to operate on this data frame / dplyr table by using some of the commands on offer from dplyr. They do pretty much what the name implies and you could use them in isolation though the power of dplyr comes through when using the piping operator to chain together commands. We’ll get there soon enough. Here are some basic examples:

filter()


# Find all rows where MPG is >= 30 and Weight is over 1.8 tons

filter(dp_mtcars, mpg >= 30 & wt > 1.8)
Source: local data frame [2 x 11]

    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
2  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1

select()

The following example illustrates how the select() function works. We will select all columns whose name begins with the letter “m”. This is more useful when you have lots of columns that are named according to some pattern. For example some Public Health data sets can have many, many columns (hundreds even) so counting columns becomes impractical which is why select() supports a form of regular expressions to find columns by name. Other helpful arguments in this category include:

Argument Purpose
ends_with(x, ignore.case=TRUE) Finds columns whose nqme ends with “x”
contains(x, ignore.case=TRUE) Finds columns whose nqme contains “x”
matches(x, ignore.case=TRUE) Finds columns whose names match the regular expression “x”
num_range(“x”,1:5, width=2) selects all variables (numerically) from x01 to x05
one_of(“x”, “y”, “z”) Selects variables provided in a character vector
select(dp_mtcars,starts_with("m"))
Source: local data frame [32 x 1]

     mpg
   (dbl)
1   21.0
2   21.0
3   22.8
4   21.4
5   18.7
6   18.1
7   14.3
8   24.4
9   22.8
10  19.2

# Get all columns except columns 5 through 10 

select(dp_mtcars,-(5:10))
Source: local data frame [32 x 5]

     mpg   cyl  disp    hp  carb
   (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110     4
2   21.0     6 160.0   110     4
3   22.8     4 108.0    93     1
4   21.4     6 258.0   110     1
5   18.7     8 360.0   175     2
6   18.1     6 225.0   105     1
7   14.3     8 360.0   245     4
8   24.4     4 146.7    62     2
9   22.8     4 140.8    95     2
10  19.2     6 167.6   123     4
..   ...   ...   ...   ...   ...

mutate()

Here we use the mutate() function to transform the wt variable by multiplying it by 1,000 and then we create a new variable called “good_mpg” which takes on a value of “good” or “bad” depending on if a given row’s MPG value is > 25 or not

mutate(dp_mtcars, wt=wt*1000, good_mpg=ifelse(mpg > 25,"good","bad"))
Source: local data frame [32 x 12]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb good_mpg
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)    (chr)
1   21.0     6 160.0   110  3.90  2620 16.46     0     1     4     4      bad
2   21.0     6 160.0   110  3.90  2875 17.02     0     1     4     4      bad
3   22.8     4 108.0    93  3.85  2320 18.61     1     1     4     1      bad
4   21.4     6 258.0   110  3.08  3215 19.44     1     0     3     1      bad
5   18.7     8 360.0   175  3.15  3440 17.02     0     0     3     2      bad
6   18.1     6 225.0   105  2.76  3460 20.22     1     0     3     1      bad
7   14.3     8 360.0   245  3.21  3570 15.84     0     0     3     4      bad
8   24.4     4 146.7    62  3.69  3190 20.00     1     0     4     2      bad
9   22.8     4 140.8    95  3.92  3150 22.90     1     0     4     2      bad
10  19.2     6 167.6   123  3.92  3440 18.30     1     0     4     4      bad
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...      ...

arrange()

Next we could sort or arrange the data according to some column values. This is usually to make visual inspection of the data easier. Let’s sort the data frame by cars with the worst MPG and then sort by weight from heaviest to lightest.

arrange(dp_mtcars,mpg,desc(wt))
Source: local data frame [32 x 11]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   10.4     8 460.0   215  3.00 5.424 17.82     0     0     3     4
2   10.4     8 472.0   205  2.93 5.250 17.98     0     0     3     4
3   13.3     8 350.0   245  3.73 3.840 15.41     0     0     3     4
4   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
5   14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
6   15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
7   15.2     8 275.8   180  3.07 3.780 18.00     0     0     3     3
8   15.2     8 304.0   150  3.15 3.435 17.30     0     0     3     2
9   15.5     8 318.0   150  2.76 3.520 16.87     0     0     3     2
10  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

Ceci n’est pas une pipe

While the above examples are instructive they are not, at least in my opinion, the way to best use dplyr. Once you get up to speed with dplyr functions I think you will soon agree that using “pipes” to create chains of commands is the way to go. If you come from a UNIX background you will no doubt have heard of “pipes” which is a construct allowing you to take the output of one command and route it or “pipe” it to the input of another program. This can be done for several commands thus creating a chain of piped commands. One can save typing while creating, in effect, a “one line program” that does a lot. Here is an example of a UNIX pipeline. I’m using Apple OSX but this should work on a Linux machine just as well. This example will pipe the output of the /etc/passwd file into the input of the awk command (a program used to parse files) and the output of the awk command will go into the input of the tail command which lists the last 10 lines of the final result.

 $ cat /etc/passwd | awk -F: '{print $1}' | tail
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_xcsbuildagent
_xcscredserver
_launchservicesd

$ 

This is a great paradigm for working on UNIX that also maps well for what one does in data exploration. When first encountering data you rarely know what it is you want to get from it (unless you are a student and your teacher told you specifically what she or he wants). So you embark on some exploratory work and start to interrogate the data which might first require some filtering and maybe exclusion of incomplete data or maybe some imputation for missing values. Until you have worked with it for a while you don’t want to change the data – you just want to experiment with various transformed and grouped versions of it which is much easier if you use dplyr. Just pipe various commands together to clean up your data, make some visualizations, and perhaps generate some hypotheses about your data. You find yourself generating some pretty involved adhoc command chains without having to create a standalone script file. The dplyr package uses the magrittr package to enable this piping capability within R. The “pipe” character is “%>%” which is different from the traditional UNIX pipe which is the vertical bar “|”. But don’t let the visual difference confuse you as, conceptually, pipes in R work just like they do in UNIX. The magrittr package has a motto “Ceci n’est pas une pipe” presumably in acknowledgement of the noted difference and also as a tribute to the painter Rene Magritte’s work La trahison des images.


# Here we filter rows where MPG is >= 25 and then select only rows 1-4
# and 10-11.

dp_mtcars %>% filter(mpg >= 25) %>% select(-c(5:9)) 
Source: local data frame [6 x 6]

    mpg   cyl  disp    hp  gear  carb
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1  32.4     4  78.7    66     4     1
2  30.4     4  75.7    52     4     2
3  33.9     4  71.1    65     4     1
4  27.3     4  79.0    66     4     1
5  26.0     4 120.3    91     5     2
6  30.4     4  95.1   113     5     2

Next we filter rows where MPG is >= 25 and then select only rows 1-4 and 10-11 after which we sort the result by MPG from highest to lowest. You can keep adding as many pipes as you wish. At first, while you are becoming familiar with the idea, it is best to keep the pipeline relatively short so you can check your work. But it will not be long before you are stringing together lots of different commands. dplyr enables and encourages this type of activity so don’t be shy.

dp_mtcars %>% filter(mpg >= 25) %>% select(-c(5:9)) %>% arrange(desc(mpg))
Source: local data frame [6 x 6]

    mpg   cyl  disp    hp  gear  carb
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1  33.9     4  71.1    65     4     1
2  32.4     4  78.7    66     4     1
3  30.4     4  75.7    52     4     2
4  30.4     4  95.1   113     5     2
5  27.3     4  79.0    66     4     1
6  26.0     4 120.3    91     5     2

That was pretty cool wasn’t it ? We don’t need to alter dp_mtcars at all to explore it. We could change our minds about how and if we want to filter, select, or sort. The way this works is that the output of the dp_mtcars data frame/table gets sent to the input of the filter function that is aware of the source which is why we don’t need to explicitly reference dp_mtcars by name. The output of the filter step gets sent to the select function which in turns pipes or chains its output into the input of the arrange function which sends its output to the screen. We could even pipe the output of these operations to the ggplot2 package. But first let’s convert some of the columns into factors so the resulting plot will look better.

# Turn the cyl and am variables into factors. Notice that the resulting
# output reflects the change

dp_mtcars %>%
mutate(cyl=factor(cyl,levels=c(4,6,8)),
am=factor(am,labels=c("Auto","Manual" )))
mpg    cyl  disp    hp  drat    wt  qsec    vs     am  gear  carb
(dbl) (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (fctr) (dbl) (dbl)
1   21.0      6 160.0   110  3.90 2.620 16.46     0 Manual     4     4
2   21.0      6 160.0   110  3.90 2.875 17.02     0 Manual     4     4
3   22.8      4 108.0    93  3.85 2.320 18.61     1 Manual     4     1
4   21.4      6 258.0   110  3.08 3.215 19.44     1   Auto     3     1
5   18.7      8 360.0   175  3.15 3.440 17.02     0   Auto     3     2
6   18.1      6 225.0   105  2.76 3.460 20.22     1   Auto     3     1
7   14.3      8 360.0   245  3.21 3.570 15.84     0   Auto     3     4
8   24.4      4 146.7    62  3.69 3.190 20.00     1   Auto     4     2
9   22.8      4 140.8    95  3.92 3.150 22.90     1   Auto     4     2
10  19.2      6 167.6   123  3.92 3.440 18.30     1   Auto     4     4
..   ...    ...   ...   ...   ...   ...   ...   ...    ...   ...   ...

But that was kind of boring – Let’s visualize this using the ggplot package whose author, Hadley Wickham, is also the author of dplyr.

dp_mtcars %>% mutate(cyl=factor(cyl,levels=c(4,6,8)),
                     am=factor(am,labels=c("Auto","Manual" ))) %>%
                     ggplot(aes(x=wt,y=mpg,color=cyl)) +
                     geom_point() + facet_wrap(~am)

ggplot_out

Okay well that might have been too much for you and that’s okay if it is. Let’s break this down into two steps. First let’s save the results of the mutate operation into a new data frame.

new_dp_mtcars <- dp_mtcars %>% mutate(cyl=factor(cyl,levels=c(4,6,8)),
                     am=factor(am,labels=c("Auto","Manual" )))

# Now we can call the ggplot command separately

ggplot(new_dp_mtcars,aes(x=wt,y=mpg,color=cyl)) +
                     geom_point() + facet_wrap(~am)

Pick whatever approach you want to break things down to the level you need. However, I guarantee that after a while you will probably wind up writing lots of one line programs.

Split-Apply-Combine

There are two more commands from the dplyr package that are particularly useful in aggregating data. The group_by() and summarize() functions help us group a data frame according to some factors and then apply some summary functions across those groups. The idea is to first “split” the data into groups, “apply” some functions (e.g. mean()) to some continuous quantity relating to each group, and then combine those group specific results back into an integrated result. In the next example we will group (or split) the data frame by the cylinder variable and then summarize the mean MPG for each group and then combine that into a final aggregated result.

dp_mtcars %>% group_by(cyl) %>% summarize(avg_mpg=mean(mpg))
Source: local data frame [3 x 2]

    cyl  avg_mpg
  (dbl)    (dbl)
1     4 26.66364
2     6 19.74286
3     8 15.10000

# Let's group by cylinder then by transmission type and then apply the mean
# and sd functions to mpg

dp_mtcars %>% group_by(cyl,am) %>% summarize(avg_mpg=mean(mpg),sd=sd(mpg))
Source: local data frame [6 x 4]
Groups: cyl [?]

    cyl    am  avg_mpg        sd
  (dbl) (dbl)    (dbl)     (dbl)
1     4     0 22.90000 1.4525839
2     4     1 28.07500 4.4838599
3     6     0 19.12500 1.6317169
4     6     1 20.56667 0.7505553
5     8     0 15.05000 2.7743959
6     8     1 15.40000 0.5656854

# Note that just grouping a data frame without summary doesn't appear to do 
# much from a visual point of view. 

dp_mtcars %>% group_by(cyl)
Source: local data frame [32 x 11]
Groups: cyl [3]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

Merging Data Frames

One of the strengths of dplyr is it’s ability to do merges via various “joins” like those associated with database joins. There is already a built-in R command called merge that can handle merging duties but dplyr offers flexible and extended capabilities in this regard. Moreover it does so in a way that is consistent (for the most part) with SQL which you can use for a wide variety of data mining tasks. If you already know SQL then you will understand these commands without much effort. Let’s set up two example simple data frames to explain the concept of joining.

df1 <- data.frame(id=c(1,2,3),m1=c(0.98,0.45,0.22))
df2 <- data.frame(id=c(3,4),m1=c(0.17,0.66))

df1
  id   m1
1  1 0.98
2  2 0.45
3  3 0.22

df2
  id   m1
1  3 0.17
2  4 0.66

Left Join

Think about what it means to merge these data frames. It makes sense to want to join the data frames with respect to some common column name. In this case it is clear that the id column is in both data frames. So let’s join the data frames using “id” as a “key”. The question is what to do about the fact that there is no id in df2 corresponding to id number 2. This is why different types of joins exist. Let’s see how they work. We’ll start with the left join:

left_join(df1,df2,by="id")
  id m1.x m1.y
1  1 0.98   NA
2  2 0.45   NA
3  3 0.22 0.17

So the left join looks at the first data frame df1 and then attempts to find corresponding “id” values in df2 that match all id values in df1. Of course there are no ids matching 2 or 3 in df2 so what happens ? The left join will insert NAs in the m1.y column since there are no values in df2. Note that there is in fact an id of value 3 in both data frames so it fills in both measurement columns with the values. Also note that since in both data frames there is a column named “m1” so it has to create unique names to accommodate both columns. The “x” and “y” come from the fact that df1 comes before df2 in the calling sequence to left_join. Thus “x” matches df1 and “y” matches df2.

Inner Join

Let’s join the two data frames in a way that yields only the intersection of the two data structures based on “id”. Using visual examination we can see that there is only one id in common to both data frames – id 3.

inner_join(df1,df2,by="id")
  id m1.x m1.y
1  3 0.22 0.17

More Involved Join Examples

Now we’ll look at a more advanced example. Let’s create two data frames where the first, (we’ll call it “authors”), presents a list of, well, authors. The second data frame presents a list of books published by various authors. Each data frame has some additional attributes of interest.

# For reference sake - these data frames come from the examples contained in 
# the help pages for the built-in R merge command

authors <- data.frame(
         surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
         nationality = c("US", "Australia", "US", "UK", "Australia"),
         deceased = c("yes", rep("no", 4)))
     
books <- data.frame(
         name = I(c("Tukey", "Venables", "Tierney",
                  "Ripley", "Ripley", "McNeil", "R Core")),
         title = c("Exploratory Data Analysis",
                   "Modern Applied Statistics ...",
                   "LISP-STAT",
                   "Spatial Statistics", "Stochastic Simulation",
                   "Interactive Data Analysis",
                   "An Introduction to R"),
         other.author = c(NA, "Ripley", NA, NA, NA, NA,
                          "Venables & Smith"))

authors
   surname nationality deceased
1    Tukey          US      yes
2 Venables   Australia       no
3  Tierney          US       no
4   Ripley          UK       no
5   McNeil   Australia       no
 
books
      name                         title     other.author
1    Tukey     Exploratory Data Analysis             <NA>
2 Venables Modern Applied Statistics ...           Ripley
3  Tierney                     LISP-STAT             <NA>
4   Ripley            Spatial Statistics             <NA>
5   Ripley         Stochastic Simulation             <NA>
6   McNeil     Interactive Data Analysis             <NA>
7   R Core          An Introduction to R Venables & Smith

At first glance it appears that there is nothing in common between these two data frames in terms of column names. However, it is fairly obvious that the “surname” column in the authors data frame matches the “name” column in books so we could probably use those as keys to join the two data frames. We also see that there is an author ,”R Core” (meaning the R Core Team), who appears in the books table though is not listed as an author in the authors data frame. This kind of thing happens all the time in real life so better get used to it. Let’s do some reporting using these two data frames:

Let’s find all authors listed in the authors table who published a book along with their book titles, other authors, nationality, and living status. Let’s try an inner join on this. Because we don’t have any common column names between books and authors we have to tell the join what columns to use for matching. The by argument exists for this purpose. Note also that the author “R Core” listed in books isn’t printed here because that author does not also exist in the authors table. This is because the inner join looks for the intersection of the tables.


inner_join(books,authors,by=c("name"="surname"))
      name                         title other.author nationality deceased
1    Tukey     Exploratory Data Analysis         <NA>          US      yes
2 Venables Modern Applied Statistics ...       Ripley   Australia       no
3  Tierney                     LISP-STAT         <NA>          US       no
4   Ripley            Spatial Statistics         <NA>          UK       no
5   Ripley         Stochastic Simulation         <NA>          UK       no
6   McNeil     Interactive Data Analysis         <NA>   Australia       no

# We could have also done a right join since this will require a result that has
# all rows form the "right" data frame (in the "y" position) which in this case is 
# authors

right_join(books,authors,by=c("name"="surname"))
      name                         title other.author nationality deceased
1    Tukey     Exploratory Data Analysis         <NA>          US      yes
2 Venables Modern Applied Statistics ...       Ripley   Australia       no
3  Tierney                     LISP-STAT         <NA>          US       no
4   Ripley            Spatial Statistics         <NA>          UK       no
5   Ripley         Stochastic Simulation         <NA>          UK       no
6   McNeil     Interactive Data Analysis         <NA>   Australia       no

Next, find any and all authors who published a book even if they do not appear in the authors table. The result should show names, titles, other authors, nationality, and living status. Let’s do a left join which will pull in all rows from “x” (books) and where there is no matching key/name in authors then NAs will be inserted for columns existing in the “y” (authors) table.

left_join(books,authors,by=c("name"="surname"))
      name                         title     other.author nationality deceased
1    Tukey     Exploratory Data Analysis             <NA>          US      yes
2 Venables Modern Applied Statistics ...           Ripley   Australia       no
3  Tierney                     LISP-STAT             <NA>          US       no
4   Ripley            Spatial Statistics             <NA>          UK       no
5   Ripley         Stochastic Simulation             <NA>          UK       no
6   McNeil     Interactive Data Analysis             <NA>   Australia       no
7   R Core          An Introduction to R Venables & Smith        <NA>     <NA>

Do the same as above but the result should show only the book title and name columns
in that order. This is simply a matter of doing the previous join and piping the result to a filter statement

left_join(books,authors,by=c("name"="surname")) %>% select(title,name)
                          title     name
1     Exploratory Data Analysis    Tukey
2 Modern Applied Statistics ... Venables
3                     LISP-STAT  Tierney
4            Spatial Statistics   Ripley
5         Stochastic Simulation   Ripley
6     Interactive Data Analysis   McNeil
7          An Introduction to R   R Core

Find the book names of all US authors and who are not deceased. Well first we filter the authors table to filter out rows according the specified conditions. Then we can pass the result to an inner_join() statement to get the book titles and then we pass that result to select only the book titles. Note that because we are piping the output from the filter() results we don’t need to specify that in the call to inner_join(). That is, the inner_join function assumes that the filter() results represent the “x” position in the call to inner_join()

authors %>% filter(deceased == "no" & nationality == "US") %>%
            inner_join(books,by=c("surname"="name")) %>% select(title)surname 
title
1 LISP-STAT

Find any book titles for authors who do not appear in the authors data frame. Here we use an anti_join() which returns all rows from books where there are no matching values in authors, keeping just columns from books – and then we pass that result to select for title and name

anti_join(books,authors,by=c("name"="surname")) %>% select(title,name)
                 title   name
1 An Introduction to R R Core

Up Next – Biking in San Francisco

That’s it for now and we have covered a lot of ground in one go although once you invest some time in playing with dplyr (especially the pipes) then it becomes difficult to go back to the “old ways” of doing things. Next up we will look at some “real” data and apply our new found knowledge to working with it. The data set actually comes from the Kaggle Project page for the San Francisco Bay Area Bike Share Service. The data is about 660MB and you can download it though you will need a Kaggle account. You might want to go ahead and download that data in anticipation of the next posting.

Data Frames

R provides a helpful data structure called the “data frame” that gives the user an intuitive way to organize, view, and access data.  Many of the functions that you would use to read in external files (e.g. read.csv) or connect to databases (RMySQL), will return a data frame structure by default. While there are other important data structures, such as the vector, list and matrix, the data frame winds up being at the heart of many operations not the least of which is aggregation. Before we get into that let me offer a very brief review of data frame concepts:

  • A data frame is a set of rows and columns.
  • Each column is of the same length and data type
  • Every row is of the same length but can be of differing data types
  • A data frame has characteristics of both a matrix and a list
  • Bracket notation is the customary method of indexing into a data frame

Subsetting Data The Old School Way

Here are some examples of getting specific subsets of information from the built in data frame mtcars. Note that the bracket notation has two dimensions here – one for row and one for column. The comma within any given bracket notation expression separates the two dimensions.

# select rows 1 and 2

mtcars[1:2,]
              mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

# select rows 1 and 2 and columns 3 and 5

mtcars[1:2,c(3,5)]
              disp drat
Mazda RX4      160  3.9
Mazda RX4 Wag  160  3.9

# Find the rows where the MPG column is greater than 30 

mtcars[mtcars$mpg > 30,]
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

# select columns 10 and 11 for all rows where MPG > 30 and cylinder is
# equal to 4 and then extract columns 10 and 11

mtcars[mtcars$mpg > 30 & mtcars$cyl == 4, 10:11]
               gear carb
Fiat 128          4    1
Honda Civic       4    2
Toyota Corolla    4    1
Lotus Europa      5    2

So the bracket notation winds up being a good way by which to index and search within a data frame although to do aggregation requires us to use other functions such as tapply, aggregate, and table. This isn’t necessarily a bad thing just that you have to learn which function is the most appropriate for the task at hand.

# Get the mean MPG by Transmission

tapply(mtcars$mpg, mtcars$am, mean)
0    1
17.1 24.4 

# Get the mean MPG for Transmission grouped by Cylinder

aggregate(mpg~am+cyl,data=mtcars,mean)
   am cyl  mpg
1  0   4 22.9
2  1   4 28.1
3  0   6 19.1
4  1   6 20.6
5  0   8 15.1
6  1   8 15.4

# Cross tabulation based on Transmission and Cylinder

table(transmission=mtcars$am, cylinder=mtcars$cyl)
              cylinder
transmission  4  6  8
0  3  4 12
1  8  3  2

Enter the data.table package

Okay this is nice though wouldn’t it be good to have a way to do aggregation within the bracket notation ? In fact there is. There are a couple of packages that could help us  to simplify aggregation though we will start with the data.table package for now. In addition to being able to do aggregation within the brackets there are some other reasons why it is useful:

  • It works well with very large data files
  • Can behave just like a data frame
  • Offers fast subset, grouping,  update,  and joins
  • Makes it easy to turn an existing data frame into a data table

Since it is an external package you will first need to install it. After that just load it up using the library statement

library(data.table)

dt <- data.table(mtcars)

class(dt)
[1] "data.table" "data.frame"

dt[,mean(mpg)]   # You can't do this with a normal data frame
[1] 20.09062

mtcars[,mean(mpg)]  # Such a thing will not work with regular data frames
Error in mean(mpg) : object 'mpg' not found

So notice that we can actually find the mean MPG directly within the bracket notation. We don’t have to go outside of the brackets to do this. So what about reproducing the previous tapply example:

tapply(mtcars$mpg,mtcars$am,mean)
 0    1
17.1 24.4 

# Here is how we would do this with the data table "dt"

dt[,mean(mpg),by=am]
   am   V1
1:  1 24.4
2:  0 17.1

# We could even extend this to group by am and cyl

dt[,mean(mpg),by=.(am,cyl)]
    am cyl  V1
1:  1   6 20.6
2:  1   4 28.1
3:  0   6 19.1
4:  0   8 15.0
5:  0   4 22.9
6:  1   8 15.4

# If we want to more clearly label the computed average

dt[,.(avg=mean(mpg)),by=.(am,cyl)]
    am cyl  avg
1:  1   6 20.6
2:  1   4 28.1
3:  0   6 19.1
4:  0   8 15.0
5:  0   4 22.9
6:  1   8 15.4

Similarities to SQL

It doesn’t require many examples to prove that we don’t have to use the aggregate or tapply functions to do any of the work once we have created a data table.  Unlike default data frames the bracket notation for a data table object has three dimensions which correspond to what one might see in an SQL statement. Don’t worry – you do not have to be an SQL expert to use data.table. In reality you don’t have to know it at all although if you do then using data.table becomes much easier.

screenshot_749

So in terms of SQL we would say something like select “j” (columns or an operation on some columns) where those columns in  a row(s) “i” satisfy some specified condition on the rows. And if the “by” index is supplied it indicates how to group the result. Well this might sound a little involved so something more practical might be in order to illustrate what I’m talking about here.

screenshot_748

So let’s revisit the previous examples and see how it relates to the SQL model – This is helpful in understanding the paradigm associated with data table objects:

dt[,mean(mpg),by=am]
   am   V1
1:  1 24.4
2:  0 17.1

# The above is analogous to an SQL statement like

select am,avg(mpg) from mtcars group by am

# The following example

dt[,.(avg=mean(mpg)),by=.(am,cyl)]
    am cyl avg
1:  1   6 20.6
2:  1   4 28.1
3:  0   6 19.1
4:  0   8 15.0
5:  0   4 22.9
6:  1   8 15.4

# is analogous to an SQL statement like:

select am,avg(mpg) as avg from mtcars group by am,cyl

# The following example

dt[mpg > 20,.(avg=mean(mpg)),by=.(am,cyl)]
    am cyl  avg
1:  1   6 21.0
2:  1   4 28.1
3:  0   6 21.4
4:  0   4 22.9

# would be analogous to the following SQL statement

select am,avg(mpg) as avg from mtcars where mpg > 20 group by am,cyl


As previously mentioned one does not need to know SQL to use data.table. However, if you do it can help you understand some of the motivations behind the package.

Tabulation

Here are some more examples that illustrate how we can count and tabulate things. Within a data table the special variable .N represents the count of rows. If there is a group by index then it presents the number of rows within that grouping variable.

dt[, .N] # How many rows
[1] 32

dt[, .N, by=cyl]  # How many cars in each cylinder group
   cyl  N
1:   6  7
2:   4 11
3:   8 14

# For rows where the wt is > 1.5 tons count the number of cars by
# transmission type.

dt[wt > 1.5, .(count=.N), by=am] 
   am count
1:  1    13
2:  0    19

We can also do sorting quite easily and very fast.


# Present the 5 cars with the best MPG

head(dt[order(-mpg)],5) 
 mpg cyl disp  hp drat   wt qsec vs am gear carb
1: 33.9   4 71.1  65 4.22 1.83 19.9  1  1    4    1
2: 32.4   4 78.7  66 4.08 2.20 19.5  1  1    4    1
3: 30.4   4 75.7  52 4.93 1.61 18.5  1  1    4    2
4: 30.4   4 95.1 113 3.77 1.51 16.9  1  1    5    2
5: 27.3   4 79.0  66 4.08 1.94 18.9  1  1    4    1

# Since data table inherits from a data frame we could have also done

dt[order(-mpg)][1:5]
mpg cyl disp  hp drat   wt qsec vs am gear carb
1: 33.9   4 71.1  65 4.22 1.83 19.9  1  1    4    1
2: 32.4   4 78.7  66 4.08 2.20 19.5  1  1    4    1
3: 30.4   4 75.7  52 4.93 1.61 18.5  1  1    4    2
4: 30.4   4 95.1 113 3.77 1.51 16.9  1  1    5    2
5: 27.3   4 79.0  66 4.08 1.94 18.9  1  1    4    1

# We could sort on multiple keys. Here we find the cars with the best
# gas mileage and then sort those on increasing weight

dt[order(-mpg,wt)][1:5]
   mpg cyl disp  hp drat   wt qsec vs am gear carb
1: 33.9   4 71.1  65 4.22 1.83 19.9  1  1    4    1
2: 32.4   4 78.7  66 4.08 2.20 19.5  1  1    4    1
3: 30.4   4 95.1 113 3.77 1.51 16.9  1  1    5    2
4: 30.4   4 75.7  52 4.93 1.61 18.5  1  1    4    2
5: 27.3   4 79.0  66 4.08 1.94 18.9  1  1    4    1

Chicago Crime Statistics

Let’s look at a more realistic example. I have a file that relates to Chicago crime data that you can download if you wish (that is if you want to work this example). It is about 81 megabytes so it isn’t terribly large.

url <- "http://steviep42.bitbucket.org/YOUTUBE.DIR/chi_crimes.csv"
download.file(url,"chi_crimes.csv")

So you might try reading it in with the typical import functions in R such as read.csv which many people use as a default when reading in CSV files. I’m reading this file in on a five year old Mac Book with about 8 GB of RAM. Your speed may vary. We’ll first read in the file using read.csv and then use the fread function supplied by data.table to see if there is any appreciable difference

system.time(df.crimes <- read.csv("chi_crimes.csv", header=TRUE,sep=","))

user system elapsed
30.251 0.283 30.569

nrow(df.crimes)
[1] 334141

# Now let's try fread

system.time(dt.crimes <- fread("chi_crimes.csv", header=TRUE,sep=","))

user system elapsed
1.045 0.037 1.362

attributes(dt.crimes)$class # dt.crimes is also a data.frame
[1] "data.table" "data.frame"

nrow(df.crimes)
[1] 334141

dt.crimes[,.N]
[1] 334141

That was a fairly significant difference. If the file were much larger we would see an even larger time difference which for me is a good thing since I routinely read in large files. Consequently fread has become a default for me even if I don’t wind up using the aggregation capability of the data.table package. Also note that the fread function returns a data.table object by default. Now let’s investigate some crime using some of the things we learned earlier. This data table has information on every call to the Chicago police in the year 2012. So we’ll want to see what factors there are in the data frame/table so we can do some summaries across groups.

 names(dt.crimes)
 [1] "Case Number"          "ID"                   "Date"
[4] "Block"                "IUCR"                 "Primary Type"
[7] "Description"          "Location Description" "Arrest"
[10] "Domestic"             "Beat"                 "District"
[13] "Ward"                 "FBI Code"             "X Coordinate"
[16] "Community Area"       "Y Coordinate"         "Year"
[19] "Latitude"             "Updated On"           "Longitude"
[22] "Location"  

# Let's see how many unique values there are for each column. Looks 
# like 30 FBI codes so maybe we could see the number of calls per FBI 
# code. What about District ? There are 25 of those.

sapply(dt.crimes,function(x) length(unique(x)))

        Case Number                   ID                 Date 
              334114               334139               121484 
               Block                 IUCR         Primary Type 
               28383                  358                   30 
         Description Location Description               Arrest 
                 296                  120                    2 
            Domestic                 Beat             District 
                   2                  302                   25 
                Ward             FBI Code         X Coordinate 
                  51                   30                60704 
      Community Area         Y Coordinate                 Year 
                  79                89895                    1 
            Latitude           Updated On            Longitude 
              180396                 1311               180393 
            Location 
              178534 


Now – I just used the sapply function to tell me how many unique values each column assumes. This is so we can identify potential summary factors. This is a common activity and I used a common R-like approach although for the newcomer it looks a little verbose (Welcome to R my friend) and it turns out that we can accomplish the same thing using data.table itself. But first let’s do some exploration and then we’ll come back to this. So how many calls per District were there ? Let’s then order this result such that we see the Districts with the highest number of calls first and in descending order.

dt.crimes[,.N,by=District][order(-N)]
1:        8 22386
2:       11 21798
3:        7 20150
4:        4 19789
5:       25 19658
6:        6 19232
7:        3 17649
8:        9 16656
9:       19 15608
10:        5 15258
11:       10 15016
12:       15 14385
13:       18 14178
14:        2 13448
15:       14 12537
16:        1 12107
17:       16 10753
18:       22 10745
19:       17  9673
20:       24  9498
21:       12  8774
22:       13  7084
23:       20  5674
24:       NA  2079
25:       31     6
District     N

Next let’s randomly sample 500 rows and then find the mean  number of calls to the cops
as grouped by FBI.Code (whatever that corresponds to) check https://www2.fbi.gov/ucr/nibrs/manuals/v1all.pdf to see them all.

dt.crimes[sample(1:.N,500), .(mean=mean(.N)), by="FBI Code"]
  FBI Code mean
1:       14   60
2:       19    3
3:       24    6
4:       26   47
5:       06  109
6:      08B   83
7:       07   27
8:      08A   22
9:       05   34
10:       18   44
11:      04B   10
12:       03   19
13:       11   15
14:      04A    7
15:       09    1
16:       15    6
17:       16    3
18:       10    1
19:       17    1
20:       02    2

# Here we count the number of calls for each day of the year and order from the highest
# to the lowest. We see then that the day with the most calls to the police was 
# New Years Day. Whereas Christmas Day had the fewest calls

dt.crimes[,.N,by=substr(Date,1,10)][order(-N)]
         substr    N
  1: 01/01/2012 1297
  2: 06/01/2012 1163
  3: 07/01/2012 1158
  4: 09/01/2012 1124
  5: 07/15/2012 1122
 ---                
362: 12/27/2012  679
363: 01/20/2012  677
364: 11/22/2012  653
365: 12/26/2012  619
366: 12/25/2012  520

# Here is another example that would emulate the HAVING clause in SQL

dt.crimes[,.N,by=substr(Date,1,10)][N > 1122][order(-N)]
       substr    N
1: 01/01/2012 1297
2: 06/01/2012 1163
3: 07/01/2012 1158
4: 09/01/2012 1124

Other Things

Keep in mind that data.table isn’t just for aggregation. You can do anything with it that you can do with a normal data frame. This includes creating new columns, modify existing ones, and create your own functions to do aggregation, and many other activities.


# Get the next to the last row from the data table

dt[.N-1]
mpg cyl disp  hp drat   wt qsec vs am gear carb
1:  15   8  301 335 3.54 3.57 14.6  0  1    5    8

dt[cyl %in% c(4,6)]
mpg cyl  disp  hp drat   wt qsec vs am gear carb
1: 21.0   6 160.0 110 3.90 2.62 16.5  0  1    4    4
2: 21.0   6 160.0 110 3.90 2.88 17.0  0  1    4    4
3: 22.8   4 108.0  93 3.85 2.32 18.6  1  1    4    1
4: 21.4   6 258.0 110 3.08 3.21 19.4  1  0    3    1
5: 18.1   6 225.0 105 2.76 3.46 20.2  1  0    3    1
6: 24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
7: 22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
8: 19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
9: 17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
10: 32.4   4  78.7  66 4.08 2.20 19.5  1  1    4    1
11: 30.4   4  75.7  52 4.93 1.61 18.5  1  1    4    2
12: 33.9   4  71.1  65 4.22 1.83 19.9  1  1    4    1
13: 21.5   4 120.1  97 3.70 2.46 20.0  1  0    3    1
14: 27.3   4  79.0  66 4.08 1.94 18.9  1  1    4    1
15: 26.0   4 120.3  91 4.43 2.14 16.7  0  1    5    2
16: 30.4   4  95.1 113 3.77 1.51 16.9  1  1    5    2
17: 19.7   6 145.0 175 3.62 2.77 15.5  0  1    5    6
18: 21.4   4 121.0 109 4.11 2.78 18.6  1  1    4    2

# Summarize different variables at once

dt[,.(avg_mpg=mean(mpg),avg_wt=mean(wt))]
avg_mpg avg_wt
1:    20.1   3.22

dt[,cyl:=NULL]  # Removes the cyl column

head(dt)
mpg disp  hp drat   wt qsec vs am gear carb
1: 21.0  160 110 3.90 2.62 16.5  0  1    4    4
2: 21.0  160 110 3.90 2.88 17.0  0  1    4    4
3: 22.8  108  93 3.85 2.32 18.6  1  1    4    1
4: 21.4  258 110 3.08 3.21 19.4  1  0    3    1
5: 18.7  360 175 3.15 3.44 17.0  0  0    3    2
6: 18.1  225 105 2.76 3.46 20.2  1  0    3    1

Identifying Possible Factors in a Data Frame

Okay – I demonstrated above a way to identify the number of unique values contained in each column as a means to determine what the factors/categories might be. I used the following


sapply(dt.crimes,function(x) length(unique(x)))

However, the data.table package has a special variable called .SD and a function called uniqueN that can help us. Let’s explore what these do. First let me create a smaller data frame which makes it easier to see what is going on

set.seed(123)
mydt <- data.table(group=sample(LETTERS[1:3],6,T),gender=sample(c("M","F"),6,T),
                   measure=round(rnorm(6,10),2))

mydt
   group gender measure
1:     A      F   10.46
2:     C      F    8.73
3:     B      F    9.31
4:     C      M    9.55
5:     C      F   11.22
6:     A      M   10.36

# The following groups the data.table by the "gender" column

mydt[,.SD,by=gender]
   gender group measure
1:      F     A   10.46
2:      F     C    8.73
3:      F     B    9.31
4:      F     C   11.22
5:      M     C    9.55
6:      M     A   10.36

In this case the .SDvariable is a way to group the data by a key as if we were creating an index. (Note that the data.table package has a setkey function for this to formalize the creation of an index).
What about this example ?

mydt[,print(.SD),by=gender] 
   group measure
1:     A   10.46
2:     C    8.73
3:     B    9.31
4:     C   11.22
   group measure
1:     C    9.55
2:     A   10.36
Empty data.table (0 rows) of 1 col: gender

This is somewhat similar to the native split function in R that let’s one split a data frame on a given factor and store the results in a list. In this case, however, the “splits” aren’t really stored any where because we are simply just printing them. A more useful example might be:

mydt[,lapply(.SD,mean),by=gender,.SDcols="measure"] 
   gender measure
1:      F    9.93
2:      M    9.96 

Oh wow – so the .SD pulls out all the columns in the data frame except gender and then applies the mean function to the columns specified by the .SDcols variable. So this is another way of doing some aggregation (although perhaps a bit intimidating for those not familiar with the lapply function. (If that is the case I have a cure for you – go read this post). Well we might be getting a bit off track here because I wanted to show you how to replace the functionality of the sapply example from above. Here it is. The uniqueN function determines how many unique values there are in a column. The lapply function applies uniqueN to each column.


mydt[,lapply(.SD,uniqueN)]
   group gender measure
1:     3      2       6

C’est la fin ?

Before we finish this posting it is worth mentioning that the data table package also provides support for setting a key much in the same way one would create an index in a relational database to speed up queries. This is for situations wherein you might have a really large data table and expect to routinely interrogate it using the same column(s) as keys. In the next posting I will look at the dplyr package to show another way to handle large files and accomplish intuitive aggregation. Some R experts represents data.table as being as competitor of dplyr although one could mix the two. What I like about data.table is that it allows you to build sophisticated queries, summaries, and aggregations within the bracket notations. It has the added flexibility of allowing you to employ existing R functions or any that you decide to write.

lapply101
Next up in our review of the family of apply commands we’ll look at the lapply function, which can be used to loop over the elements of a list (or a vector). This is a true convenience although for those with experience in other programming languages it can seem unnecessary since you are accustomed to writing your own loops. Rest assured you can take that approach in R but once you get an understanding of lists and lapply you will appreciate what it can do for you. This leads me to what I feel is an important observation. I find that most misunderstandings of the lapply command result primarily from a limited or incomplete knowledge of the list structure in R. As long as you know what lists are and how to effectively manipulate them then lapply becomes much easier to implement. So we’ll begin with a brief yet important tutorial on lists and how elements can be referenced.

The list structure represents R’s default data structure for containing heterogeneous information. Recall that vectors and matrices can accommodate only one data type at a time. That is you cannot have a vector or matrix with mixed data types. You can try to put numeric and character data into the same vector but R will convert it all to a single type. Don’t believe me ? Try it. R doesn’t complain but it will make everything a character.

somevec <- c(1,4,5,"4","5")

somevec
[1] "1" "4" "5" "4" "5"

So from a purely practical point of view some data structure must exist in R to accommodate mixed data. That is what the list structure is for. Where do lists show up in R ? All over the place it turns out. Many of the interesting statistical functions in R, as well as the many add on packages available on CRAN, return information in the form of lists.


# Let's do some regression using the mtcars data frame

mylm <- lm(mpg~wt, data = mtcars)

# What type of structure do we get back ? A list with 12 sub elements

str(mylm,max.level=0)
List of 12
 - attr(*, "class")= chr "lm"

Now isn’t that interesting ? As you might know, judging the quality of the regression process can be quite involved. R knows this so it returns lots of information encapsulated within a list to help you assess the model. The 12 elements include data types such as vectors, factors, lists, objects, and data frames. So if you are writing your own function and need to return diverse types of data then the list structure is for you ! But that’s all a bit too complex for the moment so let’s return to some basics. To motivate things I’ll present some variables here that look like they relate to characteristics of a family. We have a surname, the number of children in the family, their respective ages, and whether or not the children have had measles.

surname <- "Jones"
numofchild <- 2
ages <- c(5,7)
measles <- c("Y","N")

We could work with this information as individual variables though it might be useful to assemble it all into a single structure. So I will create a “named” list to contain this information. It is called a “named” list because each element receives a name as the list is being created. Named lists are easier for humans to manipulate and interrogate because we can refer to elements by name using the “$” notation which I will introduce momentarily.

family1 <- list(name="Jones",numofchild=2,ages=c(5,7),measles=c("Y","N"))

str(family1)
List of 4
 $ name      : chr "Jones"
 $ numofchild: num 2
 $ ages      : num [1:2] 5 7
 $ measles   : chr [1:2] "Y" "N"

Ah so we see that the elements “name” and “measles” are character vectors, element “numofchild” is a single valued vector, and the “ages” element is a multi valued vector. This proves that we can host data of differing types within a single data structure. Now how do we address these elements and retrieve their values ? Can we use numeric indexing as with a vector ? Can we use the names we created for each element ? The answer in both cases is “yes”. If we have a list with named elements then we can use the “$” notation.

family1$ages
[1] 5 7

family1$measles
[1] "Y" "N"

# We can also reference by numeric position which is more useful if you are 
# writing your own loop structures but it is less intuitive 

family1[2]
$numofchild
[1] 2

family1[[2]]
[1] 2

Hmm. What’s up with the double bracket vs the single bracket ? Well the way I think about it is that if you use the single bracket, (as you would if this were a vector), you get back the name of the element as well as it’s value. While this is useful it is usually more interesting to get the actual value(s) of the element which, (if you don’t use the element name), requires use of the double brackets. Think of the double brackets as being more specific than the single brackets. Now even if you use the $ notation you can still address individual values of a given list element. So here I’ll start with pulling out the age of the first child only.

family1$ages[1]
[1] 5

# We could pull out both ages using this approach

family1$ages[1:2]
[1] 5 7

# But this is the same as this:

family1$ages
[1] 5 7

# Which is the same as this:

family1[[3]]
[1] 5 7

The way I would summarize the above information is that if you have a named list then you can use the “$” notation for the most part though if you want to address specific values within a multivalued element then you will also have to use the bracket notation in addition to the “$” notation. If you have an unnamed list then you must use the bracket notation exclusively since there are no names available. Unnamed lists result when no effort is made to name the elements such as in the following example. I can always apply names later if I wish.

family1 <- list("Jones",2,c(5,7),c("Y","N"))

# So when we print the list results we see only brackets - no names.

family1
[[1]]
[1] "Jones"

[[2]]
[1] 2

[[3]]
[1] 5 7

[[4]]
[1] "Y" "N"

# The cool thing is that we can make names for our elements even after we have
# create the list

names(family1) <- c("name","numofchild","ages","measles")

Introducing lapply

Admittedly this family1 list is a little basic but the above examples prove that there is flexibility in how you can address elements and values. So let’s present an example of the lapply function. I’ll use it to apply the built in “typeof” function to each element in the list.

lapply(family1, typeof)
$name
[1] "character"

$numofchild
[1] "double"

$ages
[1] "double"

$measles
[1] "character"

# Using lapply will return a list

str(lapply(family1, typeof))
List of 4
 $ name      : chr "character"
 $ numofchild: chr "double"
 $ ages      : chr "double"
 $ measles   : chr "character"

So each element of family1 is “silently” passed to the typeof function and the results for all elements are returned in a list. In fact the “l” in lapply comes from the fact that it will return to you a list. To make sure you understand what is happening here let me introduce a small variation to the example.


lapply(family1, typeof)

# is the same as

lapply(family1, function(x) typeof(x))

The second version does exactly the same thing as the first but illustrates two important facts: 1) I can pass an “anonymous” functions to lapply. (If you read my blog on apply I discussed the importance and use of “anonymous” functions). 2) The variable “x” referenced in the anonymous function definition represents a placeholder for each element of the family1 list. If you need more help understanding this then look at this example where I use a function that simply prints back each element of the list and it’s associated value.

lapply(family1,function(x) x)
$name
[1] "Jones"

$numofchild
[1] 2

$ages
[1] 5 7

$measles
[1] "Y" "N"

# But this is the same as simply typing the name of the list at the prompt

family1

Now you could have also defined a function in advance instead of using an anonymous function. That is you are not obligated to use anonymous functions. Some people find them to error prone, confusing, and less readable. If you do then simply define your function in advance. It won’t change the result or impact performance at least in this case.

simplefunc <- function(x) {
   mytype <- typeof(x)
   return(mytype)
}

lapply(family1, simplefunc)
$name
[1] "character"

$numofchild
[1] "double"

$ages
[1] "double"

$measles
[1] "character"

Alright let’s get a little more advanced here. I’ll write a function that returns the mean value of each element. But I know what you are thinking. Not all of our list elements are numeric so to avoid an error I’ll have to insert some basic logic to test if the element is numeric. If it is numeric then I take the mean of the element value. If not then I ignore the element. I could implement this example two ways: 1) create a named function in advance which I then use in a call to lapply. 2) create an anonymous function when I call lapply. Here is the first scenario:


# Create a function in advance.

myfunc <- function(x) {
  if (is.numeric(x)) {
    mean(x)
  }
} 

# For numeric elements we get a meaningful result. For other elements we
# don't get back anything

lapply(family1, myfunc)
$name
NULL

$numofchild
[1] 2

$ages
[1] 6

$measles
NULL

What about approach #2 ? This is where I would define the function as I make the call to lapply. This works just as well but might be less readable to a newcomer.

lapply(family1, function(x) {if (is.numeric(x)) mean(x)})
$name
NULL

$numofchild
[1] 2

$ages
[1] 6

$measles
NULL

A list of lists !

Okay, let’s create some more lists that correspond to different families. If we want we can even create a master list whose elements are individual family lists. So in effect we are creating a list of lists ! In this case our master list has named elements so we can easily address the contents of the sub elements.

family2 <- list(name="Espinoza",numofchild=4,ages=c(5,7,9,11),measles=c("Y","N","Y","Y"))
family3 <- list(name="Ginsberg",numofchild=3,ages=c(9,13,18),measles=c("Y","N","Y"))
family4 <- list(name="Souza",numofchild=5,ages=c(3,5,7,9,11),measles=c("N","Y","Y","Y","N"))

allfams <- list(f1=family1,f2=family2,f3=family3,f4=family4)

str(allfams,max.level=1)
List of 4
 $ f1:List of 4
 $ f2:List of 4
 $ f3:List of 4
 $ f4:List of 4

allfams$f3$ages   # Get the ages of Family 3
[1]  9 13 18

# Same as

allfams[[3]]$ages
[1]  9 13 18

Okay so now what if we wanted to get the mean ages of each family’s children ? How could we do this using lapply ? It’s easy.

lapply(allfams, function(x) mean(x$ages))
$f1
[1] 6

$f2
[1] 8

$f3
[1] 13.33333

$f4
[1] 7

It might be a better idea to get the averages for all children. How might we do that ? It takes a little bit more work but not much. First, recognize that what we are getting back are all numeric values so we don’t really need a list to store that information. What I mean is that the only reason we use a list in the first place is to “host” data of differing types but here our result is all numeric so let’s convert it to a vector.

unlist(lapply(allfams, function(x) mean(x$ages)))
      f1       f2       f3       f4 
 6.00000  8.00000 13.33333  7.00000 

# So check out the following. It gives us exactly what we want.

mean(unlist(lapply(allfams, function(x) mean(x$ages))))
[1] 8.583333

# An "expanded" version of this might have looked like:

mymeanages <- function(x) {
  return(mean(x$ages))
}

hold <- lapply(allfams, mymeanages)  # Get back a list with mean ages for each family

hold2 <- unlist(hold)    # Turn the result into a vector since everything is a numeric value

mean(hold2)   # Get the mean of all ages

Let’s ask another question that we could use lapply and a companion function to answer. Which families have 2 or 3 children ? Well since we only have 4 families in allfams we could just look at the lists and answer this question via visual inspection. But this might get really hard to do if our allfams list had 10, 100, or 1,000 families. So here is one way we could do this.

hold <- lapply(allfams,function(x) {x$numofchild >= 2 & x$numofchild <= 3}) 
which(hold == T)
f1 f3 
 1  3 

# Or we could it all in one go

which(lapply(allfams,function(x) {x$numofchild >= 2 & x$numofchild <= 3}) == T)
f1 f3 
 1  3 

Using the split command

Okay, how might we use this knowledge in another example. Lists also show up in conjunction with the “split” command which, given a data frame and a factor, will split the data frame based on that factor into a list. This is best understood with an example. We’ll use the built in data frame called mtcars.

unique(mtcars$cyl)  # Cylinder takes on three distinct values
[1] 6 4 8

# We could split the data frame based on cylinder group.

mydfs <- split(mtcars,mtcars$cyl)

str(mydfs,max.level=1)
List of 3
 $ 4:'data.frame':	11 obs. of  11 variables:
 $ 6:'data.frame':	7 obs. of  11 variables:
 $ 8:'data.frame':	14 obs. of  11 variables:

So what we get back is a list called mydfs whose elements are data frames whose elements represent observations corresponding to cars with a certain number of cylinders – 4,6, or 8. This is a quick and efficient way to split up a data frame. It is worth pointing out that lots of people don’t take advantage of the split function usually because they aren’t aware of it. If you don’t use the split function then you have to do it by hand using an approach like the following. While this will work it doesn’t scale very well especially if you have a factor with many “levels”.


fourcyl  <- mtcars[mtcars$cyl==4,]  
sixcyl   <- mtcars[mtcars$cyl==6,]
eightcyl <- mtcars[mtcars$cyl==8,]

But let’s get back to split function and our list of data frames. We have a list called mydfs whose elements are data frames with observations corresponding to cars of 4,6, and 8 cylinders. We can use our knowledge of lists to look around some:

names(mydfs)
[1] "4" "6" "8"

mydfs$"4"
                mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

How could we rapidly determine the mean MPG for each cylinder group using lapply ?

mydfs <- split(mtcars,mtcars$cyl)     
lapply(mydfs,function(x) mean(x$mpg))
$`4`
[1] 26.66364

$`6`
[1] 19.74286

$`8`
[1] 15.1

# Okay cool but we could bundle this up in one statement

lapply(split(mtcars,mtcars$cyl),function(x) mean(x$mpg))
$`4`
[1] 26.66364

$`6`
[1] 19.74286

$`8`
[1] 15.1

# Or more economically (though potentially confusing to a newcomer) 

unlist(lapply(split(mtcars,mtcars$cyl),function(x) mean(x$mpg)))

      4        6        8 
26.66364 19.74286 15.10000 

# Which is identical to the tapply function. 

tapply(mtcars$mpg,mtcars$cyl,mean)
       4        6        8 
26.66364 19.74286 15.10000 

I slipped that last one in on you to make a point that there are always multiple ways to solve problems using R. Some say that this flexibility is a great strength of R whereas others say it is a great source of confusion since newcomers don’t know which approach is best. When I was new to R I simply used whatever worked until I needed a faster or more flexible approach. My advice to you is don’t worry about which way is “right” because this will slow you down. Find an approach that solves your problems and change that approach when it becomes necessary. Okay that will wrap it up for the lapply intro. As always there are many other examples I could present but hopefully this blog will help in your mastery of lists and looping over them.

apply The goal of this blog entry is to introduce basic and essential information about the apply function. Even established R users get confused when considering this family of functions especially when observing how many of the them there are: apply, tapply, lapply, sapply, rapply, eapply, mapply. When I was new to R I was rarely satisfied with the all-too-basic explanations of these commands so I thought I would create a series on some of these functions to address the more common questions that newcomers might have. There is an add on package called plyr that attempts to present a unified philosophy and approach to the process of “applying” functions to various R data structures though, for now, I’ll be sticking to the native R commands since they show up a lot in example code you are likely to encounter.

I’ll start with the apply command which expects a matrix as input. Depending on what function you specify when using the apply command, you will get back either a vector or a matrix. Let’s get started with a motivating example, which I hope will convince you on how useful the apply approach can be. Here we create a matrix of 16 elements from a normal distribution of mean 10. We use the set.seed command to enable reproducibility. Imagine that each column represents numerical measurements and the rows represent individual samples. Scientific data, such as microarray and metabolomic information, can often follow this pattern. For those of you concerned with authenticity please don’t worry – I’ll introduce some “real” data later on in the posting.

set.seed(1)  # Makes the call to rnorm generate the same numbers every time
( mymat <- matrix(round(rnorm(16,10),2),4,4) )

      [,1]  [,2]  [,3]  [,4]
[1,]  9.37 10.33 10.58  9.38
[2,] 10.18  9.18  9.69  7.79
[3,]  9.16 10.49 11.51 11.12
[4,] 11.60 10.74 10.39  9.96

Doing it the hard way

Let’s say that we want to take the mean of each column. Another way to say this is that we want to apply the mean function to each column in the matrix. This is the way to start thinking about things like this in R. But first, how might we do this if we don’t know anything about the apply command ?

mean(mymat[,1])
[1] 10.0775

mean(mymat[,2])
[1] 10.185

mean(mymat[,3])
[1] 10.5425

mean(mymat[,4])
[1] 9.5625

# Or we could stash the means into a vector if we need to use that information elsewhere

( mymatcolmean <- c(mean(mymat[,1]),mean(mymat[,2]),mean(mymat[,3]),mean(mymat[,4])) )

[1] 10.0775 10.1850 10.5425  9.5625

See what I did there ? I took the mean of each column by invoking the mean function four times and substituting columns 1-4 on each invocation. This is a bit crude but entirely effective and we do get the correct answer. In the second example I created a vector called mymatcolmean to hold the information. The two approaches are equivalent though neither scales very well. Imagine if there were 10, 100, or 1,000 columns. Sure you could do cut and paste but it is still a tedious process. Surely there must be a better way. Well, if you have experience with programming languages like C, FORTRAN, Java, or Perl then you probably know how to use a for-loop structure to solve this problem. R supports for-loops too but it also offers some ways to avoid using them.

retvec <- vector()
for (ii in 1:ncol(mymat)) {
  retvec[ii] = mean(mymat[,ii])
}
retvec
[1] 10.0775 10.1850 10.5425  9.5625

# We could even put this into a function for later use in case we need it.

myloop <- function(somemat) {
 retvec <- vector()
 length(retvec) <- ncol(somemat)
 for (ii in 1:ncol(somemat)) {
   retvec[ii] <- mean(somemat[,ii])
 }
 return(retvec)
}

myloop(mymat)
[1] 10.0775 10.1850 10.5425  9.5625

# This will now work for any matrix but of course it is specific to the columns of the matrix.

set.seed(1)

newmat <- matrix(rnorm(100),10,10)

myloop(newmat)
 [1]  0.1322028  0.2488450 -0.1336732  0.1207302  0.1341367  0.1434569  0.4512100 -0.2477361
 [9]  0.1273603  0.1123413

So this would work but what if we wanted to take the mean of the rows instead of the columns ? We could do that but we then have to rewrite our function. And, while we are at it, in order to make our function more general what about providing an argument that indicates whether we want to operate on the rows or the columns ? I’ll use “1” to indicate rows and “2” to indicate columns.

myloop <- function(somemat,rc=2) {
  retvec <- vector()
  length(retvec) <- ncol(somemat)
  
  # Check to see if we are taking the mean of the columns or rows
  # 1 indicates rows, 2 indicates columns
  
  if (rc == 2) {
    for (ii in 1:ncol(somemat)) {
      retvec[ii] <- mean(somemat[,ii])
    }
  } else {
    for (ii in 1:nrow(somemat)) {
      retvec[ii] <- mean(somemat[ii,])
    }
  }
  return(retvec)
}

# Okay let's make sure it works. 

myloop(mymat,2)   # This is correct
[1] 10.0775 10.1850 10.5425  9.5625

myloop(mymat,1)   # This is correct
[1]  9.9150  9.2100 10.5700 10.6725

mean(mymat[1,])  # A spot check
[1] 9.915

Doing it the easy way

Well okay that’s good but what if we then wanted our myloop function to accommodate a function other than mean ? Like quantile, sum, fivenum, or even a function of our own design ? We could add another argument to our function to let the user specify the name of a function to be applied to the rows or columns. But before we do more work please consider that R already has something that will do the job for us – the apply command.

apply(mymat, 2, mean)
[1] 10.0775 10.1850 10.5425  9.5625

apply(mymat, 1, mean)
[1]  9.9150  9.2100 10.5700 10.6725

See how much easier that is than writing our own looping function ? It has been my observation that those well versed in traditional programming languages have a bigger problem getting used to the apply function than newcomers simply because experienced programmers are more accustomed to writing their own summary functions. They just dive in and start coding. But R short circuits this approach by providing the apply family of commands. Note also that we can substitute in any function we want to.


apply(mymat,2,class)  # What class do the columns belong to ?
[1] "numeric" "numeric" "numeric" "numeric"

apply(mymat,2,sum)    # Get the sum of all the columns
[1] 40.31 40.74 42.17 38.25

apply(mymat,1,range)  # Get the range of all rows
      [,1]  [,2]  [,3]  [,4]
[1,]  9.37  7.79  9.16  9.96
[2,] 10.58 10.18 11.51 11.60

apply(mymat,2,fivenum) # Get the fivenum summary for each column
       [,1]   [,2]   [,3]   [,4]
[1,]  9.160  9.180  9.690  7.790
[2,]  9.265  9.755 10.040  8.585
[3,]  9.775 10.410 10.485  9.670
[4,] 10.890 10.615 11.045 10.540
[5,] 11.600 10.740 11.510 11.120

Some activities on matrices are so common that R actually has dedicated functions for them that are very efficient and fast:

rowMeans(mymat)     # Equivalent to apply(mymat,1,mean)
[1]  9.9150  9.2100 10.5700 10.6725

colMeans(mymat)     # Equivalent to apply(mymat,2,mean)
[1] 10.0775 10.1850 10.5425  9.5625

rowSums(mymat)      # Equivalent to apply(mymat,1,sum)
[1] 39.66 36.84 42.28 42.69

colSums(mymat)      # Equivalent to apply(mymat,2,sum)
[1] 40.31 40.74 42.17 38.25

Passing additional arguments

Many explanations I’ve seen for the apply family of functions omit any discussion on how to provide arguments to the function you pass to apply. I’m not sure why as this is an extremely important consideration. To show you what I mean let’s go back to the apply example that uses the mean function.

apply(mymat, 2, mean)
[1] 10.0775 10.1850 10.5425  9.5625

This is easy to understand though what happens if we wish to pass arguments to the mean function such as “trim” or “na.rm” ? For example maybe we wish to first trim off some portion of the data before we apply the mean function. If we did this using the mean function directly it would look like this:

mean(mymat[,1],trim=0.5)
[1] 9.775

# You might then be tempted to do something like this when using the apply command

apply(mymat, 2, mean(trim=0.5))
Error in mean.default(trim = 0.5) : 
  argument "x" is missing, with no default

# Or maybe this ?

apply(mymat, 2, mean(x,trim=0.5))
Error in mean(x, trim = 0.5) : object 'x' not found

To get some idea on how to do this correctly look at the help information for apply. It’s not exactly jumping out at you but the answer is there. Do you see it ?

     apply(X, MARGIN, FUN, ...)
     
Arguments:

       X: an array, including a matrix.

  MARGIN: a vector giving the subscripts which the function will be
          applied over.  E.g., for a matrix ‘1’ indicates rows, ‘2’
          indicates columns, ‘c(1, 2)’ indicates rows and columns.
          Where ‘X’ has named dimnames, it can be a character vector
          selecting dimension names.

     FUN: the function to be applied: see ‘Details’.  In the case of
          functions like ‘+’, ‘%*%’, etc., the function name must be
          backquoted or quoted.

     ...: optional arguments to ‘FUN’.

Oh so the ellipsis imply that we can simply append arguments after the reference to the function. Of course it is reasonable to assume that the arguments you pass must make sense to the function. That is you cannot pass an argument to a function that isn’t valid !

apply(mymat,2,mean,trim=0.5)
[1]  9.775 10.410 10.485  9.670

# We can pass additional arguments

apply(mymat,2,mean,trim=0.5, na.rm=T)
[1]  9.775 10.410 10.485  9.670

Using our own functions with apply

We can also use a function that we have written and pass it to the apply command. As an example let’s say that we want to express each element in a column as a proportion of the sum of the column it occupies. Here is how we would do this without using apply. As in the opening example this approach doesn’t scale very well although we do get the correct answers.

mymat[,1]/sum(mymat[,1])
[1] 0.2324485 0.2525428 0.2272389 0.2877698

mymat[,2]/sum(mymat[,2])
[1] 0.2535592 0.2253314 0.2574865 0.2636230

mymat[,3]/sum(mymat[,3])
[1] 0.2508893 0.2297842 0.2729429 0.2463837

mymat[,4]/sum(mymat[,4])
[1] 0.2452288 0.2036601 0.2907190 0.2603922

But using our new found knowledge of the apply command we can make this easier and more general. First, let’s write a function that given a vector, (which is what each column or row of a matrix is), will return a vector where each element is a proportion of the sum of the input vector.

myfunc <- function(x) {
  return(x/sum(x))
}

# Check it out to make sure it works

myfunc(mymat[,1])
[1] 0.2324485 0.2525428 0.2272389 0.2877698

all.equal(myfunc(mymat[,1]), mymat[,1]/sum(mymat[,1]))
[1] TRUE

So now we can pass this to the apply command directly


apply(mymat, 2, myfunc)
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2324485 0.2535592 0.2508893 0.2452288
[2,] 0.2525428 0.2253314 0.2297842 0.2036601
[3,] 0.2272389 0.2574865 0.2729429 0.2907190
[4,] 0.2877698 0.2636230 0.2463837 0.2603922

# If this is correct then the sum of each column should be 1.

colSums(apply(mymat,2,myfunc))
[1] 1 1 1 1

# which is equivalent to the following nested apply call

apply( apply(mymat,2,myfunc), 2, sum)
[1] 1 1 1 1

The second example above, the one that has a call to apply within a call to apply, is known as a “composite call”, which can be confusing to newcomers. However, in the world of R, which aggressively implements functional programming concepts, the use of composite function calls is quite common and encouraged. But, there is no requirement to do things this way. Also, I should point out that R has a dedicated function for computing proportions in a matrix. Check out the prop.table command.

Working anonymously

In reality we don’t have to predefine the function. We simply pass the function to apply in an “anonymous” fashion, which means that we don’t even give it a name. It lives only for the duration of the call to apply.

apply(mymat, 2, function(x) x/sum(x))
          [,1]      [,2]      [,3]      [,4]
[1,] 0.2324485 0.2535592 0.2508893 0.2452288
[2,] 0.2525428 0.2253314 0.2297842 0.2036601
[3,] 0.2272389 0.2574865 0.2729429 0.2907190
[4,] 0.2877698 0.2636230 0.2463837 0.2603922

This can be confusing to newcomers and, if it is, then you can simply predefine the function using a name and then refer to it by name when using the apply command. Sometimes you see calls to apply that look like the following. This is done in an attempt to improve readability though that is subjective. In terms of results or efficiency there are no differences.

apply(mymat, 2, function(x) {
                      return(x/sum(x))
                })

Performance vs the for loop

If you’ve read any of my other postings you will know that I have an interest in performance. Is there an efficiency gain when using the apply approach vs. the for loop approach ? To test this let’s read in some “real” Microarray data, which comes from Su et al, 2002.

url <- "http://steviep42.bitbucket.org/data/su_et_al_2002.txt"
mydata <- read.csv(url,header=T)

# B = brain, BF = fetal brain, L = liver, LF = fetal liver

str(mydata)
'data.frame':	12626 obs. of  8 variables:
 $ B1 : num  9.37 11.57 7.79 6.56 7.26 ...
 $ B2 : num  9.5 11.51 6.69 7.45 7.39 ...
 $ FB1: num  7.89 10.56 7.91 5.93 8.75 ...
 $ FB2: num  7.96 10.34 7.57 7.53 8.19 ...
 $ FL1: num  7.7 10.26 8.25 7.84 9.45 ...
 $ FL2: num  8.1 10.13 8.53 8.16 8.19 ...
 $ L1 : num  9.22 10.23 8.39 11.02 9.19 ...
 $ L2 : num  8.9 10.49 8.47 11.09 9.42 ...

# Let's check the mean and standard deviation for each column

apply(mydata,2,function(x) c(mu=mean(x),sd=sd(x)))
         B1       B2      FB1      FB2      FL1      FL2       L1       L2
mu 7.615090 7.431602 7.879343 7.770545 7.730293 7.601395 7.415937 7.518259
sd 2.257218 2.452364 2.150741 2.254736 2.207076 2.355612 2.405326 2.365394

Okay well let’s do a t-test across all rows for the brain tissue types. This becomes really easy to do with the apply command.


mypvals <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } )

round(mypvals[1:5],4)
 100_g_at   1000_at   1001_at 1002_f_at 1003_s_at 
   0.0090    0.0494    0.5211    0.8001    0.1349 
 

But is this faster than writing our own loop to do the t-test ? We’ll first need to write our own code to do the t-test. It might look something like this:


myttest <- function(somemat) {
  retvec <- vector()
  length(retvec) <- nrow(somemat)
  for (ii in 1:nrow(somemat)) {
    retvec[ii] <- t.test(somemat[ii,c("B1","B2")], somemat[ii,c("FB1","FB2")])$p.value
  }
  names(retvec) <- row.names(somemat)
  return(retvec)
}

Next we’ll make sure that this function returns the same results as the apply version after which we will time each approach.

v1 <- myttest(mydata)

v2 <- apply(mydata, 1, function(x) { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value } )

all.equal(v1,v2)
[1] TRUE

# Let's time the two different approaches

benchmark(forloop = myttest(mydata), apply=apply(mydata, 1, function(x) 
                         { t.test(x[c("B1","B2")], x[c("FB1","FB2")])$p.value }),
                           replications=5,columns=c("test","elapsed"))
     test elapsed
2   apply  19.616
1 forloop  61.800

In this case it appears that the apply approach is indeed quicker than writing our own function though in my experience, with more general cases, the difference in execution time isn’t as drastic. In the end I suggest you use whichever approach makes the most sense to you to get the job done. However, I think that as you become more confident with R that you will be attracted to the apply command since it basically replaces the for-loop. So whenever you find yourself tempted to write a for-loop, even though you are used to it in other languages, you should remind yourself that that the “R-like” way to do it is to use apply.

I’m frequently asked by newcomers to R to provide an easy to follow generic set of instructions on how to download data, transform it, aggregate it, make graphs, and write it all up for publication in a high impact journal – all by the end of the day ! While such a request is somewhat understandable coming from a student it’s somewhat awkward when coming from a seasoned research investigator, (i.e. someone who should know better). Obviously, no such set of instructions exists given the astonishing variety of available data stored in differing formats with varying degrees of adherence to a standard(s). Simply put, “real world” data wrangling and analysis can be a complex process independent of the language you choose to employ. Nonetheless, what I’ll attempt to do here is provide a “tour” involving some data I picked at random. I’ll be sticking to the problems of downloading, filtering, and graphing some of the data. And rather than dictate to you what is the “best way” to do something I’ll simply show you what is on offer and you can make your own decisions. R is amazingly flexible, which means that there are many different ways to do the same thing.

I have chosen information relating to San Francisco restaurant inspections which can be found at https://data.sfgov.org/ along with an impressive variety of data sets involving crime reports, parking availability, bus routes, campaign funding, and many other types of civic info. For reference we can map the zipcodes mentioned in the health reports using methods outlined in my blog post on Geocoding. This is basically the area we are considering.

San Francisco restaurants

Mmm. Let’s eat !

I’m hungry !

Without even knowing how the restaurant information is organized one can easily imagine a number of interesting questions such as, “What is the average health score for all restaurants ? What restaurants have the highest, (or lowest), scores ? What is the distribution of scores across zip codes ? Are there seasonal scoring trends ?”. While you can download the files from the above website I have saved a copy on my web site in case the original location changes at some point. The data is available in a zipped file that contains three CSV files that we can easily unpack into data frames.

url <- "http://steviep42.bitbucket.org/data/SFFoodProgram_Complete_Data.zip"
download.file(url,"SFFood.zip")
system("unzip -l SFFood.zip")

Archive:  SFFood.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
  4613344  04-23-14 09:51   violations_plus.csv
  1175067  04-23-14 09:51   businesses_plus.csv
  1754811  04-23-14 09:51   inspections_plus.csv
 --------                   -------
  7543222                   3 files

There are three files each of which has some key information relating to the restaurant inspection process. I should point out that a description file comes with the dataset, which you can read to get a better idea about the column names and formats. But we can also use R to help us understand what is going on. Let’s read these files into data frames:

businesses  <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F)
str(violations)

'data.frame':	55800 obs. of  5 variables:
 $ business_id    : int  10 10 10 10 10 10 17 17 17 17 ...
 $ date           : int  20140114 20140114 20140114 20120403 20110428 20121114 20130605 20130605 20130605 20140312 ...
 $ ViolationTypeID: int  103119 103145 103154 103154 103119 103154 103154 103144 103142 103142 ...
 $ risk_category  : Factor w/ 4 levels "High Risk","Low Risk",..: 3 2 2 2 3 2 2 2 2 2 ...
 $ description    : Factor w/ 70 levels "","Consumer advisory not provided for raw or undercooked foods",..: 22 17 62 62 22 62 62 58 61 61 ...

str(inspections)

'data.frame':	40935 obs. of  4 variables:
 $ business_id: int  10 10 10 10 10 10 10 10 17 17 ...
 $ Score      : int  NA 92 98 NA 98 100 NA 96 94 NA ...
 $ date       : int  20140124 20140114 20121114 20120920 20120403 20110928 20110601 20110428 20140312 20130711 ...
 $ type       : Factor w/ 14 levels "Administrative or Document Review",..: 11 13 13 11 13 13 11 13 13 11 ...

str(businesses)

'data.frame':	6109 obs. of  17 variables:
 $ business_id         : int  10 17 19 24 29 31 37 45 48 50 ...
 $ name                : Factor w/ 5562 levels " HOL N JAM LEMONADE STAND  #2",..: 5110 1901 3525 3562 1067 3509 793 1027 265 4697 ...
 $ address             : Factor w/ 5444 levels "","   ","  VARIOUS LOCATIONS  ",..: 119 1868 548 3892 713 2478 3506 2870 4814 1446 ...
 $ city                : Factor w/ 1 level "San Francisco": 1 1 1 1 1 1 1 1 1 1 ...
 $ state               : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
 $ postal_code         : Factor w/ 47 levels "","00000","04102",..: 12 29 16 12 10 35 24 17 27 10 ...
 $ latitude            : num  37.8 37.7 37.8 37.8 37.8 ...
 $ longitude           : num  -122 -122 -122 -122 -122 ...
 $ phone_number        : num  NA 1.42e+10 NA NA 1.42e+10 ...
 $ TaxCode             : Factor w/ 32 levels "AA","H07","H08",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ business_certificate: int  779059 78443 NA 352312 939325 346882 315736 340024 318022 NA ...
 $ application_date    : Factor w/ 2327 levels "","01/01/2005",..: 1 620 1 1 2137 1 1473 1836 1 1 ...
 $ owner_name          : Factor w/ 5294 levels "","\tCARLIN'S CORNER COFFEESHOP LLC",..: 4746 2704 61 3387 186 3335 4297 1809 5198 333 ...
 $ owner_address       : Factor w/ 5366 levels "","\t1 INDEPENDENCE POINTE #305",..: 2898 4397 484 3785 1479 2492 3413 414 1103 2206 ...
 $ owner_city          : Factor w/ 335 levels "","\tBRISBANE",..: 239 88 239 239 88 239 241 226 241 280 ...
 $ owner_state         : Factor w/ 51 levels "","ARIZONA","AZ",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ owner_zip           : Factor w/ 376 levels "","\t28273","\t29615",..: 165 139 170 165 139 199 183 178 188 177 ...

It looks like that all data frames have a column in common called “business_id”, which we can use as a “key” to link/merge these data frames though we can already answer some basic questions such as there were 40,935 inspections, which led to 55,800 violations over a time range of 04/25/2011 to 04/22/2014. I use the strptime function here to turn the character string dates into POSIX dates.

inspections$date <- strptime(inspections$date,format="%Y%m%d")
violations$date  <- strptime(violations$date,format="%Y%m%d")
range(inspections$date)
[1] "2011-04-25 EDT" "2014-04-22 EDT"

Okay let’s extract information just for the year 2013 and use that as our base data set. Note that I could merge the data frames into a single data frame and we will eventually look at how to do that but for now let’s see what we can get out of each data frame individually.

start_time <- strptime("20130101","%Y%m%d")
stop_time  <- strptime("20131231","%Y%m%d")
inspections <- subset(inspections, date > start_time & date < stop_time)
violations <- subset(violations, date > start_time & date < stop_time)

range(inspections$date)
[1] "2013-01-02 EST" "2013-12-30 EST"

nrow(inspections) 
13,110

nrow(violations)
19,091

Relative to the restaurant inspections process the way I understand it is that an inspector will visit a restaurant at some frequency or in response to a diner’s complaint. The inspector will evaluate the establishment, document any code violations at a severity level of low,medium,high, and ultimately assign a numeric score from 0 to 100 with the following possible meanings: 0-70: Poor, 71-85: Needs Improvement, 86-90: Adequate, 90-100: Good. If violations were noted then a followup inspection(s) will usually occur to insure that the violations are adequately addressed. Such visits usually do not involve a “re-scoring” just a verification that the establishment dealt with the previously observed violations. Evidently this could take multiple visits.

We might then expect that in the inspections file there are multiple rows for a business starting with the original visit and any associated followups. In the case of the followups it looks like the Score is set to NA. Let’s see if that is true. According to the inspections file a “Routine-Unscheduled” inspection was performed on 06/05/2013 for business_id 17 that resulted in score of 94 ,which is good, although in consulting the violations file it seems that there were three “Low Risk” violations noted. A re inspection happened on 07/11/2013.


head(inspections[order(inspections$business_id),])

   business_id Score       date                  type
10          17    NA 2013-07-11 Reinspection/Followup
11          17    94 2013-06-05 Routine - Unscheduled
18          19    96 2013-09-04 Routine - Unscheduled
23          24   100 2013-11-18 Routine - Unscheduled
27          29    87 2013-10-01 Routine - Unscheduled
34          37    96 2013-08-08 Routine - Unscheduled


head(violations[order(violations$business_id),])

   business_id       date ViolationTypeID risk_category                                      description
7           17 2013-06-05          103154      Low Risk     Unclean or degraded floors walls or ceilings
8           17 2013-06-05          103144      Low Risk Unapproved or unmaintained equipment or utensils
9           17 2013-06-05          103142      Low Risk                 Unclean nonfood contact surfaces
18          19 2013-09-04          103133 Moderate Risk           Foods not protected from contamination
23          29 2013-10-01          103142      Low Risk                 Unclean nonfood contact surfaces
24          29 2013-10-01          103120 Moderate Risk          Moderate risk food holding temperature 

Grading

By referring to the inspections file we can answer one of our questions easily. What is the average health score across all restaurants ?

mean(inspections$Score,na.rm=T)
[1] 91.95652 

But let’s dig a little deeper. I’ll create a factor called rating that implements the scoring interpretation given above. This is easy to do using the cut command. We’ll then summarize the mean score for each category and follow that up with a boxplot of the scores per category. This should give us some initial ideas about the data.


inspections$rating <- cut(inspections$Score,breaks=c(0,70,85,89,100),right=T,
    labels=c("P","NI","A","G")) 

# P = Poor, NI = Needs Improvement, A = Adequate, G = Good

tapply(inspections$Score,inspections$rating,mean)
       P       NI        A        G 
64.34337 79.97907 87.55324 95.94264 

ylim <- c(min(inspections$Score,na.rm=T)-5,max(inspections$Score,na.rm=T)+5)

# Capture the boxplot output to get the number of observations per category

myb <- boxplot(Score~rating,data=inspections,main="Scores per Rating Category",
               col=rainbow(4),ylim=ylim)

leg.txt <- paste(levels(inspections$rating),myb$n,sep=" : ")

legend(3,70,leg.txt,title="Obs. per Category",cex=0.8,pch=19,col=rainbow(4))
Boxplots of score per category

Boxplots of score per category

Hmm, this information is a little shocking since no one wants to think that there are restaurants anywhere with a health score in the “Poor” category let alone 166 of them as we see here. Who does the lowest score of 2013 belong to ? Business id 74522. Note that it would be really easy to consult the “businesses” data frame at this point to determine the full name associated with this id though since this is a blog on R and not on restaurant reviews I’ll leave it to you to determine what business name that is so you can make your dining plans accordingly.

# What business id had the lowest score in 2013 ?

inspections[which(inspections$Score==min(inspections$Score,na.rm=T)),]

      business_id Score       date                  type rating
39498       74522    42 2013-10-03 Routine - Unscheduled      P

# Let's see what the violations were

violations[violations$business_id==74522,][,-3]

 business_id       date  risk_category                                        description
       74522 2013-10-03                Low Risk                   Unclean nonfood contact surfaces
       74522 2013-10-03               High Risk            Unclean hands or improper use of gloves
       74522 2013-10-03                Low Risk       Unclean or degraded floors walls or ceilings
       74522 2013-10-03                Low Risk   Improper storage of equipment utensils or linens
       74522 2013-10-03                Low Risk                              Improper food storage
       74522 2013-10-03           Moderate Risk                          Improper thawing methods 
       74522 2013-10-03           Moderate Risk Inadequate and inaccessible handwashing facilities
       74522 2013-10-03               High Risk                       High risk vermin infestation
       74522 2013-10-03               High Risk        Unclean or unsanitary food contact surfaces
       74522 2013-10-03               High Risk                High risk food holding temperature 
       74522 2013-10-03               High Risk                   Contaminated or adulterated food
       74522 2013-10-03               High Risk              Improper cooking time or temperatures

So let’s just go ahead and find the 25 restaurants that have the highest number of “high risk” violations ? How would we find those ?

# Pull out the high risk violations

violations_high <- violations[violations$risk_category == "High Risk",]

# Split the data frame by business_id

my_high_viols <- split(violations_high,violations_high$business_id)

# Count all violations for a given business id and then sort all
# business ids from highest to lowest

violation_high_count <- sort(sapply(my_high_viols,nrow),T)[1:25]

33446  2427   286  3480 70068  6695 16806 67330 70281 73926 74522   853  1000  1343  1358  1458 
   11     9     7     7     7     6     6     6     6     6     6     5     5     5     5     5 
 2189  2302  2401  3217  3948  4590 18800 21524 36201 
    5     5     5     5     5     5     5     5     5 

inspections[inspections$business_id==33446,]
      business_id Score       date                  type rating 
22145       33446    73 2013-12-11 Routine - Unscheduled     NI      

Notice that while business_id 74522 has the worst overall score, 43, it didn’t actually receive the highest number of “high risk” violations in 2013. That honor goes to business_id 33446 who received 11 violations and a score of 73 on 12/11/2013. But let’s go back to a general view here. I want to slice up the inspections into quarters of the year, (four groups of three months), to see if perhaps scores follow some seasonal pattern at least from a visual point of view. To do this we need to use some R date functions.

inspections$quarters <- quarters(inspections$date)
inspections$quarters <- factor(inspections$quarters,levels=c("Q1","Q2","Q3","Q4"),ordered=TRUE)

# Now let's build the barchart out of a table of quarters vs ratings

library(lattice)
barchart(table(inspections$quarters,inspections$rating),auto.key=list(columns=4),
         main="Violation Category Counts by Quarter")

Barchart of quarters vs. ratings

Barchart of quarters vs. ratings

Well it doesn’t look like there are drastic differences between the scoring categories on a quarterly basis though there appears to be slightly less scores in the “NI” category in Quarter 1 than all other quarters. We could drop down to a month level of granularity to check things out from month to month. All we would need to that is to use the “months” data function above instead of the “quarters” and then apply level labels accordingly. I’ll leave that to you as an exercise.

Grades by zipcode

Well what’s next ? Let’s take a look at the distribution of scores across zip codes which we can access from the inspections file. We’ll also do some merging of the data frames to make our process a bit easier. First, how many unique zip codes do we have ?

unique(businesses$postal_code)
 [1] "94104"     "94124"     "94109"     "94103"     "94133"     "94118"     "94110"     "94122"     "94115"    
[10] "94131"     "94111"     "94117"     "94107"     "94108"     "94102"     "94132"     "94105"     "94134"    
[19] "94116"     "94121"     "94112"     "94127"     "94123"     "94114"     "94513"     "94545"     "94066"    
[28] "941033148" ""          "94158"     "95105"     "94140"     "94013"     "94130"     "CA"        "92672"    
[37] "94120"     "94143"     "94609"     "94101"     "00000"     "CA  94122" "CA  94523" "94188"     "94014"    
[46] "04102"     "94129"

Most of them look legitimate whereas others don’t appear to be legal and are perhaps the result of data entry errors. For example while “CA 94122” does contain a legal zip code the expected format is the 5 numbers. Also, the “00000” might be used to signal a missing value but we don’t know. For simplicity we’ll simply strip out values that are NOT five digits. We also have some oddities like “04102”, (Portland), and “92672”, (San Clemente), although if we look close at the businesses data frame for these records we see street names of, respectively, 366 GOLDEN GATE AVE” and “1530 Market Street” which are San Francisco addresses. Look’s like somebody messed up when entering data. So in a “real” analysis we would need to deal with these issues more definitively. But we’ll pass on these for now and merge the inspections data frame with a new version of the inspections data frame that has the normalized zip codes. Finally, we’ll plot the barchart of the average score for each zip code.


myb <- businesses[nchar(businesses$postal_code)==5 & businesses$postal_code != "00000",]

unique(myb$postal_code)

[1] "94104" "94124" "94109" "94103" "94133" "94118" "94110" "94122" "94115" "94131" "94111" "94117" "94107"
[14] "94108" "94102" "94132" "94105" "94134" "94116" "94121" "94112" "94127" "94123" "94114" "94513" "94545"
[27] "94066" "94158" "95105" "94140" "94013" "94130" "92672" "94120" "94143" "94609" "94101" "94188" "94014"
[40] "04102" "94129"

# Let's merge the two data frames myb and inspections using business_id as a key.

mym = merge(myb[,c(1:2,6)],inspections,by="business_id")

head(mym)

business_id                               name postal_code Score       date
1          17               GEORGE'S COFFEE SHOP       94124    94 2013-06-05
2          17               GEORGE'S COFFEE SHOP       94124    NA 2013-07-11
3          19              NRGIZE LIFESTYLE CAFE       94109    96 2013-09-04
4          24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY       94104   100 2013-11-18
5          29                      CHICO'S PIZZA       94103    87 2013-10-01
6          37                        CAFE BISTRO       94118    NA 2013-01-30
                   type rating quarters
1 Routine - Unscheduled      G       Q2
2 Reinspection/Followup   <NA>       Q3
3 Routine - Unscheduled      G       Q3
4 Routine - Unscheduled      G       Q4
5 Routine - Unscheduled      A       Q4
6 Reinspection/Followup   <NA>       Q1

# Let's create an aggregation

hold <- aggregate(Score~postal_code,data=mym,mean)

hvec <- hold$Score

names(hvec) <- hold$postal_code

myCols <- colorRampPalette(c("blue", "red"))( length(hvec) )
mp <- barplot(rev(sort(hvec)),axisnames=F,col=myCols,main="Average Score per Zipcode")
axis(1,at=mp,labels=names(hvec),las=2,cex.axis=0.8)
Avg Score per Zip Code

Avg Score per Zip Code

Let’s see if the lowest scored 25 restaurants are in the same zip codes as the highest scored restaurants. From the following it looks like there is significant overlap so its not as if the there is a big difference between the zipcodes.

hi <- unique(mym[order(mym$Score),][1:100,]$postal_code)

hi
 [1] "94133" "94103" "94102" "94111" "94108" "94116" "94127" "94110" "94122"
[10] "94121" "94114" "94123" "94107" "94132" "94118" "94109"

lo <- unique(mym[order(-mym$Score),][1:100,]$postal_code)
lo
 [1] "94104" "94103" "94111" "94122" "94117" "94108" "94115" "94132" "94105"
[10] "94110" "94123" "94118" "94107" "94112" "94133" "94131" "94121" "94102"
[19] "94109"

sum(lo %in% hi)
[1] 13

What’s for Dessert ?

What next ? This is really just the beginning really. One approach that I didn’t take was to use the very cool sqldf package that would allow us to treat the dataframes as tables withing a relational database. However, this assumes some familiarity with SQL which, at least in my experience, newcomers to R don’t usually possess although it is worth it to learn if you plan on a career in data mining. As a teaser here is how we could have done some things using sqldf. Check this out and see you next time.

library(sqldf)
businesses  <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F)

# How many restaurants are there ?

sqldf("select count(*) from businesses")
count(*)
1     6109

# What is the average score for all restaurants ?

sqldf("select avg(Score) from inspections")
  avg(Score)
1   91.95652

# What are the average scores per zip code ?
# Note that since this is the raw data we have some "bad" zipcodes

sqldf("select avg(Score),businesses.postal_code from inspections join businesses using(business_id) group by postal_code")

   avg(Score) postal_code
1    91.85986            
2    90.00000       00000
3    96.00000       04102
4    96.50000       92672
5    98.33333       94013
6    85.66667       94014
7   100.00000       94066
8    95.75000       94101
9    91.73356       94102
10   91.34171       94103
11   98.00000   941033148
12   95.65000       94104
13   94.11030       94105
14   94.72059       94107
15   90.39931       94108
16   90.48051       94109
17   91.54456       94110

Rolling Your Rs

In this article I discuss a general approach for Geocoding a location from within R, processing XML reports, and using R packages to create interactive maps. There are various ways to accomplish this, though using Google’s GeoCoding service is a good place to start. We’ll also talk a bit about the XML package that is a very useful tool for parsing reports returned from Google. XML is a powerful markup language that has wide support in many Internet databases so it is helpful. Lastly, we’ll use our knowledge to create a map of the tour dates on the Rolling Stones 1975 Tour of the Americas. Also, when I use the word “GeoCoding” this basically implies the process of taking a geographic location and turning it into a latitude / longitude pair.

What does Google Offer ?

Check out the main Geocoding page, which presents implementation details of the API as…

View original post 1,071 more words

If you are a newcomer to R then you are probably quite busy learning the semantics of the language as you experiment with the apply family of commands or come up to speed on the grouping and conditioning capabilities offered by lattice graphics. And, along the way, you might have heard that R has the ability to “link in” code written in other languages such as C, C++, Fortran, and Java. This is true but until you are presented with a compelling use case you will most likely ignore such capability since you’ve already got plenty to do. But that’s where this blog can help. I’ll talk about how to integrate Fortran code with R and, in a later post, discuss doing the same with C and C++.

DISCLAIMER: To accomplish this work requires the presence of a working Fortran compiler such as the GNU suite of compilers (g77, gfortran, gcc, etc). Relative to operating systems I use Linux and OSX. I rarely use Windows though do know that the GNU suite is available for that OS so you should be able to link Fortran code into R there. However, I’ve never tried it. Additional Disclaimer: The timings I present in this post are based on an Apple MacBook @ OS level 10.9.2 with a 2.5 Ghz i5 processor and 4GB of RAM. Your timings may vary.

Why Bother ?

Good question. Most people wind up wanting to access Fortran from R for a few reasons such as they have some really fast and efficient Fortran code that they want to exploit within R. Or maybe they have written some code in R that winds up being incredibly slow so they write a much faster version in Fortran and then want to call it from R. Perhaps they need to access subroutines from external Fortran libraries. Lastly, it might simply be because your boss or faculty advisor is making you do it ! Whatever your reason(s) we’ll break the process of linking in Fortran code down into three general steps: 1) prepare the subroutine for compilation and generate a “shared object”, 2) load the shared object into an active R session, and 3) provide R variables to the shared object via the “.Fortran” call, which is part of the “Foreign Function Interface” within R.

Foreign                  package:base                  R Documentation

Foreign Function Interface

Description:

     Functions to make calls to compiled code that has been loaded into
     R.

Usage:

            .C(.NAME, ..., NAOK = FALSE, DUP = TRUE, PACKAGE, ENCODING)
      .Fortran(.NAME, ..., NAOK = FALSE, DUP = TRUE, PACKAGE, ENCODING)

Prepare the Fortran Subroutine

Next, I present a very simple Fortran 77 subroutine that computes the factorial of a number “n” and stashes the result into a variable called “answer”. And speaking of subroutines it is important to know that to use the .Fortran interface one must make reference to Fortran subroutines only – not Fortran functions or full on programs. So if you have some code that you want to bring in then you will need to embed that code within a subroutine definition. We also need to pay attention to the variable types as declared within the subroutine so we can match those types accordingly when calling the subroutine from R. This winds up being one of the more critical steps in the process.

        subroutine facto(n,answer)
c
c simple subroutine to compute factorial
c
        integer n, answer, i

        answer = 1
        do 100 i = 2,n
           answer = answer * i
  100   continue 
        
        end

You should make sure, of course, that this code does compile correctly. Our goal is to generate a shared object file ( a “.so” file) that can be linked in to R. Note also that our routine doesn’t print/write things to output. It simply uses its input variables and ultimately sets an output variable. Make your code lean.

$ ls
facto.f

$ gfortran -c facto.f

$ ls
facto.f	facto.o

$ gfortran -shared -o facto.so facto.o
$ ls
facto.f		facto.o		facto.so

So it looks like we are good to go here. However, instead of doing this compilation ourselves we could have allowed R to help us. In fact it is the preferred way to do this since this will insure the compilation is done “under the supervision” of the R tools. Let’s remove the .o and .so files and start over.

$ rm *.*o

$ ls
facto.f

$ R CMD SHLIB facto.f
<you will see various compilation output messages>

$ ls
facto.f		facto.o		facto.so

Load it

So now what ? Let’s fire up R. We’ll use the dyn.load command which is part of the foreign interface capability. It’s purpose is to load/unload shared objects which are also known as DLLs, (dynamically loadable libraries).

system("ls")
facto.f		facto.o		facto.so
dyn.load("facto.so")

Okay not much happened there. What’s going on ? Well all we did was simply load the shared object. We have yet to use it. To do that we rely upon the “.Fortran” function. Keep in mind that the subroutine “facto” has two arguments both of which are integers. We’ll supply a value of 5 for “n” and we’ll pass a single integer as a value for the “answer” variable though that will be overwritten once the subroutine computes the “answer”.

system("ls")
facto.f		facto.o		facto.so

dyn.load("facto.so")

.Fortran("facto",n=as.integer(5),answer=as.integer(1))

$n
[1] 5

$answer
[1] 120

# Or more directly

.Fortran("facto",n=as.integer(5),answer=as.integer(1))$answer
[1] 120

If you are wondering what types are supported or shared between R and Fortran here is a list from the .Fortran man page. It is worth some time to peruse the man page as it provides some caveats and deeper explanations on how to interact with Fortran.

       R          Fortran          
       integer    integer          
       numeric    double precision 
       - or -     real             
       complex    double complex   
       logical    integer          

Wrap it Up!

Okay that was cool but this is sort of an awkward way to call the Fortran subroutine. We should probably write our own wrapper function in R to do the loading of the shared object and the referencing to the .Fortran function. Take a look at this approach which winds up being very “R like”.

myfacto <- function(num) {
  dyn.load("facto.so")
  retvals <- .Fortran("facto",n = as.integer(num), answer = as.integer(1))
  return(retvals$answer)
}
         
myfacto(5)
[1] 120

sapply(1:10,myfacto)
 [1]       1       2       6      24     120     720    5040   40320  362880 3628800

So with the wrapper approach we can use the Fortran subroutine just as we would any other R function since the call to .Fortran is “buried” in the wrapper function. We could make this a bit more robust by putting in some logic to see if the shared object is already loaded.

myfacto <- function(num) {
  if (!is.loaded('facto')) {
     dyn.load("facto.so")
  }
  retvals <- .Fortran("facto",n = as.integer(num), answer = as.integer(1))
  return(retvals$answer)
}
         
myfacto(5)
[1] 120

It’s all Convoluted

Well that was okay but let’s look at a more involved example. Let’s consider the idea of doing discrete convolution between two vectors. (Note: This is discussed in the “Writing R Extensions” manual). Why did I pick such an example ? Well first it’s commonly referenced in R literature and , second, it is a good motivating case for using an external language to speed up the processing. The algorithm itself isn’t hard to code up either in R or Fortran. However, the performance in R isn’t so good once the vectors get larger. Check it out:

conr <- function(x, y) {
    lx <- length(x)
    ly <- length(y)
    cxy <- numeric(lx + ly - 1)
    for(i in 1:lx) {
        xi <- x[i]
        for(j in 1:ly) {
            ij <- i+j-1
            cxy[ij] <- cxy[ij] + xi * y[j]
        }
    }
    return(cxy)
}

# Let's check the timings for vectors of different sizes

v1 = rnorm(100); v2 = rnorm(100)

system.time(conr(v1,v2))
   user  system elapsed 
  0.034   0.000   0.035 

v1 = rnorm(2000); v2 = rnorm(2000)

system.time(conr(v1,v2))
   user  system elapsed 
 13.195   0.020  13.215 

v1 = rnorm(4000); v2 = rnorm(4000)

system.time(conr(v1,v2))
   user  system elapsed 
 57.757   0.130  58.008 

The timings grow significantly longer as the sizes of the vectors grow. So passing vectors of size 10,000 could take a very long time. While this blog isn’t specifically on performance let’s do a little bit more coding to get an idea about how poorly performing the convolution written in R is. We’ll use this for later comparison with the performance numbers resulting from the Fortran subroutine. Let’s write a wrapper function to the conr function. This will call conr with a variable x that represents the size of the vectors we wish to convolute. If you don’t understand exactly what is going on here don’t worry – just think of at as more exposure to the apply family of commands.

timewrapconr <- function(x) {
    times <- system.time(conr(rnorm(x),rnorm(x)))[3]
    return(c(size=x,times))
}

# time the convolution for vectors of size 100,1000,2000, and 4000

(convtimes <- sapply(c(100,1000,2000,4000),timewrapconr))
           [,1]     [,2]     [,3]    [,4]
size    100.000 1000.000 2000.000 4000.00
elapsed   0.033    3.552   13.811   62.15

# Let's plot this
library(lattice)

xyplot(convtimes[2,]~convtimes[1,],type=c("p","l","g"),
       xlab = "vector size", ylab = "elapsed time in seconds",
       main = "Execution times for Convolution in R", pch = 19)
Figure 1. Plot of execution times

Figure 1. Plot of execution times

How do we address this problem ? Well there are opportunities for improvement within the R code by using vectorization techniques. A good start would be to somehow avoid the second for loop and there are ways to do that. In fact there is a way to avoid both loops altogether and maybe we’ll explore such an approach in another post. But for now let’s see if writing the code in Fortran and then linking it in could help improve things. So here is the rewritten convolution algorithm, which we will save into a file called convolvef77.f

      subroutine convolvef77 (x, lx, y, ly, xy)
c
c A basic implementation of convolution algorithm for two vectors
c I use zero-based arrays here.
c
      integer lx, ly, i, j
      double precision x(0:lx-1), y(0:ly-1), xy(0:lx+ly-2)
      do 20 i = 0, (lx-1) 
         do 15 j = 0, (ly-1) 
            xy(i+j) = xy(i+j) + x(i) * y(j) 
  15     continue  
  20  continue 
      end

# Save the above to a file called convolvef77.f
# Now compile it to a shared library

$ R CMD SHLIB convolvef77.f 

Next we’ll write a function in R to call the convolvef77 function. Start up R.

convolvef77 <- function(x,y) {
  dyn.load('convolvef77.so')
  lx = length(x)
  ly = length(y)
  retdata <- .Fortran("convolvef77",
                       x = as.double(x),
                       lx = as.integer(lx), 
                       y = as.double(y), 
                       ly = as.integer(ly), 
                       xy = double(lx+ly-1))$xy
  return(retdata)
}

# Now let's throw some large vectors at it. Look at how much better the times are

v1 = rnorm(4000); v2 = rnorm(4000)

system.time(convolvef77(v1,v2))
   user  system elapsed 
  0.012   0.000   0.012 

v1 = rnorm(8000); v2 = rnorm(8000)

system.time(convolvef77(v1,v2))
   user  system elapsed 
  0.047   0.001   0.083 

So the speed looks really good. So now let’s repeat the timing exercise we applied to the convolutions done in R.

timewrapconf77 <- function(x) {
    times <- system.time(convolvef77(rnorm(x),rnorm(x)))[3]
    return(c(size=x,times))
}

(convtimes <- sapply(c(100,1000,2000,4000),timewrapconf77))
           [,1]  [,2]  [,3]    [,4]
size    100.000 1e+03 2e+03 4.0e+03
elapsed   0.045 2e-03 4e-03 1.3e-02

# Wow. This is FAST !!!!! Let's throw some bigger vectors at it.

(convtimes <- sapply(c(100,1000,2000,4000,10000,20000,50000),timewrap))
         [,1]  [,2]  [,3]    [,4]    [,5]     [,6]      [,7]
size    1e+02 1e+03 2e+03 4.0e+03 1.0e+04 2.00e+04 50000.000
elapsed 1e-03 2e-03 4e-03 1.2e-02 7.2e-02 3.22e-01     2.074

# Plot the times

xyplot(convtimes[2,]~convtimes[1,],type=c("p","l","g"),
       xlab = "vector size", ylab = "elapsed time in seconds",
       main = "Execution times for Convolution in Fortran", pch = 19)
Execution times using Fortran

Execution times using Fortran

So using the Fortran subroutine took 2.0 seconds to convolute vectors of size 50,000 whereas using native R code to convolute a vector of size 1,000 took 3.5 seconds (these timings might vary depending on your architecture and OS). To get a better visual comparison let’s repeat the timings for both approaches, R and Fortran, and plot the results on the same graph so you can get some sense of proportion between the execution times. This isn’t hard to do. We’ll just rerun our timing functions:

(convtimesr <- sapply(c(100,1000,2000,4000,10000,20000),timewrapconr))

           [,1]     [,2]     [,3]     [,4]      [,5]      [,6]
size    100.000 1000.000 2000.000 4000.000 10000.000 20000.000
elapsed   0.034    3.374   14.118   64.894   355.409  1504.517
  
(convtimesf77 <- sapply(c(100,1000,2000,4000,10000,20000),timewrapconf77))
           [,1]    [,2]  [,3]    [,4]    [,5]     [,6]
size    100.000 1.0e+03 2e+03 4.0e+03 1.0e+04 2.00e+04
elapsed   0.071 2.3e-02 4e-03 1.2e-02 6.9e-02 2.99e-01

# Now plot them on the same graph

plot(convtimesr[1,],convtimesr[2,],xlab="Vector size",
     ylab="Elapsed time in seconds", 
     main="Convolution in R vs Fortran",
     type="l",col="red",lwd=2.5)

points(convtimesf77[1,],convtimesf77[2,],type="l",col="blue",lwd=2.5)

legend("topleft",c("R","Fortran"),col=c("red","blue"),
        lty=c(1,1),lwd=c(2.5,2.5))
grid()
Execution times for R and Fortran

Execution times for R and Fortran

Okay, I think you get the point here. Using the Fortran code definitely helped speed things up. However, speed might not be the only reason you choose to link in Fortran code. For example I know of people who have written the bulk of their thesis analysis work using Fortran and now seek to leverage that effort within R. Sure, they could recode their stuff into R but that would probably result in lower performance results. Any time you have a significant body of work in one language you would like to avoid having to recode it in another. Lastly, there are other ways to bring in Fortran that I haven’t discussed here. The “inline” package allows one to compile fortran code inline within a given R program, which might be more appealing to some. Hope this has been helpful.

Conditioning and grouping are two important concepts in graphing that allow us to rapidly refine our understanding of data under consideration. Conditioning, in particular, allows us to view relationships across “panels” with common scales. Each panel contains a plot whose data is “conditional” upon records drawn from the category that supports that particular panel (an example will help make this clear). Grouping allows us to view interactions within a given panel by making distinctions using different plot characters, size, or color. These ideas have been discussed in Cleveland’s book “Visualizing Data”. The Lattice, (aka “Trellis”), graphics package attempts to implement the concepts and values indicated in Cleveland’s book. These ideas can help us find interesting patterns in our data.

However, it is important to point out that all graphics packages available within R, (Base, Lattice, ggplot2, Grid), have their respective strengths and good work can be accomplished using any of them although for the newcomer Base graphics is a reasonable starting point. There is an abundance of information on Base graphics both in printed literature and via Google results so ample support is available. As a newcomer to Base graphics it doesn’t take long before you are creating graphs like that in Figure 1, which is an approximation of a conditioned plot. Strictly speaking it does not adhere to the principles of conditioned plots but its useful to see how close we can get. We have a three “panel” plot of MPG vs Weight from the mtcars data frame which is built-in to R. Each panel contains a scatter plot between MPG vs wt for each cylinder group, (4, 6, or 8). The paneling allows us to conveniently compare the MPG vs Wt relationship across groups.

par(mfrow=c(1,3))

cyls <- split(mtcars, mtcars$cyl)
for (ii in 1:length(cyls)) {
     tmpdf <- cyls[[ii]]
     sname <- names(cyls)[ii]
     plot(tmpdf$wt, tmpdf$mpg, 
          main = paste("MPG vs Wt",sname,"Cyl"),
          ylim = c(0,40), xlab = "Wt / 1,000",
          ylab = "MPG", pch=19, col="blue")
     grid()
}
Figure 1

Figure 1

We see that the higher weights correspond to records with lower a MPG figure. This idea is confirmed by the separate panel plots wherein we see the lower MPG for increasing cylinder count, which is is quite likely correlated with the weight of the automobile. (It is in fact with a correlation coefficient of 0.78). In any case the data in mtcars isn’t terribly difficult to understand, (though that’s not really the point here – we are simply trying to come to terms with how to plot data). We split up the data frame and looped through the resulting list and plotted MPG vs Wt for each subgroup. It is not a bad approach and is one that lends itself well to programming. On the other hand let’s check out how this kind of thing might be done using lattice graphics, which was designed in large part to make conditioned plots really easy. Let’s put it to the test.

library(lattice)
xyplot(mpg~wt | factor(cyl), data=mtcars, pch=19,
                main="MPG vs Wt", xlab="Wt/1,000",  ylab="MPG",layout=c(3,1),type=c("p","g"))
Figure 2

Figure 2

What’s Your Condition ?

This isn’t bad for one line (two if you include the library) statement. Let’s break down what is happening in Figure 2. The “mpg~wt” part of the call to xyplot should be familiar as a formula where mpg represents the y-axis and wt represents the x-axis. The formula interface is commonly used in other R procedures such as lm as well as aggregation functions so if this is new to you then it is important to learn about it.

The part of the call to xyplot that newcomers might not understand is the vertical bar character. The presence of this character indicates that what follows is a “conditioning” variable which is typically a category/factor. It also let’s us know that the resulting plot will contain a number of panels the number of which will correspond to the number of unique values assumed by the conditioning variable(s). In the case of mtcars$cyl this will be three since the unique values assumed by cylinder are 4,6,8. However, we really don’t have to know the number of values in advance. We can just plot it and view the defaults after which we can go in and add the “layout” argument to refine the visual presentation.

unique(mtcars$cyl)
[1] 6 4 8

But it gets better in that we can use more than one conditioning character at a time. We can look at a combination of factors in effect looking at the MPG vs Wt relationship as “cells” of a table. Sound complicated ? Not really. Check this out. We are asking xyplot to present MPG vs wt for each combination of Transmission and Cylinder type. This results in a 2 x 3 grid structure wherein the panels contain a scatterplot.

xyplot(mpg~wt | factor(cyl) + factor(am,labels=c("A","M")),
                data=mtcars, main="MPG vs Wt", 
                xlab="Wt/1,000", ylab="MPG",pch=19,type=c("p","g"))
Figure 3

Figure 3

Notice how lattice takes care of the “paneling” for us and supplies labels based on the given factor combinations. So we have a row for each Transmission type, (automatic or manual) each of which contains three columns/panels corresponding to a cylinder group of 4,6, or 8. This plot makes it trivial to determine what combination gets the best gas mileage assuming that is what we are looking for. The 4 Cylinder cars with Manual transmissions get the best gas mileage. Note that we could also further influence the layout and look of the graph and make adjustments to the “strip” labels in each panel but since this is an introductory article we’ll focus on general capabilities for now.

There is Safety in Groups !

Are there other ways that we enhance the graph to reveal more meaning ? In fact there is. In particular we could employ the idea of “grouping”. However, before apply grouping to the above example, (Figure 3), let’s look at a stand-a-lone case to make sure you understand the concept. Suppose we want to look at the MPG vs wt graph within a single panel though we wish to see points colored according to the Transmission group they occupy. This is easy to accomplish with lattice by using the “groups” argument.

xyplot(mpg~wt,data=mtcars,groups=factor(am,labels=c("A","M")),
       pch=20,auto.key=list(columns=2),type=c("p","g"))
Figure 4

Figure 4

So that was pretty easy wasn’t it ? Instead of viewing the interaction across panels we can see it within a single panel by using the different colors to partition the data into groups. Now let’s say that we wanted the points to reflect the cylinder group they occupy except instead of using color we’ll use point size to indicate the relationship. Here we use a vector from 3 down to 1 in conjunction with the cex argument to reflect the size of the points coming from, respectively, cylinder groups 4, 6, and 8. Points coming from cylinder group 4 are going to be larger than points coming from groups 6 and 8. The MPG is better with 4 cylinder autos and we want that reflected in terms of larger size.

Figure 5

Figure 5

Now if we turn our attention back to the situation wherein we used two factors to condition the MPG vs Wt plots we should be able to combine grouping with conditioning to see if there is anything lurking in the data that we might have missed previously. So what I’ll do here is plot MPG vs Wt as conditioned by Cylinders while grouping by Transmission.

xyplot(mpg ~ wt | factor(cyl), data=mtcars, pch=16,layout=c(3,1),
                  groups = factor(am,labels=c("Auto","Manual")),
                  type=c("p","g"),auto.key = TRUE)
Figure 6

Figure 6

Do we see anything interesting here ? Perhaps. If you have a 4 cylinder car and a manual transmission then you get the best gas mileage. This isn’t anything new. However, if we look at the cars in the 8 cylinder panel we see that cars with a manual transmission, (there are only two of them), don’t seem to perform as well relative to MPG as they did in the 4 cylinder group. In the 6 cylinder panel we see that auto and manual transmissions are closer in MPG than in the 4 cylinder group. So we might form the hypothesis that having a manual transmission is related to significantly better gas mileage only if you have a 4 cylinder car. Or maybe we just need to find more 8 and 6 cylinder cars with manual transmissions to see if the pattern holds. It is important to note that these cars were surveyed in 1974 so our findings would probably change were we to use updated information. Well this is all cool stuff but for now we’ll stick with the graphics details instead of further analysis.

Out of curiosity what would this example look like in Base graphics ? Here is one possibility. I’ll let you make the decision if this is easier than using lattice and if the result is as aesthetically pleasing. Base graphics are amazingly flexible but require more of a programming approach in this case than does lattice although lattice panel functions, (something I will explore in another Blog post), can also require a “programmer’s mentality”.

par(mfrow=c(1,3)) 
mysplits = split(mtcars,mtcars$cyl)
maxmpg = max(mtcars$mpg)
for (ii in 1:length(mysplits)) {
      tmpdf  <- mysplits[[ii]]
      auto <- tmpdf[tmpdf$am == 0,]
      man <- tmpdf[tmpdf$am == 1,]
      plot(tmpdf$wt, tmpdf$mpg,type="n",
           main=paste(names(mysplits[ii])," Cylinders"),
           ylim=c(0,maxmpg), xlab="wt",ylab="MPG")
      points(auto$wt,auto$mpg,col="blue",pch=19)
      points(man$wt,man$mpg,col="pink",pch=19)
      grid()
      legend("topright", inset=0.05, c("manual","auto"),
             pch = 19, col=c("pink","blue"))
}
Figure 7

Figure 7

Well I haven’t been completely forthright here. Base graphics does in fact have a function that is dedicated to conditioned plotting although it doesn’t generalize to other plot types. It does, however, handle X/Y plots quite well. I include it here because it does conform to the principles of conditioned plotting but if you are going to embrace the formal approach for conditioning then you might as well use lattice since it does generalize across various plot types.

coplot(mpg~wt|factor(cyl)+factor(am),data=mtcars)
Figure 8: coplot

Figure 8: coplot

Other Lattice Functions

Okay, you might be wondering if the arguments to xyplot will “carry over” to other lattice plotting functions. The answer is “yes” although they must make sense for the given chart type. For example with histograms or boxplots we are looking at the plot of a single continuous quantity but we can still “condition” that continuous quantity using factors from the dataset. In this next example we look at another internal dataset to R called “ChickWeight”. We’ll present a histogram of the weights of 50 chickens, measured over time, as conditioned by the type of diet they were provided. There are 4 diet types hence we will see 4 panels. All of the considerations important to lattice graphics are intact – common axes, optimal use of space, shared margins, and panels to indicate conditions.

histogram(~weight|factor(Diet),data=ChickWeight,
           main="Weight by Diet Type",xlab="Weight in grams")
Figure 9

Figure 9: lattice histogram

So checkout this example. We create a boxplot of chicken weights conditioned on the passage of time since birth which is measured in days (0,2,…,21).

bwplot(~weight|factor(Time),data=ChickWeight,col="blue",
        main="Weight by Days Since Birth",xlab="Weight in grams")
lattice boxplot

Figure 10: lattice boxplot

Conclusions

So we’ve really only scratched the surface here of what can be done with conditioning and grouping although I feel that this is a reasonable introduction. As I’ve mentioned previously, Base graphics has lots of flexibility and if you become fluent with the high and low level functions offered as part of that environment you can accomplish a lot. Admittedly, if you do not have a programming background then writing looping structures to process data can be a bit intimidating but its worth it to learn. The examples I presented with lattice are to show how convenient it is to do conditioning and grouping but then again if you want to do more customization then that will involve learning a bit more about lattice. So at some point you find yourself in the same position that you would be with Base graphics. It all depends on how detailed and custom you wish your plots to be. The ggplot2 package, which I will discuss in another Blog post, can also handle these activities quite nicely though the jargon is a bit different.

Independently of what graphics package you choose to employ, it is important to realize that flexibility does not always imply ease of use. Like any software package there is absolutely no substitute for experience and getting your hands dirty. R is no exception.