Archive for the ‘dplyr’ Category

HexBin Chart of Zika Cases in Brazil

HexBin Chart of Zika Cases in Brazil

I really thought I was done with the Express dplyr series though on completion of the second part I received many messages requesting more examples of using dplyr with ggplot along with some other types of information such as the Zika virus data which can be downloaded from Github. These examples are not drastically different from previous examples although they allow me to show how each data set presents its own challenges and also how transformations to the data will be necessary to visualize  the information.  The Zika data is spread across a number of countries but we will focus uniquely on cases in Brazil. Download the data using a Git client or straight from the site itself. If you are not using Git then you are missing out. (See my Youtube playlist for a basic introduction). It’s a great way to manage your R code and larger projects. Also keep in mind that the RStudio IDE has built in Git capabilities so as you create software it is extremely easy to “push” changes to your reference code and equally as easy for users of your code to “pull” changes down.

Processing the Data

Brazil Zika DataOkay we have seven .CSV files each of which represents some Zika information on the given date. (Note that there might be more files that have been added since I wrote this post). The format of the data isn’t particularly complicated although it isn’t standard in the sense that each row in a file represents the same type of information.  In general you never really know what the data contains (or not) until you start working with it. There are a number of ways to approach this scenario. Ideally if there is a code book that explains the format then start there although it turns out that this data is simple enough to figure out. We could read in the data for a specific date, e.g. 2016-04-02.csv, but my inclination is to read in all the files and bind them together into a single data frame so I will have all that I need in one spot. This isn’t necessary but since the data itself is not very large then this is a reasonable approach.

library(dplyr)

# You will need to specify your own path to reflect your download location

path <- "/Users/fender/Downloads/zika/Brazil/Epidemiological_Bulletin/data"
setwd(path)

temp_files <- list.files(pattern=".csv")
temp_files
[1] "Epidemiological_Bulletin-2016-04-02.csv" "Epidemiological_Bulletin-2016-04-23.csv"
[3] "Epidemiological_Bulletin-2016-04-30.csv" "Epidemiological_Bulletin-2016-05-07.csv"
[5] "Epidemiological_Bulletin-2016-05-14.csv" "Epidemiological_Bulletin-2016-05-21.csv"
[7] "Epidemiological_Bulletin-2016-05-28.csv"

# Next we read all seven files the contents of each will go into a list
# element

myfiles <- lapply(temp_files,read.csv,stringsAsFactors=FALSE) 

str(myfiles,1)
List of 7
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:
 $ :'data.frame':    33 obs. of  9 variables:

The result of the list.files() function will be a character vector that contains the names of all the .CSV files in the current working folder. We then read all of these files in using the lapply() function. We could have used a for loop construct to do the same thing although the former approach is a more “R-like” way to do things. If you don’t yet understand lapply() or need a review of what it does then please see my posting which explains it in considerable detail. It is a very cool function in R that is used many places so you definitely want to get to get to know how it works. Each element of the resulting list contains a data frame that in turn contains the contents of one of the seven .CSV files. Let’s inspect the first two lines of the first two list elements to get an idea bout the Zika data.

# First let's get the column names for the data frames

lapply(myfiles,names)[1]
[[1]]
[1] "report_date"      "location"         "location_type"    "data_field"      
[5] "data_field_code"  "time_period"      "time_period_type" "value"           
[9] "unit"      

# Now look at the first two lines of the first two list elements
lapply(myfiles,head,2)[1:2]
[[1]]
  report_date        location location_type    data_field data_field_code time_period
1  2016-04-02           Norte        region zika_reported          BR0011          NA
2  2016-04-02 Brazil-Rondonia         state zika_reported          BR0011          NA
  time_period_type value  unit
1               NA  6295 cases
2               NA   618 cases

[[2]]
  report_date    location location_type    data_field data_field_code time_period
1  2016-04-23       Norte        region zika_reported          BR0011          NA
2  2016-04-23 Brazil-Acre         state zika_reported          BR0011          NA
  time_period_type value  unit
1               NA  8545 cases
2               NA   716 cases</pre>

So we could work with each file/data frame  individually although I want to create one large data structure. The do.call() function in R is used for a number of things but is frequently used to “bind” together individual list elements  to form a single data structure so it’s going to be perfect for this job.

# The following will take all the individual list elements each of which are
# individual data frames and bind them into a single data frame

brazil <- do.call(rbind,myfiles)

# Let's turn the report_date variable into a "real" date

brazil <- brazil %>% mutate(report_date = as.Date(report_date))

glimpse(brazil)
Observations: 231
Variables: 9
$ report_date      <date> 2016-04-02, 2016-04-02, 2016-04-02, 2016-04-02, 2016-04-02,...
$ location         <chr> "Norte", "Brazil-Rondonia", "Brazil-Acre", "Brazil-Amazonas"...
$ location_type    <chr> "region", "state", "state", "state", "state", "state", "stat...
$ data_field       <chr> "zika_reported", "zika_reported", "zika_reported", "zika_rep...
$ data_field_code  <chr> "BR0011", "BR0011", "BR0011", "BR0011", "BR0011", "BR0011", ...
$ time_period      <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ time_period_type <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
$ value            <int> 6295, 618, 375, 1520, 44, 771, 74, 2893, 30286, 1202, 7, 156...
$ unit             <chr> "cases", "cases", "cases", "cases", "cases", "cases", "cases...
9               NA 30286 cases

If we use the glimpse() function to examine this data frame we see that we have some basic variables represented such as report_date, location, and value which appears to represent the number of cases for a given location. The location_type variable tells us whether the location is a state or region with the latter being a container of possibly many states. The way I understand the information be organized is that on a certain date a .CSV file was uploaded that contained specific locations within Brazil and the number of reported Zika cases at that time. For each subsequent date (file) the same thing happened although the new number of cases are a  “delta” or adjusted number to the previously reported number. We now have all of that in a single data frame. Let’s work with this data. We will remove columns 6 and 7 since all they have are missing values anyway as confirmed by the above glimpse() function results.

# Turn the data frame into a dplyr data table and get rid of
# columns 5 and 7

brazil <- brazil %>% select(-(6:7)) 

# Let's look at the first twenty rows. While we are at it we will
# omit columns 6 and 7 since the have all missing values anyway

brazil %>% slice (1:20) 

# A tibble: 20 x 7
   report_date                   location location_type    data_field data_field_code value  unit
        <date>                      <chr>         <chr>         <chr>           <chr> <int> <chr>
1   2016-04-02                      Norte        region zika_reported          BR0011  6295 cases
2   2016-04-02            Brazil-Rondonia         state zika_reported          BR0011   618 cases
3   2016-04-02                Brazil-Acre         state zika_reported          BR0011   375 cases
4   2016-04-02            Brazil-Amazonas         state zika_reported          BR0011  1520 cases
5   2016-04-02             Brazil-Roraima         state zika_reported          BR0011    44 cases
6   2016-04-02                Brazil-Para         state zika_reported          BR0011   771 cases
7   2016-04-02               Brazil-Amapa         state zika_reported          BR0011    74 cases
8   2016-04-02           Brazil-Tocantins         state zika_reported          BR0011  2893 cases
9   2016-04-02                   Nordeste        region zika_reported          BR0011 30286 cases
10  2016-04-02            Brazil-Maranhao         state zika_reported          BR0011  1202 cases
11  2016-04-02               Brazil-Piaui         state zika_reported          BR0011     7 cases
12  2016-04-02               Brazil-Ceara         state zika_reported          BR0011   156 cases
13  2016-04-02 Brazil-Rio_Grande_do_Norte         state zika_reported          BR0011   640 cases
14  2016-04-02             Brazil-Paraiba         state zika_reported          BR0011  1060 cases
15  2016-04-02          Brazil-Pernambuco         state zika_reported          BR0011   333 cases
16  2016-04-02             Brazil-Alagoas         state zika_reported          BR0011  1479 cases
17  2016-04-02             Brazil-Sergipe         state zika_reported          BR0011   348 cases
18  2016-04-02               Brazil-Bahia         state zika_reported          BR0011 25061 cases
19  2016-04-02                    Sudeste        region zika_reported          BR0011 35505 cases
20  2016-04-02        Brazil-Minas_Gerais         state zika_reported          BR0011  6693 cases 

Understanding the Data

This data frame, though simple, is a little confusing. For each region a cumulative number of cases is presented which represents the sum of all the constituent states occurring in some number of rows below this record. For example – the first row of the data frame summarizes reported cases for the “Norte” region on 04/02/16 and then rows 2-8 are the constituent states and their respective case number totals. So the sum of the cases of rows  2-8 is the same as the case numbers reported on row 1. And then there are other entries for the “Norte” region for different reporting dates later on in the data table.  If you look at row 19 you see that the same thing is happening for the “Sudeste” region. For someone wanting summary information this is fine but it makes the resulting data frame non-standard. As analysts you would probably want to have the region as a factor – another column in the data frame. Totals can easily be generated from this format so we lose nothing. But even with the current format we can still work with it. For example let’s pull out the totals for all regions and plot them.

# For each reporting_date we have a each of the 5 region case numbers
# So we have 7 dates/files so we have 35 rows in this table

brazil %>% filter(location_type=="region")
# A tibble: 35 x 7
   report_date     location location_type    data_field data_field_code value  unit
        <date>        <chr>         <chr>         <chr>           <chr> <int> <chr>
1   2016-04-02        Norte        region zika_reported          BR0011  6295 cases
2   2016-04-02     Nordeste        region zika_reported          BR0011 30286 cases
3   2016-04-02      Sudeste        region zika_reported          BR0011 35505 cases
4   2016-04-02          Sul        region zika_reported          BR0011  1797 cases
5   2016-04-02 Centro-Oeste        region zika_reported          BR0011 17504 cases
6   2016-04-23        Norte        region zika_reported          BR0011  8545 cases
7   2016-04-23     Nordeste        region zika_reported          BR0011 43000 cases
8   2016-04-23      Sudeste        region zika_reported          BR0011 46318 cases
9   2016-04-23          Sul        region zika_reported          BR0011  2197 cases
10  2016-04-23 Centro-Oeste        region zika_reported          BR0011 20101 cases
# ... with 25 more rows

# Let's plot it
brazil %>% filter(location_type=="region") %>% 
           ggplot(aes(x=report_date, y=value,
                        group=location, color=location)) + 
           geom_line() +  
           geom_point() +
           ggtitle("Brazil Zika Cases by Region")

 

Zika Line Chart

Looks like the Southeast and Northeast have the most reported cases by far and the highest increase in reported cases over the available date period. This doesn’t necessarily mean that the actual number of cases in those regions changed that much over the given date range.  It is possible that those cases were there all along and maybe the collecting and reporting mechanisms just caught up with the actual number of cases. (Note I have no idea if this is/was true or not). Let’s do a barchart of this information just for fun (and also to demonstrate an eccentricity of ggplot).

# Let's just pull out the regions into a table of its own

region <- brazil %>% filter(location_type=="region")

region %>% 
  ggplot(aes(x=location,y=value)) + geom_bar(stat="identity") +
  ylab("Number of Reported Cases") + xlab("Region") + 
  ggtitle("Reported Brazil Zika Cases 04/02/16 - 2016-05-28")

Zika Bar Chart

Dealing with Factors

You saw something like this in the previous posting on dplyr. It’s a bar plot and it’s a bit unsatisfying since the bars are not in order according to number of cases. This data is easy to understand so it’s not a big deal although I don’t like this. I want it to be in order of most cases to least cases. To accomplish this will require creating a factor out of of the location variable based on the total number of reported cases. First here is the solution:

region %>% 
  slice(1:length(unique(region$location))) %>% 
  arrange(desc(value)) %>%
  mutate(location=factor(location,levels=location,ordered=TRUE)) %>%
  ggplot(aes(x=location,y=value)) + geom_bar(stat="identity") +
  ylab("Number of Reported Cases") + xlab("Region") + 
  ggtitle("Reported Brazil Zika Cases 04/02/16 - 2016-05-28")
Sorted Zika Cases

Sorted Zika Cases

So how does this work ? Well check this out:


# This gets us the unique locations

region %>% 
    slice(1:length(unique(region$location)))
  report_date     location location_type    data_field data_field_code value  unit
1  2016-04-02        Norte        region zika_reported          BR0011  6295 cases
2  2016-04-02     Nordeste        region zika_reported          BR0011 30286 cases
3  2016-04-02      Sudeste        region zika_reported          BR0011 35505 cases
4  2016-04-02          Sul        region zika_reported          BR0011  1797 cases
5  2016-04-02 Centro-Oeste        region zika_reported          BR0011 17504 cases

# This arranges the unique locations by number of reported cases (descending)

region %>% 
     slice(1:length(unique(region$location))) %>% 
     arrange(desc(value))
  report_date     location location_type    data_field data_field_code value  unit
1  2016-04-02      Sudeste        region zika_reported          BR0011 35505 cases
2  2016-04-02     Nordeste        region zika_reported          BR0011 30286 cases
3  2016-04-02 Centro-Oeste        region zika_reported          BR0011 17504 cases
4  2016-04-02        Norte        region zika_reported          BR0011  6295 cases
5  2016-04-02          Sul        region zika_reported          BR0011  1797 cases</pre>

# So now we can create a factor out the locations variable using this order
# Notice how location is now an ordered factor

region %>% 
     slice(1:length(unique(region$location))) %>% 
     arrange(desc(value)) %>%
     mutate(location=factor(location,levels=location,ordered=TRUE)) %>% glimpse()
Observations: 5
Variables: 7
$ report_date     <date> 2016-04-02, 2016-04-02, 2016-04-02, 2016-04-02, 2016-04-02
$ location        <ord> Sudeste, Nordeste, Centro-Oeste, Norte, Sul
$ location_type   <chr> "region", "region", "region", "region", "region"
$ data_field      <chr> "zika_reported", "zika_reported", "zika_reported", "zika_re...
$ data_field_code <chr> "BR0011", "BR0011", "BR0011", "BR0011", "BR0011"
$ value           <int> 35505, 30286, 17504, 6295, 1797
$ unit            <chr> "cases", "cases", "cases", "cases", "cases"

Standardizing the Data Format

We could continue to work with the data this way – with summaries for regions and the country sitting along side entries for states. But this mixture “violates” the standards of a “tidy” data frame. What I propose is to remove the summary info for the regions and country. We will make a factor column that gives us the region for each state. This is great because 1) all rows represent the same types of information and 2) we can easily recompute the totals for the region. There are a number of ways to do this – I’m just going with a quick method here with a loop.


# Here we will pull out these total so we can later check out work
brazil_totals <- brazil %>% filter(location=="Brazil") 
region_totals <- brazil %>% filter(location_type=="region") %>%
  group_by(report_date,location) %>%  
  summarize(tot=sum(value)) 
# Here we standardize the data frame and remove all summary rows because after all we can easily
# (re)compute any summary totals that we want

regvec <- vector()  # A vector to hold the region for a given state
length(regvec) <- nrow(brazil)
for (ii in 1:nrow(brazil)) {
  if (brazil[ii,]$location_type != "region")  {
       regvec[ii] <- newlab
  } else {
       newlab <- brazil[ii,]$location
       regvec[ii] <- newlab
  }
}

# Bind the region vector to the brazil data frame

statedf <- cbind(brazil,regvec)

# Next we eliminate the summary rows for the country and regions

statedf <- statedf %>% filter(location != "Brazil") 
statedf <- statedf %>% filter(location_type != "region") 

# Let's generate the region totals from the newly transformed data to make
# sure it matches what we had before. 
statedf %>% group_by(report_date,regvec) %>% 
            summarize(tot=sum(value)) -> totals

# Do a spot check on the totals - compare this to the original data frame 
# totals. There are differences in the column names but other than it's the
# same

all.equal(totals,region_totals)
[1] "Cols in y but not x: location. Cols in x but not y: regvec. "

# Let's take a sampling from this new data frame. Note how there is now
# a factor column that shows us what region a given state location belongs to

statedf %>% select(report_date,location,value,regvec) %>% sample_n(10)
    report_date                   location value   regvec
22   2016-04-02      Brazil-Santa_Catarina    62      Sul
174  2016-05-28             Brazil-Paraiba  2865 Nordeste
173  2016-05-28 Brazil-Rio_Grande_do_Norte  2312 Nordeste
140  2016-05-21                Brazil-Para  1583    Norte
48   2016-04-23              Brazil-Parana  1847      Sul
163  2016-05-28            Brazil-Rondonia  1032    Norte
101  2016-05-07           Brazil-Sao_Paulo  3452  Sudeste
136  2016-05-21            Brazil-Rondonia   974    Norte
67   2016-04-30          Brazil-Pernambuco   450 Nordeste
92   2016-05-07 Brazil-Rio_Grande_do_Norte  1757 Nordeste
# The following will give the same line plot as above just to show you that we
# can easily regenerate the totals with no problem.

statedf %>% group_by(report_date,regvec) %>% summarize(cases=sum(value)) %>% 
   ggplot(aes(x=report_date,y=cases,group=regvec,color=regvec)) + 
   geom_line() + 
   geom_point() + 
   ggtitle("Brazil Zika Cases by Region")

Drawing a Cluster Map

Okay what next ? Well how about generating a Leaflet Cluster Map where we see the number of cases reflected in a given region. And if we zoom in we can then see the sub clusters that make up the larger cluster. This is a cool feature of Leaflet. But to do this we will need to Geo Code the locations given in the data frame. This is easy to do since the ggmap package has  geo coding capability built into it courtesy of the geocode() function which uses Google to do the work. See how easy this:

library(ggmap)
longlat <- geocode(unique(statedf$location)) %>% 
  mutate(loc=unique(statedf$location)) 

# Let's join this geocoded information to the statedf and save it into
# another data frame for mapping

statedf %>% filter(as.character(report_date)=="2016-05-28") %>% 
  group_by(location) %>% summarize(cases=sum(value)) %>% 
  inner_join(longlat,by=c("location"="loc")) %>% 
  mutate(LatLon=paste(lat,lon,sep=":")) -> formapping

head(formapping) 
# A tibble: 6 x 5
         location cases       lon       lat              LatLon
            <chr> <int>     <dbl>     <dbl>               <chr>
1     Brazil-Acre   843 -70.00000  -9.00000              -9:-70
2  Brazil-Alagoas  3577 -36.00000  -9.00000              -9:-36
3    Brazil-Amapa   189 -52.00000   1.00000               1:-52
4 Brazil-Amazonas  2176 -63.00000  -5.00000              -5:-63
5    Brazil-Bahia 45419 -38.51083 -12.97111 -12.97111:-38.51083
6    Brazil-Ceara  2144 -38.54306  -3.71722  -3.71722:-38.54306

Cluster Map of Brazil Zika Cases

Cluster Map of Brazil Zika Cases

Wait. This doesn’t look right because we know we have many more cases in many more regions. What happened ? Well the way the cluster mapping option works is that it looks at each row and the corresponding latitude and longitude and clusters together similar rows. The data we mapped has the states listed once and their location so no wonder we get this limited output. If you look at the cases column it tells us how many times each cases occurred in that region. We need to find a hack to generate a data frame to get this number of cases for each location so the Leaflet Clustermap will look reasonable. There is a way to do this – a number of ways in fact. Here were will generate the row numbers of the data frame and for each row number we’ll replicate that row case number of times.

num_of_times_to_repeat <- formapping$cases
long_formapping <- formapping[rep(seq_len(nrow(formapping)),
                                  num_of_times_to_repeat),]

head(long_formapping)
# A tibble: 6 x 5
     location cases   lon   lat LatLon
        <chr> <int> <dbl> <dbl>  <chr>
1 Brazil-Acre   843   -70    -9 -9:-70
2 Brazil-Acre   843   -70    -9 -9:-70
3 Brazil-Acre   843   -70    -9 -9:-70
4 Brazil-Acre   843   -70    -9 -9:-70
5 Brazil-Acre   843   -70    -9 -9:-70
6 Brazil-Acre   843   -70    -9 -9:-70

# ah so this is the format we need

leaflet(long_formapping) %>% addTiles() %>% 
  addMarkers(clusterOptions=markerClusterOptions()
Accurate Cluster Map

Accurate Cluster Map

So this still isn’t ideal because if we click on one of the clusters to see the constituent sub clusters the performance can be slow especially in regions where there are lots of cases (like the Southest and Northeast).  Why don’t we make a hexbin chart that might be more useful in representing the distribution of reported Zika cases in Brazil ? Hexbin charts are an extension of the heat map concept where squares are used to capture density or count in a region. The hexagonal shape is closer to a circle shape. What we will see is that the hex shapes on the map will be colored by count of Zika cases in that area. You will need to install the hexbin package prior to use. So it might be better to use a density geometry or a chloropleth map (see comments) though in creating the long_formapping data frame you will notice that it repeats the same latitude/longitude pair many times so in effect any resulting map is basically going to represent the counts of the same locales. We could jitter the lat/lon pairs some to get a better data set for density or chloropleth. For now we will stick with the hexbin since it overlays the map nicely.

library(hexbin)
mymap <- qmap(c(lon=-56.09667,lat=-15.59611),zoom=4)
mymap + 
  geom_hex(bins=bins,data = long_formapping, aes(x = lon, y = lat),alpha=.6,
           inherit.aes=FALSE) + 
  geom_point(data = long_formapping, aes(x = lon, y=lat),
             inherit.aes=FALSE,position = "jitter")+
  scale_fill_gradient(low = "blue", high = "red")
HexBin Chart of Zika Cases in Brazil

HexBin Chart of Zika Cases in Brazil

As usual there are many other ways to kick this data round but I hope I have given you a realistic introduction to working with data that isn’t custom made for summary. Most of the data you will encounter will involve cleaning and manipulation prior to getting the best format for visualization and analysis. Speaking of analysis it can also be the case that you will need to juggle data formats around a few times to accommodate your interests. So many times you will never have a single ideal data format that will serve all your interests. This is why it is essential for you to become experienced with tools like dplyr, the lapply command, and some of the R programming constructs to glue it all together for you. Thanks for reading Steve Pittard © 2016  See my website for information on consulting and R training services.

Advertisements

Express dplyr Part II

Posted: July 7, 2016 in Data Mining, dplyr

This is Part II of the “Express dplyr” posting. If you haven’t already you might want to review Part I of this topic before proceeding although if you have some knowledge of dplyr then by all means proceed. Don’t worry – there are no tests ! In this second part we will work with some “real” data to help solidify our understanding of the dplyr concepts we studied previously. Note that as I wrote this posting I’ve offered up examples I put together very fast which means that there are probably opportunities for improvement and optimization. I want to present examples that can be understood by newcomers to dplyr so I’ve constructed the examples with this in mind.

The Bay Area Bike Share program allows Bay area residents to use bicycles for commuting in the area. Here is a description of the service from the website:

Users of this service can purchase an annual membership online or get a 24-hour or 3-day membership from any station kiosk. 24-hour and 3-day members will receive a ride code from the station kiosk. Enter this code on the keypad next to any available bike. Annual members can bypass the kiosk and insert their membership key at any available dock. Wait for the green light, and pull the bike out while lifting the seat. After your trip, return the bike to any station in your service area. Push the bike firmly into an available dock, and wait for the green light to confirm your trip has ended. Enjoy unlimited trips up to 30 minutes each at no additional cost! Remember, trips longer than 30 minutes will incur overtime fees. 24-hour and 3-day members must request a new ride code for each trip by swiping their credit or debit card at the kiosk

I obtained data relating to this project from the Kaggle Project page. It represents anonymized Bike Data usage from 2013-2015. The data is about 660MB and you can download it though you will need a Kaggle account. The download contains an SQLite Database which is very cool though we’ll ignore that for now and work with the CSV files. However ! I put together a TAR file that has just the .csv files which are much smaller. Go here to get it or just execute the following commands.

url <- "http://steviep42.bitbucket.org/YOUTUBE.DIR/SF.TAR"
download.file(url,"SF.TAR")system("tar -xvf SF.TAR")
system("tar -xvf SF.TAR")

library(dplyr)
library(readr)

# Read in the station data
stations <- read_csv("station.csv")  

# Read in the trip date - you might get some messages about missing zipcodes
# but don't worry if you do

trips    <- read_csv("trip.csv")

str(stations)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	70 obs. of  7 variables:
 $ id               : int  2 3 4 5 6 7 8 9 10 11 ...
 $ name             : chr  "San Jose Diridon Caltrain Station" "San Jose Civic Center" "Santa Clara at Almaden" "Adobe on Almaden" ...
 $ lat              : num  37.3 37.3 37.3 37.3 37.3 ...
 $ long             : num  -122 -122 -122 -122 -122 ...
 $ dock_count       : int  27 15 11 19 15 15 15 15 15 19 ...
 $ city             : chr  "San Jose" "San Jose" "San Jose" "San Jose" ...
 $ installation_date: chr  "8/6/2013" "8/5/2013" "8/6/2013" "8/5/2013" ...

str(trips)

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	669959 obs. of  11 variables:
 $ id                : int  4576 4607 4130 4251 4299 4927 4500 4563 4760 4258 ...
 $ duration          : int  63 70 71 77 83 103 109 111 113 114 ...
 $ start_date        : chr  "8/29/2013 14:13" "8/29/2013 14:42" "8/29/2013 10:16" "8/29/2013 11:29" ...
 $ start_station_name: chr  "South Van Ness at Market" "San Jose City Hall" "Mountain View City Hall" "San Jose City Hall" ...
 $ start_station_id  : int  66 10 27 10 66 59 4 8 66 10 ...
 $ end_date          : chr  "8/29/2013 14:14" "8/29/2013 14:43" "8/29/2013 10:17" "8/29/2013 11:30" ...
 $ end_station_name  : chr  "South Van Ness at Market" "San Jose City Hall" "Mountain View City Hall" "San Jose City Hall" ...
 $ end_station_id    : int  66 10 27 10 67 59 5 8 66 11 ...
 $ bike_id           : int  520 661 48 26 319 527 679 687 553 107 ...
 $ subscription_type : chr  "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
 $ zip_code          : int  94127 95138 97214 95060 94103 94109 95112 95112 94103 95060 ...

With regard to the two files above a possible linking key in a join or merge of these two data frames is the id column from stations and perhaps the start_station_id and/or the end_station_id from trips. However, it might not be necessary to join the data just yet as there might be some questions that can be answered by referencing just a single table. In terms of starting to explore the data just begin thinking of some fundamental questions to get the creative juices flowing. Unless you have been given a specific assignment or set of questions there is no “right way” to do something like this. One of the biggest problems I see with students is that if you present them with an open ended opportunity for investigation they freeze up. They had much rather be given a question set and work against that but in the “real world” it isn’t really like that. Just dive in and you will be fine.

Asking Some Questions

How many bikes are there ?

Bikes have a unique identifier and are used any number of times by subscribers. The trips data table has a record of what bike was used in a given trip so we can select that column and then pipe into the distinct() function which will give us a list of all unique bike ids used. This results in a data table whose output we pipe into the nrow() function to count the number of unique bikes used.

trips %>% select(bike_id) %>% distinct()
Source: local data frame [700 x 1]

   bike_id
     (int)
1      520
2      661
3       48
4       26
5      319
6      527
7      679
8      687
9      553
10     107
..     ...

# But if we wanted a single result we could do this

trips %>% select(bike_id) %>% distinct() %>% nrow()
[1] 700

# So how many times was each bike used ? 

trips %>% group_by(bike_id) %>% summarize(times_used=n()) %>% arrange(desc(times_used))
Source: local data frame [700 x 2]

   bike_id times_used
     (int)      (int)
1      392       2061
2      489       1975
3      558       1955
4      267       1951
5      631       1948
6      518       1942
7      532       1933
8      592       1932
9      395       1927
10     368       1926
..     ...        ...

I should point out here that dplyr offers a function called count() that will do tallying/counting of data and also accomplish the group_by() and summarize() steps underneath the hood. So when determining the number of times each bike was used consider this:

trips %>% count(bike_id,sort=TRUE)
Source: local data frame [700 x 2]

   bike_id     n
     (int) (int)
1      392  2061
2      489  1975
3      558  1955
4      267  1951
5      631  1948
6      518  1942
7      532  1933
8      592  1932
9      395  1927
10     368  1926
..     ...   ...

Finding Missing Data ?

Since there were some apparent problems when reading in the trips data let’s see if we have a problem with complete cases. I suspect there are missing values in some of the zipe codes. In native R there are some functions like complete.cases() and is.na() and na.omit() that can help you figure out the “missingness” of your data. The following example also gives me an opportunity to explain how dplyr can intersect with native R commands. So we will see that the zip_code column has 17,685 missing values. Note that what I do here is pipe the trips data table into a non dplyr function.

So what columns in the trips data frame contain missing values and if so how many ?

# A pure native R solution
sapply(trips,function(x) sum(is.na(x)))
                id           duration         start_date 
                 0                  0                  0 
start_station_name   start_station_id           end_date 
                 0                  0                  0 
  end_station_name     end_station_id            bike_id 
                 0                  0                  0 
 subscription_type           zip_code 
                 0              17685 


# A dplyr equivalent

trips %>% sapply(function(x) sum(is.na(x)))
                id           duration         start_date start_station_name 
                 0                  0                  0                  0 
  start_station_id           end_date   end_station_name     end_station_id 
                 0                  0                  0                  0 
           bike_id  subscription_type           zip_code 
                 0                  0              17685 

  subscription_type zip_code
              (int)    (int)
1                 0    17685

So how many cities are covered by the service ? How many stations per city are there ?

stations %>% count(city)
Source: local data frame [5 x 2]

           city     n
          (chr) (int)
1 Mountain View     7
2     Palo Alto     5
3  Redwood City     7
4 San Francisco    35
5      San Jose    16

# We could also sort the result from highest count to lowest

 stations %>% count(city,sort=TRUE)
Source: local data frame [5 x 2]

           city     n
          (chr) (int)
1 San Francisco    35
2      San Jose    16
3 Mountain View     7
4  Redwood City     7
5     Palo Alto     5

Mapping

Here is something cool. We can pipe the locations of the stations to the leaflet package which provides an R interface to the popular Leaflet Javascript library. Once you install the package you can easily map all the stations. The following will produce an interactive map that from a high level will show circles containing numbers that indicate the number of stations in that area. Click on a circle to “drill down” to a more granular level to see where the stations are. If you click on the cluster circle over Palo Alto you will see it split into two clusters of 7 and 5 corresponding to the cities of Redwood City and Palo Alto respectively. The image below is just a screenshot of course but click it to enlarge the map.

library(leaflet)

m <- leaflet() %>% addTiles() %>% 
     addMarkers(lng=stations$long,lat=stations$lat,popup=stations$name,
                clusterOptions = markerClusterOptions())

m
LeafLet Cluster Map of Ride Share Stations

LeafLet Cluster Map of Ride Share Stations

Sorting Out The Dates

Note that there is a start_date and end_date for each trip. If we want to do any summary on this information it’s not a bad idea to parse them into an actual date recognized by R. The readr function comes with some date and time routines that we can use here although for more general use the lubridate package is a great package for manipulating dates and times. This can be confusing since native R provides date functions of its own. For now we will use the parse_datetime() function from the readr package.

trips %>% mutate(start_date=parse_datetime(start_date,
                                           format="%m/%d/%Y %H:%M"),
                 end_date=parse_datetime(end_date,
                                           format="%m/%d/%Y %H:%M")) -> trips

# So it looks like the data set starts on August 29, 2013 and continued to 
# August 31, 2015

range(trips$start_date)
[1] "2013-08-29 09:08:00 UTC" "2015-08-31 23:26:00 UTC"

trips
Source: local data frame [669,959 x 11]

      id duration          start_date       start_station_name start_station_id            end_date         end_station_name
   (int)    (int)              (time)                    (chr)            (int)              (time)                    (chr)
1   4576       63 2013-08-29 14:13:00 South Van Ness at Market               66 2013-08-29 14:14:00 South Van Ness at Market
2   4607       70 2013-08-29 14:42:00       San Jose City Hall               10 2013-08-29 14:43:00       San Jose City Hall
3   4130       71 2013-08-29 10:16:00  Mountain View City Hall               27 2013-08-29 10:17:00  Mountain View City Hall
4   4251       77 2013-08-29 11:29:00       San Jose City Hall               10 2013-08-29 11:30:00       San Jose City Hall
5   4299       83 2013-08-29 12:02:00 South Van Ness at Market               66 2013-08-29 12:04:00           Market at 10th
6   4927      103 2013-08-29 18:54:00      Golden Gate at Polk               59 2013-08-29 18:56:00      Golden Gate at Polk
7   4500      109 2013-08-29 13:25:00   Santa Clara at Almaden                4 2013-08-29 13:27:00         Adobe on Almaden
8   4563      111 2013-08-29 14:02:00      San Salvador at 1st                8 2013-08-29 14:04:00      San Salvador at 1st
9   4760      113 2013-08-29 17:01:00 South Van Ness at Market               66 2013-08-29 17:03:00 South Van Ness at Market
10  4258      114 2013-08-29 11:33:00       San Jose City Hall               10 2013-08-29 11:35:00              MLK Library
..   ...      ...                 ...                      ...              ...                 ...                      ...
Variables not shown: end_station_id (int), bike_id (int), subscription_type (chr), zip_code (int)

Now should we assume that all trips are started and completed on the same day ? My initial guess is that they are although that might not be true and/or there might have been an error in the data capture. Either way it’s probably wise to check. Looks like there were 2,099 trips that were started on one day and finished the next.

How many trips did not finish on the same day they began ?

trips %>% filter(substr(start_date,1,10) 
                        != substr(end_date,1,10)) %>% 
                        summarize(different_days=n())
 different_days
           (int)
1           2099

How many trips were there for each year ?

trips %>% count(substr(start_date,1,4))
Source: local data frame [3 x 2]

  substr(start_date, 1, 4)      n
                     (chr)  (int)
1                     2013 100563
2                     2014 326339
3                     2015 243057

Okay let’s extend this a little bit. We will look at the number of trips for each day of each year and use the googleVis package to create a heatmap to visualize this. We’ll need to modify the start_date to respect only the year and ignore the hour and minute portion. The way we did this above was by using the substr() function. However, the googleVis package expects an actual date as opposed to a character string (which is what gets returned by substr() so we’ll need to work a little bit harder (but not too much). So we use the as.Date() to truncate the full date string into just the Year, Month, and Day. We’ll then filter out just the trips that began and ended on the same day, then group by start date, and then count the number of trips for each day. Remember – we are NOT changing the underlying trips date frame at all – we just use the pipe operator to mutate, filter, group, and summarize the data in a single command chain. There is no need to store temporary or intermediate results into a data frame unless you want to.

library(googleVis)

trips %>% mutate(start_date=as.Date(start_date), 
                 end_date=as.Date(end_date)) %>%
                 filter(start_date == end_date) %>% 
                 count(start_date) -> tripdates

# Create a Gvisplot and then plot it

plot( 
  gvisCalendar(data=tripdates, datevar="start_date", numvar="n",
               options=list(
                 title="Calendar Heat Map of Open Bike Trips",
                 calendar="{cellSize:10,
                 yearLabel:{fontSize:20, color:'#444444'},
                 focusedCellColor:{stroke:'red'}}",
                 width=590, height=320),
               chartid="Calendar")
)

Trips per Day of Year

Trips per Day of Year

Trips by Day of the Week

One of the advantages of turning the start and end dates into actual date objects is because we can then use built in R date functions. For example how many trips were there for each day of the week ? We can leverage the strength of the weekdays() function to figure this out.

trips %>% mutate(weekdays = weekdays(start_date)) %>% 
         count(weekdays, sort=TRUE)
Source: local data frame [7 x 2]

   weekdays      n
      (chr)  (int)
1   Tuesday 122259
2 Wednesday 120201
3  Thursday 119089
4    Monday 115873
5    Friday 109361
6  Saturday  44785
7    Sunday  38391

# But let's plot this

trips %>% mutate(weekdays = weekdays(start_date)) %>% 
  count(weekdays, sort=TRUE) %>%
   ggplot(aes(x=weekdays,y=n)) + 
   geom_bar(stat="identity") + 
   ggtitle("Trips per Day of The Week") +
   ylab("Total Trips")

Total Trips per Day of Week

Joins

Up until now we have been working with a single table and summaries thereof. Next let’s use our knowledge of joins and merging to learn more about the data. For example, what are the most popular bike stations ? How would we answer this question ? Well we have a list of all the trips in the trips data frame which includes the beginning and ending station id for each trip. So, for each station, is there a way to find the number of trips starting or ending there ? If so then we can find the stations with the highest number of starts and stops which will then tell us the most popular stations.

# For each station we count the number of trips initiated from the 
# station or ended there. 

# Here we get the number of times a trip started at a station

trips %>% count(start_station_id) -> start

Source: local data frame [70 x 2]

   start_station_id     n
              (int) (int)
1                 2  9558
2                 3  1594
3                 4  3861
4                 5  1257
5                 6  2917
6                 7  2233
7                 8  1692
8                 9  1910
9                10  2393
10               11  2034
..              ...   ...

# Here we get the number of times a trip ended at a station

trips %>% count(end_station_id) -> end

Source: local data frame [70 x 2]

   end_station_id     n
            (int) (int)
1               2  9415
2               3  1786
3               4  3705
4               5  1169
5               6  3163
6               7  2498
7               8  1707
8               9  2200
9              10  1658
10             11  2508
..            ...   ...

# Let's join these two tables using the station_ids as keys. While we do
# the join we can add together the start and end counts for each 
# station. This tells us how "popular" a given station was

inner_join(start,end,by=c("start_station_id"="end_station_id")) %>% 
    mutate(total_trips=n.x+n.y) %>% 
    arrange(desc(total_trips))

Source: local data frame [70 x 4]

   start_station_id cnt.x cnt.y total_trips
              (int) (int) (int)       (int)
1                70 49092 63179      112271
2                69 33742 35117       68859
3                50 32934 33193       66127
4                60 27713 30796       58509
5                61 25837 28529       54366
6                77 24172 28033       52205
7                65 23724 26637       50361
8                74 24838 25025       49863
9                55 26089 23080       49169
10               76 20165 19915       40080
..              ...   ...   ...         ...

# Okay now let's join this with the stations data frame to get some
# more information such as the latitude and longitude

inner_join(start,end,by=c("start_station_id"="end_station_id")) %>% 
     mutate(total=n.x+n.y) %>% 
     arrange(desc(total)) %>%
     inner_join(stations,c("start_station_id"="id")) %>% 
     select(c(1,4:7,9)) 

Source: local data frame [70 x 6]

   start_station_id total                                          name      lat
              (int)       (int)                                         (chr)    (dbl)
1                70      112271      San Francisco Caltrain (Townsend at 4th) 37.77662
2                69       68859       San Francisco Caltrain 2 (330 Townsend) 37.77660
3                50       66127          Harry Bridges Plaza (Ferry Building) 37.79539
4                60       58509                        Embarcadero at Sansome 37.80477
5                61       54366                               2nd at Townsend 37.78053
6                77       52205                             Market at Sansome 37.78963
7                65       50361                               Townsend at 7th 37.77106
8                74       49863                             Steuart at Market 37.79414
9                55       49169 Temporary Transbay Terminal (Howard at Beale) 37.78976
10               76       40080                                 Market at 4th 37.78630
..              ...         ...                                           ...      ...
Variables not shown: long (dbl), city (chr)

More Mapping

Wouldn’t it be nice to plot this information on a map ? We could plot points representing the station locations and maybe even make the size of a given point/location to be in proportion to the number of trips starting and ending with that given station. This would make it really easy to see which stations get a lot of traffic. To do this we know we are going to need the latitude and longitude information from the stations data frame (but we already have that courtesy of the commands above). Also for purposes of this exercise let’s just focus only on San Francisco bike stations. That is we won’t worry about Palo Alto, Redwood City, Mountain View, or San Jose. Let’s start by reissuing the last command and doing some more selecting to get the right information.

inner_join(start,end,by=c("start_station_id"="end_station_id")) %>% 
     mutate(total=n.x+n.y) %>% 
     arrange(desc(total)) %>%
     inner_join(stations,c("start_station_id"="id")) %>% 
     select(c(1,4:7,9)) %>% filter(city=="San Francisco") -> sfstats

Source: local data frame [35 x 6]

   start_station_id total                                          name      lat      long          city
              (int)       (int)                                         (chr)    (dbl)     (dbl)         (chr)
1                70      112271      San Francisco Caltrain (Townsend at 4th) 37.77662 -122.3953 San Francisco
2                69       68859       San Francisco Caltrain 2 (330 Townsend) 37.77660 -122.3955 San Francisco
3                50       66127          Harry Bridges Plaza (Ferry Building) 37.79539 -122.3942 San Francisco
4                60       58509                        Embarcadero at Sansome 37.80477 -122.4032 San Francisco
5                61       54366                               2nd at Townsend 37.78053 -122.3903 San Francisco
6                77       52205                             Market at Sansome 37.78963 -122.4008 San Francisco
7                65       50361                               Townsend at 7th 37.77106 -122.4027 San Francisco
8                74       49863                             Steuart at Market 37.79414 -122.3944 San Francisco
9                55       49169 Temporary Transbay Terminal (Howard at Beale) 37.78976 -122.3946 San Francisco
10               76       40080                                 Market at 4th 37.78630 -122.4050 San Francisco
..              ...         ...                                           ...      ...       ...           ...

# Next I scale the total number of trips variable down to a reasonable number
# for use with the geom_point function. You could come up with your own 
# scaling function to do this. The idea is that when we plot each station
# on the map we can adjust the size of the respective point to reflect how
# busy that station is. There are other ways to do this of course

sz <- round((sfstats$total)^(1/3)-20)

qmap(c(lon=-122.4071,lat=37.78836),zoom=14) -> mymap

mymap + 
  geom_point(data=sfstats, aes(x=long,y=lat),
             color="red", alpha=0.4,size=sz) + 
             geom_text(data=sfstats,
                       aes(x=long,y=lat,label=start_station_id),size=4)


Plot of SF Stations by Popularity

Plot of SF Stations by Popularity

Two things to note here: 1) the numbers in the circles are the station ids of the associated station. We could have put up the actual station names but that would look messy. 2) There is some over plotting in the lower part of the map because there are two Cal Train related stations very close to each other both of which experience lots of use. So the ids are stacked on top of each other at the current map resolution. We could easily fix this by changing the resolution or plotting these points separately (left as an exercise for the reader).

       San Francisco Caltrain (Townsend at 4th)       49092
        San Francisco Caltrain 2 (330 Townsend)       33742

The “trick” with the sz vector is to find a way to scale the total_trips column in the sfstats data table in a way that let’s us use this figure to specify the size of the point. Here I took the cube root of the total count and subtracted an arbitrary number. I experimented with this but generally find that using root functions can be useful when scaling count data.

What Next ?

We’ve just scratched the surface here. What about identifying the most frequently used trip segments ? We could identify all trip pairs (starting station and ending station id) and count the number of times each occurs. We could then perhaps draw segments on the map to more easily visualize these popular segments. . In a different direction there is another CSV file that came with the data called weather.csv that has meteorological information for each day of the year. We could investigate the impact that weather has on bike usage. Lastly, the data set on Kaggle comes with an SQLite database which we could also use in a couple of ways. dplyr has the ability to connect to SQLite databases though we could also use the RSQLite connection package to execute SQL queries against the database. I’ll explore these options in a future posting. Thanks, Steve Pittard