Posts Tagged ‘database joins’

Express Intro to dplyr

Posted: June 29, 2016 in Data Mining
Tags: ,

Working The Data Like a Boss !

I recently introduced the data.table package which provides a nice way to manage and aggregate large data sources using the standard bracket notation that is commonly employed when manipulating data frames in R. As data sources grow larger one must be prepared with a variety of approaches to efficiently handle this information. Using databases (both SQL and NoSQL) are a possibility wherein one queries for a subset of information although this assumes that the database is pre-existing or that you are prepared to create it yourself. The dplyr package offers ways to read in large files, interact with databases, and accomplish aggregation and summary. Some feel that dplyr is a competitor to the data.table package though I do not share that view. I think that each offers a well-conceived philosophy and approach and does a good job of delivering on their respective design goals. That there is overlap in their potential applications simply means to me that there is another way to do something. They are just great tools in a larger toolbox so I have no complaints. Let’s dig into dplyr to learn what it can do. Note that this post is part one of two. The second dplyr blog will apply the knowledge learned in this post.

Upcoming Class

Before we get too deep into this I wanted to indicate that I will be teaching a 3-day Intro to R BootCamp in the Atlanta, GA area of the US sometime in August or September. I say “sometime” because the logistics are still under development. If interested please feel free to email me and once I get everything lined up I will get back to you with the details. You can also visit my home page. Thanks for indulging my self-promotion. Steve – Now Back to the Action…

Verbs in Action !

dplyr is based on the idea that when working with data there are a number of common activities one will pursue: reading, filtering rows on some condition, selecting or excluding columns, arranging/sorting, grouping, summarize, merging/joining, and mutating/transforming columns. There are other activities but these describe the main categories. dplyr presents a number of commands or “verbs” that help you accomplish the work. Note that dplyr does not replace any existing commands – it simply gives you new commands:

Command Purpose
select() Select columns from a data frame
filter() Filter rows according to some condition(s)
arrange() Sort / Re-order rows in a data frame
mutate() Create new columns or transform existing ones
group_by() Group a data frame by some factor(s) usually in conjunction to summary
summarize() Summarize some values from the data frame or across groups
inner_join(x,y,by=”col”) return all rows from ‘x’ where there are matching values in ‘x’, and all columns from ‘x’ and ‘y’. If there are multiple matches between ‘x’ and ‘y’, all combination of the matches are returned.
left_join(x,y,by=”col”) return all rows from ‘x’, and all columns from ‘x’ and ‘y’. Rows in ‘x’ with no match in ‘y’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned.
right_join(x,y,by=”col”) return all rows from ‘y’, and all columns from ‘x’ and y. Rows in ‘y’ with no match in ‘x’ will have ‘NA’ values in the new columns. If there are multiple matches between ‘x’ and ‘y’, all combinations of the matches are returned
anti_join(x,y,by=”col”) return all rows from ‘x’ where there are not matching values in ‘y’, keeping just columns from ‘x’

 

readr

There is also an associated package called readr that is more efficient at ingesting CSV files than the base R functions such as read.csv. While it is not part of the actual dplyr package it does in fact produce a dplyr structure as it reads in files. readr provides the read_csv function to do the work. It is also pretty smart and can figure things out like if there is a header or not so you don’t have to provide a lot of additional arguments. Here is an example using a file that contains information on weather station measurements in the year 2013.

install.packages("readr")  # one time only 
library(readr)

url <- "http://steviep42.bitbucket.org/YOUTUBE.DIR/weather.csv"
download.file(url,"weather.csv")

system("head -5 weather.csv")  # Take a peak at the first 5 lines

"origin","year","month","day","hour","temp","dewp","humid","wind_dir","wind_speed","wind_gust","precip","pressure","visib"
"EWR",2013,1,1,0,37.04,21.92,53.97,230,10.35702,11.9186514756,0,1013.9,10
"EWR",2013,1,1,1,37.04,21.92,53.97,230,13.80936,15.8915353008,0,1013,10
"EWR",2013,1,1,2,37.94,21.92,52.09,230,12.65858,14.5672406924,0,1012.6,10
"EWR",2013,1,1,3,37.94,23,54.51,230,13.80936,15.8915353008,0,1012.7,10

weather <- read_csv("weather.csv")

weather
Source: local data frame [8,719 x 14]

   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
    (chr) (int) (int) (int) (int) (dbl) (dbl) (dbl)    (int)      (dbl)
1     EWR  2013     1     1     0 37.04 21.92 53.97      230   10.35702
2     EWR  2013     1     1     1 37.04 21.92 53.97      230   13.80936
3     EWR  2013     1     1     2 37.94 21.92 52.09      230   12.65858
4     EWR  2013     1     1     3 37.94 23.00 54.51      230   13.80936
5     EWR  2013     1     1     4 37.94 24.08 57.04      240   14.96014
6     EWR  2013     1     1     6 39.02 26.06 59.37      270   10.35702
7     EWR  2013     1     1     7 39.02 26.96 61.63      250    8.05546
8     EWR  2013     1     1     8 39.02 28.04 64.43      240   11.50780
9     EWR  2013     1     1     9 39.92 28.04 62.21      250   12.65858
10    EWR  2013     1     1    10 39.02 28.04 64.43      260   12.65858
..    ...   ...   ...   ...   ...   ...   ...   ...      ...        ...
Variables not shown: wind_gust (dbl), precip (dbl), pressure (dbl), visib (dbl)

tbl_df

It is important to note that dplyr works transparently with existing R data frames though ideally one should explicitly create or transform an existing data frame to a dplyr structure to get the full benefit of the package. Let’s use the dplyr tbl_df command to wrap an existing data frame. We’ll convert the infamous mtcars data frame into a dplyr table since it is a small data frame that is easy to understand. The main advantage in using a ‘tbl_df’ over a regular data frame is the printing: tbl objects only print a few rows and all the columns that fit on one screen, describing the rest of it as text.

dp_mtcars <- tbl_df(mtcars)

# dp_mtcars is a data frame as well as a dplyr object

class(dp_mtcars)
[1] "tbl_df"     "tbl"        "data.frame"

In the example below (as with the readr example above) notice how only a subset of the data gets printed by default. This is actually very nice especially if you have ever accidentally typed the name of a really, really large native data frame. R will dutifully try to print a large portion of the data even if it locks up your R session. So wrapping the data frame in a dplyr table will prevent this. Also notice how you get a summary of the number of rows and columns as well as the type of each column.

dp_mtcars
Source: local data frame [32 x 11]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
..

Now we could start to operate on this data frame / dplyr table by using some of the commands on offer from dplyr. They do pretty much what the name implies and you could use them in isolation though the power of dplyr comes through when using the piping operator to chain together commands. We’ll get there soon enough. Here are some basic examples:

filter()


# Find all rows where MPG is >= 30 and Weight is over 1.8 tons

filter(dp_mtcars, mpg >= 30 & wt > 1.8)
Source: local data frame [2 x 11]

    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
2  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1

select()

The following example illustrates how the select() function works. We will select all columns whose name begins with the letter “m”. This is more useful when you have lots of columns that are named according to some pattern. For example some Public Health data sets can have many, many columns (hundreds even) so counting columns becomes impractical which is why select() supports a form of regular expressions to find columns by name. Other helpful arguments in this category include:

Argument Purpose
ends_with(x, ignore.case=TRUE) Finds columns whose nqme ends with “x”
contains(x, ignore.case=TRUE) Finds columns whose nqme contains “x”
matches(x, ignore.case=TRUE) Finds columns whose names match the regular expression “x”
num_range(“x”,1:5, width=2) selects all variables (numerically) from x01 to x05
one_of(“x”, “y”, “z”) Selects variables provided in a character vector
select(dp_mtcars,starts_with("m"))
Source: local data frame [32 x 1]

     mpg
   (dbl)
1   21.0
2   21.0
3   22.8
4   21.4
5   18.7
6   18.1
7   14.3
8   24.4
9   22.8
10  19.2

# Get all columns except columns 5 through 10 

select(dp_mtcars,-(5:10))
Source: local data frame [32 x 5]

     mpg   cyl  disp    hp  carb
   (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110     4
2   21.0     6 160.0   110     4
3   22.8     4 108.0    93     1
4   21.4     6 258.0   110     1
5   18.7     8 360.0   175     2
6   18.1     6 225.0   105     1
7   14.3     8 360.0   245     4
8   24.4     4 146.7    62     2
9   22.8     4 140.8    95     2
10  19.2     6 167.6   123     4
..   ...   ...   ...   ...   ...

mutate()

Here we use the mutate() function to transform the wt variable by multiplying it by 1,000 and then we create a new variable called “good_mpg” which takes on a value of “good” or “bad” depending on if a given row’s MPG value is > 25 or not

mutate(dp_mtcars, wt=wt*1000, good_mpg=ifelse(mpg > 25,"good","bad"))
Source: local data frame [32 x 12]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb good_mpg
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)    (chr)
1   21.0     6 160.0   110  3.90  2620 16.46     0     1     4     4      bad
2   21.0     6 160.0   110  3.90  2875 17.02     0     1     4     4      bad
3   22.8     4 108.0    93  3.85  2320 18.61     1     1     4     1      bad
4   21.4     6 258.0   110  3.08  3215 19.44     1     0     3     1      bad
5   18.7     8 360.0   175  3.15  3440 17.02     0     0     3     2      bad
6   18.1     6 225.0   105  2.76  3460 20.22     1     0     3     1      bad
7   14.3     8 360.0   245  3.21  3570 15.84     0     0     3     4      bad
8   24.4     4 146.7    62  3.69  3190 20.00     1     0     4     2      bad
9   22.8     4 140.8    95  3.92  3150 22.90     1     0     4     2      bad
10  19.2     6 167.6   123  3.92  3440 18.30     1     0     4     4      bad
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...      ...

arrange()

Next we could sort or arrange the data according to some column values. This is usually to make visual inspection of the data easier. Let’s sort the data frame by cars with the worst MPG and then sort by weight from heaviest to lightest.

arrange(dp_mtcars,mpg,desc(wt))
Source: local data frame [32 x 11]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   10.4     8 460.0   215  3.00 5.424 17.82     0     0     3     4
2   10.4     8 472.0   205  2.93 5.250 17.98     0     0     3     4
3   13.3     8 350.0   245  3.73 3.840 15.41     0     0     3     4
4   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
5   14.7     8 440.0   230  3.23 5.345 17.42     0     0     3     4
6   15.0     8 301.0   335  3.54 3.570 14.60     0     1     5     8
7   15.2     8 275.8   180  3.07 3.780 18.00     0     0     3     3
8   15.2     8 304.0   150  3.15 3.435 17.30     0     0     3     2
9   15.5     8 318.0   150  2.76 3.520 16.87     0     0     3     2
10  15.8     8 351.0   264  4.22 3.170 14.50     0     1     5     4
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

Ceci n’est pas une pipe

While the above examples are instructive they are not, at least in my opinion, the way to best use dplyr. Once you get up to speed with dplyr functions I think you will soon agree that using “pipes” to create chains of commands is the way to go. If you come from a UNIX background you will no doubt have heard of “pipes” which is a construct allowing you to take the output of one command and route it or “pipe” it to the input of another program. This can be done for several commands thus creating a chain of piped commands. One can save typing while creating, in effect, a “one line program” that does a lot. Here is an example of a UNIX pipeline. I’m using Apple OSX but this should work on a Linux machine just as well. This example will pipe the output of the /etc/passwd file into the input of the awk command (a program used to parse files) and the output of the awk command will go into the input of the tail command which lists the last 10 lines of the final result.

 $ cat /etc/passwd | awk -F: '{print $1}' | tail
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_xcsbuildagent
_xcscredserver
_launchservicesd

$ 

This is a great paradigm for working on UNIX that also maps well for what one does in data exploration. When first encountering data you rarely know what it is you want to get from it (unless you are a student and your teacher told you specifically what she or he wants). So you embark on some exploratory work and start to interrogate the data which might first require some filtering and maybe exclusion of incomplete data or maybe some imputation for missing values. Until you have worked with it for a while you don’t want to change the data – you just want to experiment with various transformed and grouped versions of it which is much easier if you use dplyr. Just pipe various commands together to clean up your data, make some visualizations, and perhaps generate some hypotheses about your data. You find yourself generating some pretty involved adhoc command chains without having to create a standalone script file. The dplyr package uses the magrittr package to enable this piping capability within R. The “pipe” character is “%>%” which is different from the traditional UNIX pipe which is the vertical bar “|”. But don’t let the visual difference confuse you as, conceptually, pipes in R work just like they do in UNIX. The magrittr package has a motto “Ceci n’est pas une pipe” presumably in acknowledgement of the noted difference and also as a tribute to the painter Rene Magritte’s work La trahison des images.


# Here we filter rows where MPG is >= 25 and then select only rows 1-4
# and 10-11.

dp_mtcars %>% filter(mpg >= 25) %>% select(-c(5:9)) 
Source: local data frame [6 x 6]

    mpg   cyl  disp    hp  gear  carb
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1  32.4     4  78.7    66     4     1
2  30.4     4  75.7    52     4     2
3  33.9     4  71.1    65     4     1
4  27.3     4  79.0    66     4     1
5  26.0     4 120.3    91     5     2
6  30.4     4  95.1   113     5     2

Next we filter rows where MPG is >= 25 and then select only rows 1-4 and 10-11 after which we sort the result by MPG from highest to lowest. You can keep adding as many pipes as you wish. At first, while you are becoming familiar with the idea, it is best to keep the pipeline relatively short so you can check your work. But it will not be long before you are stringing together lots of different commands. dplyr enables and encourages this type of activity so don’t be shy.

dp_mtcars %>% filter(mpg >= 25) %>% select(-c(5:9)) %>% arrange(desc(mpg))
Source: local data frame [6 x 6]

    mpg   cyl  disp    hp  gear  carb
  (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1  33.9     4  71.1    65     4     1
2  32.4     4  78.7    66     4     1
3  30.4     4  75.7    52     4     2
4  30.4     4  95.1   113     5     2
5  27.3     4  79.0    66     4     1
6  26.0     4 120.3    91     5     2

That was pretty cool wasn’t it ? We don’t need to alter dp_mtcars at all to explore it. We could change our minds about how and if we want to filter, select, or sort. The way this works is that the output of the dp_mtcars data frame/table gets sent to the input of the filter function that is aware of the source which is why we don’t need to explicitly reference dp_mtcars by name. The output of the filter step gets sent to the select function which in turns pipes or chains its output into the input of the arrange function which sends its output to the screen. We could even pipe the output of these operations to the ggplot2 package. But first let’s convert some of the columns into factors so the resulting plot will look better.

# Turn the cyl and am variables into factors. Notice that the resulting
# output reflects the change

dp_mtcars %>%
mutate(cyl=factor(cyl,levels=c(4,6,8)),
am=factor(am,labels=c("Auto","Manual" )))
mpg    cyl  disp    hp  drat    wt  qsec    vs     am  gear  carb
(dbl) (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (fctr) (dbl) (dbl)
1   21.0      6 160.0   110  3.90 2.620 16.46     0 Manual     4     4
2   21.0      6 160.0   110  3.90 2.875 17.02     0 Manual     4     4
3   22.8      4 108.0    93  3.85 2.320 18.61     1 Manual     4     1
4   21.4      6 258.0   110  3.08 3.215 19.44     1   Auto     3     1
5   18.7      8 360.0   175  3.15 3.440 17.02     0   Auto     3     2
6   18.1      6 225.0   105  2.76 3.460 20.22     1   Auto     3     1
7   14.3      8 360.0   245  3.21 3.570 15.84     0   Auto     3     4
8   24.4      4 146.7    62  3.69 3.190 20.00     1   Auto     4     2
9   22.8      4 140.8    95  3.92 3.150 22.90     1   Auto     4     2
10  19.2      6 167.6   123  3.92 3.440 18.30     1   Auto     4     4
..   ...    ...   ...   ...   ...   ...   ...   ...    ...   ...   ...

But that was kind of boring – Let’s visualize this using the ggplot package whose author, Hadley Wickham, is also the author of dplyr.

dp_mtcars %>% mutate(cyl=factor(cyl,levels=c(4,6,8)),
                     am=factor(am,labels=c("Auto","Manual" ))) %>%
                     ggplot(aes(x=wt,y=mpg,color=cyl)) +
                     geom_point() + facet_wrap(~am)

ggplot_out

Okay well that might have been too much for you and that’s okay if it is. Let’s break this down into two steps. First let’s save the results of the mutate operation into a new data frame.

new_dp_mtcars <- dp_mtcars %>% mutate(cyl=factor(cyl,levels=c(4,6,8)),
                     am=factor(am,labels=c("Auto","Manual" )))

# Now we can call the ggplot command separately

ggplot(new_dp_mtcars,aes(x=wt,y=mpg,color=cyl)) +
                     geom_point() + facet_wrap(~am)

Pick whatever approach you want to break things down to the level you need. However, I guarantee that after a while you will probably wind up writing lots of one line programs.

Split-Apply-Combine

There are two more commands from the dplyr package that are particularly useful in aggregating data. The group_by() and summarize() functions help us group a data frame according to some factors and then apply some summary functions across those groups. The idea is to first “split” the data into groups, “apply” some functions (e.g. mean()) to some continuous quantity relating to each group, and then combine those group specific results back into an integrated result. In the next example we will group (or split) the data frame by the cylinder variable and then summarize the mean MPG for each group and then combine that into a final aggregated result.

dp_mtcars %>% group_by(cyl) %>% summarize(avg_mpg=mean(mpg))
Source: local data frame [3 x 2]

    cyl  avg_mpg
  (dbl)    (dbl)
1     4 26.66364
2     6 19.74286
3     8 15.10000

# Let's group by cylinder then by transmission type and then apply the mean
# and sd functions to mpg

dp_mtcars %>% group_by(cyl,am) %>% summarize(avg_mpg=mean(mpg),sd=sd(mpg))
Source: local data frame [6 x 4]
Groups: cyl [?]

    cyl    am  avg_mpg        sd
  (dbl) (dbl)    (dbl)     (dbl)
1     4     0 22.90000 1.4525839
2     4     1 28.07500 4.4838599
3     6     0 19.12500 1.6317169
4     6     1 20.56667 0.7505553
5     8     0 15.05000 2.7743959
6     8     1 15.40000 0.5656854

# Note that just grouping a data frame without summary doesn't appear to do 
# much from a visual point of view. 

dp_mtcars %>% group_by(cyl)
Source: local data frame [32 x 11]
Groups: cyl [3]

     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
..   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...   ...

Merging Data Frames

One of the strengths of dplyr is it’s ability to do merges via various “joins” like those associated with database joins. There is already a built-in R command called merge that can handle merging duties but dplyr offers flexible and extended capabilities in this regard. Moreover it does so in a way that is consistent (for the most part) with SQL which you can use for a wide variety of data mining tasks. If you already know SQL then you will understand these commands without much effort. Let’s set up two example simple data frames to explain the concept of joining.

df1 <- data.frame(id=c(1,2,3),m1=c(0.98,0.45,0.22))
df2 <- data.frame(id=c(3,4),m1=c(0.17,0.66))

df1
  id   m1
1  1 0.98
2  2 0.45
3  3 0.22

df2
  id   m1
1  3 0.17
2  4 0.66

Left Join

Think about what it means to merge these data frames. It makes sense to want to join the data frames with respect to some common column name. In this case it is clear that the id column is in both data frames. So let’s join the data frames using “id” as a “key”. The question is what to do about the fact that there is no id in df2 corresponding to id number 2. This is why different types of joins exist. Let’s see how they work. We’ll start with the left join:

left_join(df1,df2,by="id")
  id m1.x m1.y
1  1 0.98   NA
2  2 0.45   NA
3  3 0.22 0.17

So the left join looks at the first data frame df1 and then attempts to find corresponding “id” values in df2 that match all id values in df1. Of course there are no ids matching 2 or 3 in df2 so what happens ? The left join will insert NAs in the m1.y column since there are no values in df2. Note that there is in fact an id of value 3 in both data frames so it fills in both measurement columns with the values. Also note that since in both data frames there is a column named “m1” so it has to create unique names to accommodate both columns. The “x” and “y” come from the fact that df1 comes before df2 in the calling sequence to left_join. Thus “x” matches df1 and “y” matches df2.

Inner Join

Let’s join the two data frames in a way that yields only the intersection of the two data structures based on “id”. Using visual examination we can see that there is only one id in common to both data frames – id 3.

inner_join(df1,df2,by="id")
  id m1.x m1.y
1  3 0.22 0.17

More Involved Join Examples

Now we’ll look at a more advanced example. Let’s create two data frames where the first, (we’ll call it “authors”), presents a list of, well, authors. The second data frame presents a list of books published by various authors. Each data frame has some additional attributes of interest.

# For reference sake - these data frames come from the examples contained in 
# the help pages for the built-in R merge command

authors <- data.frame(
         surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
         nationality = c("US", "Australia", "US", "UK", "Australia"),
         deceased = c("yes", rep("no", 4)))
     
books <- data.frame(
         name = I(c("Tukey", "Venables", "Tierney",
                  "Ripley", "Ripley", "McNeil", "R Core")),
         title = c("Exploratory Data Analysis",
                   "Modern Applied Statistics ...",
                   "LISP-STAT",
                   "Spatial Statistics", "Stochastic Simulation",
                   "Interactive Data Analysis",
                   "An Introduction to R"),
         other.author = c(NA, "Ripley", NA, NA, NA, NA,
                          "Venables & Smith"))

authors
   surname nationality deceased
1    Tukey          US      yes
2 Venables   Australia       no
3  Tierney          US       no
4   Ripley          UK       no
5   McNeil   Australia       no
 
books
      name                         title     other.author
1    Tukey     Exploratory Data Analysis             <NA>
2 Venables Modern Applied Statistics ...           Ripley
3  Tierney                     LISP-STAT             <NA>
4   Ripley            Spatial Statistics             <NA>
5   Ripley         Stochastic Simulation             <NA>
6   McNeil     Interactive Data Analysis             <NA>
7   R Core          An Introduction to R Venables & Smith

At first glance it appears that there is nothing in common between these two data frames in terms of column names. However, it is fairly obvious that the “surname” column in the authors data frame matches the “name” column in books so we could probably use those as keys to join the two data frames. We also see that there is an author ,”R Core” (meaning the R Core Team), who appears in the books table though is not listed as an author in the authors data frame. This kind of thing happens all the time in real life so better get used to it. Let’s do some reporting using these two data frames:

Let’s find all authors listed in the authors table who published a book along with their book titles, other authors, nationality, and living status. Let’s try an inner join on this. Because we don’t have any common column names between books and authors we have to tell the join what columns to use for matching. The by argument exists for this purpose. Note also that the author “R Core” listed in books isn’t printed here because that author does not also exist in the authors table. This is because the inner join looks for the intersection of the tables.


inner_join(books,authors,by=c("name"="surname"))
      name                         title other.author nationality deceased
1    Tukey     Exploratory Data Analysis         <NA>          US      yes
2 Venables Modern Applied Statistics ...       Ripley   Australia       no
3  Tierney                     LISP-STAT         <NA>          US       no
4   Ripley            Spatial Statistics         <NA>          UK       no
5   Ripley         Stochastic Simulation         <NA>          UK       no
6   McNeil     Interactive Data Analysis         <NA>   Australia       no

# We could have also done a right join since this will require a result that has
# all rows form the "right" data frame (in the "y" position) which in this case is 
# authors

right_join(books,authors,by=c("name"="surname"))
      name                         title other.author nationality deceased
1    Tukey     Exploratory Data Analysis         <NA>          US      yes
2 Venables Modern Applied Statistics ...       Ripley   Australia       no
3  Tierney                     LISP-STAT         <NA>          US       no
4   Ripley            Spatial Statistics         <NA>          UK       no
5   Ripley         Stochastic Simulation         <NA>          UK       no
6   McNeil     Interactive Data Analysis         <NA>   Australia       no

Next, find any and all authors who published a book even if they do not appear in the authors table. The result should show names, titles, other authors, nationality, and living status. Let’s do a left join which will pull in all rows from “x” (books) and where there is no matching key/name in authors then NAs will be inserted for columns existing in the “y” (authors) table.

left_join(books,authors,by=c("name"="surname"))
      name                         title     other.author nationality deceased
1    Tukey     Exploratory Data Analysis             <NA>          US      yes
2 Venables Modern Applied Statistics ...           Ripley   Australia       no
3  Tierney                     LISP-STAT             <NA>          US       no
4   Ripley            Spatial Statistics             <NA>          UK       no
5   Ripley         Stochastic Simulation             <NA>          UK       no
6   McNeil     Interactive Data Analysis             <NA>   Australia       no
7   R Core          An Introduction to R Venables & Smith        <NA>     <NA>

Do the same as above but the result should show only the book title and name columns
in that order. This is simply a matter of doing the previous join and piping the result to a filter statement

left_join(books,authors,by=c("name"="surname")) %>% select(title,name)
                          title     name
1     Exploratory Data Analysis    Tukey
2 Modern Applied Statistics ... Venables
3                     LISP-STAT  Tierney
4            Spatial Statistics   Ripley
5         Stochastic Simulation   Ripley
6     Interactive Data Analysis   McNeil
7          An Introduction to R   R Core

Find the book names of all US authors and who are not deceased. Well first we filter the authors table to filter out rows according the specified conditions. Then we can pass the result to an inner_join() statement to get the book titles and then we pass that result to select only the book titles. Note that because we are piping the output from the filter() results we don’t need to specify that in the call to inner_join(). That is, the inner_join function assumes that the filter() results represent the “x” position in the call to inner_join()

authors %>% filter(deceased == "no" & nationality == "US") %>%
            inner_join(books,by=c("surname"="name")) %>% select(title)surname 
title
1 LISP-STAT

Find any book titles for authors who do not appear in the authors data frame. Here we use an anti_join() which returns all rows from books where there are no matching values in authors, keeping just columns from books – and then we pass that result to select for title and name

anti_join(books,authors,by=c("name"="surname")) %>% select(title,name)
                 title   name
1 An Introduction to R R Core

Up Next – Biking in San Francisco

That’s it for now and we have covered a lot of ground in one go although once you invest some time in playing with dplyr (especially the pipes) then it becomes difficult to go back to the “old ways” of doing things. Next up we will look at some “real” data and apply our new found knowledge to working with it. The data set actually comes from the Kaggle Project page for the San Francisco Bay Area Bike Share Service. The data is about 660MB and you can download it though you will need a Kaggle account. You might want to go ahead and download that data in anticipation of the next posting.

Advertisements