Archive for the ‘Graphics’ Category

As we move into the Holidays I’m reminded of the family reunions taking place the world over and how such encounters can sometimes lead to skirmishes initiated by unsolicited advice  – “Let me tell you what your problem is” or “When are you going to get a real job ?” While I can not help you navigate the troubled waters of family politics, I do sometimes see similar behavior exhibited within the Data Science community particularly as it relates to graphics packages and whether Base R conventions have a place alongside the tidyverse or if data.table is truly superior to dplyr. For practical purposes I’ll focus on R graphics packages in this post.

Whatever approach you prefer will depend on your background, what you are trying to do, in what time frame, and for whom. I won’t contribute to the ongoing and never ending “R vs Python for Data Science” battle just as I won’t claim that one graphics package is superior to all others. They all have considerable strengths. Historically, Base graphics has been there all along, followed by lattice and then ggplot2. There is also the very cool Grid package which provides powerful low level functions (used by the other packages). So my how the R family has grown. This has also led to inter-family squabbles which I feel are unnecessary.

Family Rivalry

“Back in my day, we made our histograms by hand”

Interestingly, I recently encountered someone who referred to herself a “ggplot programmer” as opposed to, for example, an “R Programmer“.  While it does not bother me that the tidyverse provides a well-conceived point of entry into the world of R, it DOES bother me that someone might then suggest that the previous ways of dealing with things  have somehow become invalid or inferior. Many of my students go on to work at places like the CDC which has a large body of R code that involves liberal use of Base R functions and supporting logic, thus I wouldn’t be doing my job unless I made sure they knew something about it. Put another way, Base Graphics is not the “drunk uncle” of R visualization tools – a beloved aging family member who refuses to acknowledge that their prime occurred years ago – (I’m thinking Uncle Rico from Napoleon Dynamite as I write this). I tend to use a combination of approaches (both old and new) but then, in my family, I’m the “peace maker” / diplomat so I try to focus on the positive aspects contributed by each.

An Example Is In Order

I have an example that I hope will provoke some thought. The mlbench package has a data frame called PimaIndians which contains mostly measured / continuous data on 768 participants who have also been labelled as having diabetes or not.

# Load mlbench as well as the data set of interest


# Get a shorter name

pm <- PimaIndiansDiabetes

'data.frame':    768 obs. of  9 variables:
 $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
 $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
 $ pressure: num  72 66 64 66 40 74 50 0 70 96 ...
 $ triceps : num  35 29 0 23 35 0 32 0 45 0 ...
 $ insulin : num  0 0 0 94 168 0 88 0 543 0 ...
 $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
 $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
 $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
 $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

Respect Your Elders

I will use Base graphics to make a series of boxplots that compare all features for each level of the diabetes factor which is either “pos” or “neg”. The biggest “hack” here is the use of the as.formula() function to coerce a character string into something that can be then be parsed / recognized by the boxplot() function. To me, this isn’t a hack at all, it’s merely using a long existing function as it was intended. In fact most of the cognitive load here relates to understanding the as.formula() function and perhaps the par() function neither of which produces the actual boxplot. I also realize that the resulting figures do not share a common axis which is something that one would typically want although, again, in this case, I just wanted to get a look at the data with each feature reflecting its respective scale.

# Load mlbench as well as the data set of interest


# Get a shorter name

pm <- PimaIndiansDiabetes

# Set up the Plot Window manually


for (ii in 1:(nrow(pm)-1)) {
  form <- as.formula(paste(names(pm)[ii]," ~ diabetes",sep=""))
par(mfrow=c(1,1)) # Reset the plot window

Figure 1 - Boxplots Base R


What About The Tidyverse ?

Now consider the following code which uses elements of the tidyverse to produce a similar plot. In particular, we use the gather() function to reshape the data into a tidier format that lends itself to plotting with ggplot()

# Make the data tidy and then pipe to ggplot


gather(pm,key="variable",val="value",-diabetes) %>%
       ggplot(aes(x=diabetes,y=value)) + 
       geom_boxplot() + 
       facet_wrap(~variable,scales="free") + 

Figure 2 - Boxplots ggplot2

Which is the better figure ? Note, I’m NOT asking which figure LOOKS better but rather which approach is the best. Sorry for the abuse of all caps but I needed to make a point.

Looks Aren’t Everything

The Base graphics solution feels very natural to me as someone who entered the world of graphics programming by using “primitives” to build plots in layers which works well with the “pen on paper” model adopted by Base graphics. Once you draw something in the plot window then it’s there. You can draw over it , but sometimes it’s just as easy to start over. When I want to reproduce a plot I’ve seen in a scientific journal, (or even a consumer graphic in a newspaper). I can usually knock it out using a combination of high and low level Base graphics commands in about half the time it would take to produce a comparable ggplot or lattice version.

On the other hand, I appreciate the elegance and flow of tidyverse packages which allows me to think more about shaping data (and what I want to do with it) more so than the underlying options specific to a given geometry (well most of the time). To be fair, ggplot also allows one to build a plot in layers which becomes essential when reproducing against a reference example. Both lattice and ggplot will do a lot of the heavy lifting for you which is truly a marquee feature though sometimes you will definitely need to first decompose an intended plot into the appropriate aesthetics, mappings, geometries, etc. It could be argued that that is an essential step anyway. I once had a student ask, “How do I keep ggplot from doing so much for me “. Of course, as one acquires more experience then custom annotations and plots become much easier so there is an answer to that question.

In any case, were I to give someone an assignment to reproduce the plots above, I wouldn’t care what they used. Particularly for exploratory work – use whatever you darn well want.


Another Example

Here is another Base graphics solution for creating density plots from the same data. I showed this to a ggplot2 advocate who then left the room ! Granted there are some ways to clean up the code but then I didn’t spend very much time on it all and still managed to get a decent result. As with the previous Base example, most of the confusion one might experience in looking at this code relates to prepping the data and making the input variables palatable to the density function. Nothing here is a “hack” if you already know about manipulating formulae in R which I believe is a reasonable expectation.

splitdf <- split(pm,pm$diabetes)

 for (ii in 1:8) {
    a <- paste("splitdf[[1]]$",names(pm)[ii],sep="")
    a <- paste("splitdf[[2]]$",names(pm)[ii],sep="")


Figure 3 - Base R Density Plots

Lastly, I made a comment in this thread about how using Base R graphics simplifies the ordering or bar charts which is something that takes a few extra steps in ggplot. As an example, I tossed out the following of how to do it in Base R using the composite function approach.

barplot(rev(sort(table(mtcars$cyl))),horiz = TRUE)

I received a response that the following would work although it, like the example above, horrified my tidyverse colleague who views it as an entirely inappropriate application of pipes. However, for those of us with a UNIX background, using pipes to string together anything and everything is a completely legitimate thing to do – pipes don’t cease to work in the presence of Base R commands.

mtcars$cyl %>%
     table() %>% sort() %>% rev() %>% barplot(horiz = TRUE) 


Happy Holidays

So there is room at the family table for all graphics packages. No need to bicker about the differences which we should learn to appreciate and accept as strengths. None are perfect and some are younger than others but the wisdom of the older siblings (and parents) should not be hastily discarded as things tend to recycle. To wit – I never though I would see pipes in R but I’m glad it happened. And hey ! I didn’t even get into talking about the lattice package but I have a friend who uses nothing else and is amazingly productive with it).

Happy Holidays.

I’m frequently asked by newcomers to R to provide an easy to follow generic set of instructions on how to download data, transform it, aggregate it, make graphs, and write it all up for publication in a high impact journal – all by the end of the day ! While such a request is somewhat understandable coming from a student it’s somewhat awkward when coming from a seasoned research investigator, (i.e. someone who should know better). Obviously, no such set of instructions exists given the astonishing variety of available data stored in differing formats with varying degrees of adherence to a standard(s). Simply put, “real world” data wrangling and analysis can be a complex process independent of the language you choose to employ. Nonetheless, what I’ll attempt to do here is provide a “tour” involving some data I picked at random. I’ll be sticking to the problems of downloading, filtering, and graphing some of the data. And rather than dictate to you what is the “best way” to do something I’ll simply show you what is on offer and you can make your own decisions. R is amazingly flexible, which means that there are many different ways to do the same thing.

I have chosen information relating to San Francisco restaurant inspections which can be found at along with an impressive variety of data sets involving crime reports, parking availability, bus routes, campaign funding, and many other types of civic info. For reference we can map the zipcodes mentioned in the health reports using methods outlined in my blog post on Geocoding. This is basically the area we are considering.

San Francisco restaurants

Mmm. Let’s eat !

I’m hungry !

Without even knowing how the restaurant information is organized one can easily imagine a number of interesting questions such as, “What is the average health score for all restaurants ? What restaurants have the highest, (or lowest), scores ? What is the distribution of scores across zip codes ? Are there seasonal scoring trends ?”. While you can download the files from the above website I have saved a copy on my web site in case the original location changes at some point. The data is available in a zipped file that contains three CSV files that we can easily unpack into data frames.

url <- ""
system("unzip -l")

  Length     Date   Time    Name
 --------    ----   ----    ----
  4613344  04-23-14 09:51   violations_plus.csv
  1175067  04-23-14 09:51   businesses_plus.csv
  1754811  04-23-14 09:51   inspections_plus.csv
 --------                   -------
  7543222                   3 files

There are three files each of which has some key information relating to the restaurant inspection process. I should point out that a description file comes with the dataset, which you can read to get a better idea about the column names and formats. But we can also use R to help us understand what is going on. Let’s read these files into data frames:

businesses  <- read.csv(unz("","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("","inspections_plus.csv"),header=T,stringsAsFactors=F)

'data.frame':	55800 obs. of  5 variables:
 $ business_id    : int  10 10 10 10 10 10 17 17 17 17 ...
 $ date           : int  20140114 20140114 20140114 20120403 20110428 20121114 20130605 20130605 20130605 20140312 ...
 $ ViolationTypeID: int  103119 103145 103154 103154 103119 103154 103154 103144 103142 103142 ...
 $ risk_category  : Factor w/ 4 levels "High Risk","Low Risk",..: 3 2 2 2 3 2 2 2 2 2 ...
 $ description    : Factor w/ 70 levels "","Consumer advisory not provided for raw or undercooked foods",..: 22 17 62 62 22 62 62 58 61 61 ...


'data.frame':	40935 obs. of  4 variables:
 $ business_id: int  10 10 10 10 10 10 10 10 17 17 ...
 $ Score      : int  NA 92 98 NA 98 100 NA 96 94 NA ...
 $ date       : int  20140124 20140114 20121114 20120920 20120403 20110928 20110601 20110428 20140312 20130711 ...
 $ type       : Factor w/ 14 levels "Administrative or Document Review",..: 11 13 13 11 13 13 11 13 13 11 ...


'data.frame':	6109 obs. of  17 variables:
 $ business_id         : int  10 17 19 24 29 31 37 45 48 50 ...
 $ name                : Factor w/ 5562 levels " HOL N JAM LEMONADE STAND  #2",..: 5110 1901 3525 3562 1067 3509 793 1027 265 4697 ...
 $ address             : Factor w/ 5444 levels "","   ","  VARIOUS LOCATIONS  ",..: 119 1868 548 3892 713 2478 3506 2870 4814 1446 ...
 $ city                : Factor w/ 1 level "San Francisco": 1 1 1 1 1 1 1 1 1 1 ...
 $ state               : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
 $ postal_code         : Factor w/ 47 levels "","00000","04102",..: 12 29 16 12 10 35 24 17 27 10 ...
 $ latitude            : num  37.8 37.7 37.8 37.8 37.8 ...
 $ longitude           : num  -122 -122 -122 -122 -122 ...
 $ phone_number        : num  NA 1.42e+10 NA NA 1.42e+10 ...
 $ TaxCode             : Factor w/ 32 levels "AA","H07","H08",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ business_certificate: int  779059 78443 NA 352312 939325 346882 315736 340024 318022 NA ...
 $ application_date    : Factor w/ 2327 levels "","01/01/2005",..: 1 620 1 1 2137 1 1473 1836 1 1 ...
 $ owner_name          : Factor w/ 5294 levels "","\tCARLIN'S CORNER COFFEESHOP LLC",..: 4746 2704 61 3387 186 3335 4297 1809 5198 333 ...
 $ owner_address       : Factor w/ 5366 levels "","\t1 INDEPENDENCE POINTE #305",..: 2898 4397 484 3785 1479 2492 3413 414 1103 2206 ...
 $ owner_city          : Factor w/ 335 levels "","\tBRISBANE",..: 239 88 239 239 88 239 241 226 241 280 ...
 $ owner_state         : Factor w/ 51 levels "","ARIZONA","AZ",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ owner_zip           : Factor w/ 376 levels "","\t28273","\t29615",..: 165 139 170 165 139 199 183 178 188 177 ...

It looks like that all data frames have a column in common called “business_id”, which we can use as a “key” to link/merge these data frames though we can already answer some basic questions such as there were 40,935 inspections, which led to 55,800 violations over a time range of 04/25/2011 to 04/22/2014. I use the strptime function here to turn the character string dates into POSIX dates.

inspections$date <- strptime(inspections$date,format="%Y%m%d")
violations$date  <- strptime(violations$date,format="%Y%m%d")
[1] "2011-04-25 EDT" "2014-04-22 EDT"

Okay let’s extract information just for the year 2013 and use that as our base data set. Note that I could merge the data frames into a single data frame and we will eventually look at how to do that but for now let’s see what we can get out of each data frame individually.

start_time <- strptime("20130101","%Y%m%d")
stop_time  <- strptime("20131231","%Y%m%d")
inspections <- subset(inspections, date > start_time & date < stop_time)
violations <- subset(violations, date > start_time & date < stop_time)

[1] "2013-01-02 EST" "2013-12-30 EST"



Relative to the restaurant inspections process the way I understand it is that an inspector will visit a restaurant at some frequency or in response to a diner’s complaint. The inspector will evaluate the establishment, document any code violations at a severity level of low,medium,high, and ultimately assign a numeric score from 0 to 100 with the following possible meanings: 0-70: Poor, 71-85: Needs Improvement, 86-90: Adequate, 90-100: Good. If violations were noted then a followup inspection(s) will usually occur to insure that the violations are adequately addressed. Such visits usually do not involve a “re-scoring” just a verification that the establishment dealt with the previously observed violations. Evidently this could take multiple visits.

We might then expect that in the inspections file there are multiple rows for a business starting with the original visit and any associated followups. In the case of the followups it looks like the Score is set to NA. Let’s see if that is true. According to the inspections file a “Routine-Unscheduled” inspection was performed on 06/05/2013 for business_id 17 that resulted in score of 94 ,which is good, although in consulting the violations file it seems that there were three “Low Risk” violations noted. A re inspection happened on 07/11/2013.


   business_id Score       date                  type
10          17    NA 2013-07-11 Reinspection/Followup
11          17    94 2013-06-05 Routine - Unscheduled
18          19    96 2013-09-04 Routine - Unscheduled
23          24   100 2013-11-18 Routine - Unscheduled
27          29    87 2013-10-01 Routine - Unscheduled
34          37    96 2013-08-08 Routine - Unscheduled


   business_id       date ViolationTypeID risk_category                                      description
7           17 2013-06-05          103154      Low Risk     Unclean or degraded floors walls or ceilings
8           17 2013-06-05          103144      Low Risk Unapproved or unmaintained equipment or utensils
9           17 2013-06-05          103142      Low Risk                 Unclean nonfood contact surfaces
18          19 2013-09-04          103133 Moderate Risk           Foods not protected from contamination
23          29 2013-10-01          103142      Low Risk                 Unclean nonfood contact surfaces
24          29 2013-10-01          103120 Moderate Risk          Moderate risk food holding temperature 


By referring to the inspections file we can answer one of our questions easily. What is the average health score across all restaurants ?

[1] 91.95652 

But let’s dig a little deeper. I’ll create a factor called rating that implements the scoring interpretation given above. This is easy to do using the cut command. We’ll then summarize the mean score for each category and follow that up with a boxplot of the scores per category. This should give us some initial ideas about the data.

inspections$rating <- cut(inspections$Score,breaks=c(0,70,85,89,100),right=T,

# P = Poor, NI = Needs Improvement, A = Adequate, G = Good

       P       NI        A        G 
64.34337 79.97907 87.55324 95.94264 

ylim <- c(min(inspections$Score,na.rm=T)-5,max(inspections$Score,na.rm=T)+5)

# Capture the boxplot output to get the number of observations per category

myb <- boxplot(Score~rating,data=inspections,main="Scores per Rating Category",

leg.txt <- paste(levels(inspections$rating),myb$n,sep=" : ")

legend(3,70,leg.txt,title="Obs. per Category",cex=0.8,pch=19,col=rainbow(4))
Boxplots of score per category

Boxplots of score per category

Hmm, this information is a little shocking since no one wants to think that there are restaurants anywhere with a health score in the “Poor” category let alone 166 of them as we see here. Who does the lowest score of 2013 belong to ? Business id 74522. Note that it would be really easy to consult the “businesses” data frame at this point to determine the full name associated with this id though since this is a blog on R and not on restaurant reviews I’ll leave it to you to determine what business name that is so you can make your dining plans accordingly.

# What business id had the lowest score in 2013 ?


      business_id Score       date                  type rating
39498       74522    42 2013-10-03 Routine - Unscheduled      P

# Let's see what the violations were


 business_id       date  risk_category                                        description
       74522 2013-10-03                Low Risk                   Unclean nonfood contact surfaces
       74522 2013-10-03               High Risk            Unclean hands or improper use of gloves
       74522 2013-10-03                Low Risk       Unclean or degraded floors walls or ceilings
       74522 2013-10-03                Low Risk   Improper storage of equipment utensils or linens
       74522 2013-10-03                Low Risk                              Improper food storage
       74522 2013-10-03           Moderate Risk                          Improper thawing methods 
       74522 2013-10-03           Moderate Risk Inadequate and inaccessible handwashing facilities
       74522 2013-10-03               High Risk                       High risk vermin infestation
       74522 2013-10-03               High Risk        Unclean or unsanitary food contact surfaces
       74522 2013-10-03               High Risk                High risk food holding temperature 
       74522 2013-10-03               High Risk                   Contaminated or adulterated food
       74522 2013-10-03               High Risk              Improper cooking time or temperatures

So let’s just go ahead and find the 25 restaurants that have the highest number of “high risk” violations ? How would we find those ?

# Pull out the high risk violations

violations_high <- violations[violations$risk_category == "High Risk",]

# Split the data frame by business_id

my_high_viols <- split(violations_high,violations_high$business_id)

# Count all violations for a given business id and then sort all
# business ids from highest to lowest

violation_high_count <- sort(sapply(my_high_viols,nrow),T)[1:25]

33446  2427   286  3480 70068  6695 16806 67330 70281 73926 74522   853  1000  1343  1358  1458 
   11     9     7     7     7     6     6     6     6     6     6     5     5     5     5     5 
 2189  2302  2401  3217  3948  4590 18800 21524 36201 
    5     5     5     5     5     5     5     5     5 

      business_id Score       date                  type rating 
22145       33446    73 2013-12-11 Routine - Unscheduled     NI      

Notice that while business_id 74522 has the worst overall score, 43, it didn’t actually receive the highest number of “high risk” violations in 2013. That honor goes to business_id 33446 who received 11 violations and a score of 73 on 12/11/2013. But let’s go back to a general view here. I want to slice up the inspections into quarters of the year, (four groups of three months), to see if perhaps scores follow some seasonal pattern at least from a visual point of view. To do this we need to use some R date functions.

inspections$quarters <- quarters(inspections$date)
inspections$quarters <- factor(inspections$quarters,levels=c("Q1","Q2","Q3","Q4"),ordered=TRUE)

# Now let's build the barchart out of a table of quarters vs ratings

         main="Violation Category Counts by Quarter")

Barchart of quarters vs. ratings

Barchart of quarters vs. ratings

Well it doesn’t look like there are drastic differences between the scoring categories on a quarterly basis though there appears to be slightly less scores in the “NI” category in Quarter 1 than all other quarters. We could drop down to a month level of granularity to check things out from month to month. All we would need to that is to use the “months” data function above instead of the “quarters” and then apply level labels accordingly. I’ll leave that to you as an exercise.

Grades by zipcode

Well what’s next ? Let’s take a look at the distribution of scores across zip codes which we can access from the inspections file. We’ll also do some merging of the data frames to make our process a bit easier. First, how many unique zip codes do we have ?

 [1] "94104"     "94124"     "94109"     "94103"     "94133"     "94118"     "94110"     "94122"     "94115"    
[10] "94131"     "94111"     "94117"     "94107"     "94108"     "94102"     "94132"     "94105"     "94134"    
[19] "94116"     "94121"     "94112"     "94127"     "94123"     "94114"     "94513"     "94545"     "94066"    
[28] "941033148" ""          "94158"     "95105"     "94140"     "94013"     "94130"     "CA"        "92672"    
[37] "94120"     "94143"     "94609"     "94101"     "00000"     "CA  94122" "CA  94523" "94188"     "94014"    
[46] "04102"     "94129"

Most of them look legitimate whereas others don’t appear to be legal and are perhaps the result of data entry errors. For example while “CA 94122” does contain a legal zip code the expected format is the 5 numbers. Also, the “00000” might be used to signal a missing value but we don’t know. For simplicity we’ll simply strip out values that are NOT five digits. We also have some oddities like “04102”, (Portland), and “92672”, (San Clemente), although if we look close at the businesses data frame for these records we see street names of, respectively, 366 GOLDEN GATE AVE” and “1530 Market Street” which are San Francisco addresses. Look’s like somebody messed up when entering data. So in a “real” analysis we would need to deal with these issues more definitively. But we’ll pass on these for now and merge the inspections data frame with a new version of the inspections data frame that has the normalized zip codes. Finally, we’ll plot the barchart of the average score for each zip code.

myb <- businesses[nchar(businesses$postal_code)==5 & businesses$postal_code != "00000",]


[1] "94104" "94124" "94109" "94103" "94133" "94118" "94110" "94122" "94115" "94131" "94111" "94117" "94107"
[14] "94108" "94102" "94132" "94105" "94134" "94116" "94121" "94112" "94127" "94123" "94114" "94513" "94545"
[27] "94066" "94158" "95105" "94140" "94013" "94130" "92672" "94120" "94143" "94609" "94101" "94188" "94014"
[40] "04102" "94129"

# Let's merge the two data frames myb and inspections using business_id as a key.

mym = merge(myb[,c(1:2,6)],inspections,by="business_id")


business_id                               name postal_code Score       date
1          17               GEORGE'S COFFEE SHOP       94124    94 2013-06-05
2          17               GEORGE'S COFFEE SHOP       94124    NA 2013-07-11
3          19              NRGIZE LIFESTYLE CAFE       94109    96 2013-09-04
4          24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY       94104   100 2013-11-18
5          29                      CHICO'S PIZZA       94103    87 2013-10-01
6          37                        CAFE BISTRO       94118    NA 2013-01-30
                   type rating quarters
1 Routine - Unscheduled      G       Q2
2 Reinspection/Followup   <NA>       Q3
3 Routine - Unscheduled      G       Q3
4 Routine - Unscheduled      G       Q4
5 Routine - Unscheduled      A       Q4
6 Reinspection/Followup   <NA>       Q1

# Let's create an aggregation

hold <- aggregate(Score~postal_code,data=mym,mean)

hvec <- hold$Score

names(hvec) <- hold$postal_code

myCols <- colorRampPalette(c("blue", "red"))( length(hvec) )
mp <- barplot(rev(sort(hvec)),axisnames=F,col=myCols,main="Average Score per Zipcode")
Avg Score per Zip Code

Avg Score per Zip Code

Let’s see if the lowest scored 25 restaurants are in the same zip codes as the highest scored restaurants. From the following it looks like there is significant overlap so its not as if the there is a big difference between the zipcodes.

hi <- unique(mym[order(mym$Score),][1:100,]$postal_code)

 [1] "94133" "94103" "94102" "94111" "94108" "94116" "94127" "94110" "94122"
[10] "94121" "94114" "94123" "94107" "94132" "94118" "94109"

lo <- unique(mym[order(-mym$Score),][1:100,]$postal_code)
 [1] "94104" "94103" "94111" "94122" "94117" "94108" "94115" "94132" "94105"
[10] "94110" "94123" "94118" "94107" "94112" "94133" "94131" "94121" "94102"
[19] "94109"

sum(lo %in% hi)
[1] 13

What’s for Dessert ?

What next ? This is really just the beginning really. One approach that I didn’t take was to use the very cool sqldf package that would allow us to treat the dataframes as tables withing a relational database. However, this assumes some familiarity with SQL which, at least in my experience, newcomers to R don’t usually possess although it is worth it to learn if you plan on a career in data mining. As a teaser here is how we could have done some things using sqldf. Check this out and see you next time.

businesses  <- read.csv(unz("","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("","inspections_plus.csv"),header=T,stringsAsFactors=F)

# How many restaurants are there ?

sqldf("select count(*) from businesses")
1     6109

# What is the average score for all restaurants ?

sqldf("select avg(Score) from inspections")
1   91.95652

# What are the average scores per zip code ?
# Note that since this is the raw data we have some "bad" zipcodes

sqldf("select avg(Score),businesses.postal_code from inspections join businesses using(business_id) group by postal_code")

   avg(Score) postal_code
1    91.85986            
2    90.00000       00000
3    96.00000       04102
4    96.50000       92672
5    98.33333       94013
6    85.66667       94014
7   100.00000       94066
8    95.75000       94101
9    91.73356       94102
10   91.34171       94103
11   98.00000   941033148
12   95.65000       94104
13   94.11030       94105
14   94.72059       94107
15   90.39931       94108
16   90.48051       94109
17   91.54456       94110

Conditioning and grouping are two important concepts in graphing that allow us to rapidly refine our understanding of data under consideration. Conditioning, in particular, allows us to view relationships across “panels” with common scales. Each panel contains a plot whose data is “conditional” upon records drawn from the category that supports that particular panel (an example will help make this clear). Grouping allows us to view interactions within a given panel by making distinctions using different plot characters, size, or color. These ideas have been discussed in Cleveland’s book “Visualizing Data”. The Lattice, (aka “Trellis”), graphics package attempts to implement the concepts and values indicated in Cleveland’s book. These ideas can help us find interesting patterns in our data.

However, it is important to point out that all graphics packages available within R, (Base, Lattice, ggplot2, Grid), have their respective strengths and good work can be accomplished using any of them although for the newcomer Base graphics is a reasonable starting point. There is an abundance of information on Base graphics both in printed literature and via Google results so ample support is available. As a newcomer to Base graphics it doesn’t take long before you are creating graphs like that in Figure 1, which is an approximation of a conditioned plot. Strictly speaking it does not adhere to the principles of conditioned plots but its useful to see how close we can get. We have a three “panel” plot of MPG vs Weight from the mtcars data frame which is built-in to R. Each panel contains a scatter plot between MPG vs wt for each cylinder group, (4, 6, or 8). The paneling allows us to conveniently compare the MPG vs Wt relationship across groups.


cyls <- split(mtcars, mtcars$cyl)
for (ii in 1:length(cyls)) {
     tmpdf <- cyls[[ii]]
     sname <- names(cyls)[ii]
     plot(tmpdf$wt, tmpdf$mpg, 
          main = paste("MPG vs Wt",sname,"Cyl"),
          ylim = c(0,40), xlab = "Wt / 1,000",
          ylab = "MPG", pch=19, col="blue")
Figure 1

Figure 1

We see that the higher weights correspond to records with lower a MPG figure. This idea is confirmed by the separate panel plots wherein we see the lower MPG for increasing cylinder count, which is is quite likely correlated with the weight of the automobile. (It is in fact with a correlation coefficient of 0.78). In any case the data in mtcars isn’t terribly difficult to understand, (though that’s not really the point here – we are simply trying to come to terms with how to plot data). We split up the data frame and looped through the resulting list and plotted MPG vs Wt for each subgroup. It is not a bad approach and is one that lends itself well to programming. On the other hand let’s check out how this kind of thing might be done using lattice graphics, which was designed in large part to make conditioned plots really easy. Let’s put it to the test.

xyplot(mpg~wt | factor(cyl), data=mtcars, pch=19,
                main="MPG vs Wt", xlab="Wt/1,000",  ylab="MPG",layout=c(3,1),type=c("p","g"))
Figure 2

Figure 2

What’s Your Condition ?

This isn’t bad for one line (two if you include the library) statement. Let’s break down what is happening in Figure 2. The “mpg~wt” part of the call to xyplot should be familiar as a formula where mpg represents the y-axis and wt represents the x-axis. The formula interface is commonly used in other R procedures such as lm as well as aggregation functions so if this is new to you then it is important to learn about it.

The part of the call to xyplot that newcomers might not understand is the vertical bar character. The presence of this character indicates that what follows is a “conditioning” variable which is typically a category/factor. It also let’s us know that the resulting plot will contain a number of panels the number of which will correspond to the number of unique values assumed by the conditioning variable(s). In the case of mtcars$cyl this will be three since the unique values assumed by cylinder are 4,6,8. However, we really don’t have to know the number of values in advance. We can just plot it and view the defaults after which we can go in and add the “layout” argument to refine the visual presentation.

[1] 6 4 8

But it gets better in that we can use more than one conditioning character at a time. We can look at a combination of factors in effect looking at the MPG vs Wt relationship as “cells” of a table. Sound complicated ? Not really. Check this out. We are asking xyplot to present MPG vs wt for each combination of Transmission and Cylinder type. This results in a 2 x 3 grid structure wherein the panels contain a scatterplot.

xyplot(mpg~wt | factor(cyl) + factor(am,labels=c("A","M")),
                data=mtcars, main="MPG vs Wt", 
                xlab="Wt/1,000", ylab="MPG",pch=19,type=c("p","g"))
Figure 3

Figure 3

Notice how lattice takes care of the “paneling” for us and supplies labels based on the given factor combinations. So we have a row for each Transmission type, (automatic or manual) each of which contains three columns/panels corresponding to a cylinder group of 4,6, or 8. This plot makes it trivial to determine what combination gets the best gas mileage assuming that is what we are looking for. The 4 Cylinder cars with Manual transmissions get the best gas mileage. Note that we could also further influence the layout and look of the graph and make adjustments to the “strip” labels in each panel but since this is an introductory article we’ll focus on general capabilities for now.

There is Safety in Groups !

Are there other ways that we enhance the graph to reveal more meaning ? In fact there is. In particular we could employ the idea of “grouping”. However, before apply grouping to the above example, (Figure 3), let’s look at a stand-a-lone case to make sure you understand the concept. Suppose we want to look at the MPG vs wt graph within a single panel though we wish to see points colored according to the Transmission group they occupy. This is easy to accomplish with lattice by using the “groups” argument.

Figure 4

Figure 4

So that was pretty easy wasn’t it ? Instead of viewing the interaction across panels we can see it within a single panel by using the different colors to partition the data into groups. Now let’s say that we wanted the points to reflect the cylinder group they occupy except instead of using color we’ll use point size to indicate the relationship. Here we use a vector from 3 down to 1 in conjunction with the cex argument to reflect the size of the points coming from, respectively, cylinder groups 4, 6, and 8. Points coming from cylinder group 4 are going to be larger than points coming from groups 6 and 8. The MPG is better with 4 cylinder autos and we want that reflected in terms of larger size.

Figure 5

Figure 5

Now if we turn our attention back to the situation wherein we used two factors to condition the MPG vs Wt plots we should be able to combine grouping with conditioning to see if there is anything lurking in the data that we might have missed previously. So what I’ll do here is plot MPG vs Wt as conditioned by Cylinders while grouping by Transmission.

xyplot(mpg ~ wt | factor(cyl), data=mtcars, pch=16,layout=c(3,1),
                  groups = factor(am,labels=c("Auto","Manual")),
                  type=c("p","g"),auto.key = TRUE)
Figure 6

Figure 6

Do we see anything interesting here ? Perhaps. If you have a 4 cylinder car and a manual transmission then you get the best gas mileage. This isn’t anything new. However, if we look at the cars in the 8 cylinder panel we see that cars with a manual transmission, (there are only two of them), don’t seem to perform as well relative to MPG as they did in the 4 cylinder group. In the 6 cylinder panel we see that auto and manual transmissions are closer in MPG than in the 4 cylinder group. So we might form the hypothesis that having a manual transmission is related to significantly better gas mileage only if you have a 4 cylinder car. Or maybe we just need to find more 8 and 6 cylinder cars with manual transmissions to see if the pattern holds. It is important to note that these cars were surveyed in 1974 so our findings would probably change were we to use updated information. Well this is all cool stuff but for now we’ll stick with the graphics details instead of further analysis.

Out of curiosity what would this example look like in Base graphics ? Here is one possibility. I’ll let you make the decision if this is easier than using lattice and if the result is as aesthetically pleasing. Base graphics are amazingly flexible but require more of a programming approach in this case than does lattice although lattice panel functions, (something I will explore in another Blog post), can also require a “programmer’s mentality”.

mysplits = split(mtcars,mtcars$cyl)
maxmpg = max(mtcars$mpg)
for (ii in 1:length(mysplits)) {
      tmpdf  <- mysplits[[ii]]
      auto <- tmpdf[tmpdf$am == 0,]
      man <- tmpdf[tmpdf$am == 1,]
      plot(tmpdf$wt, tmpdf$mpg,type="n",
           main=paste(names(mysplits[ii])," Cylinders"),
           ylim=c(0,maxmpg), xlab="wt",ylab="MPG")
      legend("topright", inset=0.05, c("manual","auto"),
             pch = 19, col=c("pink","blue"))
Figure 7

Figure 7

Well I haven’t been completely forthright here. Base graphics does in fact have a function that is dedicated to conditioned plotting although it doesn’t generalize to other plot types. It does, however, handle X/Y plots quite well. I include it here because it does conform to the principles of conditioned plotting but if you are going to embrace the formal approach for conditioning then you might as well use lattice since it does generalize across various plot types.

Figure 8: coplot

Figure 8: coplot

Other Lattice Functions

Okay, you might be wondering if the arguments to xyplot will “carry over” to other lattice plotting functions. The answer is “yes” although they must make sense for the given chart type. For example with histograms or boxplots we are looking at the plot of a single continuous quantity but we can still “condition” that continuous quantity using factors from the dataset. In this next example we look at another internal dataset to R called “ChickWeight”. We’ll present a histogram of the weights of 50 chickens, measured over time, as conditioned by the type of diet they were provided. There are 4 diet types hence we will see 4 panels. All of the considerations important to lattice graphics are intact – common axes, optimal use of space, shared margins, and panels to indicate conditions.

           main="Weight by Diet Type",xlab="Weight in grams")
Figure 9

Figure 9: lattice histogram

So checkout this example. We create a boxplot of chicken weights conditioned on the passage of time since birth which is measured in days (0,2,…,21).

        main="Weight by Days Since Birth",xlab="Weight in grams")
lattice boxplot

Figure 10: lattice boxplot


So we’ve really only scratched the surface here of what can be done with conditioning and grouping although I feel that this is a reasonable introduction. As I’ve mentioned previously, Base graphics has lots of flexibility and if you become fluent with the high and low level functions offered as part of that environment you can accomplish a lot. Admittedly, if you do not have a programming background then writing looping structures to process data can be a bit intimidating but its worth it to learn. The examples I presented with lattice are to show how convenient it is to do conditioning and grouping but then again if you want to do more customization then that will involve learning a bit more about lattice. So at some point you find yourself in the same position that you would be with Base graphics. It all depends on how detailed and custom you wish your plots to be. The ggplot2 package, which I will discuss in another Blog post, can also handle these activities quite nicely though the jargon is a bit different.

Independently of what graphics package you choose to employ, it is important to realize that flexibility does not always imply ease of use. Like any software package there is absolutely no substitute for experience and getting your hands dirty. R is no exception.