Archive for the ‘GeoCoding XML processing’ Category

I’m frequently asked by newcomers to R to provide an easy to follow generic set of instructions on how to download data, transform it, aggregate it, make graphs, and write it all up for publication in a high impact journal – all by the end of the day ! While such a request is somewhat understandable coming from a student it’s somewhat awkward when coming from a seasoned research investigator, (i.e. someone who should know better). Obviously, no such set of instructions exists given the astonishing variety of available data stored in differing formats with varying degrees of adherence to a standard(s). Simply put, “real world” data wrangling and analysis can be a complex process independent of the language you choose to employ. Nonetheless, what I’ll attempt to do here is provide a “tour” involving some data I picked at random. I’ll be sticking to the problems of downloading, filtering, and graphing some of the data. And rather than dictate to you what is the “best way” to do something I’ll simply show you what is on offer and you can make your own decisions. R is amazingly flexible, which means that there are many different ways to do the same thing.

I have chosen information relating to San Francisco restaurant inspections which can be found at https://data.sfgov.org/ along with an impressive variety of data sets involving crime reports, parking availability, bus routes, campaign funding, and many other types of civic info. For reference we can map the zipcodes mentioned in the health reports using methods outlined in my blog post on Geocoding. This is basically the area we are considering.

San Francisco restaurants

Mmm. Let’s eat !

I’m hungry !

Without even knowing how the restaurant information is organized one can easily imagine a number of interesting questions such as, “What is the average health score for all restaurants ? What restaurants have the highest, (or lowest), scores ? What is the distribution of scores across zip codes ? Are there seasonal scoring trends ?”. While you can download the files from the above website I have saved a copy on my web site in case the original location changes at some point. The data is available in a zipped file that contains three CSV files that we can easily unpack into data frames.

url <- "http://steviep42.bitbucket.org/data/SFFoodProgram_Complete_Data.zip"
download.file(url,"SFFood.zip")
system("unzip -l SFFood.zip")

Archive:  SFFood.zip
  Length     Date   Time    Name
 --------    ----   ----    ----
  4613344  04-23-14 09:51   violations_plus.csv
  1175067  04-23-14 09:51   businesses_plus.csv
  1754811  04-23-14 09:51   inspections_plus.csv
 --------                   -------
  7543222                   3 files

There are three files each of which has some key information relating to the restaurant inspection process. I should point out that a description file comes with the dataset, which you can read to get a better idea about the column names and formats. But we can also use R to help us understand what is going on. Let’s read these files into data frames:

businesses  <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F)
str(violations)

'data.frame':	55800 obs. of  5 variables:
 $ business_id    : int  10 10 10 10 10 10 17 17 17 17 ...
 $ date           : int  20140114 20140114 20140114 20120403 20110428 20121114 20130605 20130605 20130605 20140312 ...
 $ ViolationTypeID: int  103119 103145 103154 103154 103119 103154 103154 103144 103142 103142 ...
 $ risk_category  : Factor w/ 4 levels "High Risk","Low Risk",..: 3 2 2 2 3 2 2 2 2 2 ...
 $ description    : Factor w/ 70 levels "","Consumer advisory not provided for raw or undercooked foods",..: 22 17 62 62 22 62 62 58 61 61 ...

str(inspections)

'data.frame':	40935 obs. of  4 variables:
 $ business_id: int  10 10 10 10 10 10 10 10 17 17 ...
 $ Score      : int  NA 92 98 NA 98 100 NA 96 94 NA ...
 $ date       : int  20140124 20140114 20121114 20120920 20120403 20110928 20110601 20110428 20140312 20130711 ...
 $ type       : Factor w/ 14 levels "Administrative or Document Review",..: 11 13 13 11 13 13 11 13 13 11 ...

str(businesses)

'data.frame':	6109 obs. of  17 variables:
 $ business_id         : int  10 17 19 24 29 31 37 45 48 50 ...
 $ name                : Factor w/ 5562 levels " HOL N JAM LEMONADE STAND  #2",..: 5110 1901 3525 3562 1067 3509 793 1027 265 4697 ...
 $ address             : Factor w/ 5444 levels "","   ","  VARIOUS LOCATIONS  ",..: 119 1868 548 3892 713 2478 3506 2870 4814 1446 ...
 $ city                : Factor w/ 1 level "San Francisco": 1 1 1 1 1 1 1 1 1 1 ...
 $ state               : Factor w/ 1 level "CA": 1 1 1 1 1 1 1 1 1 1 ...
 $ postal_code         : Factor w/ 47 levels "","00000","04102",..: 12 29 16 12 10 35 24 17 27 10 ...
 $ latitude            : num  37.8 37.7 37.8 37.8 37.8 ...
 $ longitude           : num  -122 -122 -122 -122 -122 ...
 $ phone_number        : num  NA 1.42e+10 NA NA 1.42e+10 ...
 $ TaxCode             : Factor w/ 32 levels "AA","H07","H08",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ business_certificate: int  779059 78443 NA 352312 939325 346882 315736 340024 318022 NA ...
 $ application_date    : Factor w/ 2327 levels "","01/01/2005",..: 1 620 1 1 2137 1 1473 1836 1 1 ...
 $ owner_name          : Factor w/ 5294 levels "","\tCARLIN'S CORNER COFFEESHOP LLC",..: 4746 2704 61 3387 186 3335 4297 1809 5198 333 ...
 $ owner_address       : Factor w/ 5366 levels "","\t1 INDEPENDENCE POINTE #305",..: 2898 4397 484 3785 1479 2492 3413 414 1103 2206 ...
 $ owner_city          : Factor w/ 335 levels "","\tBRISBANE",..: 239 88 239 239 88 239 241 226 241 280 ...
 $ owner_state         : Factor w/ 51 levels "","ARIZONA","AZ",..: 7 7 7 7 7 7 7 7 7 7 ...
 $ owner_zip           : Factor w/ 376 levels "","\t28273","\t29615",..: 165 139 170 165 139 199 183 178 188 177 ...

It looks like that all data frames have a column in common called “business_id”, which we can use as a “key” to link/merge these data frames though we can already answer some basic questions such as there were 40,935 inspections, which led to 55,800 violations over a time range of 04/25/2011 to 04/22/2014. I use the strptime function here to turn the character string dates into POSIX dates.

inspections$date <- strptime(inspections$date,format="%Y%m%d")
violations$date  <- strptime(violations$date,format="%Y%m%d")
range(inspections$date)
[1] "2011-04-25 EDT" "2014-04-22 EDT"

Okay let’s extract information just for the year 2013 and use that as our base data set. Note that I could merge the data frames into a single data frame and we will eventually look at how to do that but for now let’s see what we can get out of each data frame individually.

start_time <- strptime("20130101","%Y%m%d")
stop_time  <- strptime("20131231","%Y%m%d")
inspections <- subset(inspections, date > start_time & date < stop_time)
violations <- subset(violations, date > start_time & date < stop_time)

range(inspections$date)
[1] "2013-01-02 EST" "2013-12-30 EST"

nrow(inspections) 
13,110

nrow(violations)
19,091

Relative to the restaurant inspections process the way I understand it is that an inspector will visit a restaurant at some frequency or in response to a diner’s complaint. The inspector will evaluate the establishment, document any code violations at a severity level of low,medium,high, and ultimately assign a numeric score from 0 to 100 with the following possible meanings: 0-70: Poor, 71-85: Needs Improvement, 86-90: Adequate, 90-100: Good. If violations were noted then a followup inspection(s) will usually occur to insure that the violations are adequately addressed. Such visits usually do not involve a “re-scoring” just a verification that the establishment dealt with the previously observed violations. Evidently this could take multiple visits.

We might then expect that in the inspections file there are multiple rows for a business starting with the original visit and any associated followups. In the case of the followups it looks like the Score is set to NA. Let’s see if that is true. According to the inspections file a “Routine-Unscheduled” inspection was performed on 06/05/2013 for business_id 17 that resulted in score of 94 ,which is good, although in consulting the violations file it seems that there were three “Low Risk” violations noted. A re inspection happened on 07/11/2013.


head(inspections[order(inspections$business_id),])

   business_id Score       date                  type
10          17    NA 2013-07-11 Reinspection/Followup
11          17    94 2013-06-05 Routine - Unscheduled
18          19    96 2013-09-04 Routine - Unscheduled
23          24   100 2013-11-18 Routine - Unscheduled
27          29    87 2013-10-01 Routine - Unscheduled
34          37    96 2013-08-08 Routine - Unscheduled


head(violations[order(violations$business_id),])

   business_id       date ViolationTypeID risk_category                                      description
7           17 2013-06-05          103154      Low Risk     Unclean or degraded floors walls or ceilings
8           17 2013-06-05          103144      Low Risk Unapproved or unmaintained equipment or utensils
9           17 2013-06-05          103142      Low Risk                 Unclean nonfood contact surfaces
18          19 2013-09-04          103133 Moderate Risk           Foods not protected from contamination
23          29 2013-10-01          103142      Low Risk                 Unclean nonfood contact surfaces
24          29 2013-10-01          103120 Moderate Risk          Moderate risk food holding temperature 

Grading

By referring to the inspections file we can answer one of our questions easily. What is the average health score across all restaurants ?

mean(inspections$Score,na.rm=T)
[1] 91.95652 

But let’s dig a little deeper. I’ll create a factor called rating that implements the scoring interpretation given above. This is easy to do using the cut command. We’ll then summarize the mean score for each category and follow that up with a boxplot of the scores per category. This should give us some initial ideas about the data.


inspections$rating <- cut(inspections$Score,breaks=c(0,70,85,89,100),right=T,
    labels=c("P","NI","A","G")) 

# P = Poor, NI = Needs Improvement, A = Adequate, G = Good

tapply(inspections$Score,inspections$rating,mean)
       P       NI        A        G 
64.34337 79.97907 87.55324 95.94264 

ylim <- c(min(inspections$Score,na.rm=T)-5,max(inspections$Score,na.rm=T)+5)

# Capture the boxplot output to get the number of observations per category

myb <- boxplot(Score~rating,data=inspections,main="Scores per Rating Category",
               col=rainbow(4),ylim=ylim)

leg.txt <- paste(levels(inspections$rating),myb$n,sep=" : ")

legend(3,70,leg.txt,title="Obs. per Category",cex=0.8,pch=19,col=rainbow(4))
Boxplots of score per category

Boxplots of score per category

Hmm, this information is a little shocking since no one wants to think that there are restaurants anywhere with a health score in the “Poor” category let alone 166 of them as we see here. Who does the lowest score of 2013 belong to ? Business id 74522. Note that it would be really easy to consult the “businesses” data frame at this point to determine the full name associated with this id though since this is a blog on R and not on restaurant reviews I’ll leave it to you to determine what business name that is so you can make your dining plans accordingly.

# What business id had the lowest score in 2013 ?

inspections[which(inspections$Score==min(inspections$Score,na.rm=T)),]

      business_id Score       date                  type rating
39498       74522    42 2013-10-03 Routine - Unscheduled      P

# Let's see what the violations were

violations[violations$business_id==74522,][,-3]

 business_id       date  risk_category                                        description
       74522 2013-10-03                Low Risk                   Unclean nonfood contact surfaces
       74522 2013-10-03               High Risk            Unclean hands or improper use of gloves
       74522 2013-10-03                Low Risk       Unclean or degraded floors walls or ceilings
       74522 2013-10-03                Low Risk   Improper storage of equipment utensils or linens
       74522 2013-10-03                Low Risk                              Improper food storage
       74522 2013-10-03           Moderate Risk                          Improper thawing methods 
       74522 2013-10-03           Moderate Risk Inadequate and inaccessible handwashing facilities
       74522 2013-10-03               High Risk                       High risk vermin infestation
       74522 2013-10-03               High Risk        Unclean or unsanitary food contact surfaces
       74522 2013-10-03               High Risk                High risk food holding temperature 
       74522 2013-10-03               High Risk                   Contaminated or adulterated food
       74522 2013-10-03               High Risk              Improper cooking time or temperatures

So let’s just go ahead and find the 25 restaurants that have the highest number of “high risk” violations ? How would we find those ?

# Pull out the high risk violations

violations_high <- violations[violations$risk_category == "High Risk",]

# Split the data frame by business_id

my_high_viols <- split(violations_high,violations_high$business_id)

# Count all violations for a given business id and then sort all
# business ids from highest to lowest

violation_high_count <- sort(sapply(my_high_viols,nrow),T)[1:25]

33446  2427   286  3480 70068  6695 16806 67330 70281 73926 74522   853  1000  1343  1358  1458 
   11     9     7     7     7     6     6     6     6     6     6     5     5     5     5     5 
 2189  2302  2401  3217  3948  4590 18800 21524 36201 
    5     5     5     5     5     5     5     5     5 

inspections[inspections$business_id==33446,]
      business_id Score       date                  type rating 
22145       33446    73 2013-12-11 Routine - Unscheduled     NI      

Notice that while business_id 74522 has the worst overall score, 43, it didn’t actually receive the highest number of “high risk” violations in 2013. That honor goes to business_id 33446 who received 11 violations and a score of 73 on 12/11/2013. But let’s go back to a general view here. I want to slice up the inspections into quarters of the year, (four groups of three months), to see if perhaps scores follow some seasonal pattern at least from a visual point of view. To do this we need to use some R date functions.

inspections$quarters <- quarters(inspections$date)
inspections$quarters <- factor(inspections$quarters,levels=c("Q1","Q2","Q3","Q4"),ordered=TRUE)

# Now let's build the barchart out of a table of quarters vs ratings

library(lattice)
barchart(table(inspections$quarters,inspections$rating),auto.key=list(columns=4),
         main="Violation Category Counts by Quarter")

Barchart of quarters vs. ratings

Barchart of quarters vs. ratings

Well it doesn’t look like there are drastic differences between the scoring categories on a quarterly basis though there appears to be slightly less scores in the “NI” category in Quarter 1 than all other quarters. We could drop down to a month level of granularity to check things out from month to month. All we would need to that is to use the “months” data function above instead of the “quarters” and then apply level labels accordingly. I’ll leave that to you as an exercise.

Grades by zipcode

Well what’s next ? Let’s take a look at the distribution of scores across zip codes which we can access from the inspections file. We’ll also do some merging of the data frames to make our process a bit easier. First, how many unique zip codes do we have ?

unique(businesses$postal_code)
 [1] "94104"     "94124"     "94109"     "94103"     "94133"     "94118"     "94110"     "94122"     "94115"    
[10] "94131"     "94111"     "94117"     "94107"     "94108"     "94102"     "94132"     "94105"     "94134"    
[19] "94116"     "94121"     "94112"     "94127"     "94123"     "94114"     "94513"     "94545"     "94066"    
[28] "941033148" ""          "94158"     "95105"     "94140"     "94013"     "94130"     "CA"        "92672"    
[37] "94120"     "94143"     "94609"     "94101"     "00000"     "CA  94122" "CA  94523" "94188"     "94014"    
[46] "04102"     "94129"

Most of them look legitimate whereas others don’t appear to be legal and are perhaps the result of data entry errors. For example while “CA 94122” does contain a legal zip code the expected format is the 5 numbers. Also, the “00000” might be used to signal a missing value but we don’t know. For simplicity we’ll simply strip out values that are NOT five digits. We also have some oddities like “04102”, (Portland), and “92672”, (San Clemente), although if we look close at the businesses data frame for these records we see street names of, respectively, 366 GOLDEN GATE AVE” and “1530 Market Street” which are San Francisco addresses. Look’s like somebody messed up when entering data. So in a “real” analysis we would need to deal with these issues more definitively. But we’ll pass on these for now and merge the inspections data frame with a new version of the inspections data frame that has the normalized zip codes. Finally, we’ll plot the barchart of the average score for each zip code.


myb <- businesses[nchar(businesses$postal_code)==5 & businesses$postal_code != "00000",]

unique(myb$postal_code)

[1] "94104" "94124" "94109" "94103" "94133" "94118" "94110" "94122" "94115" "94131" "94111" "94117" "94107"
[14] "94108" "94102" "94132" "94105" "94134" "94116" "94121" "94112" "94127" "94123" "94114" "94513" "94545"
[27] "94066" "94158" "95105" "94140" "94013" "94130" "92672" "94120" "94143" "94609" "94101" "94188" "94014"
[40] "04102" "94129"

# Let's merge the two data frames myb and inspections using business_id as a key.

mym = merge(myb[,c(1:2,6)],inspections,by="business_id")

head(mym)

business_id                               name postal_code Score       date
1          17               GEORGE'S COFFEE SHOP       94124    94 2013-06-05
2          17               GEORGE'S COFFEE SHOP       94124    NA 2013-07-11
3          19              NRGIZE LIFESTYLE CAFE       94109    96 2013-09-04
4          24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY       94104   100 2013-11-18
5          29                      CHICO'S PIZZA       94103    87 2013-10-01
6          37                        CAFE BISTRO       94118    NA 2013-01-30
                   type rating quarters
1 Routine - Unscheduled      G       Q2
2 Reinspection/Followup   <NA>       Q3
3 Routine - Unscheduled      G       Q3
4 Routine - Unscheduled      G       Q4
5 Routine - Unscheduled      A       Q4
6 Reinspection/Followup   <NA>       Q1

# Let's create an aggregation

hold <- aggregate(Score~postal_code,data=mym,mean)

hvec <- hold$Score

names(hvec) <- hold$postal_code

myCols <- colorRampPalette(c("blue", "red"))( length(hvec) )
mp <- barplot(rev(sort(hvec)),axisnames=F,col=myCols,main="Average Score per Zipcode")
axis(1,at=mp,labels=names(hvec),las=2,cex.axis=0.8)
Avg Score per Zip Code

Avg Score per Zip Code

Let’s see if the lowest scored 25 restaurants are in the same zip codes as the highest scored restaurants. From the following it looks like there is significant overlap so its not as if the there is a big difference between the zipcodes.

hi <- unique(mym[order(mym$Score),][1:100,]$postal_code)

hi
 [1] "94133" "94103" "94102" "94111" "94108" "94116" "94127" "94110" "94122"
[10] "94121" "94114" "94123" "94107" "94132" "94118" "94109"

lo <- unique(mym[order(-mym$Score),][1:100,]$postal_code)
lo
 [1] "94104" "94103" "94111" "94122" "94117" "94108" "94115" "94132" "94105"
[10] "94110" "94123" "94118" "94107" "94112" "94133" "94131" "94121" "94102"
[19] "94109"

sum(lo %in% hi)
[1] 13

What’s for Dessert ?

What next ? This is really just the beginning really. One approach that I didn’t take was to use the very cool sqldf package that would allow us to treat the dataframes as tables withing a relational database. However, this assumes some familiarity with SQL which, at least in my experience, newcomers to R don’t usually possess although it is worth it to learn if you plan on a career in data mining. As a teaser here is how we could have done some things using sqldf. Check this out and see you next time.

library(sqldf)
businesses  <- read.csv(unz("SFFood.zip","businesses_plus.csv"),header=T,stringsAsFactors=F)
violations  <- read.csv(unz("SFFood.zip","violations_plus.csv"),header=T,stringsAsFactors=F)
inspections <- read.csv(unz("SFFood.zip","inspections_plus.csv"),header=T,stringsAsFactors=F)

# How many restaurants are there ?

sqldf("select count(*) from businesses")
count(*)
1     6109

# What is the average score for all restaurants ?

sqldf("select avg(Score) from inspections")
  avg(Score)
1   91.95652

# What are the average scores per zip code ?
# Note that since this is the raw data we have some "bad" zipcodes

sqldf("select avg(Score),businesses.postal_code from inspections join businesses using(business_id) group by postal_code")

   avg(Score) postal_code
1    91.85986            
2    90.00000       00000
3    96.00000       04102
4    96.50000       92672
5    98.33333       94013
6    85.66667       94014
7   100.00000       94066
8    95.75000       94101
9    91.73356       94102
10   91.34171       94103
11   98.00000   941033148
12   95.65000       94104
13   94.11030       94105
14   94.72059       94107
15   90.39931       94108
16   90.48051       94109
17   91.54456       94110
Advertisements

Rolling Your Rs

In this article I discuss a general approach for Geocoding a location from within R, processing XML reports, and using R packages to create interactive maps. There are various ways to accomplish this, though using Google’s GeoCoding service is a good place to start. We’ll also talk a bit about the XML package that is a very useful tool for parsing reports returned from Google. XML is a powerful markup language that has wide support in many Internet databases so it is helpful. Lastly, we’ll use our knowledge to create a map of the tour dates on the Rolling Stones 1975 Tour of the Americas. Also, when I use the word “GeoCoding” this basically implies the process of taking a geographic location and turning it into a latitude / longitude pair.

What does Google Offer ?

Check out the main Geocoding page, which presents implementation details of the API as…

View original post 1,071 more words

Welcome to Part 2 of the GeoCoding, R, and the Rolling Stones blog. Let’s apply some of the things we learned in Part 1 to a practical real world example.

Mapping the Stones – A Real Example

The Rolling Stones have toured for many years. You can go to Wikipedia and see information on the various tours. Here we focus only on the dates and concerts for the 1975 “Tour of the Americas”. I’ve scraped off the information from the Wikipedia page and put it into a data frame. The idea here is that we will GeoCode each city and obtain a latitude and longitude and then use it to create an interactive map of the tour using the Google Charting Tools.

If you want your own copy of this data frame then do the following:

url = "http://steviep42.bitbucket.org/data/stones75.csv"
stones75 = read.csv(url)

Here are the first 10 rows of the data frame. The format is really simple:

head(stones75,10)
           Date        City         State                Venue
1   1 June 1975 Baton Rouge     Louisiana  LSU Assembly Center
2   3 June 1975 San Antonio         Texas    Convention Center
3   4 June 1975 San Antonio         Texas    Convention Center
4   6 June 1975 Kansas City      Missouri    Arrowhead Stadium
5   8 June 1975   Milwaukee     Wisconsin       County Stadium
6   9 June 1975  Saint Paul     Minnesota         Civic Center
7  11 June 1975      Boston Massachusetts        Boston Garden
8  14 June 1975   Cleveland          Ohio    Municipal Stadium
9  15 June 1975     Buffalo      New York  Memorial Auditorium
10 17 June 1975     Toronto       Ontario   Maple Leaf Gardens

Okay let’s process the cities. Like before we’ll use the sapply command to get back the data after which we’ll use cbind to attach the results to the data frame. We might get some warnings about row names when we do this but don’t worry about it. After all “you can’t always get what you want”.

hold = data.frame(t(sapply(paste(stones75$City,stones75$State,sep=","),myGeo)))
stones75 = cbind(stones75,hold)
 
head(stones75,10)

           Date        City         State                Venue  lat   lon
1   1 June 1975 Baton Rouge     Louisiana  LSU Assembly Center 30.5 -91.1
2   3 June 1975 San Antonio         Texas    Convention Center 29.4 -98.5
3   4 June 1975 San Antonio         Texas    Convention Center 29.4 -98.5
4   6 June 1975 Kansas City      Missouri    Arrowhead Stadium 39.1 -94.6
5   8 June 1975   Milwaukee     Wisconsin       County Stadium 43.0 -87.9
6   9 June 1975  Saint Paul     Minnesota         Civic Center 45.0 -93.1
7  11 June 1975      Boston Massachusetts        Boston Garden 42.4 -71.1
8  14 June 1975   Cleveland          Ohio    Municipal Stadium 41.5 -81.7
9  15 June 1975     Buffalo      New York  Memorial Auditorium 42.9 -78.9
10 17 June 1975     Toronto       Ontario   Maple Leaf Gardens 43.7 -79.4

Great ! So now we have the lat and lon for each city. As you might notice in the data frame the Stones played several nights in the same city so we should probably keep track of this.


stones75[9:18,]

           Date          City        State                 Venue
9  15 June 1975       Buffalo     New York   Memorial Auditorium
10 17 June 1975       Toronto      Ontario    Maple Leaf Gardens
11 22 June 1975 New York City     New York Madison Square Garden
12 23 June 1975 New York City     New York Madison Square Garden
13 24 June 1975 New York City     New York Madison Square Garden
14 25 June 1975 New York City     New York Madison Square Garden
15 26 June 1975 New York City     New York Madison Square Garden
16 27 June 1975 New York City     New York Madison Square Garden
17 29 June 1975  Philadelphia Pennsylvania          The Spectrum
18  1 July 1975         Largo     Maryland        Capital Center

As you can see above, they made a six night stand at the famous Madison Square Garden arena in New York City. Our programming should check for duplicate city names before we bug Google to get information that we already have. But that is left as an assignment for you.

Creating a Map of the Tour Using googleVis

Anyway let’s now build a map of the tour dates. For this example we will use a package called “googleVis”. You might not know that Google has a number of mapping services for which R APIs exist. Look at the table at the end of this section, which lists existing packages for interfacing programmatically with the various Google mapping and chart services. You can find these packages on CRAN. In our case we’ll need to install googleVis. After that we can create a map.

install.packages("googleVis",dependencies=TRUE)
library(googleVis)

The cool thing about the googleVis package is that we get back a map in a web browser that has scroll bars and zoom tools. Additionally we can use information from the data frame to annotate the chart we plan to create. So, for example, for each tour stop that the band made we can put in meta info like the name of the venue they played as well as the date.

We have to do this in a way that accommodates the requirements of googleVis. This means we have to read through the googleVis manual pages and play around with the examples. However, hopefully I’m presenting a pretty good example here so you don’t have to immerse yourself in the manual (at least not yet).

The first thing we need to do is to create a single column for the Latitude and Longitude because goolgeVis wants this. This is easy to do. Let’s take the existing stones75 data frame and change it:

head(stones75)

         Date        City     State                Venue  lat   lon
1 1 June 1975 Baton Rouge Louisiana  LSU Assembly Center 30.5 -91.1
2 3 June 1975 San Antonio     Texas    Convention Center 29.4 -98.5
3 4 June 1975 San Antonio     Texas    Convention Center 29.4 -98.5
4 6 June 1975 Kansas City  Missouri    Arrowhead Stadium 39.1 -94.6
5 8 June 1975   Milwaukee Wisconsin       County Stadium 43.0 -87.9
6 9 June 1975  Saint Paul Minnesota         Civic Center 45.0 -93.1

stones75$LatLon = paste(round(stones75$lat,1),round(stones75$lon,1),sep=":")
stones75 = stones75[,-5:-6]  # Remove the old lat and lon columns

head(stones75)
         Date        City     State                Venue     LatLon
1 1 June 1975 Baton Rouge Louisiana  LSU Assembly Center 30.5:-91.1
2 3 June 1975 San Antonio     Texas    Convention Center 29.4:-98.5
3 4 June 1975 San Antonio     Texas    Convention Center 29.4:-98.5
4 6 June 1975 Kansas City  Missouri    Arrowhead Stadium 39.1:-94.6
5 8 June 1975   Milwaukee Wisconsin       County Stadium   43:-87.9
6 9 June 1975  Saint Paul Minnesota         Civic Center   45:-93.1

Next up we can create a column in our data frame that contains all the information we want to use to annotate each concert date. This can include HTML tags to better format the output. As an example the statement below creates a new column in the data frame called “Tip”, that has the following info: the Stop number on the tour, the Venue where it was held, and the Date of the concert. Once we have a map we can click on the “pin” for each location and see the annotation info.

stones75$Tip = paste(rownames(stones75),stones75$Venue,stones75$Date,"<BR>",sep=" ")

# Now we can create a chart !  

# Click on the Atlanta locator and you'll see that it was the 37th stop of the tour. 
# The show took place at The Omni on July 30th, 1975

stones.plot = gvisMap(stones75,"LatLon","Tip")
plot(stones.plot)

googlemap
Refining the Plot Annotations
Pretty cool huh ? We can also zoom in on different parts of the map. The gvisMap function has a number of options that would allow us to draw a line between the cities, select a different type of map, and adopt certain zoom levels by default. So what else could / should we do ?

Well we have a problem here in that the Stones played more than one show in several cities but we don’t take that into account when we are building the annotation data. What we might want to do is to process the data frame and, for those cities that had multiple shows, (e.g. New York), we can capture all the meta data in one go. We saw this before with the New York dates.

stones75[9:18,]

           Date          City        State                 Venue
9  15 June 1975       Buffalo     New York   Memorial Auditorium
10 17 June 1975       Toronto      Ontario    Maple Leaf Gardens
11 22 June 1975 New York City     New York Madison Square Garden
12 23 June 1975 New York City     New York Madison Square Garden
13 24 June 1975 New York City     New York Madison Square Garden
14 25 June 1975 New York City     New York Madison Square Garden
15 26 June 1975 New York City     New York Madison Square Garden
16 27 June 1975 New York City     New York Madison Square Garden
17 29 June 1975  Philadelphia Pennsylvania          The Spectrum
18  1 July 1975         Largo     Maryland        Capital Center

Currently our plot has only the last New York show information. But we want to have the info for all NYC shows. Here is one way to approach this problem. Note that there are probably more elegant ways to clean up the data but this will do the job for now.

test = stones75     # Create some temporary work variables
str=""
tmpdf = list()
ii = 1

repeat {                  # Loop through the copy of the stones75 data frame 

   hold = test[test$Venue == test[ii,4],"Tip"]
   
   # Do we have a multi-city stand ?
   
      if (length(hold) > 1) {   
        str = paste(hold,collapse="")
        test[ii,6] = str
        tmpdf[[ii]] = test[ii,]
        str=""
        
    # We "jump over" cities that we've already processed 
   
        ii = ii + length(hold)  
     
    # Here we process the "one night stands"        
    
   } else {                       
        tmpdf[[ii]] = test[ii,]
        ii = ii + 1
   }
if (ii > 42) break
}

tmpdf = tmpdf[!sapply(tmpdf,is.null)]    # Remove NULL list elements
stones = do.call(rbind,tmpdf)                # Bind the list back into a data frame

stones.plot = gvisMap(stones,"LatLon","Tip")
plot(stones.plot)

googmap2

Okay. Now depending on your background in R you might think that was a lot of work, (or maybe not). In either case this is fairly typical of what we have to do to clean up and/or consolidate data to get it into a format that is suitable for use with the package we are using. Don’t think that this type of effort is peculiar to googleVis because other packages would require a comparable level of processing also. Welcome to the real world of data manipulation.

Anyway let’s take a look at the new plot. At first cut it seems just like the old one but click on the New York locator and you will now see that all the info for all Madison Square Garden is present. Shows number 11 through 16 took place in NYC.

R packages for interfacing with Google

Here is a table that lists the other R packages that exist to interface with various Google services. Each one of these is worth investigation. Keep in mind that similar accommodations exist for other languages so if you prefer to do your coding in Perl or Python then you could work with the Google APIs also.

PACKAGE DESCRIPTION
googleVis Create web pages with interactive charts based on R data frames
plotGoogleMaps Plot HTML output with Google Maps API and your own data
RgoogleMaps Overlays on Google map tiles in R
animation A Gallery of Animations in Statistics and Utilities
gridSVG Export grid graphics as SVG
SVGAnnotation Tools for Post-Processing SVG Plots Created in R
RSVGTipsDevice An R SVG graphics device with dynamic tips and hyperlink
iWebPlots Interactive web-based plots

In this article I discuss a general approach for Geocoding a location from within R, processing XML reports, and using R packages to create interactive maps. There are various ways to accomplish this, though using Google’s GeoCoding service is a good place to start. We’ll also talk a bit about the XML package that is a very useful tool for parsing reports returned from Google. XML is a powerful markup language that has wide support in many Internet databases so it is helpful. This post sets us up for Part II wherein we’ll use our knowledge to create a map of the tour dates on the Rolling Stones 1975 Tour of the Americas. Also, when I use the word “GeoCoding” this basically implies the process of taking a geographic location and turning it into a latitude / longitude pair.

What does Google Offer ?

Check out the main Geocoding page, which presents implementation details of the API as well as use policies and limitations.

https://developers.google.com/maps/documentation/geocoding/

As an example let’s find out what the latitude and longitude are for “Atlanta, GA”. Actually we can get more specific than this by specifying an address or a zip code but let’s keep it simple at first. The Google API, (Application Programming Interface), is very forgiving and will work with almost any valid geographic information that we provide. We could enter just a zip code, a complete address, or an international location and Google will happily process it. Back to our example, according to the Google specs we would create a URL that looks like:

http://maps.googleapis.com/maps/api/geocode/xml?address=Atlanta,GA&sensor=false

Notice that I have placed Atlanta, GA into the URL after the “xml?address” tag. According to Google specs we must also append the string “sensor=false”. This tells Google that the query is not coming from a device but a real person (us) or perhaps a program. If you paste the above URL into a web browser you will get something back that looks very much like the following.

<GeocodeResponse>
   <status>OK</status>
   <result>
      <type>locality</type>
      <type>political</type>
      <formatted_address>Atlanta, GA, USA</formatted_address>
      ..
      ..
      <geometry>
         <location>
            <lat>33.7489954</lat>
            <lng>-84.3879824</lng>
         </location>
      ..
      ..
      </geometry>
    </result>
</GeocodeResponse>

This is what is known as an XML document (eXtensible Markup Language). It might look scary at first but in reality this report contains a lot of helpful information. Note that there are “tags” that describe the content as it is being returned. That is, what we get back is a document that basically describes itself. This is a very useful attribute of XML. We can pick out only that information we want while ignoring the rest.

In this case all we care about are the ‘lat’ and `lng’ tags. We can scan through the document and see that the latitude and longitude for Atlanta, GA is 33.7489954, -84.3879824. It is this self-documenting feature of XML that makes it ideal for processing by programming languages such as R, (of course), Perl, Python, Java, and others.

But would you want to visually accomplish this for 10, 100, or 1,000 locations ? I wouldn’t even want to do it for 2 ! So this is when we put the power of R to work. Let’s start up R and install some packages that will help us “talk” to the Google API.

install.packages("XML",dependencies=T)
library(XML)

install.packages("RCurl",dependencies=T)
library(RCurl)

Creating a URL string and Passing it to Google

We have a destination of “Atlanta, GA” that we wish to GeoCode into a latitude longitude pair. We’ll need to build a URL for eventual passing to the Google API. We can easily do this using the paste function.

google.url = "http://maps.googleapis.com/maps/api/geocode/xml?address="
query.url = paste(google.url, "Atlanta,GA","&amp&sensor=false", sep="")

query.url
[1] "http://maps.googleapis.com/maps/api/geocode/xml?address=Atlanta,GA&sensor=false"

Okay great. We are now ready to send this over to Google. This is easy using the getURL function that is part of the RCurl package.

txt = getURL(query.url)
xml.report = xmlTreeParse(txt,useInternalNodes=TRUE)

What we have here is a variable called “xml.report” that now contains the XML report that we saw in the previous section. If you don’t believe me then check out its contents:

xml.report

<?xml version="1.0" encoding="UTF-8"?>
<GeocodeResponse>
  <status>OK</status>
  <result>
    <type>locality</type>
    <type>political</type>
    <formatted_address>Atlanta, GA, USA</formatted_address>
    ..
    <geometry>
      <location>
        <lat>33.7489954</lat>
        <lng>-84.3879824</lng>
      </location>
      ..
      ..
      </GeocodeResponse>

Parsing the Returned XML Result

Don’t get too worried about how the xmlTreeParse function works at this point. Just use it as a “black box” that one implements to get the XML report. The next step uses the function getNodeSet to locate the latitude and longitude strings in the report. This is something that we did visually in the previous section though to do it from within R we need to use an “XPath expression” to search though the document for the lat/lon.

place = getNodeSet(xml.report,"//GeocodeResponse/result[1]/geometry/location[1]/*")

XPath is the XML Path Language, which is a query language for selecting “nodes” from an XML document. A full discussion of it is beyond the scope of the document. At this point just think of it as a way to search an XML document to match specific lines in the document. If you look at the string we pass to getNodeSet you might wonder what it means.

//GeocodeResponse/result[1]/geometry/location[1]/*

This is a string that we use to match specific tags in the XML report. We can “read” it as follows. Using xml.report we match the “GeocodeReponse” row, then the first “result” row, then the “geometry” row, and then the first “location” row. This will give the section of the report that relates to the latitude and longitude. To verify manually, you can visually scan through the document.

<GeocodeResponse>
  <result>
        <geometry>
             <location>
                <lat>33.7489954</lat>
                <lng>-84.3879824</lng>

Notice that in the xml.report the lines are indented, which suggests that some rows are contained in sections, which suggests a hierarchy in the document. So when you “match” the GeocodeResponse line, this is a “node” in the document that contains other nodes. So in our example the “result” node is contained within the “GeocodeResult” node. The “geometry” node is contained within the “result” node. The “location” is contained in the “result” node.

We can then extract the values contained in the “location” node. Note that in real world applications you would first use a tool such as XMLFinder to develop the correct XPath expression after which you would implement it in R as we have done here. The Firefox browser also has a plugin that allows one to parse XML reports. Finally, we use an R statement to turn the variable place (from up above) into numeric values.

lat.lon = as.numeric(sapply(place,xmlValue))
lat.lon
[1]  33.7 -84.4

Write a Function to Do This Work !

So you might think that was a lot of work. Well maybe it was though consider that we can now parse any returned XML document returned by the Google Geocoding service and find the lat / lon pair. Not only that, we can parse the returned XML file and extract any information contained therein. Since we wouldn’t want to do this manually for every GeoCoding query we might have, let’s write a function in R to do this for us. Here is one way to do it:

myGeo <- function(address="Atlanta,GA") {

# Make sure we have the required libraries to do the work

  stopifnot(require(RCurl))
  stopifnot(require(XML))
  
# Remove any spaces in the address field and replace with "+"

   address = gsub(" ","\\+",address)
   
# Create the request URL according to Google specs

  google.url = "http://maps.googleapis.com/maps/api/geocode/xml?address="
  my.url = paste(google.url,address,"&sensor=false",sep="")

  Sys.sleep(0.5) # Just so we don't beat up the Google servers with rapid fire requests
  
# Send the request to Google and turn it into an internal XML document
  
  txt = getURL(my.url)
  xml.report = xmlTreeParse(txt, useInternalNodes=TRUE)

# Pull out the lat/lon pair and return it  

  place = getNodeSet(xml.report,  "//GeocodeResponse/result[1]/geometry/location[1]/*")
  lat.lon = as.numeric(sapply(place,xmlValue))
  names(lat.lon) = c("lat","lon")
  
  return(lat.lon)
}

myGeo()
 lat   lon 
 33.7 -84.4 

myGeo("Palo Alto,CA")
   lat    lon 
  37.4 -122.1 
  
myGeo("Paris,FR")
  lat   lon 
48.86  2.35 

Now. We’ve actually got more capability than we think we do. Remember that the Google API will accept almost any reasonable geographic information. So check out the following. With just a few lines of code we have a pretty powerful way to GeoCode generally specified addresses without having to do lots of preparation and string processing within our R function.

myGeo("1600 Pennsylvania Avenue, Washington, DC")
  lat   lon 
 38.9 -77.0 

myGeo("Champs-de-Mars, Paris 75007")  # Address of the Eiffel Tower
lat  lon 
48.9  2.3 

Using the myGeo function in a Real Example

Let’s see how we might use this in a real situation. Here is a data frame called geog, which contains columns named city and state names.

geog
            city state
1       Glendale    CA
2      DesMoines    IA
3    Albuquerque    NM
4           Waco    TX
5       Honolulu    HI
6  Indianaopolis    IN
7     Pittsburgh    PA
8     Clearwater    FL
9      Sunnyvale    CA
10    Bridgeport    CT

Now. Look how easy it is to process all of these rows:

t(sapply(paste(geog$city,geog$state),myGeo))

                  lat    lon
Glendale CA      34.1 -118.3
DesMoines IA     41.6  -93.6
Albuquerque NM   35.1 -106.6
Waco TX          31.5  -97.1
Honolulu HI      21.3 -157.9
Indianaopolis IN 39.8  -86.2
Pittsburgh PA    40.4  -80.0
Clearwater FL    28.0  -82.8
Sunnyvale CA     37.4 -122.0
Bridgeport CT    41.2  -73.2

Our function should also work with a vector of zip codes.

myzipcodes
 [1] "90039" "50301" "87101" "76701" "96801" "46201" "15122" "33755" "94085" "06601"
 
t(sapply(myzipcodes,myGeo))

       lat    lon
90039 34.1 -118.3
50301 41.6  -93.6
87101 35.1 -106.6
76701 31.6  -97.1
96801 21.3 -157.9
46201 39.8  -86.1
15122 40.4  -79.9
33755 28.0  -82.8
94085 37.4 -122.0
06601 41.2  -73.2

Okay please check out Part 2 of this post to see how we process the tour data from the
1975 Rolling Stones “Tour of the Americas”.