Tag Archives: dplyr

Going Bananas With Hilbert

It seemed that everything is in ruins, and that all the basic mathematical concepts have lost their meaning (Naum Vilenkin, Russian mathematician, regarding to the discovery of Peano’s curve)

Giuseppe Peano found in 1890 a way to draw a curve in the plane that filled the entire space: just a simple line covering completely a two dimensional plane. Its discovery meant a big earthquake in the traditional structure of mathematics. Peano’s curve was the first but not the last: one of these space-filling curves was discovered by Hilbert and takes his name. It is really beautiful:
hilbert_n5

Hilbert’s curve can be created iteratively. These are the first six iterations of its construction:
hilbert

As you will see below, R code to create Hilbert’s curve is extremely easy. It is also very easy to play with the curve, altering the order in which points are sorted. Changing the initial matrix(1) by some other number, resulting curves are quite appealing:

Let’s go futher. Changing ggplot geometry from geom_path to geom_polygon generate some crazy pseudo-tessellations:

And what if you change the matrix exponent?

And what if you apply polar coordinates?

We started with a simple line and with some small changes we have created fantastical images. And all these things only using black and white. Do you want to add some colors? Try with the following code (if you draw something interesting, please let me know):

library(reshape2)
library(dplyr)
library(ggplot2)
opt=theme(legend.position="none",
          panel.background = element_rect(fill="white"),
          panel.grid=element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),
          axis.text=element_blank())
hilbert = function(m,n,r) {
  for (i in 1:n)
  {
    tmp=cbind(t(m), m+nrow(m)^2)
    m=rbind(tmp, (2*nrow(m))^r-tmp[nrow(m):1,]+1)
  }
  melt(m) %>% plyr::rename(c("Var1" = "x", "Var2" = "y", "value"="order")) %>% arrange(order)}
# Original
ggplot(hilbert(m=matrix(1), n=1, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=2, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=3, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=4, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=6, r=2), aes(x, y)) + geom_path()+ opt
# Changing order
ggplot(hilbert(m=matrix(.5), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(0), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(tan(1)), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(3), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(-1), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(log(.1)), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(-15), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(-0.001), n=5, r=2), aes(x, y)) + geom_path()+ opt
# Polygons
ggplot(hilbert(m=matrix(log(1)), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(.5), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(tan(1)), n=5, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-15), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-25), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(0), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(1000000), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-1), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-.00001), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
# Changing exponent
gplot(hilbert(m=matrix(log(1)), n=4, r=-1), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(.5), n=4, r=-2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(tan(1)), n=4, r=6), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-15), n=3, r=sin(2)), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-25), n=4, r=-.0001), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(0), n=4, r=200), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(1000000), n=3, r=.5), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-1), n=4, r=sqrt(2)), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-.00001), n=4, r=52), aes(x, y)) + geom_polygon()+ opt
# Polar coordinates
ggplot(hilbert(m=matrix(1), n=4, r=2), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(-1), n=5, r=2), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(.1), n=2, r=.5), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(1000000), n=2, r=.1), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(.25), n=3, r=3), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(tan(1)), n=5, r=1), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(1), n=4, r=1), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(log(1)), n=3, r=sin(2)), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(-.0001), n=4, r=25), aes(x, y)) + geom_polygon()+ coord_polar()+opt

A Checkpoint Of Spanish Football League

I am an absolute beginner, but I am absolutely sane (Absolute Beginners, David Bowie)

Some time ago I wrote this post, where I predicted correctly the winner of the Spanish Football League several months before its ending. After thinking intensely about taking the risk of ruining my reputation repeating the analysis, I said “no problem, Antonio, do it again: in the end you don’t have any reputation to keep”. So here we are.

From a technical point of view there are many differences between both analysis. Now I use webscraping to download data, dplyr and pipes to do transformations and interactive D3.js graphs to show results. I think my code is better now and it makes me happy.

As I did the other time, Bradley-Terry Model gives an indicator of  the power of each team, called ability, which provides a natural mechanism for ranking teams. This is the evolution of abilities of each team during the championship (last season was played during the past weekend):

liga1_ability2

Although it is a bit messy, the graph shows two main groups of teams: on the one hand, Barcelona, Atletico de Madrid, Real Madrid and Villareal; on the other hand, the rest. Let’s have a closer look to evolution of the abilities of the top 4 teams:

liga2_ability2

While Barcelona, Atletico de Madrid and Real Madrid walk in parallel,  Villareal seems to be a bit stacked in the last seasons; the gap between them and Real Madrid is increasing little by little. Maybe is the Zidane’s effect. It is quite interesting discovering what teams are increasing their abilities: they are Malaga, Eibar and Getafe. They will probably finish the championship in a better position than they have nowadays (Eibar could reach fifth position):

liga3_ability2

What about Villareal? Will they go up some position? I don’t think so. This plot shows the probability of winning any of the top 3:

liga4_villareal2

As you can see, probability is decreasing significantly. And what about Barcelona? Will win? It is a very difficult question. They are almost tied with Atletico de Madrid, and only 5 and 8 points above Real Madrid and Villareal. But it seems Barcelona keep them at bay. This plot shows the evolution of the probability of be beaten by Atletico, Real Madrid and Villareal:

liga5_Barcelona2

All probabilities are under 50% and decreasing (I supposed a scoring of 2-0 for Barcelona in the match against Sporting of season 16 that was postponed to next February 17th).

Data science is a profession for brave people so it is time to do some predictions. These are mine, ordered by likelihood:

  • Barcelona will win, followed by Atletico (2), Real Madrid (3), Villareal (4) and Eibar (5)
  • Malaga and Getafe will go up some positions
  • Next year I will do the analysis again

Here you have the code:

library(rvest)
library(stringr)
library(BradleyTerry2)
library(dplyr)
library(reshape)
library(rCharts)
nseasons=20
results=data.frame()
for (i in 1:nseasons)
{
  webpage=paste0("http://www.marca.com/estadisticas/futbol/primera/2015_16/jornada_",i,"/")
  html(webpage) %>%
    html_nodes("table") %>%
    .[[1]] %>%
    html_table(header=FALSE, fill=TRUE) %>%
    mutate(X4=i) %>%
    rbind(results)->results
}
colnames(results)=c("home", "score", "visiting", "season")
results %>% 
  mutate(home     = iconv(home,     from="UTF8",to="ASCII//TRANSLIT"),
         visiting = iconv(visiting, from="UTF8",to="ASCII//TRANSLIT")) %>%
  #filter(grepl("-", score)) %>%
  mutate(score=replace(score, score=="18:30 - 17/02/2016", "0-2")) %>% # resultado fake para el Barcelona
  mutate(score_home     = as.numeric(str_split_fixed(score, "-", 2)[,1])) %>%
  mutate(score_visiting = as.numeric(str_split_fixed(score, "-", 2)[,2])) %>%
  mutate(points_home     =ifelse(score_home > score_visiting, 3, ifelse(score_home < score_visiting, 0, 1))) %>%
  mutate(points_visiting =ifelse(score_home > score_visiting, 0, ifelse(score_home < score_visiting, 3, 1))) -> data
prob_BT=function(x, y) {exp(x-y) / (1 + exp(x-y))}
BTabilities=data.frame()
for (i in 13:nseasons)
{
  data %>% filter(season<=i) %>%
    BTm(cbind(points_home, points_visiting), home, visiting, data=.) -> footballBTModel
  BTabilities(footballBTModel) %>%
  as.data.frame()  -> tmp 
  cbind(tmp, as.character(rownames(tmp)), i) %>% 
  mutate(ability=round(ability, digits = 2)) %>%
  rbind(BTabilities) -> BTabilities
}
colnames(BTabilities)=c("ability", "s.e.", "team", "season")
sort(unique(BTabilities[,"team"])) -> teams
BTprobabilities=data.frame()
for (i in 13:nseasons)
{
  BTabilities[BTabilities$season==i,1] %>% outer( ., ., prob_BT) -> tmp
  colnames(tmp)=teams
  rownames(tmp)=teams  
  cbind(melt(tmp),i) %>% rbind(BTprobabilities) -> BTprobabilities
}
colnames(BTprobabilities)=c("team1", "team2", "probability", "season")
BTprobabilities %>% 
  filter(team1=="Villarreal") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team2 %in% c("R. Madrid", "Barcelona", "Atletico")) -> BTVillareal
BTprobabilities %>% 
  filter(team2=="Barcelona") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team1 %in% c("R. Madrid", "Villarreal", "Atletico")) -> BTBarcelona
AbilityPlot <- nPlot(
  ability ~ season, 
  data = BTabilities, 
  group = "team",
  type = "lineChart")
AbilityPlot$yAxis(axisLabel = "Estimated Ability", width = 62)
AbilityPlot$xAxis(axisLabel = "Season")
VillarealPlot <- nPlot(
  probability ~ season, 
  data = BTVillareal, 
  group = "team2",
  type = "lineChart")
VillarealPlot$yAxis(axisLabel = "Probability of beating", width = 62)
VillarealPlot$xAxis(axisLabel = "Season")
BarcelonaPlot <- nPlot(
  probability ~ season, 
  data = BTBarcelona, 
  group = "team1",
  type = "lineChart")
BarcelonaPlot$yAxis(axisLabel = "Probability of being beaten", width = 62)
BarcelonaPlot$xAxis(axisLabel = "Season")

Polar Circles

You cannot find peace by avoiding life (Virginia Woolf)

Combining polar coordinates, RColorBrewer palettes, ggplot2 and a simple trigonometric function to define the width of the tiles is easy to produce nice circular plots like these:

polar_flower4polar_flower2polar_flower1polar_flower3

Do you want to try? Here you have the code:

library(ggplot2)
library(dplyr)
library(RColorBrewer)
n=500
m=50
w=sapply(seq(from=-3.5*pi, to=3.5*pi, length.out=n), function(x) {abs(sin(x))})
x=c(1)
for (i in 2:n) {x[i]=x[i-1]+1/2*(w[i-1]+w[i])}
expand.grid(x=x, y=1:m) %>%
  mutate(w=rep(w, m))-> df
opt=theme(legend.position="none",
          panel.background = element_rect(fill="white"),
          panel.grid=element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),
          axis.text=element_blank())
ggplot(df, aes(x=x,y=y))+geom_tile(aes(fill=x, width=w))+ 
  scale_fill_gradient(low=brewer.pal(9, "Greens")[1], high=brewer.pal(9, "Greens")[9])+
  coord_polar(start = runif(1, min = 0, max = 2*pi))+opt
ggplot(df, aes(x=x,y=y))+geom_tile(aes(fill=w, width=w))+ 
  scale_fill_gradient(low=brewer.pal(9, "Reds")[1], high=brewer.pal(9, "Reds")[9])+ 
  coord_polar(start = runif(1, min = 0, max = 2*pi))+opt
ggplot(df, aes(x=x,y=y))+geom_tile(aes(fill=y, width=w))+ 
  scale_fill_gradient(low=brewer.pal(9, "Purples")[1], high=brewer.pal(9, "Purples")[9])+ 
  coord_polar(start = runif(1, min = 0, max = 2*pi))+opt
ggplot(df, aes(x=x,y=y))+geom_tile(aes(fill=w*y, width=w))+ 
  scale_fill_gradient(low=brewer.pal(9, "Blues")[9], high=brewer.pal(9, "Blues")[1])+ 
  coord_polar(start = runif(1, min = 0, max = 2*pi))+opt

A Visualization Of The 100 Greatest Love Songs ft. D3.js

What would you do? If my heart was torn in two (More Than Words, Extreme)

Playing with rCharts package I had the idea of representing the list of 100 best love songs as a connected set of points which forms a heart. Songs can be seen putting mouse cursor over each dot. This is an screenshot of the graph:

HeartScreenshotIf you want to play with the visualization, you can download it here. Or better than this, you can reproduce it with this simple code:

library(dplyr)
library(rCharts)
library(rvest)
setwd("YOUR WORKING DIRECTORY HERE")
heart <- function(r,x) {ifelse(abs(x)<2, ifelse(r%%2==0, sqrt(1-(abs(x)-1)^2), acos(1-abs(x))-pi), 0)} data.frame(x=seq(from=-3, to=3, length.out=100)) %>% 
  mutate(y=jitter(heart(row_number(), x), amount=.1)) -> df
love_songs <- html("http://www.cs.ubc.ca/~davet/music/list/Best13.html") love_songs %>%
  html_nodes("table") %>%
  .[[2]] %>%
  html_table(header=TRUE, fill = TRUE) %>%
  cbind(df) -> df
m1=mPlot(x = "x",  y = "y",  data = df,  type = "Line")
m1$set(pointSize = 5, 
       lineColors = c('red', 'red'),
       width = 850,
       height = 600,
       lineWidth = 2,
       hoverCallback = "#! function(index, options, content){
       var row = options.data[index]
       return '<b>' + row.ARTIST + '</b>' + '<br/>' + row.TITLE} !#",
       grid=FALSE,
       axes=FALSE)
m1$save('Top_100_Greatest_Love_Songs.html', standalone = TRUE)

A Simple Interactive Map Of US Prisons With Leaflet

The love of one’s country is a splendid thing. But why should love stop at the border? (Pablo Casals, Spanish cellist)

Some time ago, I discovered Enigma, an amazing open platform that unifies billions of records from thousands of government sources to make the world of public data universally accessible and useful. This is the first experiment I have done using data from Enigma. This is what I did:

  1. Create a free account, search and download data. Save the csv file in your working directory. File contains information about all prison facilities in the United States (private and state run) as recorded by the Department of Corrections in each state. Facility types, names, addresses (or lat/long coordinates) ownership names and detailed. In sum, there is information about 1.248 prison facilities.
  2. Since most of the prisons of the file do not contain geographical coordinates, I obtain latitude and longitude using geocode function from ggmap package. This step takes some time. I also remove closed facilities. Finally, I obtain a data set with complete information of 953 prison facilities.
  3. After cleaning and filling out data, generating the map is very easy using leaflet package for R. I create a column named popup_info pasting name and address to be shown in the popup. Instead using default OpenStreetMap basemap I use a CartoDB one.

In my opinion, resulting map is very appealing with a minimal effort. Since I cannot embed the map here, this is a screenshot of it:

jailsThis plot could be a good example of visual correlation, because it depends on this. Here you have the code. To see the map in your browser, press Show in new window option, a little arrow on the upper left side of the RStudio viewer window:

library(dplyr)
library(ggmap)
library(leaflet)
setwd("YOUR WORKING DIRECTORY HERE")
prisons = read.csv(file="enigma-enigma.prisons.all-facilities-bd6a927c4024c16d8ba9e21d52292b0f.csv", stringsAsFactors=FALSE)
prisons %>% 
  mutate(address=paste(facility_address1, city, state, zip, "EEUU", sep=", ")) %>%
  select(address) %>% 
  lapply(function(x){geocode(x, output="latlon")})  %>% 
  as.data.frame %>% 
  cbind(prisons) -> prisons
prisons %>%  
  mutate(popup_info=paste(sep = "<br/>", paste0("<b>", facility_name, "</b>"), facility_address1, city, state, zip)) %>% 
  mutate(lon=ifelse(is.na(longitude), address.lon, longitude),
         lat=ifelse(is.na(latitude),  address.lat, latitude)) %>%
  filter(!is.na(lon) & !grepl("CLOSED", facility_name)) -> prisons
leaflet(prisons) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(lng = ~lon, 
                   lat = ~lat, 
                   radius = 3, 
                   color = "red",
                   stroke=FALSE,
                   fillOpacity = 0.5,
                   popup = ~popup_info)

The World We Live In #5: Calories And Kilograms

I enjoy doing new tunes; it gives me a little bit to perk up, to pay a little bit more attention (Earl Scruggs, American musician)

I recently finished reading The Signal and the Noise, a book by Nate Silver, creator of the also famous FiveThirtyEight blog. The book is a very good reading for all data science professionals, and is a must in particular for all those who work trying to predict the future. The book praises the bayesian way of thinking as the best way to face and modify predictions and criticizes rigid ways of thinking with many examples of disastrous predictions. I enjoyed a lot the chapter dedicated to chess and how Deep Blue finally took over Kasparov. In a nutshell: I strongly recommend it.
One of the plots of Silver’s book present a case of false negative showing the relationship between obesity and calorie consumption across the world countries. The plot shows that there is no evidence of a connection between both variables. Since it seemed very strange to me, I decided to reproduce the plot by myself.

I compared these two variables:

  • Dietary Energy Consumption (kcal/person/day) estimated by the FAO Food Balance Sheets.
  • Prevalence of Obesity as percentage of defined population with a body mass index (BMI) of 30 kg/m2 or higher estimated by the World Health Organization

And this is the resulting plot:

Calories And KilogramsAs you can see there is a strong correlation between two variables. Why the experiment of Nate Silver shows the opposite? Obviously we did not plot the same data (although, in principle, both of us went to the same source). Anyway: to be honest, I prefer my plot because shows what all of we know: the more calories you eat, the more weight you will see in your bathroom scale. Some final thoughts seeing the plot:

  • I would like to be Japanese: they don’t gain weight!
  • Why US people are fatter than Austrian?
  • What happens in Samoa?

Here you have the code to do the plot:

library(xlsx)
library(dplyr)
library(ggplot2)
library(scales)
setwd("YOUR WORKING DIRECTORY HERE")
url_calories = "http://www.fao.org/fileadmin/templates/ess/documents/food_security_statistics/FoodConsumptionNutrients_en.xls"
download.file(url_calories, method="internal", destfile = "FoodConsumptionNutrients_en.xls", mode = "ab")
calories = read.xlsx(file="FoodConsumptionNutrients_en.xls", startRow = 4, colIndex = c(2,6), colClasses = c("character", "numeric"), sheetName="Dietary Energy Cons. Countries", stringsAsFactors=FALSE) 
colnames(calories)=c("Country", "Kcal")
url_population = "http://esa.un.org/unpd/wpp/DVD/Files/1_Excel%20(Standard)/EXCEL_FILES/1_Population/WPP2015_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.XLS"
download.file(url_population, method="internal", destfile = "Population.xls", mode = "ab")
population = read.xlsx(file="Population.xls", startRow = 17, colIndex = c(3,71), colClasses = c("character", "numeric"), sheetName="ESTIMATES", stringsAsFactors=FALSE) 
colnames(population)=c("Country", "Population")
# http://apps.who.int/gho/data/node.main.A900A?lang=en
url_obesity = "http://apps.who.int/gho/athena/data/xmart.csv?target=GHO/NCD_BMI_30A&profile=crosstable&filter=AGEGROUP:*;COUNTRY:*;SEX:*&x-sideaxis=COUNTRY&x-topaxis=GHO;YEAR;AGEGROUP;SEX&x-collapse=true"
obesity = read.csv(file=url_obesity, stringsAsFactors=FALSE)
obesity %>% select(matches("Country|2014.*Both")) -> obesity
colnames(obesity)=c("Country", "Obesity")
obesity %>% filter(Obesity!="No data") -> obesity
obesity %>% mutate(Obesity=as.numeric(substr(Obesity, 1, regexpr(pattern = "[[]", obesity$Obesity)-1))) -> obesity
population %>% inner_join(calories,by = "Country") %>% inner_join(obesity,by = "Country") -> data
opts=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=15),
  axis.title = element_text(size=18, colour="gray10"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 40, colour="gray10"))
ggplot(data, aes(x=Kcal, y=Obesity/100, size=log(Population), label=Country), guide=FALSE)+
  geom_point(colour="white", fill="sandybrown", shape=21, alpha=.55)+
  scale_size_continuous(range=c(2,40))+
  scale_x_continuous(limits=c(1500,4100))+
  scale_y_continuous(labels = percent)+
  labs(title="The World We Live In #5: Calories And Kilograms",
       x="Dietary Energy Consumption (kcal/person/day)",
       y="% population with body mass index >= 30 kg/m2")+
  geom_text(data=subset(data, Obesity>35|Kcal>3700), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Kcal<2000), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Obesity<10 & Kcal>2600), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(aes(3100, .01), colour="gray25", hjust=0, label="Source: United Nations (size of bubble depending on population)", size=4.5)+opts