Monthly Archives: January 2016

A Checkpoint Of Spanish Football League

I am an absolute beginner, but I am absolutely sane (Absolute Beginners, David Bowie)

Some time ago I wrote this post, where I predicted correctly the winner of the Spanish Football League several months before its ending. After thinking intensely about taking the risk of ruining my reputation repeating the analysis, I said “no problem, Antonio, do it again: in the end you don’t have any reputation to keep”. So here we are.

From a technical point of view there are many differences between both analysis. Now I use webscraping to download data, dplyr and pipes to do transformations and interactive D3.js graphs to show results. I think my code is better now and it makes me happy.

As I did the other time, Bradley-Terry Model gives an indicator of  the power of each team, called ability, which provides a natural mechanism for ranking teams. This is the evolution of abilities of each team during the championship (last season was played during the past weekend):

liga1_ability2

Although it is a bit messy, the graph shows two main groups of teams: on the one hand, Barcelona, Atletico de Madrid, Real Madrid and Villareal; on the other hand, the rest. Let’s have a closer look to evolution of the abilities of the top 4 teams:

liga2_ability2

While Barcelona, Atletico de Madrid and Real Madrid walk in parallel,  Villareal seems to be a bit stacked in the last seasons; the gap between them and Real Madrid is increasing little by little. Maybe is the Zidane’s effect. It is quite interesting discovering what teams are increasing their abilities: they are Malaga, Eibar and Getafe. They will probably finish the championship in a better position than they have nowadays (Eibar could reach fifth position):

liga3_ability2

What about Villareal? Will they go up some position? I don’t think so. This plot shows the probability of winning any of the top 3:

liga4_villareal2

As you can see, probability is decreasing significantly. And what about Barcelona? Will win? It is a very difficult question. They are almost tied with Atletico de Madrid, and only 5 and 8 points above Real Madrid and Villareal. But it seems Barcelona keep them at bay. This plot shows the evolution of the probability of be beaten by Atletico, Real Madrid and Villareal:

liga5_Barcelona2

All probabilities are under 50% and decreasing (I supposed a scoring of 2-0 for Barcelona in the match against Sporting of season 16 that was postponed to next February 17th).

Data science is a profession for brave people so it is time to do some predictions. These are mine, ordered by likelihood:

  • Barcelona will win, followed by Atletico (2), Real Madrid (3), Villareal (4) and Eibar (5)
  • Malaga and Getafe will go up some positions
  • Next year I will do the analysis again

Here you have the code:

library(rvest)
library(stringr)
library(BradleyTerry2)
library(dplyr)
library(reshape)
library(rCharts)
nseasons=20
results=data.frame()
for (i in 1:nseasons)
{
  webpage=paste0("http://www.marca.com/estadisticas/futbol/primera/2015_16/jornada_",i,"/")
  html(webpage) %>%
    html_nodes("table") %>%
    .[[1]] %>%
    html_table(header=FALSE, fill=TRUE) %>%
    mutate(X4=i) %>%
    rbind(results)->results
}
colnames(results)=c("home", "score", "visiting", "season")
results %>% 
  mutate(home     = iconv(home,     from="UTF8",to="ASCII//TRANSLIT"),
         visiting = iconv(visiting, from="UTF8",to="ASCII//TRANSLIT")) %>%
  #filter(grepl("-", score)) %>%
  mutate(score=replace(score, score=="18:30 - 17/02/2016", "0-2")) %>% # resultado fake para el Barcelona
  mutate(score_home     = as.numeric(str_split_fixed(score, "-", 2)[,1])) %>%
  mutate(score_visiting = as.numeric(str_split_fixed(score, "-", 2)[,2])) %>%
  mutate(points_home     =ifelse(score_home > score_visiting, 3, ifelse(score_home < score_visiting, 0, 1))) %>%
  mutate(points_visiting =ifelse(score_home > score_visiting, 0, ifelse(score_home < score_visiting, 3, 1))) -> data
prob_BT=function(x, y) {exp(x-y) / (1 + exp(x-y))}
BTabilities=data.frame()
for (i in 13:nseasons)
{
  data %>% filter(season<=i) %>%
    BTm(cbind(points_home, points_visiting), home, visiting, data=.) -> footballBTModel
  BTabilities(footballBTModel) %>%
  as.data.frame()  -> tmp 
  cbind(tmp, as.character(rownames(tmp)), i) %>% 
  mutate(ability=round(ability, digits = 2)) %>%
  rbind(BTabilities) -> BTabilities
}
colnames(BTabilities)=c("ability", "s.e.", "team", "season")
sort(unique(BTabilities[,"team"])) -> teams
BTprobabilities=data.frame()
for (i in 13:nseasons)
{
  BTabilities[BTabilities$season==i,1] %>% outer( ., ., prob_BT) -> tmp
  colnames(tmp)=teams
  rownames(tmp)=teams  
  cbind(melt(tmp),i) %>% rbind(BTprobabilities) -> BTprobabilities
}
colnames(BTprobabilities)=c("team1", "team2", "probability", "season")
BTprobabilities %>% 
  filter(team1=="Villarreal") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team2 %in% c("R. Madrid", "Barcelona", "Atletico")) -> BTVillareal
BTprobabilities %>% 
  filter(team2=="Barcelona") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team1 %in% c("R. Madrid", "Villarreal", "Atletico")) -> BTBarcelona
AbilityPlot <- nPlot(
  ability ~ season, 
  data = BTabilities, 
  group = "team",
  type = "lineChart")
AbilityPlot$yAxis(axisLabel = "Estimated Ability", width = 62)
AbilityPlot$xAxis(axisLabel = "Season")
VillarealPlot <- nPlot(
  probability ~ season, 
  data = BTVillareal, 
  group = "team2",
  type = "lineChart")
VillarealPlot$yAxis(axisLabel = "Probability of beating", width = 62)
VillarealPlot$xAxis(axisLabel = "Season")
BarcelonaPlot <- nPlot(
  probability ~ season, 
  data = BTBarcelona, 
  group = "team1",
  type = "lineChart")
BarcelonaPlot$yAxis(axisLabel = "Probability of being beaten", width = 62)
BarcelonaPlot$xAxis(axisLabel = "Season")
Advertisements

Climatic Change At A Glance

Mmm. Lost a planet, Master Obi-Wan has. How embarrassing (Yoda, Attack Of The Clones)

Some time ago I published this post in KDnuggets in which I analyze historical temperatures to show how we are gradually heading toward a warmer planet. Simple data science to obtain a simple (and increasingly accepted) conclusion: the global warming is real. Despite I was criticized I still believe what I said then: you don’t have to be a climatologist to empirically confirm global warming.

This experiment is another example of that. It is still simpler than that since it is only based on visual perception but I think is also quite conclusive. In this case, I represent U.S. temperature outliers from 1964 to 2013; a map per year. Dataset contains station ID, name, min/max temperature, as well as degree coordinates of the recorded weather. Original weather data collected from NOAA and anomalies analysis by Enigma. You can download data here.

Anomalies are divided into four categories: Strong Hot, Weak Hot, Weak Cold and Strong Cold. For each station, I represent difference between number of Cold and Hot anomalies (independently of the strength) so Blue bubbles represent stations where total number of Cold anomalies during the year is greater that total number of Hot ones and Red ones represent the opposite. Size of bubbles is also proportional to this indicator. As an example, following you can see the map of year 1975:

tonopah
It seems 1975 was hot in the right a cold on the left side. Concretely, in TONOPAH Station (Nevada) were registered 30 anomalies and most of them (26) where due to cold temperatures. This is why bubble is blue. This GIF shows the evolution of all these maps from 1964 to 2013:

anomalies

Maybe it is just my personal feeling but don’t you see how red bubbles are gradually winning to blue ones? Maybe I am a demagogue.

This code generates a dynamic map by year in html format:

library(data.table)
library(stringr)
library(leaflet)
library(RColorBrewer)
library(classInt)
library(dplyr)
library(htmlwidgets)
anomalies = fread("enigma-enigma.weather.anomalies.outliers-1964-2013-05ce047dbf3e67f83db9ae841545a333.csv")
anomalies %>%
  mutate(year=substr(date_str, 1, 4)) %>%
  group_by(year, longitude, latitude, id, station_name) %>%
  summarise(
    Strong_Hot=sum(str_count(type,"Strong Hot")),
    Weak_Hot=sum(str_count(type,"Weak Hot")),
    Weak_Cold=sum(str_count(type,"Weak Cold")),
    Strong_Cold=sum(str_count(type,"Strong Cold")),
    total=n()) %>%
  mutate(score=sign(-Strong_Hot-Weak_Hot+Weak_Cold+Strong_Cold)) %>%
  mutate(color=ifelse(score==1, "Blue",ifelse(score==0, "White", "Red"))) -> anomalies2
for (i in unique(anomalies2$year))
{
  anomalies2 %>%
    filter(year==i) %>%
    leaflet() %>%
    fitBounds(-124, 34, -62, 40) %>%
    addProviderTiles("Stamen.TonerLite") %>%
    addCircleMarkers(lng = ~longitude,
                     lat = ~latitude,
                     radius = ~ifelse(total < 20, 2, ifelse(total < 27, 4, 8)),
                     color= ~color,
                     stroke=FALSE,
                     fillOpacity = 0.5,
                     popup = ~paste(sep = "
", paste0("<b>", station_name, "</b>"),
                                    paste0("Strong Hot: ", Strong_Hot),
                                    paste0("Weak Hot: ", Weak_Hot),
                                    paste0("Weak Cold: ", Weak_Cold),
                                    paste0("Strong Cold: ", Strong_Cold))) -> m
    saveWidget(m, file=paste0("m", i, ".html"))
}