Tag Archives: R

The Hype Bubble Map for Dog Breeds

In the whole history of the world there is but one thing that money can not buy… to wit the wag of a dog’s tail (Josh Billings)

In this post I combine several things:

  • Simple webscraping to read the list of companion dogs from Wikipedia. I love rvest package to do these things.
  • Google Trends queries to download the evolution of searchings of breeds during last 6 months. I use gtrendsR package to do this and works quite well.
  • A dinamic Highchart visualization using the awesome highcharter package
  • A static ggplot visualization.

The experiment is based on a simple idea: what people search on the Internet is what people do. Can be Google Trends an useful tool to know which breed will become fashionable in the future? To be honest, I don’t really know but I will make my own bet.

What I have done is to extract last 6 months of Google trends of this list of companion breeds. After some simple text mining, I divide the set of names into 5-elements subsets because Google API doesn’t allow searchings with more than 5 items. The result of the query to Google trends is a normalized time series, meaning the 0 – 100 values are relative, not absolute, measures. This is done by taking all of the interest data for your keywords and dividing it by the highest point of interest for that date range. To make all 5-items of results comparable I always include King Charles Spaniel breed in all searchings (as a kind of undercover agent I will use to compare searching levels). The resulting number is my “Level” Y-Axis of the plot. I limit searchings to code=”0-66″ which is restrict results to Animals and pets category. Thanks, Philippe, for your help in this point. I also restrict rearchings To the United States of America.

There are several ways to obtain an aggregated trend indicator of a time series. My choice here was doing a short moving average order=2 to the resulting interest over time obtained from Google. The I divide the weekly variations by the smoothed time series. The trend indicator is the mean of these values. To obtains a robust indicator, I remove outliers of the original time series. This is my X-axis.

This is how dog breeds are arranged with respect my Trend and Level indicators:


Inspired by Gartner’s Hype Cycle of Emerging Technologies I distinguish two sets of dog breeds:

  • Plateau of Productivity Breeds (succesful breeds with very high level indicator and possitive trend): Golden Retriever, Pomeranian, Chihuahua, Collie and Shih Tzu.
  • Innovation Trigger Breeds (promising dog breeds with very high trend indicator and low level): Mexican Hairless Dog, Keeshond, West Highland White Terrier and German Spitz.

I discovered recently a wonderful package called highcharter which allows you to create incredibly cool dynamic visualizations. I love it and I could not resist to use it to do the previous plot with the look and feel of The Economist. This is an screenshot (reproduce it to play with tits interactivity):

And here comes my prediction. After analyzing the set Innovation Trigger Breeds, my bet is Keeshond will increase its popularity in the nearly future: don’t you think it is lovely?

Photo by Terri BrownFlickr: IMG_4723, CC BY 2.0

Here you have the code:


read_html(x) %>% 
  html_nodes("ul:nth-child(19)") %>% 
  html_text() %>% 
  strsplit(., "\n") %>% 
  unlist() -> breeds

breeds=iconv(breeds[breeds!= ""], "UTF-8")

gconnect(usr, psw)

ref="King Charles Spaniel"

#New set
breeds=setdiff(breeds, ref)

#Subsets. Do not worry about warning message
sub.breeds=split(breeds, 1:ceiling(length(breeds)/4))

for (i in 1:length(sub.breeds))
  res <- gtrends(unlist(union(ref, sub.breeds[i])), 
          start_date = Sys.Date()-180,

trends=data.frame(name=character(0), level=numeric(0), trend=numeric(0))
for (i in 1:length(results))
  for (j in 3:ncol(df))
    s=rm.outlier(df[,j], fill = TRUE)
    t=mean(diff(ma(s, order=2))/ma(s, order=2), na.rm = T)
    trends=rbind(data.frame(name=colnames(df)[j], level=l, trend=t), trends)

trends %>% 
  group_by(name) %>% 
  summarize(level=mean(level), trend=mean(trend*100)) %>% 
  filter(level>0 & trend > -10 & level<500) %>% 
  na.omit() %>% 
  mutate(name=str_replace_all(name, ".US","")) %>% 
  mutate(name=str_replace_all(name ,"[[:punct:]]"," ")) %>% 
    x = trend,
    y = level
  ) -> trends

#Dinamic chart as The Economist
highchart() %>% 
  hc_title(text = "The Hype Bubble Map for Dog Breeds") %>%
  hc_subtitle(text = "According Last 6 Months of Google Searchings") %>% 
  hc_xAxis(title = list(text = "Trend"), labels = list(format = "{value}%")) %>% 
  hc_yAxis(title = list(text = "Level")) %>% 
  hc_add_theme(hc_theme_economist()) %>%
  hc_add_series(data = list.parse3(trends), type = "bubble", showInLegend=FALSE, maxSize=40) %>% 
  hc_tooltip(formatter = JS("function(){
                            return ('<b>Trend: </b>' + Highcharts.numberFormat(this.x, 2)+'%' + '<br><b>Level: </b>' + Highcharts.numberFormat(this.y, 2) + '<br><b>Breed: </b>' + this.point.name)

#Static chart
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="gray25", size=15),
  axis.text.x = element_text(colour="gray25", size=15),
  text = element_text(size=20),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 30))
ggplot(trends, aes(x=x/100, y=y, label=name), guide=FALSE)+
  geom_point(colour="white", fill="darkorchid2", shape=21, alpha=.3, size=9)+
  scale_x_continuous(limits=c(-.02,.02), labels = percent)+
  labs(title="The Hype Bubble Map for Dog Breeds",
  geom_text(data=subset(trends, x> .2 & y > 50), size=4, colour="gray25")+
  geom_text(data=subset(trends, x > .7), size=4, colour="gray25")+opts

The Coaster Maker by Shiny

The word you invented is well formed and could be used in the Italian language (The Accademia della Crusca regarding to the word “Petaloso”, recently invented by an eight-year-old boy)

Are you tired of your old coasters? Do you like to make things by your own? Do you have a PC and a printer at home? If you answered yes to all these questions, just follow these simple instructions:

  • Install R and RStudio in your PC
  • Open RStudio and create a new Shiny Web App multiple file (ui.R/server.R)
  • Substitute sample code of each file by the code below
  • Press Run App
  • Press buttom Get your coaster! until you obtain a image you like
  • Print the image
  • Cut out the image
  • Place on the coaster your favorite drinking

These are some examples:

This is the code of ui.R

# This is the user-interface definition of a Shiny web application. You can
# run the application by clicking 'Run App' above.
# Find out more about building applications with Shiny here:
#    http://shiny.rstudio.com/
  titlePanel("The coaster maker"),

      # adding the new div tag to the sidebar
      tags$div(class="header", checked=NA,
               tags$p("This coasters are generated by hypocycloid curves.The curve is formed by the locus of a point,
                      attached to a circle, that rolls on the inside of another circle.
                      In the curve's equation the first part denotes the relative position between the two circles,
                      the second part denotes the rotation of the rolling circle.")),
      tags$div(class="header", checked=NA,

More info <a href=\"http://www.2dcurves.com/roulette/rouletteh.html#rhodon\">here</a>")
      actionButton('rerun','Get your coaster!')

This is the code of server.R

# This is the server logic of a Shiny web application. You can run the
# application by clicking 'Run App' above.
# Find out more about building applications with Shiny here:
#    http://shiny.rstudio.com/
CreateDS = function ()
  t=seq(-31*pi, 31*pi, 0.002)
  a=sample(seq(from=1/31, to=29/31, by=2/31), 1)
  b=runif(1, min = 1, max = 3)
  data.frame(x=(1-a)*cos(a*t)+a*b*cos((1-a)*t), y=(1-a)*sin(a*t)-a*b*sin((1-a)*t))
shinyServer(function(input, output) {
  dat<-reactive({if (input$rerun) dat=CreateDS() else dat=CreateDS()})
      geom_point(data=data.frame(x=0,y=0), aes(x,y), color=rgb(rbeta(1, .5, .5), rbeta(1, .5, .5), rbeta(1, .5, .5)) , shape=19, fill="yellow", size=220)+
      geom_polygon(aes(x, y), fill=rgb(rbeta(1, 2, 2), rbeta(1, 2, 2), rbeta(1, 2, 2))) +
            panel.background = element_rect(fill="white"),
  }, height = 500, width = 500)


The world is full of wonderful things, like sunflowers (Machanguito, my islander friend)

Sunflower seeds are arranged following a mathematical pattern where golden ratio plays a starring role. There are tons of web sites explaining this amazing fact. In general, the arrangement of leaves on a plant stem are ruled by spirals. This fact is called phyllotaxis, and I did this experiment about it some time ago. Voronoi tessellation originated by points arranged according the golden angle spiral give rise to this sunflowers:sunflowers

I know this drawing will like to my friend Machanguito because he loves sunflowers. He also loves dancing, chocolate cookies, music and swimming in the sea. Machanguito loves life, it is just that simple. He is also a big defender of renewable energy and writes down his thoughts on recycled papers. You can follow his adventures here.

This is the code:

opt = theme(legend.position  = "none",
            panel.background = element_rect(fill="red4"),
            axis.ticks       = element_blank(),
            panel.grid       = element_blank(),
            axis.title       = element_blank(),
            axis.text        = element_blank())
CreateSunFlower <- function(nob=500, dx=0, dy=0) {   data.frame(r=sqrt(1:nob), t=(1:nob)*(3-sqrt(5))*pi) %>%
    mutate(x=r*cos(t)+dx, y=r*sin(t)+dy)
g=seq(from=0, by = 45, length.out = 4)
jitter(g, amount=2) %>%
  expand.grid(jitter(g, amount=2)) %>%
  apply(1, function(x) CreateSunFlower(nob=round(jitter(220, factor=15)), dx=x[1], dy=x[2])) %>%
  do.call("rbind", .) %>% deldir() %>% .$dirsgs -> sunflowers
ggplot(sunflowers) +
  geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), color="greenyellow") +

A Silky Drawing and a Tiny Experiment

It is a capital mistake to theorize before one has data (Sherlock Holmes, A Scandal in Bohemia)

One of my favorite entertainments is drawing things: crazy curves, imaginary flowerscelestial bodies, fractalic acacias … but sometimes I wonder myself if these drawings result interesting to whom arrive to my blog. One way to define interesting could be wanting to reproduce the drawing. I know some people do it because they sometimes share with me their creations. So, how many people appreciate the code I write? I manage some a priori for this number (which I will maintain for myself) but I want to refine my estimation with the next experiment. I have done this drawing, which shows that simple mathematics can produce very nice patterns:


To estimate how many people is really interested in this plot, at the end of the post I will publish all the code except for a line. If you want the line, you will have to ask it to me. How? It is very easy: you will have to send me a direct message in Twitter. If you don’t follow me, do it here and I will follow you back. If you already follow me but I don’t, tweet something mentioning me and I will follow you back. Then you will be able to send me the direct message. If you prefer, you can send me an email. You can find my email address here.

I know this experiment can be quite biased, but I am also pretty sure that the resulting estimation will be much better than the one I manage nowadays. This is the kidnapped code:

opt = theme(legend.position  = "none",
panel.background = element_rect(fill="violetred4"),
axis.ticks       = element_blank(),
panel.grid       = element_blank(),
axis.title       = element_blank(),
axis.text        = element_blank())
seq(from=-10, to=10, by = 0.05) %>%
expand.grid(x=., y=.) %>%
geom_point(alpha=.1, shape=20, size=1, color="white") + opt

The Gender of Big Data

When I grow up I want to be a dancer (Carmen, my beautiful daughter)

The presence of women in positions of responsibility inside Big Data companies is quite far of parity: while approximately 5o% of world population are women, only 7% of CEOs of Top 100 Big Data Companies are.

To do this experiment, I did some webscraping to download the list of big data companies from here. I also used a very interesting package called genderizeR, which makes gender prediction based on first names (more info here).

Here you have the code:

paste0("http://www.crn.com/slide-shows/data-center/300076704/2015-big-data-100-business-analytics.htm/pgno/0/", 1:45) %>%
c(., paste0("http://www.crn.com/slide-shows/data-center/300076709/2015-big-data-100-data-management.htm/pgno/0/",1:30)) %>%
c(., paste0("http://www.crn.com/slide-shows/data-center/300076740/2015-big-data-100-infrastructure-tools-and-services.htm/pgno/0/",1:25)) -> webpages
for(x in webpages)
read_html(x) %>% html_nodes("p:nth-child(1)") %>% .[[2]] %>% html_text() -> Company
read_html(x) %>% html_nodes("p:nth-child(2)") %>% .[[1]] %>% html_text() -> Executive
results=rbind(results, data.frame(Company, Executive))
results=data.frame(lapply(results, as.character), stringsAsFactors=FALSE)
results[74,]=c("Trifacta", "Top Executive: CEO Adam Wilson")
results %>% mutate(Name=gsub("Top|\\bExec\\S*|\\bCEO\\S*|President|Founder|and|Co-Founder|\\:", "", Executive)) %>%
mutate(Name=word(str_trim(Name))) -> results
results %>%
select(Name) %>%
findGivenNames() %>%
filter(probability > 0.9 & count > 15) %>%
as.data.frame() -> data
data %>% group_by(gender) %>% summarize(Total=n()) -> dat
legend="{ position: 'bottom', textStyle: {fontSize: 10}}",
title='TOP 100 BIG DATA COMPANIES 2015
Gender of CEOs',

The Unbereable Insolence of Prime Numbers or (Playing to be Ulam)

So rock me mama like a wagon wheel, rock me mama anyway you feel (Wagon Wheel, Old Crow Medicine Show)

This is the third iteration of Hilbert curve. I placed points in its corners. Since the curve has beginning and ending, I labeled each vertex with the order it occupies:hilbert_primes3_1Dark green vertex are those labeled with prime numbers and light ones with non-prime. This is the sixth iteration colored as I described before (I removed lines and labels):hilbert_primes6_2

Previous plot has 4.096 points. There are 564 primes lower than 4.096. What If I color 564 points randomly instead coloring primes? This is an example:
Do you see any difference? I do. Let me place both images together (on the left, the one with primes colored):

The dark points are much more ordered in the first plot. The second one is more noisy. This is my particular tribute to Stanislaw Ulam and its spiral: one of the most amazing fruits of boredom in the history of mathematics.

This is the code:

          panel.background = element_rect(fill="white"),
hilbert = function(m,n,r) {
  for (i in 1:n)
    tmp=cbind(t(m), m+nrow(m)^2)
    m=rbind(tmp, (2*nrow(m))^r-tmp[nrow(m):1,]+1)
  melt(m) %>% plyr::rename(c("Var1" = "x", "Var2" = "y", "value"="order")) %>% arrange(order)}
iter=3 #Number of iterations
df=hilbert(m=matrix(1), n=iter, r=2)
df %>%  mutate(prime=order %in% subprimes,
               random=sample(x=c(TRUE, FALSE), size=nrow(df), prob=c(length(subprimes),(nrow(df)-length(subprimes))), replace = TRUE)) -> df
#Labeled (primes colored)
ggplot(df, aes(x, y, colour=prime)) +
  geom_path(color="gray75", size=3)+
  scale_colour_manual(values = c("olivedrab1", "olivedrab"))+
  scale_x_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  scale_y_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  geom_text(aes(label=order), size=8, color="white")+
#Non labeled (primes colored)
ggplot(df, aes(x, y, colour=prime)) +
  scale_colour_manual(values = c("olivedrab1", "olivedrab"))+
  scale_x_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  scale_y_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
#Non labeled (random colored)
ggplot(df, aes(x, y, colour=random)) +
  scale_colour_manual(values = c("olivedrab1", "olivedrab"))+
  scale_x_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  scale_y_continuous(expand=c(0,0), limits=c(0,2^iter+1))+

Going Bananas With Hilbert

It seemed that everything is in ruins, and that all the basic mathematical concepts have lost their meaning (Naum Vilenkin, Russian mathematician, regarding to the discovery of Peano’s curve)

Giuseppe Peano found in 1890 a way to draw a curve in the plane that filled the entire space: just a simple line covering completely a two dimensional plane. Its discovery meant a big earthquake in the traditional structure of mathematics. Peano’s curve was the first but not the last: one of these space-filling curves was discovered by Hilbert and takes his name. It is really beautiful:

Hilbert’s curve can be created iteratively. These are the first six iterations of its construction:

As you will see below, R code to create Hilbert’s curve is extremely easy. It is also very easy to play with the curve, altering the order in which points are sorted. Changing the initial matrix(1) by some other number, resulting curves are quite appealing:

Let’s go futher. Changing ggplot geometry from geom_path to geom_polygon generate some crazy pseudo-tessellations:

And what if you change the matrix exponent?

And what if you apply polar coordinates?

We started with a simple line and with some small changes we have created fantastical images. And all these things only using black and white. Do you want to add some colors? Try with the following code (if you draw something interesting, please let me know):

          panel.background = element_rect(fill="white"),
hilbert = function(m,n,r) {
  for (i in 1:n)
    tmp=cbind(t(m), m+nrow(m)^2)
    m=rbind(tmp, (2*nrow(m))^r-tmp[nrow(m):1,]+1)
  melt(m) %>% plyr::rename(c("Var1" = "x", "Var2" = "y", "value"="order")) %>% arrange(order)}
# Original
ggplot(hilbert(m=matrix(1), n=1, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=2, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=3, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=4, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(1), n=6, r=2), aes(x, y)) + geom_path()+ opt
# Changing order
ggplot(hilbert(m=matrix(.5), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(0), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(tan(1)), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(3), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(-1), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(log(.1)), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(-15), n=5, r=2), aes(x, y)) + geom_path()+ opt
ggplot(hilbert(m=matrix(-0.001), n=5, r=2), aes(x, y)) + geom_path()+ opt
# Polygons
ggplot(hilbert(m=matrix(log(1)), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(.5), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(tan(1)), n=5, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-15), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-25), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(0), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(1000000), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-1), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-.00001), n=4, r=2), aes(x, y)) + geom_polygon()+ opt
# Changing exponent
gplot(hilbert(m=matrix(log(1)), n=4, r=-1), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(.5), n=4, r=-2), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(tan(1)), n=4, r=6), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-15), n=3, r=sin(2)), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-25), n=4, r=-.0001), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(0), n=4, r=200), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(1000000), n=3, r=.5), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-1), n=4, r=sqrt(2)), aes(x, y)) + geom_polygon()+ opt
ggplot(hilbert(m=matrix(-.00001), n=4, r=52), aes(x, y)) + geom_polygon()+ opt
# Polar coordinates
ggplot(hilbert(m=matrix(1), n=4, r=2), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(-1), n=5, r=2), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(.1), n=2, r=.5), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(1000000), n=2, r=.1), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(.25), n=3, r=3), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(tan(1)), n=5, r=1), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(1), n=4, r=1), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(log(1)), n=3, r=sin(2)), aes(x, y)) + geom_polygon()+ coord_polar()+opt
ggplot(hilbert(m=matrix(-.0001), n=4, r=25), aes(x, y)) + geom_polygon()+ coord_polar()+opt

A Checkpoint Of Spanish Football League

I am an absolute beginner, but I am absolutely sane (Absolute Beginners, David Bowie)

Some time ago I wrote this post, where I predicted correctly the winner of the Spanish Football League several months before its ending. After thinking intensely about taking the risk of ruining my reputation repeating the analysis, I said “no problem, Antonio, do it again: in the end you don’t have any reputation to keep”. So here we are.

From a technical point of view there are many differences between both analysis. Now I use webscraping to download data, dplyr and pipes to do transformations and interactive D3.js graphs to show results. I think my code is better now and it makes me happy.

As I did the other time, Bradley-Terry Model gives an indicator of  the power of each team, called ability, which provides a natural mechanism for ranking teams. This is the evolution of abilities of each team during the championship (last season was played during the past weekend):


Although it is a bit messy, the graph shows two main groups of teams: on the one hand, Barcelona, Atletico de Madrid, Real Madrid and Villareal; on the other hand, the rest. Let’s have a closer look to evolution of the abilities of the top 4 teams:


While Barcelona, Atletico de Madrid and Real Madrid walk in parallel,  Villareal seems to be a bit stacked in the last seasons; the gap between them and Real Madrid is increasing little by little. Maybe is the Zidane’s effect. It is quite interesting discovering what teams are increasing their abilities: they are Malaga, Eibar and Getafe. They will probably finish the championship in a better position than they have nowadays (Eibar could reach fifth position):


What about Villareal? Will they go up some position? I don’t think so. This plot shows the probability of winning any of the top 3:


As you can see, probability is decreasing significantly. And what about Barcelona? Will win? It is a very difficult question. They are almost tied with Atletico de Madrid, and only 5 and 8 points above Real Madrid and Villareal. But it seems Barcelona keep them at bay. This plot shows the evolution of the probability of be beaten by Atletico, Real Madrid and Villareal:


All probabilities are under 50% and decreasing (I supposed a scoring of 2-0 for Barcelona in the match against Sporting of season 16 that was postponed to next February 17th).

Data science is a profession for brave people so it is time to do some predictions. These are mine, ordered by likelihood:

  • Barcelona will win, followed by Atletico (2), Real Madrid (3), Villareal (4) and Eibar (5)
  • Malaga and Getafe will go up some positions
  • Next year I will do the analysis again

Here you have the code:

for (i in 1:nseasons)
  html(webpage) %>%
    html_nodes("table") %>%
    .[[1]] %>%
    html_table(header=FALSE, fill=TRUE) %>%
    mutate(X4=i) %>%
colnames(results)=c("home", "score", "visiting", "season")
results %>% 
  mutate(home     = iconv(home,     from="UTF8",to="ASCII//TRANSLIT"),
         visiting = iconv(visiting, from="UTF8",to="ASCII//TRANSLIT")) %>%
  #filter(grepl("-", score)) %>%
  mutate(score=replace(score, score=="18:30 - 17/02/2016", "0-2")) %>% # resultado fake para el Barcelona
  mutate(score_home     = as.numeric(str_split_fixed(score, "-", 2)[,1])) %>%
  mutate(score_visiting = as.numeric(str_split_fixed(score, "-", 2)[,2])) %>%
  mutate(points_home     =ifelse(score_home > score_visiting, 3, ifelse(score_home < score_visiting, 0, 1))) %>%
  mutate(points_visiting =ifelse(score_home > score_visiting, 0, ifelse(score_home < score_visiting, 3, 1))) -> data
prob_BT=function(x, y) {exp(x-y) / (1 + exp(x-y))}
for (i in 13:nseasons)
  data %>% filter(season<=i) %>%
    BTm(cbind(points_home, points_visiting), home, visiting, data=.) -> footballBTModel
  BTabilities(footballBTModel) %>%
  as.data.frame()  -> tmp 
  cbind(tmp, as.character(rownames(tmp)), i) %>% 
  mutate(ability=round(ability, digits = 2)) %>%
  rbind(BTabilities) -> BTabilities
colnames(BTabilities)=c("ability", "s.e.", "team", "season")
sort(unique(BTabilities[,"team"])) -> teams
for (i in 13:nseasons)
  BTabilities[BTabilities$season==i,1] %>% outer( ., ., prob_BT) -> tmp
  cbind(melt(tmp),i) %>% rbind(BTprobabilities) -> BTprobabilities
colnames(BTprobabilities)=c("team1", "team2", "probability", "season")
BTprobabilities %>% 
  filter(team1=="Villarreal") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team2 %in% c("R. Madrid", "Barcelona", "Atletico")) -> BTVillareal
BTprobabilities %>% 
  filter(team2=="Barcelona") %>% 
  mutate(probability=round(probability, digits = 2)) %>%
  filter(team1 %in% c("R. Madrid", "Villarreal", "Atletico")) -> BTBarcelona
AbilityPlot <- nPlot(
  ability ~ season, 
  data = BTabilities, 
  group = "team",
  type = "lineChart")
AbilityPlot$yAxis(axisLabel = "Estimated Ability", width = 62)
AbilityPlot$xAxis(axisLabel = "Season")
VillarealPlot <- nPlot(
  probability ~ season, 
  data = BTVillareal, 
  group = "team2",
  type = "lineChart")
VillarealPlot$yAxis(axisLabel = "Probability of beating", width = 62)
VillarealPlot$xAxis(axisLabel = "Season")
BarcelonaPlot <- nPlot(
  probability ~ season, 
  data = BTBarcelona, 
  group = "team1",
  type = "lineChart")
BarcelonaPlot$yAxis(axisLabel = "Probability of being beaten", width = 62)
BarcelonaPlot$xAxis(axisLabel = "Season")

Climatic Change At A Glance

Mmm. Lost a planet, Master Obi-Wan has. How embarrassing (Yoda, Attack Of The Clones)

Some time ago I published this post in KDnuggets in which I analyze historical temperatures to show how we are gradually heading toward a warmer planet. Simple data science to obtain a simple (and increasingly accepted) conclusion: the global warming is real. Despite I was criticized I still believe what I said then: you don’t have to be a climatologist to empirically confirm global warming.

This experiment is another example of that. It is still simpler than that since it is only based on visual perception but I think is also quite conclusive. In this case, I represent U.S. temperature outliers from 1964 to 2013; a map per year. Dataset contains station ID, name, min/max temperature, as well as degree coordinates of the recorded weather. Original weather data collected from NOAA and anomalies analysis by Enigma. You can download data here.

Anomalies are divided into four categories: Strong Hot, Weak Hot, Weak Cold and Strong Cold. For each station, I represent difference between number of Cold and Hot anomalies (independently of the strength) so Blue bubbles represent stations where total number of Cold anomalies during the year is greater that total number of Hot ones and Red ones represent the opposite. Size of bubbles is also proportional to this indicator. As an example, following you can see the map of year 1975:

It seems 1975 was hot in the right a cold on the left side. Concretely, in TONOPAH Station (Nevada) were registered 30 anomalies and most of them (26) where due to cold temperatures. This is why bubble is blue. This GIF shows the evolution of all these maps from 1964 to 2013:


Maybe it is just my personal feeling but don’t you see how red bubbles are gradually winning to blue ones? Maybe I am a demagogue.

This code generates a dynamic map by year in html format:

anomalies = fread("enigma-enigma.weather.anomalies.outliers-1964-2013-05ce047dbf3e67f83db9ae841545a333.csv")
anomalies %>%
  mutate(year=substr(date_str, 1, 4)) %>%
  group_by(year, longitude, latitude, id, station_name) %>%
    Strong_Hot=sum(str_count(type,"Strong Hot")),
    Weak_Hot=sum(str_count(type,"Weak Hot")),
    Weak_Cold=sum(str_count(type,"Weak Cold")),
    Strong_Cold=sum(str_count(type,"Strong Cold")),
    total=n()) %>%
  mutate(score=sign(-Strong_Hot-Weak_Hot+Weak_Cold+Strong_Cold)) %>%
  mutate(color=ifelse(score==1, "Blue",ifelse(score==0, "White", "Red"))) -> anomalies2
for (i in unique(anomalies2$year))
  anomalies2 %>%
    filter(year==i) %>%
    leaflet() %>%
    fitBounds(-124, 34, -62, 40) %>%
    addProviderTiles("Stamen.TonerLite") %>%
    addCircleMarkers(lng = ~longitude,
                     lat = ~latitude,
                     radius = ~ifelse(total < 20, 2, ifelse(total < 27, 4, 8)),
                     color= ~color,
                     fillOpacity = 0.5,
                     popup = ~paste(sep = "
", paste0("<b>", station_name, "</b>"),
                                    paste0("Strong Hot: ", Strong_Hot),
                                    paste0("Weak Hot: ", Weak_Hot),
                                    paste0("Weak Cold: ", Weak_Cold),
                                    paste0("Strong Cold: ", Strong_Cold))) -> m
    saveWidget(m, file=paste0("m", i, ".html"))

Phyllotaxis By Shiny

Antonio, you don’t know what empathy is! (Cecilia, my beautiful wife)

Spirals are nice. In the wake of my previous post I have done a Shiny app to explore patterns generated by changing angle, shape and number of points of Fermat’s spiral equation. You can obtain an almost infinite number of images. This is just an example:

I like thinking in imaginary flowers. This is why I called this experiment Phyllotaxis.
More examples:

Just one comment about code: I did the Shiny in just one R file as this guy suggested me some time ago because of this post.

This is the code. Do your own imaginary flowers:

CreatePlot = function (ang=pi*(3-sqrt(5)), nob=150, siz=15, alp=0.8, sha=16, col="black", bac="white") {
ggplot(data.frame(r=sqrt(1:nob), t=(1:nob)*ang*pi/180), aes(x=r*cos(t), y=r*sin(t)))+
geom_point(colour=col, alpha=alp, size=siz, shape=sha)+
scale_x_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+
scale_y_continuous(expand=c(0,0), limits=c(-sqrt(nob)*1.4, sqrt(nob)*1.4))+
panel.background = element_rect(fill=bac),
ui = fluidPage(
titlePanel("Phyllotaxis by Shiny"),
selectInput("col", label = "Colour of points:", choices = colors(), selected = "black"),
selectInput("bac", label = "Background colour:", choices = colors(), selected = "white"),
selectInput("sha", label = "Shape of points:",
choices = list("Empty squares" = 0, "Empty circles" = 1, "Empty triangles"=2,
"Crosses" = 3, "Blades"=4, "Empty diamonds"=5,
"Inverted empty triangles"=6, "Bladed squares"=7,
"Asterisks"=8, "Crosed diamonds"=9, "Crossed circles"=10,
"Stars"=11, "Cubes"=12, "Bladed circles"=13,
"Filled squares" = 15, "Filled circles" = 16, "Filled triangles"=17,
"Filled diamonds"=18), selected = 16),
sliderInput("ang", label = "Angle (degrees):", min = 0, max = 360, value = 180*(3-sqrt(5)), step = .05),
sliderInput("nob", label = "Number of points:", min = 1, max = 1500, value = 60, step = 1),
sliderInput("siz", label = "Size of points:", min = 1, max = 60, value = 10, step = 1),
sliderInput("alp", label = "Transparency:", min = 0, max = 1, value = .5, step = .01)
server = function(input, output) {
CreatePlot(ang=input$ang, nob=input$nob, siz=input$siz, alp=input$alp, sha=as.numeric(input$sha), col=input$col, bac=input$bac)
}, height = 650, width = 650 )}