Monthly Archives: April 2015

Analysing The Rock ‘n’ Roll Madrid Marathon

Nobody’s going to win all the time. On the highway of life you can’t always be in the fast lane (Haruki Murakami, What I Talk About When I Talk About Running)

I started running two years ago and one if my dreams is to run a marathon someday. One month ago I run my first half marathon so this day is become closer but I am in no hurry to do it. Meanwhile, I prefer analysing the results of the last edition of the Rock ‘n’ Roll Madrid marathon which, by the way, will be hold again next weekend. This is the first time I do webscraping to download data from a website and it has been quite easy thanks to rvest package.

Once I go over this website form, I have timings of every runner (more than 11.000) at 5, 10, 15, 20, 21.097, 25, 30, 35, 40, and 42.195 kilometers. This, togheter with the category of each runner is the base for my analysis. I remove categories with a small number of runners.

This is the Box plot of the finish time by category:

Finish Time

Who maintains the rhythm better? Since I have the time at the end and at the middle (21.097 km), I can do a Box plot with the variation time between first and second half of the Marathon:

Variation Time

Only a handful of people pull out of the Marathon before finishing but once again this rate is not the same between categories. This is the survival rate by category along the race:

Survival Rate

This plots show some interesting things:

  • Fastest category is 35-40 years old for both genders
  • Fastest individuals are inside 24-35 years old for both genders
  • Youngest ages are not the fastest
  • Between 40-45 category is the second fastest for men and the third one for women
  • Females between 40-45 keep the most constant rhythm of all categories.
  • Young men between 22-24 years old are the most unconstant: their second half rhythm is much slower than the first one.
  • All females between 55-60 years old ended the marathon
  • On the other hand, males between 60-65 are the category with most ‘deads’ during the race

Long life forties:

library(rvest)
library(lubridate)
library(ggplot2)
library(plyr)
library(sqldf)
library(scales)
library(gplots)
setwd("YOUR WORKING DIRECTORY HERE")
maraton_web="http://www.maratonmadrid.org/resultados/clasificacion.asp?carrera=13&parcial=par1&clasificacion=1&dorsal=&nombre=&apellidos=&pais=&pagina=par2"
#Grid with parameters to navigate in the web to do webscraping
searchdf=rbind(expand.grid( 1, 0:32), expand.grid( 2, 0:55), expand.grid( 3, 0:56), expand.grid( 4, 0:56),
               expand.grid( 5, 0:56), expand.grid( 6, 0:55), expand.grid( 7, 0:55), expand.grid( 8, 0:55),
               expand.grid( 9, 0:53), expand.grid(10, 0:55))
colnames(searchdf)=c("parcial", "pagina")
#Webscraping. I open the webpage and download the related table with partial results
results=data.frame()
for (i in 1:nrow(searchdf))
{
  maraton_tmp=gsub("par2", searchdf[i,2], gsub("par1", searchdf[i,1], maraton_web))
  df_tmp=html(maraton_tmp) %>% html_nodes("table") %>% .[[3]] %>% html_table()
  results=rbind(results, data.frame(searchdf[i,1], df_tmp[,1:7]))
}
#Name the columns
colnames(results)=c("Partial", "Place", "Bib", "Name", "Surname", "Cat", "Gross", "Net")
#Since downloadind data takes time, I save results in RData format
save(results, file="results.RData")
load("results.RData")
#Translate Net timestamp variable into hours
results$NetH=as.numeric(dhours(hour(hms(results$Net)))+dminutes(minute(hms(results$Net)))+dseconds(second(hms(results$Net))))/3600
results$Sex=substr(results$Cat, 3, 4)
#Translate Cat into years and gender
results$Cat2=revalue(results$Cat, 
                     c("A-F"="18-22 Females", "A-M"="18-22 Males", "B-F"="22-24 Females", "B-M"="22-24 Males", "C-F"="24-35 Females", "C-M"="24-35 Males", 
                       "D-F"="35-40 Females", "D-M"="35-40 Males", "E-F"="40-45 Females", "E-M"="40-45 Males", "F-F"="45-50 Females", "F-M"="45-50 Males", 
                       "G-F"="50-55 Females", "G-M"="50-55 Males", "H-F"="55-60 Females", "H-M"="55-60 Males", "I-F"="60-65 Females", "I-M"="60-65 Males", 
                       "J-F"="65-70 Females", "J-M"="65-70 Males", "K-F"="+70 Females", "K-M"="+70 Males"))
#Translate partial code into kilometers
results$PartialKm=mapvalues(results$Partial, from = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), to = c(5, 10, 15, 20, 21.097, 25, 30, 35, 40, 42.195))
#There are some categories with very few participants. I will remove from the analysis
count(subset(results, Partial==10), "Cat")
#General options for ggplot
opts=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=15),
  axis.title = element_text(size=20, colour="gray10"),
  legend.key = element_blank(),
  legend.background = element_blank(),
  plot.title = element_text(size = 32, colour="gray10"))
#Data set with finish times
results1=subset(results, Partial==10 & !(Cat %in% c("A-F", "A-M", "B-F", "I-F", "J-F", "K-F", "K-M")))
ggplot(results1, aes(x=reorder(Cat2, NetH, FUN=median), y=NetH)) + geom_boxplot(aes(fill=Sex), colour = "gray25")+
  scale_fill_manual(values=c("hotpink", "lawngreen"), name="Sex", breaks=c("M", "F"), labels=c("Males", "Females"))+
  labs(title="Finish Time by Category of Rock 'n' Roll Madrid Marathon 2014", x="Category (age and sex)", y="Finish time (hours)")+
  theme(axis.text.x = element_text(angle = 90, vjust=.5, hjust = 0), legend.justification=c(1,0), legend.position=c(1,0))+opts
#Data set with variation times
results2=sqldf("SELECT 
               a.Bib, a.Sex, b.Cat2, (a.NetH-2*b.NetH)*60 as VarMin
               FROM results1 a INNER JOIN results b ON (a.Bib = b.Bib AND b.Partial=5) order by VarMin asc")
ggplot(results2, aes(x=reorder(Cat2, VarMin, FUN=median), y=VarMin)) + geom_boxplot(aes(fill=Sex))+
  scale_fill_manual(values=c("hotpink", "lawngreen"), name="Sex", breaks=c("M", "F"), labels=c("Males", "Females"))+
  labs(title="Time Variation by Category Between First and Last Half\nof Rock 'n' Roll Madrid Marathon 2014", x="Category (age and sex)", y="Variation (minutes)")+
  theme(axis.text.x = element_text(angle = 90, vjust=.5, hjust = 0), legend.justification=c(1,0), legend.position=c(1,0))+opts

results3_tmp1=expand.grid(Cat2=unique(results$Cat2), PartialKm=unique(results$PartialKm))
results3_tmp2=sqldf("SELECT Bib, Sex, Cat2, Max(PartialKm) as PartialKmMax FROM results 
                   WHERE Cat NOT IN ('A-F', 'A-M', 'B-F', 'I-F', 'J-F', 'K-F', 'K-M') GROUP BY 1,2,3")
results3_tmp3=sqldf("SELECT PartialKmMax, Sex, Cat2, COUNT(*) AS Runners FROM results3_tmp2 GROUP BY 1,2,3")
results3_tmp4=sqldf("SELECT a.Cat2, a.PartialKm, SUM(Runners) as Runners FROM results3_tmp1 a INNER JOIN results3_tmp3 b
                    on (a.Cat2 = b.Cat2 AND b.PartialKmMax>=a.PartialKm) 
                    GROUP BY 1,2")
#Data set with survival rates
results3=sqldf("SELECT a.Cat2, a.PartialKm, a.Runners*1.00/b.Runners*1.00 as Po_Survivors
               FROM results3_tmp4 a INNER JOIN (SELECT Cat2, COUNT(*) as Runners FROM results3_tmp2 GROUP BY 1) b
               ON (a.Cat2 = b.Cat2)")
ggplot(results3, aes(x=PartialKm, y=Po_Survivors, group=Cat2, colour=Cat2)) + geom_line(lwd=3)+
  scale_color_manual(values=alpha(rich.colors(15, palette="temperature"), 0.3), name="Category")+
  scale_x_continuous(breaks = unique(results3$PartialKm), labels=c("5", "10", "15", "20", "21.097", "25", "30", "35", "40", "42.195"))+
  scale_y_continuous(labels = percent)+
  labs(title="Survival Rate by Category of Rock 'n' Roll Madrid Marathon 2014", x="Kilometer", y="% of survivors")+
  theme(axis.text.x = element_text(angle = 90, vjust=.5, hjust = 0), legend.justification=c(1,0), legend.position=c(.15,0))+opts

Discovering Shiny

It is not an experiment if you know it is going to work (Jeff Bezos)

From time to time, I discover some of my experiments translated into Shiny Apps, like this one. Some days ago, I discovered one of these translations and I contacted the author, who was a guy from Vietnam called Vu Anh. I asked him to do a Shiny App from this experiment. Vu was enthusiastic with the idea. We defined some parameters to play with shape, number, width and alpha of lines as well as background color and I received a perfect release of the application in just a few hours.With just a handful of parameters, possible outputs are almost infinite. Following you can find some of them:

SinyCollageI think the code is a nice example to take the first steps in Shiny. If you are not used to Markdown files, you can follow this instructions to run the code.

Vu is a talented guy, who loves maths and programming. He represents the future of our nice profession and I predict a successful future for him. Do not miss his brand new blog. I am sure you will find amazing things there.

This is the code of the app:

---
title: "Maths, Music and Merkbar"
author: "Brother Rain"
date: "18/03/2015"
output: html_document
runtime: shiny
---

Original Post: https://aschinchon.wordpress.com/2015/01/15/maths-music-and-merkbar/

## Load Data

```{r}
library(circlize)
library(scales)
factors = as.factor(0:9)
lines = 2000 #Number of lines to plot in the graph
alpha = 0.4  #Alpha for color lines
colors0=c(
    rgb(239,143,121, max=255),
    rgb(126,240,188, max=255),
    rgb(111,228,235, max=255),
    rgb(127,209,249, max=255),
    rgb( 74,106,181, max=255),
    rgb(114,100,188, max=255),
    rgb(181,116,234, max=255),
    rgb(226,135,228, max=255),
    rgb(239,136,192, max=255),
    rgb(233,134,152, max=255)
)
# You can find the txt file here:
# http://www.goldennumber.net/wp-content/uploads/2012/06/Phi-To-100000-Places.txt
phi=readLines("data/Phi-To-100000-Places.txt")[5]
```

## Visualization

```{r, echo=FALSE}
fluidPage(
  fluidRow(
    column(width = 4,
        sidebarPanel(
            sliderInput("lines", "Number of lines:", min=100, max=100000, step=100, value=500), 
            sliderInput("alpha", "Alpha:", min=0.01, max=1, step=0.01, value=0.4),
            sliderInput("lwd", "Line width", min=0, max=1, step=0.05, value=0.2),
            selectInput("background", "Background:",
                c("Purple" = "mediumpurple4", "Gray" = "gray25", "Orange"="orangered4", 
                  "Red" = "red4", "Brown"="saddlebrown", "Blue"="slateblue4", 
                  "Violet"="palevioletred4", "Green"="forestgreen", "Pink"="deeppink"), selected="Purple"),
            sliderInput("h0", "h0:", min=0, max=0.4,
                    step=0.0005, value=0.1375),
           sliderInput("h1", "h1:", min=0, max=0.4,
                step=0.0005, value=0.1125),
            width=12
        )
    ),
    column(width = 8,
        renderPlot({
            # get data
            phi=gsub("\\.","", substr(phi,1,input$lines))
            phi=gsub("\\.","", phi)
            position=1/(nchar(phi)-1)
            
            # create circos
            circos.clear()
            par(mar = c(1, 1, 1, 1), lwd = 0.1,
                cex = 0.7, bg=alpha(input$background, 1))
            circos.par(
                "cell.padding"=c(0.01,0.01),
                "track.height" = 0.025,
                "gap.degree" = 3
            )
            circos.initialize(factors = factors, xlim = c(0, 1))
            circos.trackPlotRegion(factors = factors, ylim = c(0, 1))
            ## create first region
            for (i in 0:9) {
                circos.updatePlotRegion(
                    sector.index = as.character(i),
                    bg.col = alpha(input$background, 1),
                    bg.border=alpha(colors0[i+1], 1)
                )
            }
            for (i in 1:(nchar(phi)-1)) {
                m=min(as.numeric(substr(phi, i, i)), as.numeric(substr(phi, i+1, i+1)))
                M=max(as.numeric(substr(phi, i, i)), as.numeric(substr(phi, i+1, i+1)))
                d=min((M-m),((m+10)-M))
                col=t(col2rgb(colors0[(as.numeric(substr(phi, i, i))+1)]))
                for(index in 1:3){
                    col[index] = max(min(255, col[index]), 0)
                }
                if (d>0) {
                    circos.link(
                        substr(phi, i, i), position*(i-1),
                        substr(phi, i+1, i+1), position*i,
                        h = input$h0 * d + input$h1,
                        lwd=input$lwd,
                        col=alpha(rgb(col, max=255), input$alpha), rou = 0.92
                    )
                }
            }
            }, width=600, height=600, res=192)
    )
  )
)


```