A Segmentation Of The World According To Migration Flows ft. Leaflet

Up in the sky you just feel fine, there is no running out of time and you never cross a line (Up In The Sky, 77 Bombay Street)

In this post I analyze two datasets from Enigma:

  • Migration flows: Every 10 years, since 1960, the World Bank estimates migrations worldwide (267.960 rows)
  • World population: Values and percentages of populations for each nation examined beginning in year 1960, by the World Bank’s Health, Nutrition and Population project (4.168.185 rows)

Since the second dataset is very large, I load it into R using fread function of data.table package, which is extremely fast. To filter datasets, I also use dplyr and pipes of magrittr package (my life changed since I discovered it).

To build a comparable indicator across countries, I divide migration flows (from and to each country) by the mean population in each decade. I do this because migration flows are aggregated for each decade since 1960. For example, during the first decade of 21st century, Argentina reveived 1.537.850 inmigrants, which represents a 3,99% of the mean population of the country in this decade. In the same period, inmigration to Burundi only represented a 0,67% of its mean population.

What happened in the whole world in that decade? There were around 166 million people who moved to other countries. It represents a 2.58% of the mean population of the world. I use this figure to divide countries into four groups:

  • Isolated: countries with both % of inmigrants and % of migrants under 2.58%
  • Emitter: countries with % of inmigrants under 2.58% and % of migrants over 2.58%
  • Receiver: countries with % of inmigrants over 2.58% and % of migrants under 2.58%
  • Transit: countries with both % of inmigrants and % of migrants over 2.58%

To create the map I use leaflet package as I did in my previous post. Shapefile of the world can be downloaded here. This is how the world looks like according to this segmentation:

Migration Flows

Some conclusions:

  • There are just sixteen receiver countries: United Arab Emirates, Argentina, Australia, Bhutan, Botswana, Costa Rica, Djibouti, Spain, Gabon, The Gambia, Libya, Qatar, Rwanda, Saudi Arabia, United States and Venezuela
  • China and India (the two most populous countries in the world) are isolated
  • Transit countries are concentrated in the north hemisphere and most of them are located in cold latitudes
  • There are six emitter countries with more than 30% of emigrants between 2000 and 2009: Guyana, Tonga, Tuvalu, Jamaica, Bosnia and Herzegovina and Albania

This is the code you need to reproduce the map:

populflows = read.csv(file="enigma-org.worldbank.migration-remittances.migrants.migration-flow-c57405e33412118c8757b1052e8a1490.csv", stringsAsFactors=FALSE)
population = fread("enigma-org.worldbank.hnp.data-eaa31d1a34fadb52da9d809cc3bec954.csv")
# Population
population %>% 
  filter(indicator_name=="Population, total") %>% 
  as.data.frame %>% 
  mutate(decade=(year-year%%10)) %>% 
  group_by(country_name, country_code, decade) %>% 
  summarise(avg_pop=mean(value)) -> population2
# Inmigrants by country
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, destination_country) %>% 
  summarise(inmigrants = sum(total_migrants))  %>% 
  merge(population2, by.x = c("destination_country", "migration_year"), by.y = c("country_name", "decade"))  %>% 
  mutate(p_inmigrants=inmigrants/avg_pop) -> inmigrants
# Migrants by country
populflows %>% filter(!is.na(total_migrants)) %>% 
  group_by(migration_year, country_of_origin) %>% 
  summarise(migrants = sum(total_migrants)) %>%  
  merge(population2, by.x = c("country_of_origin", "migration_year"), by.y = c("country_name", "decade"))  %>%
  mutate(p_migrants=migrants/avg_pop) -> migrants
# Join of data sets
migrants %>% 
  merge(inmigrants, by.x = c("country_code", "migration_year"), by.y = c("country_code", "migration_year")) %>%
  filter(migration_year==2000) %>% 
  select(country_of_origin, country_code, avg_pop.x, migrants, p_migrants, inmigrants, p_inmigrants) %>% 
  plyr::rename(., c("country_of_origin"="Country", 
                    "p_inmigrants"="p.of.inmigrants")) -> populflows2000
# Threshold to create groups
populflows2000 %>% 
  summarise(x=sum(Total.migrants), y=sum(Total.inmigrants), z=sum(Population.mean)) %>% 
  mutate(m=y/z) %>% 
  select(m)  %>% 
  as.numeric -> avg
# Segmentation
populflows2000[populflows2000$p.of.migrants>avg & populflows2000$p.of.inmigrants>avg, "Group"]="Transit"
populflows2000[populflows2000$p.of.migrants<avg & populflows2000$p.of.inmigrants<avg, "Group"]="Isolated" 
populflows2000[populflows2000$p.of.migrants>avg & populflows2000$p.of.inmigrants<avg, "Group"]="Emitter" 
#Loading shapefile from http://data.okfn.org/data/datasets/geo-boundaries-world-110m 
countries=readOGR("json/countries.geojson", "OGRGeoJSON") 
# Join shapefile and enigma information 
joined=merge(countries, populflows2000, by.x="wb_a3", by.y="Country.code", all=FALSE, sort = FALSE) 
# To define one color by segment 
factpal=colorFactor(brewer.pal(4, "Dark2"), joined$Group) 
leaflet(joined) %>%
  addPolygons(stroke = TRUE, color="white", weight=1, smoothFactor = 0.2, fillOpacity = .8, fillColor = ~factpal(Group)) %>%
  addTiles() %>%
  addLegend(pal = factpal, values=c("Emitter", "Isolated", "Receiver", "Transit"))

A Simple Interactive Map Of US Prisons With Leaflet

The love of one’s country is a splendid thing. But why should love stop at the border? (Pablo Casals, Spanish cellist)

Some time ago, I discovered Enigma, an amazing open platform that unifies billions of records from thousands of government sources to make the world of public data universally accessible and useful. This is the first experiment I have done using data from Enigma. This is what I did:

  1. Create a free account, search and download data. Save the csv file in your working directory. File contains information about all prison facilities in the United States (private and state run) as recorded by the Department of Corrections in each state. Facility types, names, addresses (or lat/long coordinates) ownership names and detailed. In sum, there is information about 1.248 prison facilities.
  2. Since most of the prisons of the file do not contain geographical coordinates, I obtain latitude and longitude using geocode function from ggmap package. This step takes some time. I also remove closed facilities. Finally, I obtain a data set with complete information of 953 prison facilities.
  3. After cleaning and filling out data, generating the map is very easy using leaflet package for R. I create a column named popup_info pasting name and address to be shown in the popup. Instead using default OpenStreetMap basemap I use a CartoDB one.

In my opinion, resulting map is very appealing with a minimal effort. Since I cannot embed the map here, this is a screenshot of it:

jailsThis plot could be a good example of visual correlation, because it depends on this. Here you have the code. To see the map in your browser, press Show in new window option, a little arrow on the upper left side of the RStudio viewer window:

prisons = read.csv(file="enigma-enigma.prisons.all-facilities-bd6a927c4024c16d8ba9e21d52292b0f.csv", stringsAsFactors=FALSE)
prisons %>% 
  mutate(address=paste(facility_address1, city, state, zip, "EEUU", sep=", ")) %>%
  select(address) %>% 
  lapply(function(x){geocode(x, output="latlon")})  %>% 
  as.data.frame %>% 
  cbind(prisons) -> prisons
prisons %>%  
  mutate(popup_info=paste(sep = "<br/>", paste0("<b>", facility_name, "</b>"), facility_address1, city, state, zip)) %>% 
  mutate(lon=ifelse(is.na(longitude), address.lon, longitude),
         lat=ifelse(is.na(latitude),  address.lat, latitude)) %>%
  filter(!is.na(lon) & !grepl("CLOSED", facility_name)) -> prisons
leaflet(prisons) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(lng = ~lon, 
                   lat = ~lat, 
                   radius = 3, 
                   color = "red",
                   fillOpacity = 0.5,
                   popup = ~popup_info)

The World We Live In #5: Calories And Kilograms

I enjoy doing new tunes; it gives me a little bit to perk up, to pay a little bit more attention (Earl Scruggs, American musician)

I recently finished reading The Signal and the Noise, a book by Nate Silver, creator of the also famous FiveThirtyEight blog. The book is a very good reading for all data science professionals, and is a must in particular for all those who work trying to predict the future. The book praises the bayesian way of thinking as the best way to face and modify predictions and criticizes rigid ways of thinking with many examples of disastrous predictions. I enjoyed a lot the chapter dedicated to chess and how Deep Blue finally took over Kasparov. In a nutshell: I strongly recommend it.
One of the plots of Silver’s book present a case of false negative showing the relationship between obesity and calorie consumption across the world countries. The plot shows that there is no evidence of a connection between both variables. Since it seemed very strange to me, I decided to reproduce the plot by myself.

I compared these two variables:

  • Dietary Energy Consumption (kcal/person/day) estimated by the FAO Food Balance Sheets.
  • Prevalence of Obesity as percentage of defined population with a body mass index (BMI) of 30 kg/m2 or higher estimated by the World Health Organization

And this is the resulting plot:

Calories And KilogramsAs you can see there is a strong correlation between two variables. Why the experiment of Nate Silver shows the opposite? Obviously we did not plot the same data (although, in principle, both of us went to the same source). Anyway: to be honest, I prefer my plot because shows what all of we know: the more calories you eat, the more weight you will see in your bathroom scale. Some final thoughts seeing the plot:

  • I would like to be Japanese: they don’t gain weight!
  • Why US people are fatter than Austrian?
  • What happens in Samoa?

Here you have the code to do the plot:

url_calories = "http://www.fao.org/fileadmin/templates/ess/documents/food_security_statistics/FoodConsumptionNutrients_en.xls"
download.file(url_calories, method="internal", destfile = "FoodConsumptionNutrients_en.xls", mode = "ab")
calories = read.xlsx(file="FoodConsumptionNutrients_en.xls", startRow = 4, colIndex = c(2,6), colClasses = c("character", "numeric"), sheetName="Dietary Energy Cons. Countries", stringsAsFactors=FALSE) 
colnames(calories)=c("Country", "Kcal")
url_population = "http://esa.un.org/unpd/wpp/DVD/Files/1_Excel%20(Standard)/EXCEL_FILES/1_Population/WPP2015_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.XLS"
download.file(url_population, method="internal", destfile = "Population.xls", mode = "ab")
population = read.xlsx(file="Population.xls", startRow = 17, colIndex = c(3,71), colClasses = c("character", "numeric"), sheetName="ESTIMATES", stringsAsFactors=FALSE) 
colnames(population)=c("Country", "Population")
# http://apps.who.int/gho/data/node.main.A900A?lang=en
url_obesity = "http://apps.who.int/gho/athena/data/xmart.csv?target=GHO/NCD_BMI_30A&profile=crosstable&filter=AGEGROUP:*;COUNTRY:*;SEX:*&x-sideaxis=COUNTRY&x-topaxis=GHO;YEAR;AGEGROUP;SEX&x-collapse=true"
obesity = read.csv(file=url_obesity, stringsAsFactors=FALSE)
obesity %>% select(matches("Country|2014.*Both")) -> obesity
colnames(obesity)=c("Country", "Obesity")
obesity %>% filter(Obesity!="No data") -> obesity
obesity %>% mutate(Obesity=as.numeric(substr(Obesity, 1, regexpr(pattern = "[[]", obesity$Obesity)-1))) -> obesity
population %>% inner_join(calories,by = "Country") %>% inner_join(obesity,by = "Country") -> data
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=15),
  axis.title = element_text(size=18, colour="gray10"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 40, colour="gray10"))
ggplot(data, aes(x=Kcal, y=Obesity/100, size=log(Population), label=Country), guide=FALSE)+
  geom_point(colour="white", fill="sandybrown", shape=21, alpha=.55)+
  scale_y_continuous(labels = percent)+
  labs(title="The World We Live In #5: Calories And Kilograms",
       x="Dietary Energy Consumption (kcal/person/day)",
       y="% population with body mass index >= 30 kg/m2")+
  geom_text(data=subset(data, Obesity>35|Kcal>3700), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Kcal<2000), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Obesity<10 & Kcal>2600), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(aes(3100, .01), colour="gray25", hjust=0, label="Source: United Nations (size of bubble depending on population)", size=4.5)+opts

Going Bananas #2: A Needle In A Haystack

Now I’m gonna tell my momma that I’m a traveller, I’m gonna follow the sun (The Sun, Parov Stelar)

Inspired by this book I read recently, I decided to do this experiment. The idea is comparing how easy is to find sequences of numbers inside Pi, e, Golden Ratio (Phi) and a randomly generated number. For example, since Pi is 3.1415926535897932384… the 4-size sequence 5358 can be easily found at the begining as well as the 5-size sequence 79323. I considered interesting comparing Pi with a random generated number. What I though before doing the experiment is that it would be easier finding sequences inside the andom one. Why? Because despite of being irrational and transcendental I thought there should be some kind of residual pattern in Pi that should make more difficult to find random sequences inside it than do it inside a randomly generated number.

  • I downloaded Pi, e and Phi from the Internet and extract first 100.000 digits of all of them. I generate a random 100.000 number on the fly.
  • I generate a representative sample of 4-size sequences
  • I look for each of these sequences inside first 5.000 digits of Pi, e, Phi and the randomly generated one. I repeat searching for first 10.000, first 15.000 and so on until I search into the whole 100.000 -size number
  • I store how many sequences I find for each searching
  • I repeat this for 5 and 6-size sequences.

At first sight, is equally easy (or difficult), to find random sequences inside all numbers: my hypothesis was wrong.

As you can see here, 100.000 digits is more than enough to find 4-size sequences. In fact, from 45.000 digits I reach 100% of successful matches:


I only find 60% of 5-size sequences inside 100.000 digits of numbers:


And only 10% of 6-size sequences:


Why these four numbers are so equal in order to find random sequences inside them? I don’t know. What I know is that if you want to find your telephone number inside Pi, you will probably need an enormous number of digits.

library(extrafont);windowsFonts(Comic=windowsFont("Comic Sans MS"))
p = html("http://www.geom.uiuc.edu/~huberty/math5337/groupe/digits.html")
f = html("http://www.goldennumber.net/wp-content/uploads/2012/06/Phi-To-100000-Places.txt")
e = html("http://apod.nasa.gov/htmltest/gifcity/e.2mil")
p %>%  
  html_text() %>% 
  substr(., regexpr("3.14",.), regexpr("Go to Historical",.)) %>% 
  gsub("[^0-9]", "", .)  %>% 
  substr(., 1, 100000) -> p
f %>%  
  html_text() %>% 
  substr(., regexpr("1.61",.), nchar(.)) %>% 
  gsub("[^0-9]", "", .) %>%  
  substr(., 1, 100000) -> f
e %>%  
  html_text() %>% 
  substr(., regexpr("2.71",.), nchar(.)) %>% 
  gsub("[^0-9]", "", .) %>% 
  substr(., 1, 100000) -> e
r = paste0(sample(0:9, 100000, replace = TRUE), collapse = "")
results=data.frame(Cut=numeric(0), Pi=numeric(0), Phi=numeric(0), e=numeric(0), Random=numeric(0))
samp=min(10^dgts*2/100, 10000)
for (i in 1:bins) {
  p0=substr(p, start=0, stop=cut)
  f0=substr(f, start=0, stop=cut)
  e0=substr(e, start=0, stop=cut)
  r0=substr(r, start=0, stop=cut)
  sample(0:(10^dgts-1), samp, replace = FALSE) %>% str_pad(dgts, pad = "0") -> comb
  comb %>% sapply(function(x) grepl(x, p0)) %>% sum() -> p1
  comb %>% sapply(function(x) grepl(x, f0)) %>% sum() -> f1
  comb %>% sapply(function(x) grepl(x, e0)) %>% sum() -> e1
  comb %>% sapply(function(x) grepl(x, r0)) %>% sum() -> r1
  results=rbind(results, data.frame(Cut=cut, Pi=p1, Phi=f1, e=e1, Random=r1))
results=melt(results, id.vars=c("Cut") , variable.name="number", value.name="matches")
  panel.background = element_rect(fill="darkolivegreen1"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="white", linetype = 1),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="black"),
  axis.text.x = element_text(colour="black"),
  text = element_text(size=20, family="Comic"),
  legend.text = element_text(size=25),
  legend.key = element_blank(),
  legend.position = c(.75,.2),
  legend.background = element_blank(),
  plot.title = element_text(size = 30))
ggplot(results, aes(x = Cut, y = matches/samp, color = number))+
  geom_line(size=1.5, alpha=.8)+
  scale_color_discrete(name = "")+
  scale_x_continuous(breaks=seq(100000/bins, 100000, by=100000/bins))+
  scale_y_continuous(labels = percent)+
  theme(axis.text.x = element_text(angle = 90, vjust=.5, hjust = 1))+
  labs(title=paste0("Finding ",dgts, "-size strings into 100.000-digit numbers"), 
       x="Cut Position", 
       y="% of Matches")+opts

The Moon And The Sun

Do not swear by the moon, for she changes constantly. Then your love would also change (William Shakespeare, Romeo and Juliet)

The sun is a big point ant the moon is a cardioid:


Here you have the code. It is a simple example of how to use ggplot:

t0=seq(from=3, to=2*n+1, by=2) %% n
df=data.frame(x1=cos((t1-1)*2*pi/n), y1=sin((t1-1)*2*pi/n), x2=cos((t2-1)*2*pi/n), y2=sin((t2-1)*2*pi/n))
panel.background = element_rect(fill="white"),
panel.grid = element_blank(),
axis.text =element_blank())
ggplot(df, aes(x = x1, y = y1, xend = x2, yend = y2)) +
geom_point(x=0, y=0, size=245, color="gold")+
geom_segment(color="white", alpha=.5)+opt

Trigonometric Pattern Design

Triangles are my favorite shape, three points where two lines meet (Tessellate, Alt-J)

Inspired by recurrence plots and by the Gauss error function, I have done the following plots. The first one represents the recurrence plot of f\left ( x \right )= sec\left ( x \right ) where distance between points is measured by Gauss error function:

sec1This one is the same for f\left ( x \right )= tag\left ( x \right )

tan1And this one represents f\left ( x \right )= sin\left ( x \right )

sin1I like them: they are elegant, attractive and easy to make. Try your own functions. One final though: the more I use magrittr package, the more I like it. This is the code for the first plot.

RecurrencePlot = function(from, to, col1, col2) {
  opt = theme(legend.position  = "none",
              panel.background = element_blank(),
              axis.ticks       = element_blank(),
              panel.grid       = element_blank(),
              axis.title       = element_blank(),
              axis.text        = element_blank()) 
  seq(from, to, by = .1) %>% expand.grid(x=., y=.) %>% 
    ggplot( ., aes(x=x, y=y, fill=erf(sec(x)-sec(y)))) + geom_tile() + 
    scale_fill_gradientn(colours=colorRampPalette(c(col1, col2))(2)) + opt}
RecurrencePlot(from = -5*pi, to = 5*pi, col1 = "black", col2= "white")

The 2D-Harmonograph In Shiny

If you wish to make an apple pie from scratch, you must first invent the universe (Carl Sagan)

I like Shiny and I can’t stop converting into apps some of my previous experiments: today is the turn of the harmonograph. This is a simple application since you only can press a button to generate a random harmonograph-simulated curve. I love how easy is to create a nice interactive app to play with from an existing code. The only trick in this case in to add a rerun button in the UI side and transfer the interaction to the server side using a simple if. Very easy. This is a screenshot of the application:


Press the button and you will get a new drawing. Most of them are nice scrawls and from time to time you will obtain beautiful shapely curves.

And no more candy words: It is time to complain. I say to RStudio with all due respect, you are very cruel. You let me to deploy my previous app to your server but you suspended it almost immediately for fifteen days due to “exceeded usage hours”. My only option is paying at least $440 per year to upgrade my current plan. I tried the ambrosia for an extremely short time. RStudio: Why don’t you launch a cheaper account? Why don’t you launch a free account with just one perpetual alive app at a time? Why don’t you increase the usage hours threshold? I can help you to calculate the return on investment of these scenarios.

Or, Why don’t you make me a gift for my next birthday? I promise to upload a new app per month to promote your stunning tool. Think about it and please let me know your conclusions.

Meanwhile I will run the app privately. This is the code to do it:


# This is the user-interface definition of a Shiny web application.
# You can find out more about building applications with Shiny here:
# http://www.rstudio.com/shiny/


  titlePanel("Mathematical Beauties: The Harmonograph"),

      # adding the new div tag to the sidebar            
      tags$div(class="header", checked=NA,
               tags$p("A harmonograph is a mechanical apparatus that employs pendulums to create a 
                       geometric image. The drawings created typically are Lissajous curves, or 
                       related drawings of greater complexity. The devices, which began to appear 
                       in the mid-19th century and peaked in popularity in the 1890s, cannot be 
                       conclusively attributed to a single person, although Hugh Blackburn, a professor 
                       of mathematics at the University of Glasgow, is commonly believed to be the official 
                       inventor. A simple, so-called \"lateral\" harmonograph uses two pendulums to control the movement 
                       of a pen relative to a drawing surface. One pendulum moves the pen back and forth along 
                       one axis and the other pendulum moves the drawing surface back and forth along a 
                       perpendicular axis. By varying the frequency and phase of the pendulums relative to 
                       one another, different patterns are created. Even a simple harmonograph as described 
                       can create ellipses, spirals, figure eights and other Lissajous figures (Source: Wikipedia)")),
               tags$div(class="header", checked=NA,
                       HTML("<p>Click <a href=\"http://paulbourke.net/geometry/harmonograph/harmonograph3.html\">here</a> to see an image of a real harmonograph</p>")
        actionButton('rerun','Launch the harmonograph!')


# This is the server logic for a Shiny web application.
# You can find out more about building applications with Shiny here:
# http://www.rstudio.com/shiny/


CreateDS = function () 
  f=jitter(sample(c(2,3),4, replace = TRUE))
  xt = function(t) exp(-d[1]*t)*sin(t*f[1]+p[1])+exp(-d[2]*t)*sin(t*f[2]+p[2])
  yt = function(t) exp(-d[3]*t)*sin(t*f[3]+p[3])+exp(-d[4]*t)*sin(t*f[4]+p[4])
  t=seq(1, 200, by=.0005)
  data.frame(t=t, x=xt(t), y=yt(t))}

shinyServer(function(input, output) {
    dat<-reactive({if (input$rerun) dat=CreateDS() else dat=CreateDS()})
   plot(rnorm(1000),xlim =c(-2,2), ylim =c(-2,2), type="n")
   with(dat(), plot(x,y, type="l", xlim =c(-2,2), ylim =c(-2,2), xlab = "", ylab = "", xaxt='n', yaxt='n', col="gray10", bty="n"))
  }, height = 650, width = 650)

Shiny Wool Skeins

Chaos is not a pit: chaos is a ladder (Littlefinger in Game of Thrones)

Some time ago I wrote this post to show how my colleague Vu Anh translated into Shiny one of my experiments, opening my eyes to an amazing new world. I am very proud to present you the first Shiny experiment entirely written by me.

In this case I took inspiration from another previous experiment to draw some kind of wool skeins. The shiny app creates a plot consisting of chords inside a circle. There are to kind of chords:

  • Those which form a track because they are a set of glued chords; number of tracks and number of chords per track can be selected using Number of track chords and Number of scrawls per track sliders of the app respectively.
  • Those forming the background, randomly allocated inside the circle. Number of background chords can be chosen as well in the app

There is also the possibility to change colors of chords. This are the main steps I followed to build this Shiny app:

  1. Write a simple R program
  2. Decide which variables to parametrize
  3. Open a new Shiny project in RStudio
  4. Analize the sample UI.R and server.R files generated by default
  5. Adapt sample code to my particular code (some iterations are needed here)
  6. Deploy my app in the Shiny Apps free server

Number 1 is the most difficult step, but it does not depends on Shiny: rest of them are easier, specially if you have help as I had from my colleague Jorge. I encourage you to try. This is an snapshot of the app:


You can play with the app here.

Some things I thought while developing this experiment:

  • Shiny gives you a lot with a minimal effort
  • Shiny can be a very interesting tool to teach maths and programming to kids
  • I have to translate to Shiny some other experiment
  • I will try to use it for my job

Try Shiny: is very entertaining. A typical Shiny project consists on two files, one to define the user interface (UI.R) and the other to define the back end side (server.R).

This is the code of UI.R:

# This is the user-interface definition of a Shiny web application.
# You can find out more about building applications with Shiny here:
# http://shiny.rstudio.com



  # Application title
  titlePanel("Shiny Wool Skeins"),
  HTML("<p>This experiment is based on <a href=\"https://aschinchon.wordpress.com/2015/05/13/bertrand-or-the-importance-of-defining-problems-properly/\">this previous one</a> I did some time ago. It is my second approach to the wonderful world of Shiny.</p>"),
  # Sidebar with a slider input for number of bins
        sliderInput("lin", label = "Number of track chords:",
                    min = 1, max = 20, value = 5, step = 1),
        sliderInput("rep", label = "Number of scrawls per track:",
                    min = 1, max = 50, value = 10, step = 1),
        sliderInput("nbc", label = "Number of background chords:",
                    min = 0, max = 2000, value = 500, step = 2),
        selectInput("col1", label = "Track colour:",
                    choices = colors(), selected = "darkmagenta"),
        selectInput("col2", label = "Background chords colour:",
                    choices = colors(), selected = "gold")

    # Show a plot of the generated distribution

And this is the code of server.R:

# This is the server logic for a Shiny web application.
# You can find out more about building applications with Shiny here:
# http://shiny.rstudio.com

shinyServer(function(input, output) {

    ini=runif(n=input$lin, min=0,max=2*pi)
    ini %>% 
      +runif(n=input$lin, min=pi/2,max=3*pi/2) %>% 
      cbind(ini, end=.) %>% 
      as.data.frame() -> Sub1
    Sub1=Sub1[rep(seq_len(nrow(Sub1)), input$rep),]
    Sub1 %>% apply(c(1, 2), jitter) %>% as.data.frame() -> Sub1
    Sub1=with(Sub1, data.frame(col=input$col1, x1=cos(ini), y1=sin(ini), x2=cos(end), y2=sin(end)))
    Sub2=runif(input$nbc, min = 0, max = 2*pi)
    Sub2=data.frame(x=cos(Sub2), y=sin(Sub2))
    Sub2=cbind(input$col2, Sub2[(1:(input$nbc/2)),], Sub2[(((input$nbc/2)+1):input$nbc),])
    colnames(Sub2)=c("col", "x1", "y1", "x2", "y2")
    rbind(Sub1, Sub2)
             panel.background = element_rect(fill="white"),
             panel.grid = element_blank(),
             axis.text =element_blank())
    p=ggplot(df())+geom_segment(aes(x=x1, y=y1, xend=x2, yend=y2), colour=df()$col, alpha=runif(nrow(df()), min=.1, max=.3), lwd=1)+opts;print(p)
  }, height = 600, width = 600 )


Simple Data Science To Maximize Return On Lottery Investment

Every finite game has an equilibrium point (John Nash, Non-Cooperative Games, 1950)

I read recently this amazing book, where I discovered that we (humans) are not capable of generating random sequences of numbers by ourselves when we play lottery. John Haigh demonstrates this fact analyzing a sample of 282 raffles of 6/49 UK Lotto. Once I read this, I decided to prove if this disability is property only of British population or if it is shared with Spanish people as well. I am Spanish, so this experiment can bring painful results to myself, but here I come.

The Spanish equivalent of 6/40 UK Lotto is called “Lotería Primitiva” (or “Primitiva”, to abbreviate). This is a ticket of Primitiva lotto:


As you can see, one ticket gives the chance to do 8 bets. Each bet consists on 6 numbers between 1 and 49 to be chosen in a grid of 10 rows by 5 columns. People tend to choose separate numbers because we think that they are more likely to come up than combinations with some consecutive numbers. We think we have more chances to get rich choosing 4-12-23-25-31-43 rather than 3-17-18-19-32-33, for instance. To be honest, I should recognize I am one of these persons.

Primitiva lotto is managed by Sociedad Estatal Loterías y Apuestas del Estado, a public business entity belonging to the Spanish Ministry of Finance and Public Administrations. They know what people choose and they could do this experiment more exactly than me. They could analyze just human bets (those made by players by themselves) and discard machine ones (those made automatically by vending machines) but anyway it is possible to confirm the previous thesis with some public data.

I analysed 432 raffles of Primitiva carried out between 2011 and 2015; for each raffle I have this information:

  • The six numbers that form the winning combination
  • Total number of bets
  • Number of bets which hit the six numbers (Observed Winners)

The idea is to compare observed winners of raffles with the expected number of them, estimated as follows:

Expected\, Winners=\frac{Total\, Bets}{C_{6}^{49}},\: where\: C_{6}^{49}=\binom{49}{6}=\frac{49!}{43!6!}

This table compare the number of expected and observed winners between raffles which contain consecutive and raffles which not:


There are 214 raffles without consecutive with 294 winners while the expected number of them was 219. In other words, a winner of a non-consecutive-raffle must share the prize with a 33% of some other person. On the other hand, the number of observed winners of a raffle with consecutive numbers 17% lower than the expected one. Simple and conclusive. Spanish are like British, at least in what concerns to this particular issue.

Let’s go further. I can do the same for any particular number. For example, there were 63 raffles containing number 45 in the winning combination and 57 (observed) winners, although 66 were expected. After doing this for every number, I can draw this plot, where I paint in blue those which ratio of observed winners between expected is lower than 0.9:


It seems that blue numbers are concentrated on the right side of the grid. Do we prefer small numbers rather than big ones? There are 15 primes between 1 and 49 (rate: 30%) but only 3 primes between blue numbers (rate: 23%). Are we attracted by primes?

Let’s combine both previous results. This table compares the number of expected and observed winners between raffles which contain consecutive and blues (at least one) and raffles which not:


Now, winning combinations with some consecutive and some blue numbers present 20% less of observed winners than expected. After this, which combination would you choose for your next bet? 27-35-36-41-44-45 or 2-6-13-15-26-28? I would choose the first one. Both of them have the same probability to come up, but probably you will become richer with the first one if it happens.

This is the code of this experiment. If someone need the dataset set to do their own experiments, feel free to ask me (you can find my email here):

file = "SORTEOS_PRIMITIVA_2011_2015.xls"
data=read.xlsx(file, sheetName="ALL", colClasses=c("numeric", "Date", rep("numeric", 21)))  
#Impute null values to zero
data$C1_EUROS=with(data, impute(C1_EUROS, 0))
data$CE_WINNERS=with(data, impute(CE_WINNERS, 0))
#Expected winners for each raffle
#Consecutives indicator
data$DIFFMIN=apply(data[,3:8], 1, function (x) min(diff(sort(x))))
#Consecutives vs non-consecutives comparison
      FROM data GROUP BY CONS")
colnames(df1)=c("Contains consecutives?", "Number of  raffles", "Expected Winners", "Observed Winners")
Table1=gvisTable(df1, formats=list('Expected Winners'='#,###'))
#Heat map of each number
results=data.frame(BALL=numeric(0), EXP_WINNER=numeric(0), OBS_WINNERS=numeric(0))
for (i in 1:49)
  data$TF=apply(data[,3:8], 1, function (x) i %in% x + 0)
  results=rbind(results, v)
results$ObsByExp2=with(results, cut(ObsByExp, breaks=c(-Inf,.9,Inf), right = FALSE))
          panel.background = element_blank(),
          panel.grid = element_blank(),
          axis.text =element_blank())
ggplot(results, aes(y=ROW, x=COL)) +
  geom_tile(aes(fill = ObsByExp2), colour="gray85", lwd=2) +
  geom_text(aes(family="Garamond"), label=results$BALL, color="gray10", size=12)+
  scale_fill_manual(values = c("dodgerblue", "gray98"))+
#Blue numbers
Bl=subset(results, ObsByExp2=="[-Inf,0.9)")[,1]
data$BLUES=apply(data[,3:8], 1, function (x) length(intersect(x,Bl)))
#Combination of consecutives and blues
      FROM data GROUP BY IND")
colnames(df2)=c("Contains consecutives and blues?", "Number of  raffles", "Expected Winners", "Observed Winners")
Table2=gvisTable(df2, formats=list('Expected Winners'='#,###'))

Bertrand or (The Importance of Defining Problems Properly)

We better keep an eye on this one: she is tricky (Michael Banks, talking about Mary Poppins)

Professor Bertrand teaches Simulation and someday, ask his students:

Given a circumference, what is the probability that a chord chosen at random is longer than a side of the equilateral triangle inscribed in the circle?

Since they must reach the answer through simulation, very approximate solutions are welcome.

Some students choose chords as the line between two random points on the circumference and conclude that the asked probability is around 1/3. This is the plot of one of their simulations, where 1000 random chords are chosen according this method and those longer than the side of the equilateral triangle are red coloured (smalller in grey):


Some others choose a random radius and a random point in it. The chord then is the perpendicular through this point. They calculate that the asked probability is around 1/2:


And some others choose a random point inside the circle and define the chord as the only one with this point as midpoint. For them, the asked probability is around 1/4:


Who is right? Professor Bertrand knows that everybody is. In fact, his main purpose was to show how important is to define problems properly. Actually, he used this to give an unforgettable lesson to his students.

          panel.background = element_rect(fill="white"),
          panel.grid = element_blank(),
          axis.text =element_blank())
#First approach
angle=runif(2*n, min = 0, max = 2*pi)
pt1=data.frame(x=cos(angle), y=sin(angle))
df1=cbind(pt1[1:n,], pt1[((n+1):(2*n)),])
colnames(df1)=c("x1", "y1", "x2", "y2")
p1=ggplot(df1) + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour=length>sqrt(3)), alpha=.4, lwd=.6)+
  scale_colour_manual(values = c("gray75", "red"))+opt
#Second approach
pt2=data.frame(aa=cos(angle), bb=sin(angle))
p2=ggplot(df2) + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour=length>sqrt(3)), alpha=.4, lwd=.6)+
scale_colour_manual(values = c("gray75", "red"))+opt
#Third approach
pt3=data.frame(x0=sqrt(radius)*cos(angle), y0=sqrt(radius)*sin(angle))
p3=ggplot(df3) + geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2, colour=length>sqrt(3)), alpha=.4, lwd=.6)+scale_colour_manual(values = c("gray75", "red"))+opt