Monthly Archives: December 2014

Mixing Benford, GoogleVis And On-Line Encyclopedia of Integer Sequences

The chess-board is the world; the pieces are the phenomena of the universe; the rules of the game are what we call the laws of Nature (T. H. Huxley)

One of the greatest packages I discovered recently is googleVis. While plotting with ggplot can be sometimes very arduous, doing plots with googleVis is extremely easy. Here you can find many examples of what you can do with this great package.

Not long ago, I also discovered The On-Line Encyclopedia of Integer Sequences (OEIS), a huge database of about 250.000 integer sequences where, for example, you can find the number of ways to lace a shoe that has n pairs of eyelets or the smallest number of stones in Tchoukaillon (or Mancala, or Kalahari) solitaire which make use of n-th hole. Many mathematicians, as Ralph Stephan, use this useful resource to develop their theories.

The third protagonist of this story is Frank Benford, who formulated in 1938 his famous law which states that considering different lists of numbers, 1 occurs as the leading digit about the 30% of time, while larger digits occur in that position less frequently.

In this experiment I read 20 random sequences from the OEIS. For each sequence, I obtain the distribution of first digit of the numbers and calculate the similarity with the theoretical distribution given by Benford’s Law so the more similar is the distribution, the closer is this number to 1. Sequences of OEIS are labeled with a seven characters code (an “A” followed by 6 digits). A nice way to show the output of this experiment is using the Gauge visualization of googleVis:

IMG1

Sequence A001288 is the closest to the Benford’s Law. This sequence is the number of distinct 11-element subsets that can be formed from a n element set. Why is so close to the Benford’s Law? No idea further than binomial coefficients are related to some biological laws as number of descendants of a couple of rabbits.

I would like to wish you all a Merry Christmas and a Happy New Year:

library(plyr)
bendford=data.frame(first=0:9, freq=c(0,log10(1+1/(1:9))))
SequencesIds=formatC(sample(1:250000, 20, replace=FALSE), width = 6, format = "d", flag = "0")
results=data.frame(SEQID=character(0), BENDFORNESS=numeric(0))
for(i in 1:length(SequencesIds))
{
  SEQID = SequencesIds[i]
  TEXTFILE=paste("b", SEQID, ".txt", sep="")
  if (!file.exists(TEXTFILE)) download.file(paste("http://oeis.org/A",SEQID, "/b", SEQID, ".txt",sep=""), destfile = TEXTFILE)
  SEQ=readLines(paste("b", SEQID, ".txt", sep=""))
  SEQ=SEQ[SEQ != ""]
  SEQ=SEQ[unlist(gregexpr(pattern ='synthesized',SEQ))<0]
  m=t(sapply(SEQ, function(x) unlist(strsplit(x, " "))))
  df=data.frame(first=substr(gsub("[^0-9]","",m[,2]), 1, 1), row.names = NULL)
  df=count(df, vars = "first")
  df$freq=df$freq/sum(df$freq)
  df2=merge(x = bendford, y = df, by = "first", all.x=TRUE)
  df2[is.na(df2)]=0
  results=rbind(results, data.frame(SEQID=paste("A", SEQID, sep=""), BENDFORNESS=1-sqrt(sum((df2$freq.x - df2$freq.y) ^ 2))))
}
results$BENDFORNESS=as.numeric(format(round(results$BENDFORNESS, 2), nsmall = 2))
Gauge=gvisGauge(results, options=list(min=0, max=1, greenFrom=.75, greenTo=1, yellowFrom=.25, yellowTo=.75, redFrom=0, redTo=.25, width=400, height=300))
plot(Gauge)

The Awesome Parrondo’s Paradox

A technique succeeds in mathematical physics, not by a clever trick, or a happy accident, but because it expresses some aspect of physical truth (O. G. Sutton)

Imagine three unbalanced coins:

  • Coin 1: Probability of head=0.495 and probability of tail=0.505
  • Coin 2: Probability of head=0.745 and probability of tail=0.255
  • Coin 3: Probability of head=0.095 and probability of tail=0.905

Now let’s define two games using these coins:

  • Game A: You toss coin 1 and if it comes up head you receive 1€ but if not, you lose 1€
  • Game B: If your present capital is a multiple of 3, you toss coin 2. If not, you toss coin 3. In both cases, you receive 1€ if coin comes up head and lose 1€ if not.

Played separately, both games are quite unfavorable. Now let’s define Game A+B in which you toss a balanced coin and if it comes up head, you play Game A and play Game B otherwise. In other words, in Game A+B you decide between playing Game A or Game B randomly.

Starting with 0€, it is easy to simulate the three games along 500 plays. This is an example of one of these simulations:#Rstats #R

Resulting profit of Game A+B after 500 plays  is +52€ and is -9€ and -3€ for Games A and B respectively. Let’s do some more simulations (I removed legends and titles but colors of games are the same):

#Rstats #R

As you can see, Game A+B is the most profitable in almost all the previous simulations. Coincidence? Not at all. This is a consequence of the stunning Parrondo’s Paradox which states that two losing games can combine into a winning one.

If you still don’t believe in this brain-crashing paradox, following you can see the empirical distributions of final profits of three games after 1.000 plays:#Rstats #R

After 1000 plays, mean profit of Game A is -13€, is -7€ for Game B and 17€ for Game A+B.

This paradox was discovered in the last nineties by the Spanish physicist Juan Parrondo and can help to explain, among other things, why investing in losing shares can result in obtaining big profits. Amazing:

require(ggplot2)
require(scales)
library(gridExtra)
opts=theme(
  legend.position = "bottom",
  legend.background = element_rect(colour = "black"),
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="gray25", size=15),
  axis.text.x = element_text(colour="gray25", size=15),
  text = element_text(size=20),
  plot.title = element_text(size = 35))
PlayGameA = function(profit, x, c) {if (runif(1) < c-x) profit+1 else profit-1}
PlayGameB = function(profit, x1, c1, x2, c2) {if (profit%%3>0) PlayGameA(profit, x=x1, c=c1) else PlayGameA(profit, x=x2, c=c2)}
####################################################################
#EVOLUTION
####################################################################
noplays=500
alpha=0.005
profit0=0
results=data.frame(Play=0, ProfitA=profit0, ProfitB=profit0, ProfitAB=profit0)
for (i in 1:noplays) {results=rbind(results, c(i,
    PlayGameA(profit=results[results$Play==(i-1),2], x =alpha, c =0.5),
    PlayGameB(profit=results[results$Play==(i-1),3], x1=alpha, c1=0.75, x2=alpha, c2=0.1),
    if (runif(1)<0.5) PlayGameA(profit=results[results$Play==(i-1),4], x =alpha, c =0.5) else PlayGameB(profit=results[results$Play==(i-1),4], x1=alpha, c1=0.75, x2=alpha, c2=0.1)
    ))}
results=rbind(data.frame(Play=results$Play, Game="A",   Profit=results$ProfitA),
              data.frame(Play=results$Play, Game="B",   Profit=results$ProfitB),
              data.frame(Play=results$Play, Game="A+B", Profit=results$ProfitAB))
ggplot(results, aes(Profit, x=Play, y=Profit, color = Game)) +
  scale_x_continuous(limits=c(0,noplays), "Plays")+
  scale_y_continuous(limits=c(-75,75), expand = c(0, 0), "Profit")+
  labs(title="Evolution of profit games along 500 plays")+
  geom_line(size=3)+opts
####################################################################
#DISTRIBUTION
####################################################################
noplays=1000
alpha=0.005
profit0=0
results2=data.frame(Play=numeric(0), ProfitA=numeric(0), ProfitB=numeric(0), ProfitAB=numeric(0))
for (j in 1:100) {results=data.frame(Play=0, ProfitA=profit0, ProfitB=profit0, ProfitAB=profit0)
  for (i in 1:noplays) {results=rbind(results, c(i,
      PlayGameA(profit=results[results$Play==(i-1),2], x =alpha, c =0.5),
      PlayGameB(profit=results[results$Play==(i-1),3], x1=alpha, c1=0.75, x2=alpha, c2=0.1),
      if (runif(1)<0.5) PlayGameA(profit=results[results$Play==(i-1),4], x =alpha, c =0.5)
      else PlayGameB(profit=results[results$Play==(i-1),4], x1=alpha, c1=0.75, x2=alpha, c2=0.1)))}
      results2=rbind(results2, results[results$Play==noplays, ])}
results2=rbind(data.frame(Game="A", Profit=results2$ProfitA),
data.frame(Game="B", Profit=results2$ProfitB),
data.frame(Game="A+B", Profit=results2$ProfitAB))
ggplot(results2, aes(Profit, fill = Game)) +
  scale_x_continuous(limits=c(-150,150), "Profit")+
  scale_y_continuous(limits=c(0,0.02), expand = c(0, 0), "Density", labels = percent)+
  labs(title=paste("Parrondo's Paradox (",as.character(noplays)," plays)",sep=""))+
  geom_density(alpha=.75)+opts

Going Bananas #1: Superman Was Born In Kentucky!

Who knows? Could be the tropic heat or something that I eat, that makes me gonzo (I’m Going Bananas, Madonna)

This is the first post of a new category called “Going Bananas” in which I will post totally crazy experiments. My aim is to try some techniques without taking care of making serious things. These are experiments without rhyme or reason (or, as we say in Spanish, sin pies ni cabeza). So here we go.

Some days ago, my friend Juanjo show me how to download a webpage and to extract a table in R. He did an amazing script to locate on a map all spanish football stadiums in just a handful of lines. And this is what what I will do in this experiment with a slight difference. I will not locate stadiums: I will try to locate the 118 chemical elements discovered up today. How to do it? It is extremely easy: as I did it in this previous post, using geocode function of ggmap package you can locate any address, even if it is the name of a chemical element.

It must be taking into account that almost all names of the elements come from greek or from latin as you can see here. This is the result of locating in the world the 118 chemical elements:

Bananas01

Some conclusions:

  • Only 88 elements (a 75%) have a place on Earth.
  • 40 of the elements with a place on Earth (a 45%) are located in the United States of America, very far from Greek-Latin cultures, by the way
  • Only 16 of them are located in Europe (a 16%)
  • South-Africa and Philipines have almost all the elements outside of EEUU and Europe.
  • And last but not least, Krypton is in Kentucky, as you can see here:

Bananas02

Clark Kent, Krypton, Kentucky … there are a lot of misteries surrounding Superman … and a lot of K letters!!!!

library(RCurl)
library(XML)
library(ggmap)
library (plotGoogleMaps)
webpage=getURL("http://en.wikipedia.org/wiki/List_of_elements", .opts = opts)
table=readHTMLTable(webpage)[[1]]
table=table[nchar(as.character(table$Element))>2,c("Sym", "Element")]
for (i in 1:nrow(table)) {
 valor=geocode(as.character(table$Element[i]))
 table$lon[i]=valor$lon;table$lat[i]=valor$lat}
table2paint=na.omit(table[, c("lat", "lon", "Sym", "Element")])
coordinates(table2paint)=~lon+lat
proj4string(table2paint)=CRS('+proj=longlat +datum=WGS84')
plotGoogleMaps(table2paint, col="yellow", control.width="0%", control.height="0%", mapTypeId='ROADMAP', legend=FALSE)

The World We Live In #3: Breastfeeding

Facts are stubborn, but statistics are more pliable (Mark Twain)

According to World Health Organization, exclusive breastfeeding is recommended up to 6 months of age, with continued breastfeeding along with appropriate complementary foods up to two years of age or beyond. Thus, the defining characteristic of continued breastfeeding is that the infant between 6 months and 2 years of age receives at least some breast milk regardless of the quantity or the presence of other foods or liquids in the diet.

On the other hand, as can be read in The World Factbook of Central Intelligence Agency, the Total Fertility Rate (TFR) is the average number of children that would be born to a woman over her lifetime if she were to experience the exact current age-specific fertility rates through her lifetime and she were to survive from birth through the end of her reproductive life. It is obtained by summing the single-year age-specific rates at a given time.

This is how the world is arranged according to these two rates:

#Rstats #R There are many differences between countries. Both rates are very low in some east European countries like Ukraine, Bosnia, Belarus and Moldova. On the other hand both of them are very high in Benin, Rwanda, Burkina Faso and Malawi, all of them African. Also African countries are Angola, Nigeria and Somalia where fertility rate is very high but breastfeeding is not very established (Timor-Leste in Asia belongs to this segment as well); and women in Nepal, Bangladesh, Sri-Lanka and India feed their moderate number of descendants with their own milk.

We live in a complex and beautiful world which cannot be measured only with averages nor standard deviations:

#Continued breastfeeding rate: http://data.un.org/Data.aspx?d=SOWC&f=inID%3a89
#Total fertility rate (TFR): http://data.un.org/Data.aspx?d=SOWC&f=inID%3a127
#Population: http://data.un.org/Data.aspx?d=SOWC&f=inID%3a105
require("sqldf")
require("ggplot2")
require("scales")
breastfeeding=read.csv("UNdata_Export_20141122_122134175.csv", nrows=124, header=T, row.names=NULL)
fertility=read.csv("UNdata_Export_20141122_122330581.csv", nrows=570, header=T, row.names=NULL)
population=read.csv("UNdata_Export_20141122_142359579.csv", nrows=999, header=T, row.names=NULL)
colnames(breastfeeding)[1]="Country"
colnames(fertility)[1]="Country"
colnames(population)[1]="Country"
data=sqldf("SELECT a.Country, a.Value as Pop, b.Value as Fertility, c.Value as Breastfeeding
           FROM population a inner join fertility b
           on (a.Country=b.Country) INNER JOIN breastfeeding c
           on (a.Country=c.Country)
           where a.Subgroup = 'Total' AND b.Year = 2011
           AND a.Country NOT IN ('World', 'South Asia',
           'Sub-Saharan Africa', 'Least Developed Countries/Territories', 'Eastern and Southern Africa',
           'East Asia and Pacific')")
opts=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="gray25", size=15),
  axis.text.x = element_text(colour="gray25", size=15),
  text = element_text(size=20),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 45))
ggplot(data, aes(x=Fertility, y=Breastfeeding/100, size=log(Pop), label=Country), guide=FALSE)+
  geom_point(colour="white", fill="darkorchid2", shape=21, alpha=.55)+
  scale_size_continuous(range=c(2,40))+
  scale_x_continuous(limits=c(1,7))+
  scale_y_continuous(limits=c(0,1), labels = percent)+
  labs(title="The World We Live In #3: Breastfeeding",
       x="Total fertility rate (TFR)",
       y="Continued breastfeeding rate")+
  geom_text(data=subset(data, Fertility>5 & (Breastfeeding>75|Breastfeeding<40)), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Fertility<3 & Breastfeeding>75), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Fertility<2 & Breastfeeding<12), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(aes(5, 0), colour="gray25", hjust=0, label="Source: United Nations (size of bubble depending on population)", size=4)+opts