How Big Is The Vatican City?

Dici che il fiume trova la via al mare e come il fiume giungerai a me (Miss Sarajevo, U2)

One way to calculate approximately the area of some place is to circumscribe it into a polygon of which you know its area. After that, generate coordinates inside the polygon and count how many of them fall into the place. The percentage of coordinates inside the place by the area of the polygon is an approximation of the desired area.

I applied this technique to calculate the area of the Vatican City. I generated a squared grid of coordinates around the Capella Sistina (located inside the Vatican City). To calculate the area I easily obtain the convex hull polygon of the coordinates using chull function of grDevices package. Then, I calculate the area of the polygon using areaPolygon function of geosphere package.

To obtain how many coordinates of the grid fall inside the Vatican City, I use revgeocode function of ggmap package (I love this function). For me, one coordinate is inside the Vatican City if its related address contains the words “Vatican City”.

What happens generating a grid of 20×20 coordinates? I obtain that the area of the Vatican City is about 0.32Km2 but according to Wikipedia, the area is 0.44Km2: this method underestimates the area around a 27%. But why? Look at this:

Vatican2

This plot shows which addresses of the grid fall inside the Vatican City (ones) and which of them do not fall inside (zeros). As you can see, there is a big zone in the South, and a smaller one in the North of the city where reverse geocode do not return “Vatican City” addresses.

Maybe Pope Francis should phone Larry Page and Sergey Brin to claim this 27% of his wonderful country.

I was willing to do this experiment since I wrote this post. This is the code:

require(geosphere)
require(ggmap)
require(plotGoogleMaps)
require(grDevices)
setwd("YOUR-WORKING-DIRECTORY-HERE")
#Coordinates of Capella Sistina
capella=geocode("capella sistina, Vatican City, Roma")
#20x20 grid of coordinates around the Capella
g=expand.grid(lon = seq(capella$lon-0.010, capella$lon+0.010, length.out=20),
lat = seq(capella$lat-0.005, capella$lat+0.005, length.out=20))
#Hull Polygon containing coordinates
p=g[c(chull(g),chull(g)[1]),]
#Address of each coordinate of grid
a=apply(g, 1, revgeocode)
#Estimated area of the vatican city
length(grep("Vatican City", a))/length(a)*areaPolygon(p)/1000/1000
s=cbind(g, a)
s$InOut=apply(s, 1, function(x) grepl('Vatican City', x[3]))+0
coordinates(s)=~lon+lat
proj4string(s)=CRS('+proj=longlat +datum=WGS84')
ic=iconlabels(s$InOut, height=12)
plotGoogleMaps(s, iconMarker=ic, mapTypeId="ROADMAP", legend=FALSE)

The United States In Two Words

Sweet home Alabama, Where the skies are so blue; Sweet home Alabama, Lord, I’m coming home to you (Sweet home Alabama, Lynyrd Skynyrd)

This is the second post I write to show the abilities of twitteR package and also the second post I write for KDnuggets. In this case my goal is to have an insight of what people tweet about american states. To do this, I look for tweets containing the exact phrase “[STATE NAME] is” for every states. Once I have the set of tweets for each state I do some simple text mining: cleaning, standardizing, removing empty words and crossing with these sentiment lexicons. Then I choose the two most common words to describe each state. You can read the original post here. This is the visualization I produced to show the result of the algorithm:

States In Two Words v2

Since the right side of the map is a little bit messy, in the original post you can see a table with the couple of words describing each state. This is just an experiment to show how to use and combine some interesting tools of R. If you don’t like what Twitter says about your state, don’t take it too seriously.

This is the code I wrote for this experiment:

# Do this if you have not registered your R app in Twitter
library(twitteR)
library(RCurl)
setwd("YOUR-WORKING-DIRECTORY-HERE")
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR-CONSUMER_KEY-HERE"
consumerSecret = "YOUR-CONSUMER-SECRET-HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
save(Cred, file="twitter authentification.Rdata")
# Start here if you have already your twitter authentification.Rdata file
library(twitteR)
library(RCurl)
library(XML)
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Read state names from wikipedia
webpage=getURL("http://simple.wikipedia.org/wiki/List_of_U.S._states")
table=readHTMLTable(webpage, which=1)
table=table[!(table$"State name" %in% c("Alaska", "Hawaii")), ]
#Extract tweets for each state
results=data.frame()
for (i in 1:nrow(table))
{
  tweets=searchTwitter(searchString=paste("'\"", table$"State name"[i], " is\"'",sep=""), n=200, lang="en")
  tweets.df=twListToDF(tweets)
  results=rbind(cbind(table$"State name"[i], tweets.df), results)
}
results=results[,c(1,2)]
colnames(results)=c("State", "Text")
library(tm)
#Lexicons
pos = scan('positive-words.txt',  what='character', comment.char=';')
neg = scan('negative-words.txt',  what='character', comment.char=';')
posneg=c(pos,neg)
results$Text=tolower(results$Text)
results$Text=gsub("[[:punct:]]", " ", results$Text)
# Extract most important words for each state
words=data.frame(Abbreviation=character(0), State=character(0), word1=character(0), word2=character(0), word3=character(0), word4=character(0))
for (i in 1:nrow(table))
{
  doc=subset(results, State==as.character(table$"State name"[i]))
  doc.vec=VectorSource(doc[,2])
  doc.corpus=Corpus(doc.vec)
  stopwords=c(stopwords("english"), tolower(unlist(strsplit(as.character(table$"State name"), " "))), "like")
  doc.corpus=tm_map(doc.corpus, removeWords, stopwords)
  TDM=TermDocumentMatrix(doc.corpus)
  TDM=TDM[Reduce(intersect, list(rownames(TDM),posneg)),]
  v=sort(rowSums(as.matrix(TDM)), decreasing=TRUE)
  words=rbind(words, data.frame(Abbreviation=as.character(table$"Abbreviation"[i]), State=as.character(table$"State name"[i]),
                                   word1=attr(head(v, 4),"names")[1],
                                   word2=attr(head(v, 4),"names")[2],
                                   word3=attr(head(v, 4),"names")[3],
                                   word4=attr(head(v, 4),"names")[4]))
}
# Visualization
require("sqldf")
statecoords=as.data.frame(cbind(x=state.center$x, y=state.center$y, abb=state.abb))
#To make names of right side readable
texts=sqldf("SELECT a.abb,
            CASE WHEN a.abb IN ('DE', 'NJ', 'RI', 'NH') THEN a.x+1.7
            WHEN a.abb IN ('CT', 'MA') THEN a.x-0.5  ELSE a.x END as x,
            CASE WHEN a.abb IN ('CT', 'VA', 'NY') THEN a.y-0.4 ELSE a.y END as y,
            b.word1, b.word2 FROM statecoords a INNER JOIN words b ON a.abb=b.Abbreviation")
texts$col=rgb(sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),max=255)
library(maps)
jpeg(filename = "States In Two Words v2.jpeg", width = 1200, height = 600, quality = 100)
map("state", interior = FALSE, col="gray40", fill=FALSE)
map("state", boundary = FALSE, col="gray", add = TRUE)
text(x=as.numeric(as.character(texts$x)), y=as.numeric(as.character(texts$y)), apply(texts[,4:5] , 1 , paste , collapse = "\n" ), cex=1, family="Humor Sans", col=texts$col)
dev.off()

Mixing Waves

Fill a cocktail shaker with ice; add vodka, triple sec, cranberry, and lime, and shake well; strain into a chilled cocktail glass and garnish with orange twist (Cosmopolitan Cocktail Recipe)

This is a tribute to Blaise Pascal and Joseph Fourier, two of the greatest mathematicians in history. As Pascal did in his famous triangle, I generate a set of random curves (sines or cosines with random amplitudes between 1 and 50) and I arrange them over the lateral edges of the triangle. Each inner curve in the triangle is the sum of the two directly curves above it.  This is the result for a 6 rows triangle:

Adding Waves

Two comments:

  1. Inner curves are noisy. The greater is the distance from the edge, the higher the entropy. When I was a child, I used to play a game called the broken telephone; I can see some kind of connection between this graphic and the game.
  2. I have read that using eval+parse in sympton of being a bad programmer. Does anyone have an idea to do this in some other way without filling the screen of code?

This is the code:

library(ggplot2)
library(gridExtra)
nrows=6
for (i in 1:nrows){
  eval(parse(text=paste("f",i,1,"=function(x) ", sample(c("sin(","cos("),1), runif(min=1, max=50,1) ,"*x)",sep="")))
  eval(parse(text=paste("f",i,i,"=function(x) ", sample(c("sin(","cos("),1), runif(min=1, max=50,1) ,"*x)",sep="")))}
for (i in 3:nrows) {
  for (j in 2:(i-1)) eval(parse(text=paste("f",i, j, "=function(x) f",(i-1),(j-1), "(x) + f",(i-1),j,"(x)",sep="")))}
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
opts=theme(legend.position="none",
           panel.background = element_rect(fill="gray95"),
           plot.background = element_rect(fill="gray95", colour="gray95"),
           panel.grid = element_blank(),
           axis.ticks=element_blank(),
           axis.title=element_blank(),
           axis.text =element_blank())
setwd("YOUR WORKING DIRECTORY HERE")
grid.newpage()
jpeg(file="Adding Waves.jpeg", width=1800,height=1000, bg = "gray95", quality = 100)
pushViewport(viewport(layout = grid.layout(nrows, 2*nrows-1)))
for (i in 1:nrows) {
  for (j in 1:i) {
    print(ggplot(data.frame(x = c(0, 20)), aes(x)) + stat_function(fun = eval(parse(text=paste("f",i,j,sep=""))), colour = "black", alpha=.75)+opts, vp = vplayout(i, nrows+(2*j-(i+1))))
  }
}
dev.off()

Visualizing Home Ownership With Small Multiples And R

If everybody had an ocean, across the U.S.A., then everybody’d be surfin’ like California (Beach Boys, Surfin’ U.S.A.)

home_ownership

I was invited to write a post for Domino Data Lab, a company based in California which provides a cloud-based machine learning platform which enables companies to use the power of the cloud to build analytical projects. I also discovered recently this book which support the premises of companies like Domino Data Lab which are leading the change in the way of doing data science. How I wish to forget in the future expressions like execution time, update versions and memory limit!

Since I like a lot Small multiples, I decided to plot the evolution of homeownership across the United States (the more I use GridExtra package the more I like it). You can read the post here (code included).

By the way, if you want to go to Gigaom Structure Data 2015 for free, Domino Data Lab is giving away 2 tickets here.

How e Can Help You To Find The Love Of Your Life

Match.com will bring more love to the planet than anything since Jesus Christ (Gary Kremen, founder of Match.com)

Sarah is a brilliant 39 years old mathematician living in Massachusetts. She lives alone and has dedicated her whole life to study. She has realized lately that theorems and books no longer satisfy her. Sarah has realized that needs to find love.

To find the love of her life, Sarah joined Match.com to try to have a date with a man every week for a year (52 dates in total). She has her own method to rate each man according his sympathy, his physical appearance, his punctuality,  his conversation and his hobbies. This method allows her to compare candidates with each other. Sarah wants to choose the top-scored man but she is afraid to waste time. If she waits until having all the dates, it could be too late to call back the best candidate, especially if he was one of the first dates. So she wants to be agile and decide inmediately. Her plan is as follows: she will start having some dates only to assess the candidates and after this period, she will try to win over the first man better than any of the first candidates, according her scoring.

But, how many men should discard to maximize the probability of choosing the top-scored one? Discarding just one, probability of having a date with a better man in the next date is very high. But probably he will not be the man she is looking for. On the other hand, discarding many men makes very probable discarding also the top-scored one.

Sarah did a simulation in R of the 52 dates to approximate the probability of choosing the best man depending on the number of discards. She obtained that the probability of choosing the top-scored man is maximal discarding the 19 first men, as can be seen in the following graph:

Prince

Why 19? Because 19 is approximately 52/e. This is one of the rare places where can found the number e. You can see an explanation of the phenomenon here.

Note: This is just a story to illustrate the secretary problem without repeating the original argument of the problem. My apologies if I cause offense to someone. This is a blog about mathematics and R and this is the way as must be understood. 

require(extrafont)
require(ggplot2)
n=52
sims=1000
results=data.frame(discards=numeric(0), triumphs=numeric(0))
for(i in 0:n)
{
  triumphs=0
  for (j in 1:sims) {
    opt=sample(seq(1:n), n, replace=FALSE)
    if (max(opt[1:i])==n)  triumphs=triumphs+0
    else triumphs=triumphs+(opt[i+1:n][min(which(opt[i+1:n] > max(opt[1:i])))]==n)}
  results=rbind(results, data.frame(discards=i, triumphs=triumphs/sims))
}
opts=theme(
  panel.background = element_rect(fill="darkolivegreen1"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="white", linetype = 1),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="black", size=20),
  axis.text.x = element_text(colour="black", size=20),
  text = element_text(size=25, family="xkcd"),
  legend.key = element_blank(),
  legend.background = element_blank(),
  plot.title = element_text(size = 40))
ggplot(results, aes(discards, triumphs))+
  geom_vline(xintercept = n/exp(1), size = 1, linetype=2, colour = "black", alpha=0.8)+
  geom_line(color="green4", size=1.5)+
  geom_point(color="gray92", size=8, pch=16)+
  geom_point(color="green4", size=6, pch=16)+
  ggtitle("How e can help you to find the love of your life")+
  xlab("Discards") +
  ylab("Prob. of finding the love of your life")+
  scale_x_continuous(breaks=seq(0, n, by = 2))+opts

Simple Data Science Of Global Warming In KDnuggets

Would love to get a post from you for KDnuggets (Gregory Piatetsky, KDnuggets President)

logoSome days ago, Gregory Piatetsky invited me to write a post for KDnuggets. I couldn’t say no. He suggested to me some topics and I decided to experiment around climate change to demonstrate how easy is to see some evidences of this alarming menace. You can read the post here.

This is the code I wrote to do this experiment:

require(sqldf)
require(googleVis)
require(ggplot2)
require(ggthemes)
setwd("YOUR WORKING DIRECTORY HERE") #This line doen's work until you type a valid path
#Data Avaliable in http://eca.knmi.nl/utils/downloadfile.php?file=download/ECA_blend_tg.zip
files=list.files(pattern = "TG_STAID")
results=data.frame(staid=character(0), trnd=numeric(0), nobs=numeric(0))
#Loop to calculate linear trends
for (i in 1:length(files))
{
  table=read.table(files[i], header=TRUE, sep = ',', skip=20)
  table=table[table$Q_TG==0,]
  table$date=as.Date(as.character(table$DATE), "%Y%m%d")
  results=rbind(data.frame(staid=files[i], trnd=lm.fit (x = matrix(table$date), y = table$TG)$coefficients, nobs=nrow(table)), results)
}
write.csv(results, file="results.csv", row.names = FALSE)#Save your work
results=read.csv(file="results.csv")#Read your work. You can start here if you stop your work further this line
#Remove outliers
results=results[!results$trnd %in% boxplot.stats(results$trnd, coef = 4)$out,]
#Histogram
hist(x=results$trnd, breaks = 50,
     col = "orange",
     border = "pink",
     freq=TRUE,
     xlim=c(-0.03, 0.03),
     ylim=c(0, 300),
     xlab="Historical trend of daily mean temperatures",
     ylab="Number of stations",
     main="Evolution of temperatures in Europe and the Mediterranean",
     sub="Source: European Climate Assessment & Dataset project")
results$staid2=as.numeric(gsub("[^0-9]","",results$staid)) #To join results with geographical coordinates
#Read table of geographical coordinates
staids=read.table("http://www.ecad.eu/download/ECA_all_stations.txt", header=TRUE, sep = ',', skip=17)
#Right tail of the distribution
hotpoints=sqldf("SELECT a.staid, a.staid2, a.trnd, a.nobs, b.staname, b.lat, b.lon
      FROM results a INNER JOIN staids b ON (a.staid2=b.staid) WHERE TRND >= 0.02")
#Convert degrees:minutes:seconds to decimal coordinates
hotpoints=within(hotpoints, {
  dms=do.call(rbind, strsplit(as.character(LAT), ":"))
  lat=sign(as.numeric(dms[,1]))*(abs(as.numeric(dms[,1]))+(as.numeric(dms[,2]) + as.numeric(dms[,3])/60)/60);rm(dms)
})
hotpoints=within(hotpoints, {
  dms=do.call(rbind, strsplit(as.character(LON), ":"))
  lon=sign(as.numeric(dms[,1]))*(abs(as.numeric(dms[,1]))+(as.numeric(dms[,2]) + as.numeric(dms[,3])/60)/60);rm(dms)
})
#To make readable tha name of the station in the map
hotpoints$staname=gsub("^\\s+|\\s+$", "", hotpoints$STANAME)
#To prepare coordinates to gvis function
hotpoints$LatLong=with(hotpoints, paste(lat, lon, sep=":"))
#The amazing gvisMap function of googleVis package
hotpointsMap=gvisMap(hotpoints, "LatLong" , "STANAME",
        options=list(showTip=TRUE,
                     showLine=TRUE,
                     enableScrollWheel=TRUE,
                     mapType='terrain',
                     useMapTypeControl=TRUE))
plot(hotpointsMap)
#The plot one of this hot stations
InAmenas=read.table("TG_STAID000312.txt", header=TRUE, sep = ',', skip=20)
InAmenas=InAmenas[InAmenas$Q_TG==0,]
InAmenas$date=as.Date(as.character(InAmenas$DATE), "%Y%m%d")
ggplot(InAmenas, aes(date, TG)) +
  geom_line(color="red", alpha=0.8) +
  xlab("") +
  ylab("Mean temperature in 0.1C")+
  ggtitle("Mean temperature in IN-AMENAS (ALGERIA) 1958- 1998")+
  geom_smooth(method = "lm", se=FALSE, color="red", lwd=2)+
  theme_economist(base_size = 20, base_family = "sans", horizontal = TRUE,
                  dkpanel = FALSE, stata = FALSE)

Maths, Music and Merkbar

Control is what we already know. Control is where we have already ventured. Control is what helps us predict the future. (Merkbar)

Maths and music get along very well. Last December I received a mail from a guy called Jesper. He is one of the two members of Merkbar: a electronic music band from Denmark. As can be read in their website:

Merkbar is Jesper and David who are both interested in the psychedelic worlds and oriental spiritualism. They both studied Computer Music, where they’ve done research in sound synthesis, generative composition and the design of new digital instrument.

They asked me a front cover for their new album which will be released at the beginning of 2015. Why? Because they liked this post I did about circlizing numbers.  To do this plot I circlized the Golden Ratio number (Phi). But in this case I changed ribbons (all equal pairs of consecutive numbers gather together) by lines (every pair of consecutive numbers form a different line). As I did before, I used circlize package, which implements in R the features of Circos, a software to create stunning circular visualizations.

The final plot represents the first 2.000 digits of Phi:

Merkbar Cover2

You can hear an advancement of their new album here, which is called “Phi”. Enjoy their sensitive and full-of-shades music: you will be delightfully surprised as I was.

This is the code to circlize Phi:

library(circlize)
library(scales)
factors = as.factor(0:9)
lines = 2000 #Number of lines to plot in the graph
alpha = 0.4  #Alpha for color lines
colors0=c(rgb(239,143,121, max=255), rgb(126,240,188, max=255), rgb(111,228,235, max=255),
          rgb(127,209,249, max=255), rgb( 74,106,181, max=255), rgb(114,100,188, max=255),
          rgb(181,116,234, max=255), rgb(226,135,228, max=255), rgb(239,136,192, max=255),
          rgb(233,134,152, max=255))
#You can find the txt file here: http://www.goldennumber.net/wp-content/uploads/2012/06/Phi-To-100000-Places.txt
phi=readLines("Phi-To-100000-Places.txt")[5]
phi=gsub("\\.","", substr(phi,1,lines))
phi=gsub("\\.","", phi)
position=1/(nchar(phi)-1)
#This code generates a pdf file in your working directory
pdf(file="CirclizePhi.pdf", width=25, height=25)
circos.clear()
par(mar = c(1, 1, 1, 1), lwd = 0.1, cex = 0.7, bg=alpha("black", 1))
circos.par("cell.padding"=c(0.01,0.01), "track.height" = 0.025, "gap.degree" = 3)
circos.initialize(factors = factors, xlim = c(0, 1))
circos.trackPlotRegion(factors = factors, ylim = c(0, 1))
for (i in 0:9) {circos.updatePlotRegion(sector.index = as.character(i), bg.col = alpha("black", 1), bg.border=alpha(colors0[i+1], 1))}
for (i in 1:(nchar(phi)-1)) {
  m=min(as.numeric(substr(phi, i, i)), as.numeric(substr(phi, i+1, i+1)))
  M=max(as.numeric(substr(phi, i, i)), as.numeric(substr(phi, i+1, i+1)))
  d=min((M-m),((m+10)-M))
  col=t(col2rgb(colors0[(as.numeric(substr(phi, i, i))+1)]))
  if (col[1]>255) col[1]=255;if (col[2]>255) col[2]=255;if (col[3]>255) col[3]=255
  if (col[1]<0) col[1]=0;if (col[2]<0) col[2]=0;if (col[3]<0) col[3]=0
  if (d>0) circos.link(substr(phi, i, i), position*(i-1), substr(phi, i+1, i+1), position*i, h = 0.1375*d+0.1125, lwd=0, col=alpha(rgb(col, max=255), alpha), rou = 0.92)
}
dev.off()

First Anniversary Of Ripples

If people do not believe that mathematics is simple, it is only because they do not realize how complicated life is (John von Neumann)

I started this blog one year ago and the experience has been better than I could have imagine:

  • Ripples has been viewed about 30.000 times
  • 49 posts published (The Sound Of Mandelbrot Set was the most-viewed)
  • R-bloggers is my top referring site (thanks to Tal Galili for his support)
  • Visitors come from 141 countries: most of them came from The United States. Spain and U.K. were not far behind (below you can see my map of the empire)

I have met a lot of wonderful people along the way like Andrew Wyer with whom I wrote a post about 3D-Harmonographs, people of Merkbar,  a musicians who asked me a plot for the cover of their new album (I will post about this project in the next days) and Gregory Piatetsky from KDnuggets (I expect also doing something with him in the close future).

I think this new year is going to be funny. I still have lot of experiments in my mind that I want to explore with R.

Happy 2015!

 

WORLD

Mixing Benford, GoogleVis And On-Line Encyclopedia of Integer Sequences

The chess-board is the world; the pieces are the phenomena of the universe; the rules of the game are what we call the laws of Nature (T. H. Huxley)

One of the greatest packages I discovered recently is googleVis. While plotting with ggplot can be sometimes very arduous, doing plots with googleVis is extremely easy. Here you can find many examples of what you can do with this great package.

Not long ago, I also discovered The On-Line Encyclopedia of Integer Sequences (OEIS), a huge database of about 250.000 integer sequences where, for example, you can find the number of ways to lace a shoe that has n pairs of eyelets or the smallest number of stones in Tchoukaillon (or Mancala, or Kalahari) solitaire which make use of n-th hole. Many mathematicians, as Ralph Stephan, use this useful resource to develop their theories.

The third protagonist of this story is Frank Benford, who formulated in 1938 his famous law which states that considering different lists of numbers, 1 occurs as the leading digit about the 30% of time, while larger digits occur in that position less frequently.

In this experiment I read 20 random sequences from the OEIS. For each sequence, I obtain the distribution of first digit of the numbers and calculate the similarity with the theoretical distribution given by Benford’s Law so the more similar is the distribution, the closer is this number to 1. Sequences of OEIS are labeled with a seven characters code (an “A” followed by 6 digits). A nice way to show the output of this experiment is using the Gauge visualization of googleVis:

IMG1

Sequence A001288 is the closest to the Benford’s Law. This sequence is the number of distinct 11-element subsets that can be formed from a n element set. Why is so close to the Benford’s Law? No idea further than binomial coefficients are related to some biological laws as number of descendants of a couple of rabbits.

I would like to wish you all a Merry Christmas and a Happy New Year:

library(plyr)
bendford=data.frame(first=0:9, freq=c(0,log10(1+1/(1:9))))
SequencesIds=formatC(sample(1:250000, 20, replace=FALSE), width = 6, format = "d", flag = "0")
results=data.frame(SEQID=character(0), BENDFORNESS=numeric(0))
for(i in 1:length(SequencesIds))
{
  SEQID = SequencesIds[i]
  TEXTFILE=paste("b", SEQID, ".txt", sep="")
  if (!file.exists(TEXTFILE)) download.file(paste("http://oeis.org/A",SEQID, "/b", SEQID, ".txt",sep=""), destfile = TEXTFILE)
  SEQ=readLines(paste("b", SEQID, ".txt", sep=""))
  SEQ=SEQ[SEQ != ""]
  SEQ=SEQ[unlist(gregexpr(pattern ='synthesized',SEQ))<0]
  m=t(sapply(SEQ, function(x) unlist(strsplit(x, " "))))
  df=data.frame(first=substr(gsub("[^0-9]","",m[,2]), 1, 1), row.names = NULL)
  df=count(df, vars = "first")
  df$freq=df$freq/sum(df$freq)
  df2=merge(x = bendford, y = df, by = "first", all.x=TRUE)
  df2[is.na(df2)]=0
  results=rbind(results, data.frame(SEQID=paste("A", SEQID, sep=""), BENDFORNESS=1-sqrt(sum((df2$freq.x - df2$freq.y) ^ 2))))
}
results$BENDFORNESS=as.numeric(format(round(results$BENDFORNESS, 2), nsmall = 2))
Gauge=gvisGauge(results, options=list(min=0, max=1, greenFrom=.75, greenTo=1, yellowFrom=.25, yellowTo=.75, redFrom=0, redTo=.25, width=400, height=300))
plot(Gauge)

The Awesome Parrondo’s Paradox

A technique succeeds in mathematical physics, not by a clever trick, or a happy accident, but because it expresses some aspect of physical truth (O. G. Sutton)

Imagine three unbalanced coins:

  • Coin 1: Probability of head=0.495 and probability of tail=0.505
  • Coin 2: Probability of head=0.745 and probability of tail=0.255
  • Coin 3: Probability of head=0.095 and probability of tail=0.905

Now let’s define two games using these coins:

  • Game A: You toss coin 1 and if it comes up head you receive 1€ but if not, you lose 1€
  • Game B: If your present capital is a multiple of 3, you toss coin 2. If not, you toss coin 3. In both cases, you receive 1€ if coin comes up head and lose 1€ if not.

Played separately, both games are quite unfavorable. Now let’s define Game A+B in which you toss a balanced coin and if it comes up head, you play Game A and play Game B otherwise. In other words, in Game A+B you decide between playing Game A or Game B randomly.

Starting with 0€, it is easy to simulate the three games along 500 plays. This is an example of one of these simulations:#Rstats #R

Resulting profit of Game A+B after 500 plays  is +52€ and is -9€ and -3€ for Games A and B respectively. Let’s do some more simulations (I removed legends and titles but colors of games are the same):

#Rstats #R

As you can see, Game A+B is the most profitable in almost all the previous simulations. Coincidence? Not at all. This is a consequence of the stunning Parrondo’s Paradox which states that two losing games can combine into a winning one.

If you still don’t believe in this brain-crashing paradox, following you can see the empirical distributions of final profits of three games after 1.000 plays:#Rstats #R

After 1000 plays, mean profit of Game A is -13€, is -7€ for Game B and 17€ for Game A+B.

This paradox was discovered in the last nineties by the Spanish physicist Juan Parrondo and can help to explain, among other things, why investing in losing shares can result in obtaining big profits. Amazing:

require(ggplot2)
require(scales)
library(gridExtra)
opts=theme(
  legend.position = "bottom",
  legend.background = element_rect(colour = "black"),
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="gray25", size=15),
  axis.text.x = element_text(colour="gray25", size=15),
  text = element_text(size=20),
  plot.title = element_text(size = 35))
PlayGameA = function(profit, x, c) {if (runif(1) < c-x) profit+1 else profit-1}
PlayGameB = function(profit, x1, c1, x2, c2) {if (profit%%3>0) PlayGameA(profit, x=x1, c=c1) else PlayGameA(profit, x=x2, c=c2)}
####################################################################
#EVOLUTION
####################################################################
noplays=500
alpha=0.005
profit0=0
results=data.frame(Play=0, ProfitA=profit0, ProfitB=profit0, ProfitAB=profit0)
for (i in 1:noplays) {results=rbind(results, c(i,
    PlayGameA(profit=results[results$Play==(i-1),2], x =alpha, c =0.5),
    PlayGameB(profit=results[results$Play==(i-1),3], x1=alpha, c1=0.75, x2=alpha, c2=0.1),
    if (runif(1)<0.5) PlayGameA(profit=results[results$Play==(i-1),4], x =alpha, c =0.5) else PlayGameB(profit=results[results$Play==(i-1),4], x1=alpha, c1=0.75, x2=alpha, c2=0.1)
    ))}
results=rbind(data.frame(Play=results$Play, Game="A",   Profit=results$ProfitA),
              data.frame(Play=results$Play, Game="B",   Profit=results$ProfitB),
              data.frame(Play=results$Play, Game="A+B", Profit=results$ProfitAB))
ggplot(results, aes(Profit, x=Play, y=Profit, color = Game)) +
  scale_x_continuous(limits=c(0,noplays), "Plays")+
  scale_y_continuous(limits=c(-75,75), expand = c(0, 0), "Profit")+
  labs(title="Evolution of profit games along 500 plays")+
  geom_line(size=3)+opts
####################################################################
#DISTRIBUTION
####################################################################
noplays=1000
alpha=0.005
profit0=0
results2=data.frame(Play=numeric(0), ProfitA=numeric(0), ProfitB=numeric(0), ProfitAB=numeric(0))
for (j in 1:100) {results=data.frame(Play=0, ProfitA=profit0, ProfitB=profit0, ProfitAB=profit0)
  for (i in 1:noplays) {results=rbind(results, c(i,
      PlayGameA(profit=results[results$Play==(i-1),2], x =alpha, c =0.5),
      PlayGameB(profit=results[results$Play==(i-1),3], x1=alpha, c1=0.75, x2=alpha, c2=0.1),
      if (runif(1)<0.5) PlayGameA(profit=results[results$Play==(i-1),4], x =alpha, c =0.5)
      else PlayGameB(profit=results[results$Play==(i-1),4], x1=alpha, c1=0.75, x2=alpha, c2=0.1)))}
      results2=rbind(results2, results[results$Play==noplays, ])}
results2=rbind(data.frame(Game="A", Profit=results2$ProfitA),
data.frame(Game="B", Profit=results2$ProfitB),
data.frame(Game="A+B", Profit=results2$ProfitAB))
ggplot(results2, aes(Profit, fill = Game)) +
  scale_x_continuous(limits=c(-150,150), "Profit")+
  scale_y_continuous(limits=c(0,0.02), expand = c(0, 0), "Density", labels = percent)+
  labs(title=paste("Parrondo's Paradox (",as.character(noplays)," plays)",sep=""))+
  geom_density(alpha=.75)+opts