Category Archives: Curiosities

Visualizing Stirling’s Approximation With Highcharts

I said, “Wait a minute, Chester, you know I’m a peaceful man”, He said, “That’s okay, boy, won’t you feed him when you can” (The Weight, The Band)

It is quite easy to calculate the probability of obtaining the same number of heads and tails when tossing a coin N times, and N is even. There are 2^{N} possible outcomes and only C_{N/2}^{N} are favorable so the exact probability is the quotient of these numbers (# of favorable divided by # of possible).

There is another way to approximate this number incredibly well: to use the Stirling’s formula, which is 1/\sqrt{\pi\cdot N/2}

This plot represents both calculations for N from 2 to 200:

Stirling

Although for small values of N, Stirling’s approximation tends to overestimate probability …

Stirling 2

… is extremely precise as N becomes bigger:

Stirling 3

James Stirling published this amazing formula in 1730. It simplifies the calculus to the extreme and also gives a quick way to obtain the answer to a very interesting question: How many tosses are needed to be sure that the probability of obtaining the same number of heads and tails is under any given threshold? Just solve the formula for N and you will obtain the answer. And, also, the formula is another example of the presence of pi in the most unexpected places, as happens here.

Just another thing: the more I use highcharter package the more I like it.

This is the code:

library(highcharter)
library(dplyr)
data.frame(N=seq(from=2, by=2, length.out = 100)) %>%
  mutate(Exact=choose(N,N/2)/2**N, Stirling=1/sqrt(pi*N/2))->data
hc <- highchart() %>% 
  hc_title(text = "Stirling's Approximation") %>% 
  hc_subtitle(text = "How likely is getting 50% heads and 50% tails tossing a coin N times?") %>% 
  hc_xAxis(title = list(text = "N: Number of tosses"), categories = data$N) %>% 
  hc_yAxis(title = list(text = "Probability"), labels = list(format = "{value}%", useHTML = TRUE)) %>% 
  hc_add_series(name = "Stirling", data = data$Stirling*100,  marker = list(enabled = FALSE), color="blue") %>% 
  hc_add_series(name = "Exact", data = data$Exact*100,  marker = list(enabled = FALSE), color="lightblue") %>% 
  hc_tooltip(formatter = JS("function(){return ('<b>Number of tosses: </b>'+this.x+'<br><b>Probability: </b>'+Highcharts.numberFormat(this.y, 2)+'%')}")) %>%
  hc_exporting(enabled = TRUE) %>%
  hc_chart(zoomType = "xy")
hc

Amazing Things That Happen When You Toss a Coin 12 Times

If there is a God, he’s a great mathematician (Paul Dirac)

Imagine you toss a coin 12 times and you count how many heads and tails you are obtaining after each throwing (the coin is equilibrated so the probability of head or tail is the same). At some point, it can happen that number of heads and number of tails are the same. For example, if you obtain the sequence T-H-T-T-H-T-H-H-T-T-H-H, after the second throwing, number of heads is equal to number of tails (and both equal to one). It happens again after the 8th throwing and after last one. In this example, the last throwing where equallity occurs is the number 12. Obviously, equallity can only be observed in even throwings.

If you repeat the experiment 10.000 times you will find something like this if you draw the relative frequency of the last throwing where cumulated number of heads is equal to the one of tails:

Coin12Times
From my point of view there are three amazing things in this plot:

  1. It is symmetrical, so prob(n)=prob(12-n)
  2. The least likely throwing to obtain the last equality is the central one.
  3. As a corollary, the most likely is not obtaining any equality (number of heads never are the same than number of tails) or obtaining last equality in the last throwing: two extremely different scenarios with the same chances to be observed.

Behind the simplicity of tossing coins there is a beautiful universe of mathematical surprises.

library(dplyr)
library(ggplot2)
library(scales)
tosses=12
iter=10000
results=data.frame(nmax=numeric(0), count=numeric(0), iter=numeric(0))
tmp=data.frame(nmax=numeric(0))
for (j in 1:iter)
{
data.frame(x=sample(c(-1,1), size=tosses, replace=TRUE)) %>%
add_rownames(var = "n") %>%
mutate(cumsum = cumsum(x)) %>% filter(cumsum==0) %>%
summarize(nmax=max(as.numeric(n))) %>% rbind(tmp)->tmp
}
tmp %>%
group_by(nmax) %>%
summarize(count=n()) %>%
mutate(nmax=ifelse(is.finite(nmax), nmax, 0), iter=iter) %>%
rbind(results)->results
opts=theme(
panel.background = element_rect(fill="darkolivegreen1"),
panel.border = element_rect(colour="black", fill=NA),
axis.line = element_line(size = 0.5, colour = "black"),
axis.ticks = element_line(colour="black"),
panel.grid.major = element_line(colour="white", linetype = 1),
panel.grid.minor = element_blank(),
axis.text.y = element_text(colour="black"),
axis.text.x = element_text(colour="black"),
text = element_text(size=20),
legend.key = element_blank(),
plot.title = element_text(size = 30)
)
ggplot(results, aes(x=nmax, y=count/iter)) +
geom_line(size=2, color="green4")+
geom_point(size=8, fill="green4", colour="darkolivegreen1",pch=21)+
scale_x_continuous(breaks = seq(0, tosses, by=2))+
scale_y_continuous(labels=percent, limits=c(0, .25))+
labs(title="What happens when you toss a coin 12 times?",
x="Last throwing where cumulated #tails = #heads",
y="Probability (estimated)")+opts

The Unbereable Insolence of Prime Numbers or (Playing to be Ulam)

So rock me mama like a wagon wheel, rock me mama anyway you feel (Wagon Wheel, Old Crow Medicine Show)

This is the third iteration of Hilbert curve. I placed points in its corners. Since the curve has beginning and ending, I labeled each vertex with the order it occupies:hilbert_primes3_1Dark green vertex are those labeled with prime numbers and light ones with non-prime. This is the sixth iteration colored as I described before (I removed lines and labels):hilbert_primes6_2

Previous plot has 4.096 points. There are 564 primes lower than 4.096. What If I color 564 points randomly instead coloring primes? This is an example:
hilbert_primes6_2rand
Do you see any difference? I do. Let me place both images together (on the left, the one with primes colored):
hilbert_primes6_3

The dark points are much more ordered in the first plot. The second one is more noisy. This is my particular tribute to Stanislaw Ulam and its spiral: one of the most amazing fruits of boredom in the history of mathematics.

This is the code:

library(reshape2)
library(dplyr)
library(ggplot2)
library(pracma)
opt=theme(legend.position="none",
          panel.background = element_rect(fill="white"),
          panel.grid=element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),
          axis.text=element_blank())
hilbert = function(m,n,r) {
  for (i in 1:n)
  {
    tmp=cbind(t(m), m+nrow(m)^2)
    m=rbind(tmp, (2*nrow(m))^r-tmp[nrow(m):1,]+1)
  }
  melt(m) %>% plyr::rename(c("Var1" = "x", "Var2" = "y", "value"="order")) %>% arrange(order)}
iter=3 #Number of iterations
df=hilbert(m=matrix(1), n=iter, r=2)
subprimes=primes(nrow(df))
df %>%  mutate(prime=order %in% subprimes,
               random=sample(x=c(TRUE, FALSE), size=nrow(df), prob=c(length(subprimes),(nrow(df)-length(subprimes))), replace = TRUE)) -> df
#Labeled (primes colored)
ggplot(df, aes(x, y, colour=prime)) +
  geom_path(color="gray75", size=3)+
  geom_point(size=28)+
  scale_colour_manual(values = c("olivedrab1", "olivedrab"))+
  scale_x_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  scale_y_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  geom_text(aes(label=order), size=8, color="white")+
  opt
#Non labeled (primes colored)
ggplot(df, aes(x, y, colour=prime)) +
  geom_point(size=5)+
  scale_colour_manual(values = c("olivedrab1", "olivedrab"))+
  scale_x_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  scale_y_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  opt
#Non labeled (random colored)
ggplot(df, aes(x, y, colour=random)) +
  geom_point(size=5)+
  scale_colour_manual(values = c("olivedrab1", "olivedrab"))+
  scale_x_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  scale_y_continuous(expand=c(0,0), limits=c(0,2^iter+1))+
  opt

Simple Data Science To Maximize Return On Lottery Investment

Every finite game has an equilibrium point (John Nash, Non-Cooperative Games, 1950)

I read recently this amazing book, where I discovered that we (humans) are not capable of generating random sequences of numbers by ourselves when we play lottery. John Haigh demonstrates this fact analyzing a sample of 282 raffles of 6/49 UK Lotto. Once I read this, I decided to prove if this disability is property only of British population or if it is shared with Spanish people as well. I am Spanish, so this experiment can bring painful results to myself, but here I come.

The Spanish equivalent of 6/40 UK Lotto is called “Lotería Primitiva” (or “Primitiva”, to abbreviate). This is a ticket of Primitiva lotto:

JugarPrimitiva

As you can see, one ticket gives the chance to do 8 bets. Each bet consists on 6 numbers between 1 and 49 to be chosen in a grid of 10 rows by 5 columns. People tend to choose separate numbers because we think that they are more likely to come up than combinations with some consecutive numbers. We think we have more chances to get rich choosing 4-12-23-25-31-43 rather than 3-17-18-19-32-33, for instance. To be honest, I should recognize I am one of these persons.

Primitiva lotto is managed by Sociedad Estatal Loterías y Apuestas del Estado, a public business entity belonging to the Spanish Ministry of Finance and Public Administrations. They know what people choose and they could do this experiment more exactly than me. They could analyze just human bets (those made by players by themselves) and discard machine ones (those made automatically by vending machines) but anyway it is possible to confirm the previous thesis with some public data.

I analysed 432 raffles of Primitiva carried out between 2011 and 2015; for each raffle I have this information:

  • The six numbers that form the winning combination
  • Total number of bets
  • Number of bets which hit the six numbers (Observed Winners)

The idea is to compare observed winners of raffles with the expected number of them, estimated as follows:

Expected\, Winners=\frac{Total\, Bets}{C_{6}^{49}},\: where\: C_{6}^{49}=\binom{49}{6}=\frac{49!}{43!6!}

This table compare the number of expected and observed winners between raffles which contain consecutive and raffles which not:

Table1

There are 214 raffles without consecutive with 294 winners while the expected number of them was 219. In other words, a winner of a non-consecutive-raffle must share the prize with a 33% of some other person. On the other hand, the number of observed winners of a raffle with consecutive numbers 17% lower than the expected one. Simple and conclusive. Spanish are like British, at least in what concerns to this particular issue.

Let’s go further. I can do the same for any particular number. For example, there were 63 raffles containing number 45 in the winning combination and 57 (observed) winners, although 66 were expected. After doing this for every number, I can draw this plot, where I paint in blue those which ratio of observed winners between expected is lower than 0.9:

LotteryBlues

It seems that blue numbers are concentrated on the right side of the grid. Do we prefer small numbers rather than big ones? There are 15 primes between 1 and 49 (rate: 30%) but only 3 primes between blue numbers (rate: 23%). Are we attracted by primes?

Let’s combine both previous results. This table compares the number of expected and observed winners between raffles which contain consecutive and blues (at least one) and raffles which not:

Table2

Now, winning combinations with some consecutive and some blue numbers present 20% less of observed winners than expected. After this, which combination would you choose for your next bet? 27-35-36-41-44-45 or 2-6-13-15-26-28? I would choose the first one. Both of them have the same probability to come up, but probably you will become richer with the first one if it happens.

This is the code of this experiment. If someone need the dataset set to do their own experiments, feel free to ask me (you can find my email here):

library("xlsx")  
library("sqldf")
library("Hmisc")
library("lubridate")
library("ggplot2")
library("extrafont")
library("googleVis")
windowsFonts(Garamond=windowsFont("Garamond"))
setwd("YOUR WORKING DIRECTORY HERE")
file = "SORTEOS_PRIMITIVA_2011_2015.xls"
data=read.xlsx(file, sheetName="ALL", colClasses=c("numeric", "Date", rep("numeric", 21)))  
#Impute null values to zero
data$C1_EUROS=with(data, impute(C1_EUROS, 0))
data$CE_WINNERS=with(data, impute(CE_WINNERS, 0))
#Expected winners for each raffle
data$EXPECTED=data$BETS/(factorial(49)/(factorial(49-6)*factorial(6)))
#Consecutives indicator
data$DIFFMIN=apply(data[,3:8], 1, function (x) min(diff(sort(x))))
#Consecutives vs non-consecutives comparison
df1=sqldf("SELECT CASE WHEN DIFFMIN=1 THEN 'Yes' ELSE 'No' END AS CONS, 
      COUNT(*) AS RAFFLES,
      SUM(EXPECTED) AS EXP_WINNERS, 
      SUM(CE_WINNERS+C1_WINNERS) AS OBS_WINNERS
      FROM data GROUP BY CONS")
colnames(df1)=c("Contains consecutives?", "Number of  raffles", "Expected Winners", "Observed Winners")
Table1=gvisTable(df1, formats=list('Expected Winners'='#,###'))
plot(Table1)
#Heat map of each number
results=data.frame(BALL=numeric(0), EXP_WINNER=numeric(0), OBS_WINNERS=numeric(0))
for (i in 1:49)
{
  data$TF=apply(data[,3:8], 1, function (x) i %in% x + 0)
  v=data.frame(BALL=i, sqldf("SELECT SUM(EXPECTED) AS EXP_WINNERS, SUM(CE_WINNERS+C1_WINNERS) AS OBS_WINNERS FROM data WHERE TF = 1"))
  results=rbind(results, v)
}
results$ObsByExp=results$OBS_WINNERS/results$EXP_WINNERS
results$ROW=results$BALL%%10+1
results$COL=floor(results$BALL/10)+1
results$ObsByExp2=with(results, cut(ObsByExp, breaks=c(-Inf,.9,Inf), right = FALSE))
opt=theme(legend.position="none",
          panel.background = element_blank(),
          panel.grid = element_blank(),
          axis.ticks=element_blank(),
          axis.title=element_blank(),
          axis.text =element_blank())
ggplot(results, aes(y=ROW, x=COL)) +
  geom_tile(aes(fill = ObsByExp2), colour="gray85", lwd=2) +
  geom_text(aes(family="Garamond"), label=results$BALL, color="gray10", size=12)+
  scale_fill_manual(values = c("dodgerblue", "gray98"))+
  scale_y_reverse()+opt
#Blue numbers
Bl=subset(results, ObsByExp2=="[-Inf,0.9)")[,1]
data$BLUES=apply(data[,3:8], 1, function (x) length(intersect(x,Bl)))
#Combination of consecutives and blues
df2=sqldf("SELECT CASE WHEN DIFFMIN=1 AND BLUES>0 THEN 'Yes' ELSE 'No' END AS IND, 
      COUNT(*) AS RAFFLES,
      SUM(EXPECTED) AS EXP_WINNERS, 
      SUM(CE_WINNERS+C1_WINNERS) AS OBS_WINNERS
      FROM data GROUP BY IND")
colnames(df2)=c("Contains consecutives and blues?", "Number of  raffles", "Expected Winners", "Observed Winners")
Table2=gvisTable(df2, formats=list('Expected Winners'='#,###'))
plot(Table2)

How Big Is The Vatican City?

Dici che il fiume trova la via al mare e come il fiume giungerai a me (Miss Sarajevo, U2)

One way to calculate approximately the area of some place is to circumscribe it into a polygon of which you know its area. After that, generate coordinates inside the polygon and count how many of them fall into the place. The percentage of coordinates inside the place by the area of the polygon is an approximation of the desired area.

I applied this technique to calculate the area of the Vatican City. I generated a squared grid of coordinates around the Capella Sistina (located inside the Vatican City). To calculate the area I easily obtain the convex hull polygon of the coordinates using chull function of grDevices package. Then, I calculate the area of the polygon using areaPolygon function of geosphere package.

To obtain how many coordinates of the grid fall inside the Vatican City, I use revgeocode function of ggmap package (I love this function). For me, one coordinate is inside the Vatican City if its related address contains the words “Vatican City”.

What happens generating a grid of 20×20 coordinates? I obtain that the area of the Vatican City is about 0.32Km2 but according to Wikipedia, the area is 0.44Km2: this method underestimates the area around a 27%. But why? Look at this:

Vatican2

This plot shows which addresses of the grid fall inside the Vatican City (ones) and which of them do not fall inside (zeros). As you can see, there is a big zone in the South, and a smaller one in the North of the city where reverse geocode do not return “Vatican City” addresses.

Maybe Pope Francis should phone Larry Page and Sergey Brin to claim this 27% of his wonderful country.

I was willing to do this experiment since I wrote this post. This is the code:

require(geosphere)
require(ggmap)
require(plotGoogleMaps)
require(grDevices)
setwd("YOUR-WORKING-DIRECTORY-HERE")
#Coordinates of Capella Sistina
capella=geocode("capella sistina, Vatican City, Roma")
#20x20 grid of coordinates around the Capella
g=expand.grid(lon = seq(capella$lon-0.010, capella$lon+0.010, length.out=20),
lat = seq(capella$lat-0.005, capella$lat+0.005, length.out=20))
#Hull Polygon containing coordinates
p=g[c(chull(g),chull(g)[1]),]
#Address of each coordinate of grid
a=apply(g, 1, revgeocode)
#Estimated area of the vatican city
length(grep("Vatican City", a))/length(a)*areaPolygon(p)/1000/1000
s=cbind(g, a)
s$InOut=apply(s, 1, function(x) grepl('Vatican City', x[3]))+0
coordinates(s)=~lon+lat
proj4string(s)=CRS('+proj=longlat +datum=WGS84')
ic=iconlabels(s$InOut, height=12)
plotGoogleMaps(s, iconMarker=ic, mapTypeId="ROADMAP", legend=FALSE)

Mixing Waves

Fill a cocktail shaker with ice; add vodka, triple sec, cranberry, and lime, and shake well; strain into a chilled cocktail glass and garnish with orange twist (Cosmopolitan Cocktail Recipe)

This is a tribute to Blaise Pascal and Joseph Fourier, two of the greatest mathematicians in history. As Pascal did in his famous triangle, I generate a set of random curves (sines or cosines with random amplitudes between 1 and 50) and I arrange them over the lateral edges of the triangle. Each inner curve in the triangle is the sum of the two directly curves above it.  This is the result for a 6 rows triangle:

Adding Waves

Two comments:

  1. Inner curves are noisy. The greater is the distance from the edge, the higher the entropy. When I was a child, I used to play a game called the broken telephone; I can see some kind of connection between this graphic and the game.
  2. I have read that using eval+parse in sympton of being a bad programmer. Does anyone have an idea to do this in some other way without filling the screen of code?

This is the code:

library(ggplot2)
library(gridExtra)
nrows=6
for (i in 1:nrows){
  eval(parse(text=paste("f",i,1,"=function(x) ", sample(c("sin(","cos("),1), runif(min=1, max=50,1) ,"*x)",sep="")))
  eval(parse(text=paste("f",i,i,"=function(x) ", sample(c("sin(","cos("),1), runif(min=1, max=50,1) ,"*x)",sep="")))}
for (i in 3:nrows) {
  for (j in 2:(i-1)) eval(parse(text=paste("f",i, j, "=function(x) f",(i-1),(j-1), "(x) + f",(i-1),j,"(x)",sep="")))}
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
opts=theme(legend.position="none",
           panel.background = element_rect(fill="gray95"),
           plot.background = element_rect(fill="gray95", colour="gray95"),
           panel.grid = element_blank(),
           axis.ticks=element_blank(),
           axis.title=element_blank(),
           axis.text =element_blank())
setwd("YOUR WORKING DIRECTORY HERE")
grid.newpage()
jpeg(file="Adding Waves.jpeg", width=1800,height=1000, bg = "gray95", quality = 100)
pushViewport(viewport(layout = grid.layout(nrows, 2*nrows-1)))
for (i in 1:nrows) {
  for (j in 1:i) {
    print(ggplot(data.frame(x = c(0, 20)), aes(x)) + stat_function(fun = eval(parse(text=paste("f",i,j,sep=""))), colour = "black", alpha=.75)+opts, vp = vplayout(i, nrows+(2*j-(i+1))))
  }
}
dev.off()

How e Can Help You To Find The Love Of Your Life

Match.com will bring more love to the planet than anything since Jesus Christ (Gary Kremen, founder of Match.com)

Sarah is a brilliant 39 years old mathematician living in Massachusetts. She lives alone and has dedicated her whole life to study. She has realized lately that theorems and books no longer satisfy her. Sarah has realized that needs to find love.

To find the love of her life, Sarah joined Match.com to try to have a date with a man every week for a year (52 dates in total). She has her own method to rate each man according his sympathy, his physical appearance, his punctuality,  his conversation and his hobbies. This method allows her to compare candidates with each other. Sarah wants to choose the top-scored man but she is afraid to waste time. If she waits until having all the dates, it could be too late to call back the best candidate, especially if he was one of the first dates. So she wants to be agile and decide inmediately. Her plan is as follows: she will start having some dates only to assess the candidates and after this period, she will try to win over the first man better than any of the first candidates, according her scoring.

But, how many men should discard to maximize the probability of choosing the top-scored one? Discarding just one, probability of having a date with a better man in the next date is very high. But probably he will not be the man she is looking for. On the other hand, discarding many men makes very probable discarding also the top-scored one.

Sarah did a simulation in R of the 52 dates to approximate the probability of choosing the best man depending on the number of discards. She obtained that the probability of choosing the top-scored man is maximal discarding the 19 first men, as can be seen in the following graph:

Prince

Why 19? Because 19 is approximately 52/e. This is one of the rare places where can found the number e. You can see an explanation of the phenomenon here.

Note: This is just a story to illustrate the secretary problem without repeating the original argument of the problem. My apologies if I cause offense to someone. This is a blog about mathematics and R and this is the way as must be understood. 

require(extrafont)
require(ggplot2)
n=52
sims=1000
results=data.frame(discards=numeric(0), triumphs=numeric(0))
for(i in 0:n)
{
  triumphs=0
  for (j in 1:sims) {
    opt=sample(seq(1:n), n, replace=FALSE)
    if (max(opt[1:i])==n)  triumphs=triumphs+0
    else triumphs=triumphs+(opt[i+1:n][min(which(opt[i+1:n] > max(opt[1:i])))]==n)}
  results=rbind(results, data.frame(discards=i, triumphs=triumphs/sims))
}
opts=theme(
  panel.background = element_rect(fill="darkolivegreen1"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="white", linetype = 1),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="black", size=20),
  axis.text.x = element_text(colour="black", size=20),
  text = element_text(size=25, family="xkcd"),
  legend.key = element_blank(),
  legend.background = element_blank(),
  plot.title = element_text(size = 40))
ggplot(results, aes(discards, triumphs))+
  geom_vline(xintercept = n/exp(1), size = 1, linetype=2, colour = "black", alpha=0.8)+
  geom_line(color="green4", size=1.5)+
  geom_point(color="gray92", size=8, pch=16)+
  geom_point(color="green4", size=6, pch=16)+
  ggtitle("How e can help you to find the love of your life")+
  xlab("Discards") +
  ylab("Prob. of finding the love of your life")+
  scale_x_continuous(breaks=seq(0, n, by = 2))+opts

Mixing Benford, GoogleVis And On-Line Encyclopedia of Integer Sequences

The chess-board is the world; the pieces are the phenomena of the universe; the rules of the game are what we call the laws of Nature (T. H. Huxley)

One of the greatest packages I discovered recently is googleVis. While plotting with ggplot can be sometimes very arduous, doing plots with googleVis is extremely easy. Here you can find many examples of what you can do with this great package.

Not long ago, I also discovered The On-Line Encyclopedia of Integer Sequences (OEIS), a huge database of about 250.000 integer sequences where, for example, you can find the number of ways to lace a shoe that has n pairs of eyelets or the smallest number of stones in Tchoukaillon (or Mancala, or Kalahari) solitaire which make use of n-th hole. Many mathematicians, as Ralph Stephan, use this useful resource to develop their theories.

The third protagonist of this story is Frank Benford, who formulated in 1938 his famous law which states that considering different lists of numbers, 1 occurs as the leading digit about the 30% of time, while larger digits occur in that position less frequently.

In this experiment I read 20 random sequences from the OEIS. For each sequence, I obtain the distribution of first digit of the numbers and calculate the similarity with the theoretical distribution given by Benford’s Law so the more similar is the distribution, the closer is this number to 1. Sequences of OEIS are labeled with a seven characters code (an “A” followed by 6 digits). A nice way to show the output of this experiment is using the Gauge visualization of googleVis:

IMG1

Sequence A001288 is the closest to the Benford’s Law. This sequence is the number of distinct 11-element subsets that can be formed from a n element set. Why is so close to the Benford’s Law? No idea further than binomial coefficients are related to some biological laws as number of descendants of a couple of rabbits.

I would like to wish you all a Merry Christmas and a Happy New Year:

library(plyr)
bendford=data.frame(first=0:9, freq=c(0,log10(1+1/(1:9))))
SequencesIds=formatC(sample(1:250000, 20, replace=FALSE), width = 6, format = "d", flag = "0")
results=data.frame(SEQID=character(0), BENDFORNESS=numeric(0))
for(i in 1:length(SequencesIds))
{
  SEQID = SequencesIds[i]
  TEXTFILE=paste("b", SEQID, ".txt", sep="")
  if (!file.exists(TEXTFILE)) download.file(paste("http://oeis.org/A",SEQID, "/b", SEQID, ".txt",sep=""), destfile = TEXTFILE)
  SEQ=readLines(paste("b", SEQID, ".txt", sep=""))
  SEQ=SEQ[SEQ != ""]
  SEQ=SEQ[unlist(gregexpr(pattern ='synthesized',SEQ))<0]
  m=t(sapply(SEQ, function(x) unlist(strsplit(x, " "))))
  df=data.frame(first=substr(gsub("[^0-9]","",m[,2]), 1, 1), row.names = NULL)
  df=count(df, vars = "first")
  df$freq=df$freq/sum(df$freq)
  df2=merge(x = bendford, y = df, by = "first", all.x=TRUE)
  df2[is.na(df2)]=0
  results=rbind(results, data.frame(SEQID=paste("A", SEQID, sep=""), BENDFORNESS=1-sqrt(sum((df2$freq.x - df2$freq.y) ^ 2))))
}
results$BENDFORNESS=as.numeric(format(round(results$BENDFORNESS, 2), nsmall = 2))
Gauge=gvisGauge(results, options=list(min=0, max=1, greenFrom=.75, greenTo=1, yellowFrom=.25, yellowTo=.75, redFrom=0, redTo=.25, width=400, height=300))
plot(Gauge)

The Awesome Parrondo’s Paradox

A technique succeeds in mathematical physics, not by a clever trick, or a happy accident, but because it expresses some aspect of physical truth (O. G. Sutton)

Imagine three unbalanced coins:

  • Coin 1: Probability of head=0.495 and probability of tail=0.505
  • Coin 2: Probability of head=0.745 and probability of tail=0.255
  • Coin 3: Probability of head=0.095 and probability of tail=0.905

Now let’s define two games using these coins:

  • Game A: You toss coin 1 and if it comes up head you receive 1€ but if not, you lose 1€
  • Game B: If your present capital is a multiple of 3, you toss coin 2. If not, you toss coin 3. In both cases, you receive 1€ if coin comes up head and lose 1€ if not.

Played separately, both games are quite unfavorable. Now let’s define Game A+B in which you toss a balanced coin and if it comes up head, you play Game A and play Game B otherwise. In other words, in Game A+B you decide between playing Game A or Game B randomly.

Starting with 0€, it is easy to simulate the three games along 500 plays. This is an example of one of these simulations:#Rstats #R

Resulting profit of Game A+B after 500 plays  is +52€ and is -9€ and -3€ for Games A and B respectively. Let’s do some more simulations (I removed legends and titles but colors of games are the same):

#Rstats #R

As you can see, Game A+B is the most profitable in almost all the previous simulations. Coincidence? Not at all. This is a consequence of the stunning Parrondo’s Paradox which states that two losing games can combine into a winning one.

If you still don’t believe in this brain-crashing paradox, following you can see the empirical distributions of final profits of three games after 1.000 plays:#Rstats #R

After 1000 plays, mean profit of Game A is -13€, is -7€ for Game B and 17€ for Game A+B.

This paradox was discovered in the last nineties by the Spanish physicist Juan Parrondo and can help to explain, among other things, why investing in losing shares can result in obtaining big profits. Amazing:

require(ggplot2)
require(scales)
library(gridExtra)
opts=theme(
  legend.position = "bottom",
  legend.background = element_rect(colour = "black"),
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="gray25", size=15),
  axis.text.x = element_text(colour="gray25", size=15),
  text = element_text(size=20),
  plot.title = element_text(size = 35))
PlayGameA = function(profit, x, c) {if (runif(1) < c-x) profit+1 else profit-1}
PlayGameB = function(profit, x1, c1, x2, c2) {if (profit%%3>0) PlayGameA(profit, x=x1, c=c1) else PlayGameA(profit, x=x2, c=c2)}
####################################################################
#EVOLUTION
####################################################################
noplays=500
alpha=0.005
profit0=0
results=data.frame(Play=0, ProfitA=profit0, ProfitB=profit0, ProfitAB=profit0)
for (i in 1:noplays) {results=rbind(results, c(i,
    PlayGameA(profit=results[results$Play==(i-1),2], x =alpha, c =0.5),
    PlayGameB(profit=results[results$Play==(i-1),3], x1=alpha, c1=0.75, x2=alpha, c2=0.1),
    if (runif(1)<0.5) PlayGameA(profit=results[results$Play==(i-1),4], x =alpha, c =0.5) else PlayGameB(profit=results[results$Play==(i-1),4], x1=alpha, c1=0.75, x2=alpha, c2=0.1)
    ))}
results=rbind(data.frame(Play=results$Play, Game="A",   Profit=results$ProfitA),
              data.frame(Play=results$Play, Game="B",   Profit=results$ProfitB),
              data.frame(Play=results$Play, Game="A+B", Profit=results$ProfitAB))
ggplot(results, aes(Profit, x=Play, y=Profit, color = Game)) +
  scale_x_continuous(limits=c(0,noplays), "Plays")+
  scale_y_continuous(limits=c(-75,75), expand = c(0, 0), "Profit")+
  labs(title="Evolution of profit games along 500 plays")+
  geom_line(size=3)+opts
####################################################################
#DISTRIBUTION
####################################################################
noplays=1000
alpha=0.005
profit0=0
results2=data.frame(Play=numeric(0), ProfitA=numeric(0), ProfitB=numeric(0), ProfitAB=numeric(0))
for (j in 1:100) {results=data.frame(Play=0, ProfitA=profit0, ProfitB=profit0, ProfitAB=profit0)
  for (i in 1:noplays) {results=rbind(results, c(i,
      PlayGameA(profit=results[results$Play==(i-1),2], x =alpha, c =0.5),
      PlayGameB(profit=results[results$Play==(i-1),3], x1=alpha, c1=0.75, x2=alpha, c2=0.1),
      if (runif(1)<0.5) PlayGameA(profit=results[results$Play==(i-1),4], x =alpha, c =0.5)
      else PlayGameB(profit=results[results$Play==(i-1),4], x1=alpha, c1=0.75, x2=alpha, c2=0.1)))}
      results2=rbind(results2, results[results$Play==noplays, ])}
results2=rbind(data.frame(Game="A", Profit=results2$ProfitA),
data.frame(Game="B", Profit=results2$ProfitB),
data.frame(Game="A+B", Profit=results2$ProfitAB))
ggplot(results2, aes(Profit, fill = Game)) +
  scale_x_continuous(limits=c(-150,150), "Profit")+
  scale_y_continuous(limits=c(0,0.02), expand = c(0, 0), "Density", labels = percent)+
  labs(title=paste("Parrondo's Paradox (",as.character(noplays)," plays)",sep=""))+
  geom_density(alpha=.75)+opts

Hi

Why do some mathematicians wear a white coat? Are they afraid to be splashed by an integral? (Read on Twitter)

If you run into someone wearing a white coat who tells you something like

e raised to minus 3 by zero point five plus x squared plus y squared between two plus e raised to minus x squared minus y squared between two by cosine of four by x

do not be afraid: is just a harmless mathematician waving to you. Look at this:

HI2

This is the code to draw these mathematical greetings:

levelpersp=function(x, y, z, colors=heat.colors, ...) {
  ## getting the value of the midpoint
  zz=(z[-1,-1] + z[-1,-ncol(z)] + z[-nrow(z),-1] + z[-nrow(z),-ncol(z)])/4
  ## calculating the breaks
  breaks=hist(zz, plot=FALSE)$breaks
  ## cutting up zz
  cols=colors(length(breaks)-1)
  zzz=cut(zz, breaks=breaks, labels=cols)
  ## plotting
  persp(x, y, z, col=as.character(zzz), ...)
}
x=seq(-5, 5, length= 30);y=x
f=function(x,y) {exp(-3*((0.5+x)^2+y^2/2))+exp(-x^2-y^2/2)*cos(4*x)}
z=outer(x, y, f)
z[z>.001]=.001;z[z<0]=z[1,1]
levelpersp(x, y, z, theta = 30, phi = 55, expand = 0.5, axes=FALSE, box=FALSE, shade=.25)