NASDAQ 100 Couples

Heaven, I’m in heaven, and my heart beats so that I can hardly speak, and I seem to find the happiness I seek, when we’re out together dancing cheek to cheek (Cheek To Cheek, Irving Berlin)

There are about 6.500 available packages in CRAN repository. If I were a superhuman, able to learn one package a day, I would spend almost 18 years of my life studying R. And how many packages would be uploaded to CRAN during this period? Who knows: R is infinite.

Today, my experiment deals with quantmod package, which allows you to play to be quant for a while. I download the daily quotes of NASDAQ 100 companies and measure distances between each pair of companies. Distance is based on the cross-correlation between two series so high-correlated series (not exceeding a maximum lag) are closer than low-correlated ones. You can read a good description of this distance here. Since NASDAQ 100 contains 107 companies, I calculate distances for 5.671 different couples. Next plot represent distances between each pair of companies. The darker is the color, the closer are the related companies:

Nasdaq100

Yes, I know is not a graph for someone with visual problems. Let me show you an example of what is behind one of these little tiles. Distance between Mattel Inc. and 21st Century Fox is very small (its related tile is dark coloured). Why? Because of this:

MattelFox
These two companies have been dancing cheek to cheek for more than seven years. It is also curious how some companies are far from any of their NASDAQ 100 colleagues. Some examples of these unpaired companies are Express Scripts Holding Company (ESRX), Expeditors International of Washington Inc. (EXPD) and Fastenal Company (FAST). I do not why but there must be an explanation, do not you think so?

Something tells me I will do some other experiment using quantmod package:

library("quantmod")
library("TSdist")
library("ade4")
library("ggplot2")
library("Hmisc")
library("zoo")
library("scales")
library("reshape2")
setwd("YOUR WORKING DIRECTORY HERE")
temp=tempfile()
download.file("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx?render=download",temp)
data=read.csv(temp, header=TRUE)
for (i in 1:nrow(data)) getSymbols(as.character(data[i,1]))
results=t(apply(combn(sort(as.character(data[,1]), decreasing = TRUE), 2), 2,
      function(x)
      {
        ts1=drop(Cl(eval(parse(text=x[1]))))
        ts2=drop(Cl(eval(parse(text=x[2]))))
        c(symbol1=x[1], symbol2=x[2], tsDistances(ts1, ts2, distance="crosscorrelation"))
      }))
results=as.data.frame(results)
colnames(results)=c("Sym1", "Sym2", "TSdist")
results$TSdist=as.numeric(as.character(results$TSdist))
results=rbind(results, data.frame(Sym1=as.character(data[,1]), Sym2=as.character(data[,1]), TSdist=0))
results$TSdist2=as.numeric(cut2(results$TSdist, g=4))
opts=theme(axis.text.x = element_text(angle = 90, vjust=.5, hjust = 0),
           panel.background = element_blank(),
           axis.text = element_text(colour="gray25", size=8),
           legend.position = "none",
           panel.grid = element_blank())
ggplot(results,aes(x=Sym2,y=Sym1))+
  geom_tile(aes(fill = TSdist2), colour="gray80")+
  scale_size_continuous(range=c(1,10))+
  scale_x_discrete("", limits=sort(unique(as.character(results$Sym1))))+
  scale_y_discrete("", limits=sort(unique(as.character(results$Sym2)), decreasing = TRUE))+
  scale_fill_gradient(low = "steelblue", high = "white")+
  opts
MAT.close=Cl(MAT)
FOX.close=Cl(FOX)
cls=merge(MAT.close, FOX.close, all = FALSE)
df=data.frame(date = time(cls), coredata(cls))
names(df)[-1]=c("mat", "fox")
df1=melt(df, id.vars = "date", measure.vars = c("mat", "fox"))
opts2=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=15),
  axis.title = element_text(size=18, colour="gray10"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 40, colour="gray10"))
ggplot(df1, aes(x = date, y = value, color = variable))+
  geom_line(size = I(1.2))+
  scale_color_discrete(guide = "none")+
  scale_x_date(labels = date_format("%Y-%m-%d"))+
  labs(title="Nasdaq 100 Couples: Mattel And Fox", x="Date", y="Closing Price")+
  annotate("text", x = as.Date("2011-01-01", "%Y-%m-%d"), y = c(10, 30), label = c("21st Century Fox", "Mattel Inc."), size=7, colour="gray25")+
  opts2

The World We Live In #4: Marriage Ages

It is time for women to stop being politely angry (Leymah Gbowee, Nobel Prize Peace Winner)

Sometimes very simple plots give insight into we live in a world of differences. This plot shows the mean age at marriage for men and women across countries:

Marriage Ages

Being a woman in some countries of this world must be a hard experience:

#Singulate mean age at marriage: http://data.un.org/Data.aspx?d=GenderStat&f=inID%3a20
#Population: http://data.un.org/Data.aspx?d=SOWC&f=inID%3a105
require("sqldf")
require("ggplot2")
setwd("YOUR WORKING DIRECTORY HERE")
mar=read.csv("UNdata_Export_20150309_171525152.csv", nrows=321, header=T, row.names=NULL)
pop=read.csv("UNdata_Export_20150309_172046384.csv", nrows=999, header=T, row.names=NULL)
colnames(mar)[1]="Country"
colnames(pop)[1]="Country"
data=sqldf("SELECT
  a.Country,
  a.Value as Pop,
  b.Value as Female,
  c.Value as Male
FROM
  pop a INNER JOIN mar b
  ON (a.Country=b.Country AND b.Subgroup='Female') INNER JOIN mar c
  ON (a.Country=c.Country AND c.Subgroup='Male')
WHERE a.Subgroup = 'Total'")
opts=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=15),
  axis.title = element_text(size=18, colour="gray10"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 40, colour="gray10"))
ggplot(data, aes(x=Female, y=Male, size=log(Pop), label=Country), guide=FALSE)+
  geom_point(colour="white", fill="chartreuse3", shape=21, alpha=.55)+
  scale_size_continuous(range=c(2,36))+
  scale_x_continuous(limits=c(16,36), breaks=seq(16, 36, by = 2), expand = c(0, 0))+
  scale_y_continuous(limits=c(16,36), breaks=seq(16, 36, by = 2), expand = c(0, 0))+
  geom_abline(intercept = 0, slope = 1, colour = "gray10", linetype=2)+
  labs(title="The World We Live In #4: Marriage Ages",
       x="Females mean age at marriage",
       y="Males mean age at marriage")+
  geom_text(data=subset(data, abs(Female-Male)>7), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Female>=32|Female<=18), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(aes(24, 17), colour="gray25", hjust=0, label="Source: United Nations (size of bubble depending on population)", size=5)+opts

Visual Complexity

Oh, can it be, the voices calling me, they get lost and out of time (Little Black Submarines, The Black Keys)

Last October I did this experiment about complex domain coloring. Since I like giving my posts a touch of randomness, I have done this experiment. I plot four random functions on the form p1(x)*p2(x)/p3(x) where pi(x) are polynomials up-to-4th-grade with random coefficients following a chi-square distribution with degrees of freedom between 2 and 5. I measure the function over the complex plane and arrange the four resulting plots into a 2×2 grid. This is an example of the output:
Surrealism Every time you run the code you will obtain a completely different output. I have run it hundreds of times because results are always surprising. Do you want to try? Do not hesitate to send me your creations. What if you change the form of the functions or the distribution of coefficients? You can find my email here.

setwd("YOUR WORKING DIRECTORY HERE")
require(polynom)
require(ggplot2)
library(gridExtra)
ncol=2
for (i in 1:(10*ncol)) {eval(parse(text=paste("p",formatC(i, width=3, flag="0"),"=as.function(polynomial(rchisq(n=sample(2:5,1), df=sample(2:5,1))))",sep="")))}
z=as.vector(outer(seq(-5, 5, by =.02),1i*seq(-5, 5, by =.02),'+'))
opt=theme(legend.position="none",
          panel.background = element_blank(),
          panel.margin = unit(0,"null"),
          panel.grid = element_blank(),
          axis.ticks= element_blank(),
          axis.title= element_blank(),
          axis.text = element_blank(),
          strip.text =element_blank(),
          axis.ticks.length = unit(0,"null"),
          axis.ticks.margin = unit(0,"null"),
          plot.margin = rep(unit(0,"null"),4))
for (i in 1:(ncol^2))
{
  pols=sample(1:(10*ncol), 3, replace=FALSE)
  p1=paste("p", formatC(pols[1], width=3, flag="0"), "(x)*", sep="")
  p2=paste("p", formatC(pols[2], width=3, flag="0"), "(x)/", sep="")
  p3=paste("p", formatC(pols[3], width=3, flag="0"), "(x)",  sep="")
  eval(parse(text=paste("p = function (x) ", p1, p2, p3, sep="")))
  df=data.frame(x=Re(z),
                y=Im(z),
                h=(Arg(p(z))<0)*1+Arg(p(z))/(2*pi),
                s=(1+sin(2*pi*log(1+Mod(p(z)))))/2,
                v=(1+cos(2*pi*log(1+Mod(p(z)))))/2)
  g=ggplot(data=df[is.finite(apply(df,1,sum)),], aes(x=x, y=y)) + geom_tile(fill=hsv(df$h,df$s,df$v))+ opt
  assign(paste("hsv_g", formatC(i, width=3, flag="0"), sep=""), g)
}
jpeg(filename = "Surrealism.jpg", width = 800, height = 800, quality = 100)
grid.arrange(hsv_g001, hsv_g002, hsv_g003, hsv_g004, ncol=ncol)
dev.off()

Silhouettes

Romeo, Juliet, balcony in silhouette, makin o’s with her cigarette, it’s juliet (Flapper Girl, The Lumineers)

Two weeks ago I published this post for which designed two different visualizations. At the end, I decided to place words on the map of the United States. The discarded visualization was this other one, where I place the words over the silhouette of each state:

States In Two Words v1

I do not want to set aside this chart because I really like it and also because I think it is a nice example of the possibilities one have working with R.

Here you have the code. It substitutes the fragment of the code headed by “Visualization” of the original post:

library(ggplot2)
library(maps)
library(gridExtra)
library(extrafont)
opt=theme(legend.position="none",
             panel.background = element_blank(),
             panel.grid = element_blank(),
             axis.ticks=element_blank(),
             axis.title=element_blank(),
             axis.text =element_blank(),
             plot.title = element_text(size = 28))
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
grid.newpage()
jpeg(filename = "States In Two Words.jpeg", width = 1200, height = 600, quality = 100)
pushViewport(viewport(layout = grid.layout(6, 8)))
for (i in 1:nrow(table))
{
  wd=subset(words, State==as.character(table$"State name"[i]))
  p=ggplot() + geom_polygon( data=subset(map_data("state"), region==tolower(table$"State name"[i])), aes(x=long, y=lat, group = group), colour="white", fill="gold", alpha=0.6, linetype=0 )+opt
  print(p, vp = vplayout(floor((i-1)/8)+1, i%%8+(i%%8==0)*8))
  txt=paste(as.character(table$"State name"[i]),"\n is", wd$word1,"\n and", wd$word2, sep=" ")
  grid.text(txt, gp=gpar(font=1, fontsize=16, col="midnightblue", fontfamily="Humor Sans"), vp = viewport(layout.pos.row = floor((i-1)/8)+1, layout.pos.col = i%%8+(i%%8==0)*8))
}
dev.off()

How Big Is The Vatican City?

Dici che il fiume trova la via al mare e come il fiume giungerai a me (Miss Sarajevo, U2)

One way to calculate approximately the area of some place is to circumscribe it into a polygon of which you know its area. After that, generate coordinates inside the polygon and count how many of them fall into the place. The percentage of coordinates inside the place by the area of the polygon is an approximation of the desired area.

I applied this technique to calculate the area of the Vatican City. I generated a squared grid of coordinates around the Capella Sistina (located inside the Vatican City). To calculate the area I easily obtain the convex hull polygon of the coordinates using chull function of grDevices package. Then, I calculate the area of the polygon using areaPolygon function of geosphere package.

To obtain how many coordinates of the grid fall inside the Vatican City, I use revgeocode function of ggmap package (I love this function). For me, one coordinate is inside the Vatican City if its related address contains the words “Vatican City”.

What happens generating a grid of 20×20 coordinates? I obtain that the area of the Vatican City is about 0.32Km2 but according to Wikipedia, the area is 0.44Km2: this method underestimates the area around a 27%. But why? Look at this:

Vatican2

This plot shows which addresses of the grid fall inside the Vatican City (ones) and which of them do not fall inside (zeros). As you can see, there is a big zone in the South, and a smaller one in the North of the city where reverse geocode do not return “Vatican City” addresses.

Maybe Pope Francis should phone Larry Page and Sergey Brin to claim this 27% of his wonderful country.

I was willing to do this experiment since I wrote this post. This is the code:

require(geosphere)
require(ggmap)
require(plotGoogleMaps)
require(grDevices)
setwd("YOUR-WORKING-DIRECTORY-HERE")
#Coordinates of Capella Sistina
capella=geocode("capella sistina, Vatican City, Roma")
#20x20 grid of coordinates around the Capella
g=expand.grid(lon = seq(capella$lon-0.010, capella$lon+0.010, length.out=20),
lat = seq(capella$lat-0.005, capella$lat+0.005, length.out=20))
#Hull Polygon containing coordinates
p=g[c(chull(g),chull(g)[1]),]
#Address of each coordinate of grid
a=apply(g, 1, revgeocode)
#Estimated area of the vatican city
length(grep("Vatican City", a))/length(a)*areaPolygon(p)/1000/1000
s=cbind(g, a)
s$InOut=apply(s, 1, function(x) grepl('Vatican City', x[3]))+0
coordinates(s)=~lon+lat
proj4string(s)=CRS('+proj=longlat +datum=WGS84')
ic=iconlabels(s$InOut, height=12)
plotGoogleMaps(s, iconMarker=ic, mapTypeId="ROADMAP", legend=FALSE)

The United States In Two Words

Sweet home Alabama, Where the skies are so blue; Sweet home Alabama, Lord, I’m coming home to you (Sweet home Alabama, Lynyrd Skynyrd)

This is the second post I write to show the abilities of twitteR package and also the second post I write for KDnuggets. In this case my goal is to have an insight of what people tweet about american states. To do this, I look for tweets containing the exact phrase “[STATE NAME] is” for every states. Once I have the set of tweets for each state I do some simple text mining: cleaning, standardizing, removing empty words and crossing with these sentiment lexicons. Then I choose the two most common words to describe each state. You can read the original post here. This is the visualization I produced to show the result of the algorithm:

States In Two Words v2

Since the right side of the map is a little bit messy, in the original post you can see a table with the couple of words describing each state. This is just an experiment to show how to use and combine some interesting tools of R. If you don’t like what Twitter says about your state, don’t take it too seriously.

This is the code I wrote for this experiment:

# Do this if you have not registered your R app in Twitter
library(twitteR)
library(RCurl)
setwd("YOUR-WORKING-DIRECTORY-HERE")
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR-CONSUMER_KEY-HERE"
consumerSecret = "YOUR-CONSUMER-SECRET-HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
save(Cred, file="twitter authentification.Rdata")
# Start here if you have already your twitter authentification.Rdata file
library(twitteR)
library(RCurl)
library(XML)
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Read state names from wikipedia
webpage=getURL("http://simple.wikipedia.org/wiki/List_of_U.S._states")
table=readHTMLTable(webpage, which=1)
table=table[!(table$"State name" %in% c("Alaska", "Hawaii")), ]
#Extract tweets for each state
results=data.frame()
for (i in 1:nrow(table))
{
  tweets=searchTwitter(searchString=paste("'\"", table$"State name"[i], " is\"'",sep=""), n=200, lang="en")
  tweets.df=twListToDF(tweets)
  results=rbind(cbind(table$"State name"[i], tweets.df), results)
}
results=results[,c(1,2)]
colnames(results)=c("State", "Text")
library(tm)
#Lexicons
pos = scan('positive-words.txt',  what='character', comment.char=';')
neg = scan('negative-words.txt',  what='character', comment.char=';')
posneg=c(pos,neg)
results$Text=tolower(results$Text)
results$Text=gsub("[[:punct:]]", " ", results$Text)
# Extract most important words for each state
words=data.frame(Abbreviation=character(0), State=character(0), word1=character(0), word2=character(0), word3=character(0), word4=character(0))
for (i in 1:nrow(table))
{
  doc=subset(results, State==as.character(table$"State name"[i]))
  doc.vec=VectorSource(doc[,2])
  doc.corpus=Corpus(doc.vec)
  stopwords=c(stopwords("english"), tolower(unlist(strsplit(as.character(table$"State name"), " "))), "like")
  doc.corpus=tm_map(doc.corpus, removeWords, stopwords)
  TDM=TermDocumentMatrix(doc.corpus)
  TDM=TDM[Reduce(intersect, list(rownames(TDM),posneg)),]
  v=sort(rowSums(as.matrix(TDM)), decreasing=TRUE)
  words=rbind(words, data.frame(Abbreviation=as.character(table$"Abbreviation"[i]), State=as.character(table$"State name"[i]),
                                   word1=attr(head(v, 4),"names")[1],
                                   word2=attr(head(v, 4),"names")[2],
                                   word3=attr(head(v, 4),"names")[3],
                                   word4=attr(head(v, 4),"names")[4]))
}
# Visualization
require("sqldf")
statecoords=as.data.frame(cbind(x=state.center$x, y=state.center$y, abb=state.abb))
#To make names of right side readable
texts=sqldf("SELECT a.abb,
            CASE WHEN a.abb IN ('DE', 'NJ', 'RI', 'NH') THEN a.x+1.7
            WHEN a.abb IN ('CT', 'MA') THEN a.x-0.5  ELSE a.x END as x,
            CASE WHEN a.abb IN ('CT', 'VA', 'NY') THEN a.y-0.4 ELSE a.y END as y,
            b.word1, b.word2 FROM statecoords a INNER JOIN words b ON a.abb=b.Abbreviation")
texts$col=rgb(sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),sample(0:150, nrow(texts)),max=255)
library(maps)
jpeg(filename = "States In Two Words v2.jpeg", width = 1200, height = 600, quality = 100)
map("state", interior = FALSE, col="gray40", fill=FALSE)
map("state", boundary = FALSE, col="gray", add = TRUE)
text(x=as.numeric(as.character(texts$x)), y=as.numeric(as.character(texts$y)), apply(texts[,4:5] , 1 , paste , collapse = "\n" ), cex=1, family="Humor Sans", col=texts$col)
dev.off()

Mixing Waves

Fill a cocktail shaker with ice; add vodka, triple sec, cranberry, and lime, and shake well; strain into a chilled cocktail glass and garnish with orange twist (Cosmopolitan Cocktail Recipe)

This is a tribute to Blaise Pascal and Joseph Fourier, two of the greatest mathematicians in history. As Pascal did in his famous triangle, I generate a set of random curves (sines or cosines with random amplitudes between 1 and 50) and I arrange them over the lateral edges of the triangle. Each inner curve in the triangle is the sum of the two directly curves above it.  This is the result for a 6 rows triangle:

Adding Waves

Two comments:

  1. Inner curves are noisy. The greater is the distance from the edge, the higher the entropy. When I was a child, I used to play a game called the broken telephone; I can see some kind of connection between this graphic and the game.
  2. I have read that using eval+parse in sympton of being a bad programmer. Does anyone have an idea to do this in some other way without filling the screen of code?

This is the code:

library(ggplot2)
library(gridExtra)
nrows=6
for (i in 1:nrows){
  eval(parse(text=paste("f",i,1,"=function(x) ", sample(c("sin(","cos("),1), runif(min=1, max=50,1) ,"*x)",sep="")))
  eval(parse(text=paste("f",i,i,"=function(x) ", sample(c("sin(","cos("),1), runif(min=1, max=50,1) ,"*x)",sep="")))}
for (i in 3:nrows) {
  for (j in 2:(i-1)) eval(parse(text=paste("f",i, j, "=function(x) f",(i-1),(j-1), "(x) + f",(i-1),j,"(x)",sep="")))}
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
opts=theme(legend.position="none",
           panel.background = element_rect(fill="gray95"),
           plot.background = element_rect(fill="gray95", colour="gray95"),
           panel.grid = element_blank(),
           axis.ticks=element_blank(),
           axis.title=element_blank(),
           axis.text =element_blank())
setwd("YOUR WORKING DIRECTORY HERE")
grid.newpage()
jpeg(file="Adding Waves.jpeg", width=1800,height=1000, bg = "gray95", quality = 100)
pushViewport(viewport(layout = grid.layout(nrows, 2*nrows-1)))
for (i in 1:nrows) {
  for (j in 1:i) {
    print(ggplot(data.frame(x = c(0, 20)), aes(x)) + stat_function(fun = eval(parse(text=paste("f",i,j,sep=""))), colour = "black", alpha=.75)+opts, vp = vplayout(i, nrows+(2*j-(i+1))))
  }
}
dev.off()

Visualizing Home Ownership With Small Multiples And R

If everybody had an ocean, across the U.S.A., then everybody’d be surfin’ like California (Beach Boys, Surfin’ U.S.A.)

home_ownership

I was invited to write a post for Domino Data Lab, a company based in California which provides a cloud-based machine learning platform which enables companies to use the power of the cloud to build analytical projects. I also discovered recently this book which support the premises of companies like Domino Data Lab which are leading the change in the way of doing data science. How I wish to forget in the future expressions like execution time, update versions and memory limit!

Since I like a lot Small multiples, I decided to plot the evolution of homeownership across the United States (the more I use GridExtra package the more I like it). You can read the post here (code included).

By the way, if you want to go to Gigaom Structure Data 2015 for free, Domino Data Lab is giving away 2 tickets here.

How e Can Help You To Find The Love Of Your Life

Match.com will bring more love to the planet than anything since Jesus Christ (Gary Kremen, founder of Match.com)

Sarah is a brilliant 39 years old mathematician living in Massachusetts. She lives alone and has dedicated her whole life to study. She has realized lately that theorems and books no longer satisfy her. Sarah has realized that needs to find love.

To find the love of her life, Sarah joined Match.com to try to have a date with a man every week for a year (52 dates in total). She has her own method to rate each man according his sympathy, his physical appearance, his punctuality,  his conversation and his hobbies. This method allows her to compare candidates with each other. Sarah wants to choose the top-scored man but she is afraid to waste time. If she waits until having all the dates, it could be too late to call back the best candidate, especially if he was one of the first dates. So she wants to be agile and decide inmediately. Her plan is as follows: she will start having some dates only to assess the candidates and after this period, she will try to win over the first man better than any of the first candidates, according her scoring.

But, how many men should discard to maximize the probability of choosing the top-scored one? Discarding just one, probability of having a date with a better man in the next date is very high. But probably he will not be the man she is looking for. On the other hand, discarding many men makes very probable discarding also the top-scored one.

Sarah did a simulation in R of the 52 dates to approximate the probability of choosing the best man depending on the number of discards. She obtained that the probability of choosing the top-scored man is maximal discarding the 19 first men, as can be seen in the following graph:

Prince

Why 19? Because 19 is approximately 52/e. This is one of the rare places where can found the number e. You can see an explanation of the phenomenon here.

Note: This is just a story to illustrate the secretary problem without repeating the original argument of the problem. My apologies if I cause offense to someone. This is a blog about mathematics and R and this is the way as must be understood. 

require(extrafont)
require(ggplot2)
n=52
sims=1000
results=data.frame(discards=numeric(0), triumphs=numeric(0))
for(i in 0:n)
{
  triumphs=0
  for (j in 1:sims) {
    opt=sample(seq(1:n), n, replace=FALSE)
    if (max(opt[1:i])==n)  triumphs=triumphs+0
    else triumphs=triumphs+(opt[i+1:n][min(which(opt[i+1:n] > max(opt[1:i])))]==n)}
  results=rbind(results, data.frame(discards=i, triumphs=triumphs/sims))
}
opts=theme(
  panel.background = element_rect(fill="darkolivegreen1"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="white", linetype = 1),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="black", size=20),
  axis.text.x = element_text(colour="black", size=20),
  text = element_text(size=25, family="xkcd"),
  legend.key = element_blank(),
  legend.background = element_blank(),
  plot.title = element_text(size = 40))
ggplot(results, aes(discards, triumphs))+
  geom_vline(xintercept = n/exp(1), size = 1, linetype=2, colour = "black", alpha=0.8)+
  geom_line(color="green4", size=1.5)+
  geom_point(color="gray92", size=8, pch=16)+
  geom_point(color="green4", size=6, pch=16)+
  ggtitle("How e can help you to find the love of your life")+
  xlab("Discards") +
  ylab("Prob. of finding the love of your life")+
  scale_x_continuous(breaks=seq(0, n, by = 2))+opts

Simple Data Science Of Global Warming In KDnuggets

Would love to get a post from you for KDnuggets (Gregory Piatetsky, KDnuggets President)

logoSome days ago, Gregory Piatetsky invited me to write a post for KDnuggets. I couldn’t say no. He suggested to me some topics and I decided to experiment around climate change to demonstrate how easy is to see some evidences of this alarming menace. You can read the post here.

This is the code I wrote to do this experiment:

require(sqldf)
require(googleVis)
require(ggplot2)
require(ggthemes)
setwd("YOUR WORKING DIRECTORY HERE") #This line doen's work until you type a valid path
#Data Avaliable in http://eca.knmi.nl/utils/downloadfile.php?file=download/ECA_blend_tg.zip
files=list.files(pattern = "TG_STAID")
results=data.frame(staid=character(0), trnd=numeric(0), nobs=numeric(0))
#Loop to calculate linear trends
for (i in 1:length(files))
{
  table=read.table(files[i], header=TRUE, sep = ',', skip=20)
  table=table[table$Q_TG==0,]
  table$date=as.Date(as.character(table$DATE), "%Y%m%d")
  results=rbind(data.frame(staid=files[i], trnd=lm.fit (x = matrix(table$date), y = table$TG)$coefficients, nobs=nrow(table)), results)
}
write.csv(results, file="results.csv", row.names = FALSE)#Save your work
results=read.csv(file="results.csv")#Read your work. You can start here if you stop your work further this line
#Remove outliers
results=results[!results$trnd %in% boxplot.stats(results$trnd, coef = 4)$out,]
#Histogram
hist(x=results$trnd, breaks = 50,
     col = "orange",
     border = "pink",
     freq=TRUE,
     xlim=c(-0.03, 0.03),
     ylim=c(0, 300),
     xlab="Historical trend of daily mean temperatures",
     ylab="Number of stations",
     main="Evolution of temperatures in Europe and the Mediterranean",
     sub="Source: European Climate Assessment & Dataset project")
results$staid2=as.numeric(gsub("[^0-9]","",results$staid)) #To join results with geographical coordinates
#Read table of geographical coordinates
staids=read.table("http://www.ecad.eu/download/ECA_all_stations.txt", header=TRUE, sep = ',', skip=17)
#Right tail of the distribution
hotpoints=sqldf("SELECT a.staid, a.staid2, a.trnd, a.nobs, b.staname, b.lat, b.lon
      FROM results a INNER JOIN staids b ON (a.staid2=b.staid) WHERE TRND >= 0.02")
#Convert degrees:minutes:seconds to decimal coordinates
hotpoints=within(hotpoints, {
  dms=do.call(rbind, strsplit(as.character(LAT), ":"))
  lat=sign(as.numeric(dms[,1]))*(abs(as.numeric(dms[,1]))+(as.numeric(dms[,2]) + as.numeric(dms[,3])/60)/60);rm(dms)
})
hotpoints=within(hotpoints, {
  dms=do.call(rbind, strsplit(as.character(LON), ":"))
  lon=sign(as.numeric(dms[,1]))*(abs(as.numeric(dms[,1]))+(as.numeric(dms[,2]) + as.numeric(dms[,3])/60)/60);rm(dms)
})
#To make readable tha name of the station in the map
hotpoints$staname=gsub("^\\s+|\\s+$", "", hotpoints$STANAME)
#To prepare coordinates to gvis function
hotpoints$LatLong=with(hotpoints, paste(lat, lon, sep=":"))
#The amazing gvisMap function of googleVis package
hotpointsMap=gvisMap(hotpoints, "LatLong" , "STANAME",
        options=list(showTip=TRUE,
                     showLine=TRUE,
                     enableScrollWheel=TRUE,
                     mapType='terrain',
                     useMapTypeControl=TRUE))
plot(hotpointsMap)
#The plot one of this hot stations
InAmenas=read.table("TG_STAID000312.txt", header=TRUE, sep = ',', skip=20)
InAmenas=InAmenas[InAmenas$Q_TG==0,]
InAmenas$date=as.Date(as.character(InAmenas$DATE), "%Y%m%d")
ggplot(InAmenas, aes(date, TG)) +
  geom_line(color="red", alpha=0.8) +
  xlab("") +
  ylab("Mean temperature in 0.1C")+
  ggtitle("Mean temperature in IN-AMENAS (ALGERIA) 1958- 1998")+
  geom_smooth(method = "lm", se=FALSE, color="red", lwd=2)+
  theme_economist(base_size = 20, base_family = "sans", horizontal = TRUE,
                  dkpanel = FALSE, stata = FALSE)