Monthly Archives: March 2015

NASDAQ 100 Couples

Heaven, I’m in heaven, and my heart beats so that I can hardly speak, and I seem to find the happiness I seek, when we’re out together dancing cheek to cheek (Cheek To Cheek, Irving Berlin)

There are about 6.500 available packages in CRAN repository. If I were a superhuman, able to learn one package a day, I would spend almost 18 years of my life studying R. And how many packages would be uploaded to CRAN during this period? Who knows: R is infinite.

Today, my experiment deals with quantmod package, which allows you to play to be quant for a while. I download the daily quotes of NASDAQ 100 companies and measure distances between each pair of companies. Distance is based on the cross-correlation between two series so high-correlated series (not exceeding a maximum lag) are closer than low-correlated ones. You can read a good description of this distance here. Since NASDAQ 100 contains 107 companies, I calculate distances for 5.671 different couples. Next plot represent distances between each pair of companies. The darker is the color, the closer are the related companies:

Nasdaq100

Yes, I know is not a graph for someone with visual problems. Let me show you an example of what is behind one of these little tiles. Distance between Mattel Inc. and 21st Century Fox is very small (its related tile is dark coloured). Why? Because of this:

MattelFox
These two companies have been dancing cheek to cheek for more than seven years. It is also curious how some companies are far from any of their NASDAQ 100 colleagues. Some examples of these unpaired companies are Express Scripts Holding Company (ESRX), Expeditors International of Washington Inc. (EXPD) and Fastenal Company (FAST). I do not why but there must be an explanation, do not you think so?

Something tells me I will do some other experiment using quantmod package:

library("quantmod")
library("TSdist")
library("ade4")
library("ggplot2")
library("Hmisc")
library("zoo")
library("scales")
library("reshape2")
setwd("YOUR WORKING DIRECTORY HERE")
temp=tempfile()
download.file("http://www.nasdaq.com/quotes/nasdaq-100-stocks.aspx?render=download",temp)
data=read.csv(temp, header=TRUE)
for (i in 1:nrow(data)) getSymbols(as.character(data[i,1]))
results=t(apply(combn(sort(as.character(data[,1]), decreasing = TRUE), 2), 2,
      function(x)
      {
        ts1=drop(Cl(eval(parse(text=x[1]))))
        ts2=drop(Cl(eval(parse(text=x[2]))))
        c(symbol1=x[1], symbol2=x[2], tsDistances(ts1, ts2, distance="crosscorrelation"))
      }))
results=as.data.frame(results)
colnames(results)=c("Sym1", "Sym2", "TSdist")
results$TSdist=as.numeric(as.character(results$TSdist))
results=rbind(results, data.frame(Sym1=as.character(data[,1]), Sym2=as.character(data[,1]), TSdist=0))
results$TSdist2=as.numeric(cut2(results$TSdist, g=4))
opts=theme(axis.text.x = element_text(angle = 90, vjust=.5, hjust = 0),
           panel.background = element_blank(),
           axis.text = element_text(colour="gray25", size=8),
           legend.position = "none",
           panel.grid = element_blank())
ggplot(results,aes(x=Sym2,y=Sym1))+
  geom_tile(aes(fill = TSdist2), colour="gray80")+
  scale_size_continuous(range=c(1,10))+
  scale_x_discrete("", limits=sort(unique(as.character(results$Sym1))))+
  scale_y_discrete("", limits=sort(unique(as.character(results$Sym2)), decreasing = TRUE))+
  scale_fill_gradient(low = "steelblue", high = "white")+
  opts
MAT.close=Cl(MAT)
FOX.close=Cl(FOX)
cls=merge(MAT.close, FOX.close, all = FALSE)
df=data.frame(date = time(cls), coredata(cls))
names(df)[-1]=c("mat", "fox")
df1=melt(df, id.vars = "date", measure.vars = c("mat", "fox"))
opts2=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=15),
  axis.title = element_text(size=18, colour="gray10"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 40, colour="gray10"))
ggplot(df1, aes(x = date, y = value, color = variable))+
  geom_line(size = I(1.2))+
  scale_color_discrete(guide = "none")+
  scale_x_date(labels = date_format("%Y-%m-%d"))+
  labs(title="Nasdaq 100 Couples: Mattel And Fox", x="Date", y="Closing Price")+
  annotate("text", x = as.Date("2011-01-01", "%Y-%m-%d"), y = c(10, 30), label = c("21st Century Fox", "Mattel Inc."), size=7, colour="gray25")+
  opts2
Advertisements

The World We Live In #4: Marriage Ages

It is time for women to stop being politely angry (Leymah Gbowee, Nobel Prize Peace Winner)

Sometimes very simple plots give insight into we live in a world of differences. This plot shows the mean age at marriage for men and women across countries:

Marriage Ages

Being a woman in some countries of this world must be a hard experience:

#Singulate mean age at marriage: http://data.un.org/Data.aspx?d=GenderStat&f=inID%3a20
#Population: http://data.un.org/Data.aspx?d=SOWC&f=inID%3a105
require("sqldf")
require("ggplot2")
setwd("YOUR WORKING DIRECTORY HERE")
mar=read.csv("UNdata_Export_20150309_171525152.csv", nrows=321, header=T, row.names=NULL)
pop=read.csv("UNdata_Export_20150309_172046384.csv", nrows=999, header=T, row.names=NULL)
colnames(mar)[1]="Country"
colnames(pop)[1]="Country"
data=sqldf("SELECT
  a.Country,
  a.Value as Pop,
  b.Value as Female,
  c.Value as Male
FROM
  pop a INNER JOIN mar b
  ON (a.Country=b.Country AND b.Subgroup='Female') INNER JOIN mar c
  ON (a.Country=c.Country AND c.Subgroup='Male')
WHERE a.Subgroup = 'Total'")
opts=theme(
  panel.background = element_rect(fill="gray98"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="gray75", linetype = 2),
  panel.grid.minor = element_blank(),
  axis.text = element_text(colour="gray25", size=15),
  axis.title = element_text(size=18, colour="gray10"),
  legend.key = element_blank(),
  legend.position = "none",
  legend.background = element_blank(),
  plot.title = element_text(size = 40, colour="gray10"))
ggplot(data, aes(x=Female, y=Male, size=log(Pop), label=Country), guide=FALSE)+
  geom_point(colour="white", fill="chartreuse3", shape=21, alpha=.55)+
  scale_size_continuous(range=c(2,36))+
  scale_x_continuous(limits=c(16,36), breaks=seq(16, 36, by = 2), expand = c(0, 0))+
  scale_y_continuous(limits=c(16,36), breaks=seq(16, 36, by = 2), expand = c(0, 0))+
  geom_abline(intercept = 0, slope = 1, colour = "gray10", linetype=2)+
  labs(title="The World We Live In #4: Marriage Ages",
       x="Females mean age at marriage",
       y="Males mean age at marriage")+
  geom_text(data=subset(data, abs(Female-Male)>7), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(data=subset(data, Female>=32|Female<=18), size=5.5, colour="gray25", hjust=0, vjust=0)+
  geom_text(aes(24, 17), colour="gray25", hjust=0, label="Source: United Nations (size of bubble depending on population)", size=5)+opts

Visual Complexity

Oh, can it be, the voices calling me, they get lost and out of time (Little Black Submarines, The Black Keys)

Last October I did this experiment about complex domain coloring. Since I like giving my posts a touch of randomness, I have done this experiment. I plot four random functions on the form p1(x)*p2(x)/p3(x) where pi(x) are polynomials up-to-4th-grade with random coefficients following a chi-square distribution with degrees of freedom between 2 and 5. I measure the function over the complex plane and arrange the four resulting plots into a 2×2 grid. This is an example of the output:
Surrealism Every time you run the code you will obtain a completely different output. I have run it hundreds of times because results are always surprising. Do you want to try? Do not hesitate to send me your creations. What if you change the form of the functions or the distribution of coefficients? You can find my email here.

setwd("YOUR WORKING DIRECTORY HERE")
require(polynom)
require(ggplot2)
library(gridExtra)
ncol=2
for (i in 1:(10*ncol)) {eval(parse(text=paste("p",formatC(i, width=3, flag="0"),"=as.function(polynomial(rchisq(n=sample(2:5,1), df=sample(2:5,1))))",sep="")))}
z=as.vector(outer(seq(-5, 5, by =.02),1i*seq(-5, 5, by =.02),'+'))
opt=theme(legend.position="none",
          panel.background = element_blank(),
          panel.margin = unit(0,"null"),
          panel.grid = element_blank(),
          axis.ticks= element_blank(),
          axis.title= element_blank(),
          axis.text = element_blank(),
          strip.text =element_blank(),
          axis.ticks.length = unit(0,"null"),
          axis.ticks.margin = unit(0,"null"),
          plot.margin = rep(unit(0,"null"),4))
for (i in 1:(ncol^2))
{
  pols=sample(1:(10*ncol), 3, replace=FALSE)
  p1=paste("p", formatC(pols[1], width=3, flag="0"), "(x)*", sep="")
  p2=paste("p", formatC(pols[2], width=3, flag="0"), "(x)/", sep="")
  p3=paste("p", formatC(pols[3], width=3, flag="0"), "(x)",  sep="")
  eval(parse(text=paste("p = function (x) ", p1, p2, p3, sep="")))
  df=data.frame(x=Re(z),
                y=Im(z),
                h=(Arg(p(z))<0)*1+Arg(p(z))/(2*pi),
                s=(1+sin(2*pi*log(1+Mod(p(z)))))/2,
                v=(1+cos(2*pi*log(1+Mod(p(z)))))/2)
  g=ggplot(data=df[is.finite(apply(df,1,sum)),], aes(x=x, y=y)) + geom_tile(fill=hsv(df$h,df$s,df$v))+ opt
  assign(paste("hsv_g", formatC(i, width=3, flag="0"), sep=""), g)
}
jpeg(filename = "Surrealism.jpg", width = 800, height = 800, quality = 100)
grid.arrange(hsv_g001, hsv_g002, hsv_g003, hsv_g004, ncol=ncol)
dev.off()

Silhouettes

Romeo, Juliet, balcony in silhouette, makin o’s with her cigarette, it’s juliet (Flapper Girl, The Lumineers)

Two weeks ago I published this post for which designed two different visualizations. At the end, I decided to place words on the map of the United States. The discarded visualization was this other one, where I place the words over the silhouette of each state:

States In Two Words v1

I do not want to set aside this chart because I really like it and also because I think it is a nice example of the possibilities one have working with R.

Here you have the code. It substitutes the fragment of the code headed by “Visualization” of the original post:

library(ggplot2)
library(maps)
library(gridExtra)
library(extrafont)
opt=theme(legend.position="none",
             panel.background = element_blank(),
             panel.grid = element_blank(),
             axis.ticks=element_blank(),
             axis.title=element_blank(),
             axis.text =element_blank(),
             plot.title = element_text(size = 28))
vplayout=function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)
grid.newpage()
jpeg(filename = "States In Two Words.jpeg", width = 1200, height = 600, quality = 100)
pushViewport(viewport(layout = grid.layout(6, 8)))
for (i in 1:nrow(table))
{
  wd=subset(words, State==as.character(table$"State name"[i]))
  p=ggplot() + geom_polygon( data=subset(map_data("state"), region==tolower(table$"State name"[i])), aes(x=long, y=lat, group = group), colour="white", fill="gold", alpha=0.6, linetype=0 )+opt
  print(p, vp = vplayout(floor((i-1)/8)+1, i%%8+(i%%8==0)*8))
  txt=paste(as.character(table$"State name"[i]),"\n is", wd$word1,"\n and", wd$word2, sep=" ")
  grid.text(txt, gp=gpar(font=1, fontsize=16, col="midnightblue", fontfamily="Humor Sans"), vp = viewport(layout.pos.row = floor((i-1)/8)+1, layout.pos.col = i%%8+(i%%8==0)*8))
}
dev.off()