Monthly Archives: September 2014

PageRank For SQL Lovers

If you’re changing the world, you’re working on important things. You’re excited to get up in the morning (Larry Page, CEO and Co-Founder of Google)

This is my particular tribute to one of the most important, influential and life-changer R packages I have discovered in the last times: sqldf package.

Because of my job, transforming data through SQL queries is very natural for me. This, together with the power of R made this package indispensable for me since I knew of its existence.

Imagine you have a directed graph like this:PR1

Given a vertex V, these are the steps to calculate its PageRank, lets call it PR(V):

  • Initialize PR(V) to some value (I do it to 1 in my script)
  • Iterate this formula until converges: PR(V)=(1-d)+d*(PR(T1)/C(T1)+ ... +PR(Tn)/C(Tn)) where Ti are the vertex that point to V and C(Ti) is the number of edges going out of Ti

After doing this, result is:


Following you can find my code to do it with sqldf, which is quite simple from my point of view. I am pretty sure there must be some package which calculates PageRank but the main goal of this post is to show how easy is to calculate it with two simple queries, no more. The example is taken from here, where you can find a good explanation of how PageRank works:

net=data.frame(origin=c("A","A","B","C","D"), end=c("C","B","C","A","C"))
par(family="serif", cex=1, ps=25, bg="white", col.lab="black", col.axis="black")
plot(graph.edgelist(as.matrix(net)), edge.arrow.size=1, vertex.color="gray90", edge.color="black")
netou=sqldf("SELECT origin, COUNT(*) outs FROM net GROUP BY 1")
netpr=sqldf("SELECT origin vertex, 1.0 pagerank FROM net UNION SELECT end, 1.0 FROM net")
for (i in 1:50)
netx1=sqldf("SELECT vertex, pagerank/outs factor FROM netou a INNER JOIN netpr b ON (a.origin = b.vertex)")
netpr=sqldf("SELECT a.vertex, 0.15+SUM(0.85*COALESCE(factor,0)) AS pagerank
FROM netpr a LEFT OUTER JOIN net b ON (a.vertex = b.end) LEFT OUTER JOIN netx1 c
ON (b.origin = c.vertex) GROUP BY 1")
V(g)$name=sqldf("SELECT a.vertex||' (PR='||ROUND(b.pagerank,2)||')' as name from names a inner join netpr b ON (a.vertex=b.vertex)")$name
plot(g, edge.arrow.size=1, vertex.color="gray90", edge.color="black")

Space Invaders

I burned through all of my extra lives in a matter of minutes, and my two least-favorite words appeared on the screen: GAME OVER (Ernest Cline, Ready Player One)

Inspired by the book I read this summer and by this previous post, I decided to draw these aliens:

Invader1 Invader3

Do not miss to check this indispensable document to choose your favorite colors:

0,0,0,0,0,0,0,0,0,0,0,0), nrow=12, byrow = TRUE)
0,0,0,0,0,0,0,0,0,0,0,0), nrow=12, byrow = TRUE)
0,0,0,0,0,0,0,0,0,0,0,0), nrow=12, byrow = TRUE)
0,0,0,0,0,0,0,0,0,0,0,0), nrow=12, byrow = TRUE)
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text = element_blank())
p1=ggplot(melt(mars1), aes(x=X2, y=X1))+geom_tile(aes(fill=jitter(value, amount=.1)), colour="gray65", lwd=.025)+
scale_fill_gradientn(colours = c("chartreuse", "navy"))+scale_y_reverse()+opt
p2=ggplot(melt(mars2), aes(x=X2, y=X1))+geom_tile(aes(fill=jitter(value, amount=.1)), colour="gray65", lwd=.025)+
scale_fill_gradientn(colours = c("olivedrab1", "magenta4"))+scale_y_reverse()+opt
p3=ggplot(melt(mars3), aes(x=X2, y=X1))+geom_tile(aes(fill=jitter(value, amount=.1)), colour="gray65", lwd=.025)+
scale_fill_gradientn(colours = c("violetred4", "yellow"))+scale_y_reverse()+opt
p4=ggplot(melt(mars4), aes(x=X2, y=X1))+geom_tile(aes(fill=jitter(value, amount=.1)), colour="gray65", lwd=.025)+
scale_fill_gradientn(colours = c("tomato4", "lawngreen"))+scale_y_reverse()+opt

Princess Jasmine’s Trick

I’m history! No, I’m mythology! Nah, I don’t care what I am; I’m free hee! (Genie, when he is released from the magical oil lamp by Aladdin)

Princess JasmineA long time ago, in a kingdom far away, lived a beautiful princess named Jasmine. There also lived a very rich and evil wizard named Jafar, who was in love with the princess. In order to married with Jasmine, Jafar bought  her father’s will with treasures, but the princess was harder to convince. One day Jafar told the princess: Request me whatever you want and if  I am able to bring it to you, you will become my wife. The princess, tired of the insistence of Jafar, answered: I only want a gold chain, but I want you to give it to me as follows: the first day I should have just one link of the chain. The second day I should have two links. The third day, three … and so on. When you give me all the links of the chain I will marry you. Jafar, intrigued, asked: But how many links should have the chain?  And Jasmine replied: I want you to give me the longest chain that allows you to pay me breaking only 30 links. Jafar began to laugh out loud as he walked away and said to the princess: Tomorrow I’ll bring you such chain!. But as he went to his palace, his happiness turned into anger: he realized that there was not enough gold in the world to build the chain that asked Jasmine.

This is my own version of one of my favorite anti-common-sense mathematical curiosities. To explain it, let me start with an example. Imagine a simple chain with 7 links. If you open the 3rd link, the you split the chain into 3 pieces: a single link (the one you opened), a piece of 2 links and another one of 4 links. You could pay to Jasmine during seven days combining these 3 pieces:

  • Day 1: Give her the single link
  • Day 2: Give her the 2-links piece and take the single link, leaving her with 2 links
  • Day 3: Give her the single link again, leaving her with 3 links
  • Day 4: Give her the 4-links piece and take all pieces she has, leaving her with 4 links
  • Day 5: Give her the single link again, leaving her with 5 links
  • Day 6: Give her the 2-links piece and take 2-links piece, leaving her with 6 links
  • Day 7: Give her the single link piece, leaving her with all links

Is easy to see that having a chain with 63 links, you could pay Jasmine breaking only 3 links (positions 5th, 14th and 31st). It easy to prove that the length of the biggest chain you can manage breaking only n links is (2n+1-1)*(n+1)+n

Next plot represents the minimum number of breaks to pay Jasmine daily for a given chain’s length. I call it the Jasmine’s Staircase:

Generated With R #rstats

Some curiosities around chains:

  • Jasmine asked Jafar a chain of 66.571.993.087 links
  • Supposing one link weights 4 grams, the chain of Jasmine would weight around 266 tons. It is supposed to be around 171 tons of gold in the world
  • If you spend 1 second to climb the first step of the staircase, you will spend 302 years to climb the step number 100

Jafar was right. Jasmine was clever:

CalculateLength = function(n) {n+sum(sapply(0:n, function(x) 2^x*(n+1)))}
results=data.frame(breaks=1:max.breaks, length=sapply(1:max.breaks, CalculateLength))
results=sqldf("SELECT links.links, min(results.breaks) as minbreaks FROM links, results WHERE links.links <= results.length GROUP BY 1")
panel.background = element_rect(fill="mistyrose"),
panel.border = element_rect(colour="black", fill=NA),
axis.line = element_line(size = 0.5, colour = "black"),
axis.ticks = element_line(colour="black"),
panel.grid = element_line(colour="white", linetype = 2),
axis.text.y = element_text(colour="black"),
axis.text.x = element_text(colour="black"),
text = element_text(size=20, family="Humor Sans"),
plot.title = element_text(size = 40)
ggplot(results, aes(links,minbreaks))+
geom_area(fill="violet", alpha=.4)+
geom_step(color="violetred", lwd=1.5)+
labs(x="Chain's Length", y="Minimum Number of Breaks", title="Princess Jasmine's Staircase")+
scale_x_continuous(expand = c(0, 0), breaks = sapply(1:max.breaks, CalculateLength))+

Looking For Life


Machines take me by surprise with great frequency (Alan Turing)

Imagine a 8×8 grid in which cells can be alive (black colored) or empty (light gray colored): Generated with R #RstatsAs with the One-dimensional Cellular Automata, the next state of a cell is a function of the states of the cell’s nearest neighbors (the only neighbors that we consider are the eight cells that form the immediate perimeter of a cell). To evolve this community of cells over time, rules are quite simple:

  • If a live cell has less than two neighbors, then it dies (loneliness)
  • If a live cell has more than three neighbors, then it dies (overcrowding)
  • If an dead cell has three live neighbors, then it comes to life (reproduction)
  • Otherwise, a cell stays as is (stasis)

These are the rules of the famous Conway’s Game Of Life, a cellular automaton invented in the late 1960s by John Conway trying to refine the description of a cellular automaton to the simplest one that could support universal computation. In 1970 Martin Gardner described Conway’s work in his Scientific American column. Gardner’s article inspired many people around the world to experiment with Conway’s, which eventually led to the final pieces of how the Game Of Life could support universal computation in what was surely a global collaborative effort.

In this experiment I will try to find interesting objects that can be found in the Game Of Life: static (remain the same over the time), periodic (change but repeating their initial configuration in some iterations) or moving objects (travel through the grid before disappearing). Why are interesting? Because these are the kind of objects required to perform computations.

The plan is simple: I will generate some random grids and will evolve them over time a significant number of times. After doing this, I will check those grids having still some alive cells inside. Will I find there what I am looking for?

I generated 81 grids in which live cells are randomly located using binomial random variables with probabilities equal to i/80 with i from 0 (all cells empty) to 80 (all cells alive). This is a quick way to try a set of populations with a wide range of live cells. I measure % of alive cells after each iteration. I will analyze those grids which still have live cells after all iterations. This is what happens after 150 iterations:

Generated with R #Rstats

I find some interesting objects. Since I keep them in my script, I can list them with ls(pattern= "Conway", all.names = TRUE). Two of them are specially interesting because are not static. Are those which produce non-constant lines in the previous plot.

First one is a periodic object which reproduces itself after every 3 iterations:

Generated with R #Rstats

Second one is a bit more complex. After 8 iterations appears rotated:

Generated with R #Rstats

What kind of calculations can be done with these objects? I don’t now yet but let’s give time to time. Do you want to look for Life?

SumNeighbors = function (m) #Summarizes number of alive neighbors for each cell
    cbind(rbind(m[-1,],0)[,-1],0)+cbind(0, rbind(0,m[-nrow(m),])[,-nrow(m)])+
NextNeighborhood = function (m) #Calculates next state for each cell according to Conway's rules
  (1-(SumNeighbors(m)<2 | SumNeighbors(m)>3)*1)-(SumNeighbors(m)==2 & m==0)*1
splits=80 #Number of different populations to simulate. Each population s initialized randomly
          #according a binomial with probability i/splits with i from 0 to splits
rm(list = ls(pattern = "conway")) #Remove previous solutions (don't worry!)
for (i in 0:splits)
  z=matrix(rbinom(size=1, n=8^2, prob=i/splits), nrow=8); z0=z
  results=rbind(results, c(i/splits, 0, sum(z)/(nrow(z)*ncol(z))))
  for(j in 1:iter)
  {z=NextNeighborhood(z); results=rbind(results, c(i/splits, j, sum(z)/(nrow(z)*ncol(z))))}
  #Save interesting solutions
  if (sum(z)/(nrow(z)*ncol(z))>0) assign(paste("Conway",format(i/splits, nsmall = 4), sep=""), z)
colnames(results)=c("start", "iter", "sparsity")
#Plot reults of simulation
opt1=theme(panel.background = element_rect(fill="gray85"),
          panel.grid.minor = element_blank(),
          panel.grid.major.x = element_blank(),
          panel.grid.major.y = element_line(color="white", size=.5, linetype=2),
          plot.title = element_text(size = 45, color="black"),
          axis.title = element_text(size = 24, color="black"),
          axis.text = element_text(size=20, color="black"),
          axis.ticks = element_blank(),
          axis.line = element_line(colour = "black", size=1))
ggplot(results, aes(iter, sparsity, group = start))+
  geom_path(size=.8, alpha = 0.5, colour="black")+
  scale_x_continuous("Iteration", expand = c(0, 0), limits=c(0, iter), breaks = seq(0,iter,10))+
  scale_y_continuous("Alive Cells", labels = percent, expand = c(0, 0), limits=c(0, 1), breaks = seq(0, 1,.1))+
  labs(title = "Conway's Game Of Life Simulation")+opt1
#List of all interesting solutions
ls(pattern= "Conway", all.names = TRUE)
#Example to plot the evolution of an interesting solution (in this case "Conway0.5500")
            panel.background = element_blank(),
            panel.grid = element_blank(),
            axis.text =element_blank())
p1=ggplot(melt(Conway0.5500), aes(x=X1, y=X2))+geom_tile(aes(fill=value), colour="white", lwd=2)+
  scale_fill_gradientn(colours = c("gray85", "black"))+opt
p2=ggplot(melt(NextNeighborhood(Conway0.5500)), aes(x=X1, y=X2))+geom_tile(aes(fill=value), colour="white", lwd=2)+
  scale_fill_gradientn(colours = c("gray85", "black"))+opt
p3=ggplot(melt(NextNeighborhood(NextNeighborhood(Conway0.5500))), aes(x=X1, y=X2))+geom_tile(aes(fill=value), colour="white", lwd=2)+
  scale_fill_gradientn(colours = c("gray85", "black"))+opt
p4=ggplot(melt(NextNeighborhood(NextNeighborhood(NextNeighborhood(Conway0.5500)))), aes(x=X1, y=X2))+geom_tile(aes(fill=value), colour="white", lwd=2)+
  scale_fill_gradientn(colours = c("gray85", "black"))+opt
#Arrange four plots in a 2x2 grid
grid.arrange(p1, p2, p3, p4, ncol=2)