Batman’s Choice

A hero can be anyone, even a man doing something as simple and reassuring as putting a coat on a young boy’s shoulders to let him know the world hadn’t ended (Batman in The Dark Knight Rises)

Joker has captured Batman and keeps him into a dark and cold dungeon of Gotham City. Showing his sadistic character, Joker proposes the following game to Batman:

This is a six shooter revolver with two bullets in the cylinder. Bullets are inside two consecutive chambers. I will spin the cylinder and will fire the gun aiming to my head. If I survive you will have to do the same but you decide if you want to spin the cylinder before firing or not. If you still keep you head over your shoulders after firing, you will be free.

Joker fires and nothing happens. He survives and passes the revolver to Batman. What should Batman do? Spinning or not? What would you do?

From my point of view, answer is quite anti-intutive because the best option is not spinning the cylinder again. Spinning case is clear: probability of losing the head is 2/6=33% but what about not spinning? Doing the next shoot directly eliminates two possibilities: the previous shot of Joker and the second bullet according to direction of cylinder rotation (remember two bullets are consecutive and Joker is unfortunately  still alive). It means there is only 1 chance to dead between 4, so probability of losing the head in this scenario is 1/4=25% which is significantly lower than the first one.

Here you can find the resulting graph of simulating the game up to 500 times:

batmans_choice2

Will it be the end of Batman? Not sure.

This is the code of this experiment:

library(ggplot2)
library(extrafont)
niter=500
results=data.frame()
for (i in 1:niter)
{
bullet1=sample(1:6,1)
Joker=sample((1:6)[-c(bullet1, bullet1%%6+1)],1)
#Option 1: Shooting
Batman1=Joker%%6+1
dead1=(Batman1 %in% c(bullet1, bullet1%%6+1))*1
#Option 2: Rolling and Shooting
Batman2=sample(1:6,1)
dead2=(Batman2 %in% c(bullet1, bullet1%%6+1))*1
results=rbind(results, c(i, dead1, dead2))
}
colnames(results)=c("iter", "dead1", "dead2")
results$csum1=cumsum(results$dead1)/as.numeric(rownames(results))
results$csum2=cumsum(results$dead2)/as.numeric(rownames(results))
theme_xkcd=theme(
panel.background = element_rect(fill="darkolivegreen1"),
panel.border = element_rect(colour="black", fill=NA),
axis.line = element_line(size = 0.5, colour = "black"),
axis.ticks = element_line(colour="black"),
panel.grid = element_line(colour="white", linetype = 2),
axis.text.y = element_text(colour="black"),
axis.text.x = element_text(colour="black"),
text = element_text(size=18, family="Humor Sans"),
plot.title = element_text(size = 50)
)
p=ggplot(data=results, aes(x=iter, y=csum1))+
geom_abline(intercept = 1/4, slope = 0, size = 0.4, linetype=2, colour = "black", alpha=0.8)+
geom_abline(intercept = 1/3, slope = 0, size = 0.4, linetype=2, colour = "black", alpha=0.8)+
geom_line(aes(y=csum2), colour="green4", size=1.5, fill=NA)+
geom_line(colour="green4", size=1.5, fill=NA)+
coord_cartesian(ylim=c(.0, 1), xlim=c(1, niter))+
scale_y_continuous(breaks = c(0,round(1/4, digits = 2),round(1/3, digits = 2),1))+
geom_text(data=results[niter*.75, ], family="Humor Sans", colour="green4", y=0.38, label="Rotating Cylinder and Shooting ...", size=4, adjust=1)+
geom_text(data=results[niter*.75, ], family="Humor Sans", colour="green4", y=0.20, label="Shooting without Rotating Cylinder ...", size=4, adjust=1)+
labs(x="Number Of Trials", y="Prob. of Losing The Head", title="Batman's Choice")+
theme_xkcd
ggsave("batmans_choice.jpg", plot=p, width=8, height=5)

The Zebra Of Riemann

Mathematics is the art of giving the same name to different things (Henri Poincare)

Many surveys among experts point that demonstration of the Riemann Hypothesis is the most important pending mathematical issue in this world. This hypothesis is related to Riemann zeta function, which is supossed to be zero only for those complex whose real part is equal to 1/2 (this is the conjecture of Riemann itself). Confirming the conjecture would imply deep consequences in prime numbers teory and also in our knowledge of their properties. Next plot represents the argument of zeta function over complex with real part between -75 and 5 (x axis) and imaginary part between -40 and 40 (y axis). An explanation of this kind of graph can be found here. Does not it remind you of something?
zebra
Solving the hypothesis of Riemann is one of the seven Clay Mathematics Institute Millennium Prize Problems. Maybe zebras keep the secret to solve it in their skins.

This is the code to plot the graph:

require(pracma)
z=outer(seq(-75, 5, by =.1),1i*seq(-40, 40, by =.1),'+')
z=apply(z, c(1,2), function(x) Arg(zeta(x)))
opt=theme(legend.position="none",
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
axis.text =element_blank())
z=data.frame(expand.grid(x=seq(ncol(z)), y=seq(nrow(z))), z=as.vector(z))
require(ggplot2)
ggplot(z, aes(x=x, y=y, color=z)) + geom_tile() + scale_colour_gradientn(colours=c("black","white", "gray80", "gray40")) + opt

Four Simple Turtle Graphs To Play With Kids

Technology is just a tool: in terms of getting the kids working together and motivating them, the teacher is the most important (Bill Gates)

Some days ago I read this article in R-bloggers and I discovered the TurtleGraphics package. I knew about turtle graphics long time ago and I was thinking of writing a post about them, doing some draw on my own using this technique. In fact, some of my previous drawings could have been done using this package.

Turtle graphics are a good tool to teach kids some very important mathematical and computational concepts such as length, coordinates, location, angles, iterations … Small changes in some of this parameters (especially in angles) can produce very different drawings. Those who want to know more about turtle graphics can read the Wikipedia article.

Following you can find four simple graphics preceded by the code to generate them. Please, send me your creations if you want:

library(TurtleGraphics)
turtle_init()
turtle_col("gray25")
for (i in 1:150) {
  turtle_forward(dist=1+0.5*i)
  turtle_right(angle=89.5)}
turtle_hide()

turtle1

library(TurtleGraphics)
turtle_init()
turtle_col("gray25")
turtle_right(angle=234)
for (i in 1:100) {
  turtle_forward(dist=0.9*i)
  turtle_right(angle=144.3)}
turtle_hide()

turtle2

library(TurtleGraphics)
turtle_init()
turtle_col("gray25")
turtle_setpos(48,36)
d=50
for (i in 1:300) {
  turtle_forward(dist=d)
  if (i%%4==0) {
    turtle_right(angle=75)
    d=d*.95}
else turtle_right(angle=90)}
turtle_hide()

turtle3

library(TurtleGraphics)
turtle_init()
turtle_col("gray25")
turtle_setpos(50,35)
turtle_right(angle=30)
d=25
turtle_setpos(50-d/2,50-d/2*tan(pi/6))
for (i in 1:100) {
  turtle_forward(dist=d)
  d=d+.5
  turtle_right(angle=120+1)}
turtle_hide()

turtle4

How To Approximate Pi With A Short Pencil And A Big Paper

Experiment, be curious: though interfering friends may frown, get furious at each attempt to hold you down (Tony Bennett, Experiment)

Instructions:

  1. Take a pencil and measure it
  2. Take a piece of paper and draw parallel lines on it (you can use the pencil, of course); separation between lines should double the length of the pencil
  3. Toss the pencil over the paper 100 times (or more)
  4. Make note of how many times do the pencil cross some of the lines
  5. Calculate ratio between tosses and crosses: this is your approximation of Pi

Some time ago, I published a post about one of the most amazing places where PI was discovered. This is another example of the ubiquity of this mathematical constant. This experiment is based on Buffon’s needle problem, another amazing experiment of 18th century. Next plot represents ratio of tosses to crosses depending on the length of pencil. When the pencil is half the length of the separation between lines, the previous ratio is approximately PI:
Buffon
If you get very bored some afternoon you can replicate this experiment with your children. Use a short pencil. If not, you will need an extremely big piece of paper. Meanwhile here you have the code:

trials=100000
results=sapply(seq(.1, 2, by = .05), function(x)
{
  r=x #Length of pencil in relation to separation between lines
  Needles=t(sapply(1:trials, function(y) c(100*runif(1),2*pi*runif(1))))
  Needles=cbind(Needles,Needles[,1]+r*cos(Needles[,2]))
  Needles=data.frame(x1=Needles[,1], x2=Needles[,3], Cross=abs(trunc(Needles[,1])-trunc(Needles[,3])))
  c(r, trials/(trials-nrow(Needles[Needles$Cross==0,])))
})
results=data.frame(t(results))
colnames(results)=c(c("ratio", "inv.perc.crosses"))
require(ggplot2)
require(extrafont)
require(grid)
opts=theme(
  panel.background = element_rect(fill="darkolivegreen1"),
  panel.border = element_rect(colour="black", fill=NA),
  axis.line = element_line(size = 0.5, colour = "black"),
  axis.ticks = element_line(colour="black"),
  panel.grid.major = element_line(colour="white", linetype = 1),
  panel.grid.minor = element_blank(),
  axis.text.y = element_text(colour="black", size=20),
  axis.text.x = element_text(colour="black", size=20),
  text = element_text(size=25, family="xkcd"),
  legend.key = element_blank(),
  legend.position = c(.2,.75),
  legend.background = element_blank(),
  plot.title = element_text(size = 50))
c=ggplot(results, aes(ratio, inv.perc.crosses))
c +geom_abline(intercept = pi, slope = 0, size = 0.4, linetype=2, colour = "black", alpha=0.8)+
  geom_line(color="green4", size=1.5)+
  geom_point(color="gray92", size=8, pch=16)+
  geom_point(color="green4", size=6, pch=16)+
  geom_text(aes(0.55, 5), hjust=0, family="xkcd", label="PI is around here!", size=10)+
  ggtitle("Hot to approximate PI with \na short pencil and a big paper")+
  xlab("Length of pencil divided by separation between lines") +
  ylab("Number of tosses divided by number of crosses")+
  geom_segment(aes(x = .625, y = 4.6, xend = .52, yend = 3.45), size=1.5, arrow = arrow(length = unit(0.5, "cm")))+
  scale_y_continuous(breaks=c(pi, 4, 8, 12, 16), labels=c("3.141593","4","8","12","16"))+
  scale_x_continuous(breaks=seq(.1, 2, by = .1))+opts

The Goldbach’s Comet

Every even integer greater than 2 can be expressed as the sum of two primes (Christian Goldbach, 1742)

The point cloud known as Goldbach’s Comet represents the amount of different ways (y axis) an even number (x axis) can be writen as sum of two prime numbers. In this plot, x axis is between 2 and 50.000 (stars are just ornaments):

Goldbach

Mathematicians are still waiting for a proof of this conjecture. This is the code for drawing this plot:

require(schoolmath)
require(utils)
library(plyr)
n=50000
data(primlist)
primes=as.data.frame(t(combn(primlist[primlist>2 & primlist<n-2], 2)))
primes$V3=primes$V1+primes$V2
primes2=count(primes, "V3")
primes2=primes2[primes2$V3<=n,]
stars=cbind(runif(50, min=-n*0.05, max=n), runif(200, min=-n*0.001, max=max(primes2$freq)))
plot.new()
par(mai = rep(0, 4), bg = "gray12")
plot(NA,type="n", xlim=c(-n*0.05,n), xaxs="i", ylim=c(-n*0.001,max(primes2$freq)))
points(stars, col = "blue4", cex=.7, pch=16)
points(stars, col = "blue", cex=.3, pch=16)
points(stars, col = "gray75", cex=.1, pch=16)
apply(primes2, 1, function(x) points(x=x[1],y=x[2], col = if (runif(1)&gt;x[1]/n) {"white"} else {sample(colours(),1)}, cex=.1, pch=16))

The Ikeda’s Galaxy

Chaos is the score upon which reality is written (Henry Miller)

Nonlinear dynamical systems are an enormous seam of amazing images. The Ikeda Map is an example of strange attractor which represents the movement of particles under the rules of certain differential equations.

I have drawn the trajectories followed by of 200 particles under the 2D-Ikeda Map with the same tecnique I used in this previous post, resulting this nice galaxy:

ikeda0.918

Wold you like to create your own galaxies? Here you have the code:

u=0.918 #Parameter between 0 and 1
n=200 #Number of particles
m=40 #Number of iterations
ikeda=data.frame(it=1,x1=runif(n, min = -40, max = 40), y1=runif(n, min = -40, max = 40))
ikeda$x2=1+u*(ikeda$x1*cos(0.4-6/(1+ikeda$x1^2+ikeda$y1^2))-ikeda$y1*sin(0.4-6/(1+ikeda$x1^2+ikeda$y1^2)))
ikeda$y2=  u*(ikeda$x1*sin(0.4-6/(1+ikeda$x1^2+ikeda$y1^2))+ikeda$y1*cos(0.4-6/(1+ikeda$x1^2+ikeda$y1^2)))
for (k in 1:m)
{
df=as.data.frame(cbind(rep(k+1,n),
ikeda[ikeda$it==k,]$x2,
ikeda[ikeda$it==k,]$y2,
1+u*(ikeda[ikeda$it==k,]$x2*cos(0.4-6/(1+ikeda[ikeda$it==k,]$x2^2+ikeda[ikeda$it==k,]$y2^2))-ikeda[ikeda$it==k,]$y2*sin(0.4-6/(1+ikeda[ikeda$it==k,]$x2^2+ikeda[ikeda$it==k,]$y2^2))),
u*(ikeda[ikeda$it==k,]$x2*sin(0.4-6/(1+ikeda[ikeda$it==k,]$x2^2+ikeda[ikeda$it==k,]$y2^2))+ikeda[ikeda$it==k,]$y2*cos(0.4-6/(1+ikeda[ikeda$it==k,]$x2^2+ikeda[ikeda$it==k,]$y2^2)))))
names(df)=names(ikeda)
ikeda=rbind(df, ikeda)
}
plot.new()
par(mai = rep(0, 4), bg = "gray12")
plot(c(0,0),type="n", xlim=c(-35, 35), ylim=c(-35,35))
apply(ikeda, 1, function(x) lines(x=c(x[2],x[4]), y=c(x[3],x[5]), col = paste("gray", as.character(min(round(jitter(x[1]*80/(m-1)+(20*m-100)/(m-1), amount=5)), 100)), sep = ""), lwd=0.1))

The Gilbreath’s Conjecture

317 is a prime, not because we think so, or because our minds are shaped in one way rather than another, but because it is so, because mathematical reality is built that way (G.H. Hardy)

In 1958, the mathematician and magician Norman L. Gilbreath presented a disconcerting hypothesis conceived in the back of a napkin. Gilbreath wrote first prime numbers in a row. In the next rows, he wrote the difference  in absolute value of consecutive values of previous row. Beginning with first 20 primes in the first row, he obtained something like this:Gilbreath01

The conjecture is easy: except for the first one, all elements of the first column are 1. So far, no one has demonstrated this hypothesis. In fact, according to mathematician Richard Guy, it seems unlikely that we will see a demonstration of Gilbreath’s conjecture in the near future, though probably this conjecture is true.

In the previous chart, I coloured zeros in white, ones in violet and rest of numbers in gold. The conjecture says that except for the first element, the first column is entirely violet. Following you can see the coloured chart for first 20, 40, 60 and 80 prime numbers:

Gilbreath02

It is nice how zeros create triangular patterns similar to patterns created by cellular automata. This is the chart for first 200 primes:Gilbreath03

How much time will it take to demonstrate this simple conjecture? Who knows. Meanwhile, you can draw triangles with this code:

library(ggplot2)
create.gilbreath=function(n)
{
require(reshape)
require(schoolmath)
data(primlist)
gilbreath=t(matrix(primlist[2:(n+1)], n, n))
for (i in 2:n) {gilbreath[i,]=c(eval(parse(text=paste(c(paste("abs(diff(",collapse=""), "gilbreath[i-1,]", paste("))",collapse="")),collapse=""))),rep(NA,1))}
na.omit(melt(t(gilbreath)))
}
opt=theme(legend.position="none",
panel.background = element_blank(),
panel.grid = element_blank(),
axis.ticks=element_blank(),
axis.title=element_blank(),
axis.text =element_blank())
gilbreath=create.gilbreath(20)
gilbreath$value1=cut(gilbreath$value, breaks=c(-Inf,1:2,Inf), right = FALSE)
ggplot(gilbreath, aes(x=X1, y=X2)) +
geom_tile(aes(fill = value1), colour="grey") +
scale_fill_manual(values = c("white", "darkviolet", "gold"))+
geom_text(label=gilbreath$value, size=8)+
scale_y_reverse()+opt

The Three Little Pigs

Jesse, you asked me if I was in the meth business or the money business. Neither. I’m in the empire business (Walter White in Breaking Bad)

The game of pig has simple rules but complex strategies. It was described for the first time in 1945  by a magician called John Scarne. Playing the pig game is easy: each turn, a player repeatedly rolls a die until either a 1 is rolled or the player decides to hold:

  • If the player rolls a 1, they score nothing and it becomes the next player’s turn
  • If the player rolls any other number, it is added to their turn total and the player’s turn continues
  • If a player chooses to hold, their turn total is added to their score, and it becomes the next player’s turn

The first player who reach at least 100 points is the winner. For example: you obtain a 3 and then decide to roll again, obtaining a 1. Your score is zero in this turn. Next player gets the sequence 3-4-6 and decides to hold, obtaining a score of 13 points in this turn.

Despite its simplicity, the pig game has a very complex and against-intuition optimal strategy. It was calculated in 2004 by Todd W. Neller and Clifton Presser from Gettysburg College of Pennsilvania with the help of computers.

To illustrate the game, I simulated three players (pigs) playing the pig game with three different strategies:

  • The Coward pig, who only rolls the die a small number of times in every turn
  • The Risky pig, who rolls the die a more times than the coward one
  • The Ambitious pig, who tries to obtain in every turn more points than two others

I simulated several scenarios.

  • Some favorable scenarios for Coward pig:

pigs2Coward pigs4Coward

In first scenario, the Coward pig rolls the die between 1 and 5 times each round and wins if the Risky pig asumes an excessive level of risk (rolling each time between 10 and 15 times). Trying to obtain more than the Coward is a bad option for the Ambitious pig. Simulating this scenario 100 times gives victory to Coward a 51% of times (25% to Risky and 24% to Ambitious).

Second scenario puts closer Coward and Risky pigs (first one rolls the die between 4 and 7 times  each round and second one between 6 and 9 times). Coward wins 54% of times (34% Risky and only 12% Ambitious).

Being coward seems to be a good strategy when you play against a reckless or when you are just a bit more conservative than a Risky one.

  • Some favorable scenarios for Risky pig:

pigs5RiskyPigs1Risky

Rolling the die between 4 and 6 times each round seems to be a good option, even more when you are playing against a extremely conservative player who rolls no more than 3 times each time. Simulating 100 times these previous scenarios gives victory to Risky pig a 58% of times in first the case in which Coward rolls allways 1 and Risky 6 times each round (0% for Coward and only 42% form Ambitious) and 66% of times in the second one (only 5% to Coward and 29% to Ambitious).

Being Risky is a good strategy when you play against a chicken.

  • Some favorable scenarios for Ambitious pig:

pigs3Ambitious pigs6ambitious

The Ambitious pig wins when two others turn into extremely coward and risky pigs as can be seen in the first scenario in which Ambitious wins 65% of the times (31% for Coward and 4% for Risky). Ambitious pig also wins when two others get closer and hit the die a small number of times (2 rolls the Coward and 4 rolls the Risky). In this scenario the Ambitious wins 58% of times (5% for Coward and 37% for Risky). By the way, these two scenarios sound very unreal.

Being ambitious seems to be dangerous but works well when you play against a crazy and a chicken or against very conservative players.

From my point of view, this is a good example to experiment with simulations, game strategies and xkcd style graphics.

The code:

require(ggplot2)
require(extrafont)
#Number of hits for Coward
CowardLower=2
CowardUpper=2
#Number of hits for Risky
RiskyLower=4
RiskyUpper=4
game=data.frame(ROUND=0, part.p1=0, part.p2=0, part.p3=0, Coward=0, Risky=0, Ambitious=0)
while(max(game$Coward)<100 & max(game$Risky)<100 & max(game$Ambitious)<100)
{
#Coward Little Pig
p1=sample(1:6,sample(CowardLower:CowardUpper,1), replace=TRUE)
s1=min(min(p1-1),1)*sum(p1)
#Risky Little Pig
p2=sample(1:6,sample(RiskyLower:RiskyUpper,1), replace=TRUE)
s2=min(min(p2-1),1)*sum(p2)
#Ambitious Little Pig
s3=0
repeat {
p3=sample(1:6,1)
s3=(p3+s3)*min(min(p3-1),1)
if (p3==1|s3>max(s1,s2)) break
}
game[nrow(game)+1,]=c(max(game$ROUND)+1,s1,s2,s3,max(game$Coward)+s1,max(game$Risky)+s2,max(game$Ambitious)+s3)
}
opts=theme(
panel.background = element_rect(fill="darkolivegreen1"),
panel.border = element_rect(colour="black", fill=NA),
axis.line = element_line(size = 0.5, colour = "black"),
axis.ticks = element_line(colour="black"),
panel.grid.major = element_line(colour="white", linetype = 1),
panel.grid.minor = element_blank(),
axis.text.y = element_text(colour="black"),
axis.text.x = element_text(colour="black"),
text = element_text(size=25, family="xkcd"),
legend.key = element_blank(),
legend.position = c(.2,.75),
legend.background = element_blank(),
plot.title = element_text(size = 50)
)
ggplot(game, mapping=aes(x=game$ROUND, y=game$Coward)) +
geom_line(color="red", size=1.5) +
geom_line(aes(x=game$ROUND, y=game$Risky), color="blue", size=1.5) +
geom_line(aes(x=game$ROUND, y=game$Ambitious), color="green4", size=1.5) +
geom_point(aes(x=game$ROUND, y=game$Coward, colour="c1"), size=5.5) +
geom_point(aes(x=game$ROUND, y=game$Risky, colour="c2"), size=5.5) +
geom_point(aes(x=game$ROUND, y=game$Ambitious, colour="c3"), size=5.5) +
ggtitle("THE THREE LITTLE PIGS") +
xlab("ROUND") + ylab("SCORING") +
geom_text(aes(max(game$ROUND), max(max(game$Coward, game$Risky, game$Ambitious)), hjust=1.2, family="xkcd", label="WINNER!"), size=10)+
geom_hline(yintercept=100, linetype=2, size=1)+
scale_y_continuous(breaks=seq(0, max(max(game$Coward, game$Risky, game$Ambitious))+10, 10))+
scale_x_continuous(breaks=seq(0, max(game$ROUND), 1))+
scale_colour_manual("",
labels = c(paste("Coward: ", CowardLower, "-", CowardUpper, " hits", sep = ""), paste("Risky: ", RiskyLower, "-", RiskyUpper, " hits", sep = ""), "Ambitious"),
breaks = c("c1", "c2", "c3"),
values = c("red", "blue", "green4"))+ opts

floweR

It is the time you have wasted for your rose that makes your rose so important (Antoine de Saint-Exupéry, The Little Prince)

Yesterday I found a package called circular and I could not suppress to do this:

rose

Make your own floweRs:

require(circular)
for (i in 1:8)
{
  rose.diag(circular(runif(5000, 0, 2*pi)), 
            bins = (25-round(abs(jitter(0, amount=2*i)))), 
            axes=FALSE,
            border=rgb(255,50+204*((i-1)/8),50+204*((i-1)/8), alpha=255, max=255),
            ticks = FALSE,
            col=rgb(250,0+204*((i-1)/8),0+204*((i-1)/8), alpha=255, max=255), 
            control.circle=circle.control(lty=0), 
            shrink=0.22+(i-1)*(2-0.22)/8)
  par(new=TRUE)  
}
for (i in 1:150)
  {
  q = runif(1)*pi*2
  r = sqrt(runif(1))
  x = (0.18*r)* cos(q)
  y = (0.18*r)* sin(q)
  points(x, y, 
         col = rgb(255 ,sample(80:120, 1), 0, alpha= 95, max=255),
         bg = rgb(255 ,sample(100:200, 1), 0, alpha= 95, max=255), pch=21, cex=2)
  }

How Do Cities Feel?

If you are lost and feel alone, circumnavigate the globe (For You, Coldplay)

You can not consider yourself a R-blogger until you do an analysis of Twitter using twitteR package. Everybody knows it. So here I go.

Inspired by the fabulous work of Jonathan Harris I decided to compare human emotions of people living (or twittering in this case) in different cities. My plan was analysing tweets generated in different locations of USA and UK with one thing in common: all of them must contain the string “I FEEL”. These are the main steps I followed:

  • Locate cities I want to analyze using world cities database of maps package
  • Download tweets around these locations using searchTwitter function of twitteR package.
  • Cross tweets with positive and negative lists of words and calculate a simple scoring for each tweet as number of positive words – number of negative words
  • Calculate how many tweets have non-zero scoring; since these tweets put into words some emotion I call them sentimental tweets
  • Represent cities in a bubble chart where x-axis is percentage of sentimental tweets, y-axis is average scoring and size of bubble is population

This is the result of my experiment:HowDoCitiesFeel3

These are my conclusions (please, do not take it seriously):

  • USA cities seem to have better vibrations and are more sentimental than UK ones
  • Capital city is the happiest one for both countries
  • San Francisco (USA) is the most sentimental city of the analysis; on the other hand, Liverpool (UK) is the coldest one
  • The more sentimental, the better vibrations

From my point of view, this analysis has some important limitations:

  • It strongly depends on particular events (i.e. local football team wins the championship)
  • I have no idea of what kind of people is behind tweets
  • According to my experience, searchTwitter only works well for a small number of searches (no more than 300); for larger number of tweets to return, it use to give malformed JSON response error from server

Anyway, I hope it will serve as starting point of some other analysis in the future. At least, I learned interesting things about R doing it.

Here you have the code:

library(twitteR)
library(RCurl)
library(maps)
library(plyr)
library(stringr)
library(bitops)
library(scales)
#Register
if (!file.exists('cacert.perm'))
{
  download.file(url = 'http://curl.haxx.se/ca/cacert.pem', destfile='cacert.perm')
}
requestURL="https://api.twitter.com/oauth/request_token"
accessURL="https://api.twitter.com/oauth/access_token"
authURL="https://api.twitter.com/oauth/authorize"
consumerKey = "YOUR CONSUMER KEY HERE"
consumerSecret = "YOUR CONSUMER SECRET HERE"
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=requestURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo=system.file("CurlSSL", "cacert.pem", package="RCurl"))
#Save credentials
save(Cred, file="twitter authentification.Rdata")
load("twitter authentification.Rdata")
registerTwitterOAuth(Cred)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
#Cities to analyze
cities=data.frame(
  CITY=c('Edinburgh', 'London', 'Glasgow', 'Birmingham', 'Liverpool', 'Manchester',
         'New York', 'Washington', 'Las Vegas', 'San Francisco', 'Chicago','Los Angeles'),
  COUNTRY=c("UK", "UK", "UK", "UK", "UK", "UK", "USA", "USA", "USA", "USA", "USA", "USA"))
data(world.cities)
cities2=world.cities[which(!is.na(match(
str_trim(paste(world.cities$name, world.cities$country.etc, sep=",")),
str_trim(paste(cities$CITY, cities$COUNTRY, sep=","))
))),]
cities2$SEARCH=paste(cities2$lat, cities2$long, "10mi", sep = ",")
cities2$CITY=cities2$name
#Download tweets
tweets=data.frame()
for (i in 1:nrow(cities2))
{
  tw=searchTwitter("I FEEL", n=400, geocode=cities2[i,]$SEARCH)
  tweets=rbind(merge(cities[i,], twListToDF(tw),all=TRUE), tweets)
}
#Save tweets
write.csv(tweets, file="tweets.csv", row.names=FALSE)
#Import csv file
city.tweets=read.csv("tweets.csv")
#Download lexicon from http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
hu.liu.pos = scan('lexicon/positive-words.txt',  what='character', comment.char=';')
hu.liu.neg = scan('lexicon/negative-words.txt',  what='character', comment.char=';')
#Function to clean and score tweets
score.sentiment=function(sentences, pos.words, neg.words, .progress='none')
{
  require(plyr)
  require(stringr)
  scores=laply(sentences, function(sentence, pos.word, neg.words) {
    sentence=gsub('[[:punct:]]','',sentence)
    sentence=gsub('[[:cntrl:]]','',sentence)
    sentence=gsub('\\d+','',sentence)
    sentence=tolower(sentence)
    word.list=str_split(sentence, '\\s+')
    words=unlist(word.list)
    pos.matches=match(words, pos.words)
    neg.matches=match(words, neg.words)
    pos.matches=!is.na(pos.matches)
    neg.matches=!is.na(neg.matches)
    score=sum(pos.matches) - sum(neg.matches)
    return(score)
  }, pos.words, neg.words, .progress=.progress)
  scores.df=data.frame(score=scores, text=sentences)
  return(scores.df)
}
cities.scores=score.sentiment(city.tweets[1:nrow(city.tweets),], hu.liu.pos, hu.liu.neg, .progress='text')
cities.scores$pos2=apply(cities.scores, 1, function(x) regexpr(",",x[2])[1]-1)
cities.scores$CITY=apply(cities.scores, 1, function(x) substr(x[2], 1, x[3]))
cities.scores=merge(x=cities.scores, y=cities, by='CITY')
df1=aggregate(cities.scores["score"], by=cities.scores[c("CITY")], FUN=length)
names(df1)=c("CITY", "TWEETS")
cities.scores2=cities.scores[abs(cities.scores$score)>0,]
df2=aggregate(cities.scores2["score"], by=cities.scores2[c("CITY")], FUN=length)
names(df2)=c("CITY", "TWEETS.SENT")
df3=aggregate(cities.scores2["score"], by=cities.scores2[c("CITY")], FUN=mean)
names(df3)=c("CITY", "TWEETS.SENT.SCORING")
#Data frame with results
df.result=join_all(list(df1,df2,df3,cities2), by = 'CITY', type='full')
#Plot results
radius <- sqrt(df.result$pop/pi)
symbols(100*df.result$TWEETS.SENT/df.result$TWEETS, df.result$TWEETS.SENT.SCORING, circles=radius,
        inches=0.85, fg="white", bg="gold", xlab="Sentimental Tweets", ylab="Scoring Of Sentimental Tweets (Average)",
        main="How Do Cities Feel?")
text(100*df.result$TWEETS.SENT/df.result$TWEETS, df.result$TWEETS.SENT.SCORING, paste(df.result$CITY, df.result$country.etc, sep="-"), cex=1, col="gray50")