Category Archives: Maps

Visualizing the Gender of US Senators With R and Highmaps

I wake up every morning in a house that was built by slaves (Michelle Obama)

Some days ago I was invited by the people of Highcharts to write a post in their blog. What I have done is a simple but revealing map of women senators of the United States of America. Briefly, this is what I’ve done to generate it:

  • read from the US senate website a XML file with senators info
  • clean and obtain gender of senators from their first names
  • summarize results by state
  • join data with a US geojson dataset to create the highmap

You can find details and R code here.

It is easy creating a highcharts using highcharter, an amazing library as genderizeR, the one I use to obtain gender names. I like them a lot.

Climatic Change At A Glance

Mmm. Lost a planet, Master Obi-Wan has. How embarrassing (Yoda, Attack Of The Clones)

Some time ago I published this post in KDnuggets in which I analyze historical temperatures to show how we are gradually heading toward a warmer planet. Simple data science to obtain a simple (and increasingly accepted) conclusion: the global warming is real. Despite I was criticized I still believe what I said then: you don’t have to be a climatologist to empirically confirm global warming.

This experiment is another example of that. It is still simpler than that since it is only based on visual perception but I think is also quite conclusive. In this case, I represent U.S. temperature outliers from 1964 to 2013; a map per year. Dataset contains station ID, name, min/max temperature, as well as degree coordinates of the recorded weather. Original weather data collected from NOAA and anomalies analysis by Enigma. You can download data here.

Anomalies are divided into four categories: Strong Hot, Weak Hot, Weak Cold and Strong Cold. For each station, I represent difference between number of Cold and Hot anomalies (independently of the strength) so Blue bubbles represent stations where total number of Cold anomalies during the year is greater that total number of Hot ones and Red ones represent the opposite. Size of bubbles is also proportional to this indicator. As an example, following you can see the map of year 1975:

It seems 1975 was hot in the right a cold on the left side. Concretely, in TONOPAH Station (Nevada) were registered 30 anomalies and most of them (26) where due to cold temperatures. This is why bubble is blue. This GIF shows the evolution of all these maps from 1964 to 2013:


Maybe it is just my personal feeling but don’t you see how red bubbles are gradually winning to blue ones? Maybe I am a demagogue.

This code generates a dynamic map by year in html format:

anomalies = fread("")
anomalies %>%
  mutate(year=substr(date_str, 1, 4)) %>%
  group_by(year, longitude, latitude, id, station_name) %>%
    Strong_Hot=sum(str_count(type,"Strong Hot")),
    Weak_Hot=sum(str_count(type,"Weak Hot")),
    Weak_Cold=sum(str_count(type,"Weak Cold")),
    Strong_Cold=sum(str_count(type,"Strong Cold")),
    total=n()) %>%
  mutate(score=sign(-Strong_Hot-Weak_Hot+Weak_Cold+Strong_Cold)) %>%
  mutate(color=ifelse(score==1, "Blue",ifelse(score==0, "White", "Red"))) -> anomalies2
for (i in unique(anomalies2$year))
  anomalies2 %>%
    filter(year==i) %>%
    leaflet() %>%
    fitBounds(-124, 34, -62, 40) %>%
    addProviderTiles("Stamen.TonerLite") %>%
    addCircleMarkers(lng = ~longitude,
                     lat = ~latitude,
                     radius = ~ifelse(total < 20, 2, ifelse(total < 27, 4, 8)),
                     color= ~color,
                     fillOpacity = 0.5,
                     popup = ~paste(sep = "
", paste0("<b>", station_name, "</b>"),
                                    paste0("Strong Hot: ", Strong_Hot),
                                    paste0("Weak Hot: ", Weak_Hot),
                                    paste0("Weak Cold: ", Weak_Cold),
                                    paste0("Strong Cold: ", Strong_Cold))) -> m
    saveWidget(m, file=paste0("m", i, ".html"))

A Segmentation Of The World According To Migration Flows ft. Leaflet

Up in the sky you just feel fine, there is no running out of time and you never cross a line (Up In The Sky, 77 Bombay Street)

In this post I analyze two datasets from Enigma:

  • Migration flows: Every 10 years, since 1960, the World Bank estimates migrations worldwide (267.960 rows)
  • World population: Values and percentages of populations for each nation examined beginning in year 1960, by the World Bank’s Health, Nutrition and Population project (4.168.185 rows)

Since the second dataset is very large, I load it into R using fread function of data.table package, which is extremely fast. To filter datasets, I also use dplyr and pipes of magrittr package (my life changed since I discovered it).

To build a comparable indicator across countries, I divide migration flows (from and to each country) by the mean population in each decade. I do this because migration flows are aggregated for each decade since 1960. For example, during the first decade of 21st century, Argentina reveived 1.537.850 inmigrants, which represents a 3,99% of the mean population of the country in this decade. In the same period, inmigration to Burundi only represented a 0,67% of its mean population.

What happened in the whole world in that decade? There were around 166 million people who moved to other countries. It represents a 2.58% of the mean population of the world. I use this figure to divide countries into four groups:

  • Isolated: countries with both % of inmigrants and % of migrants under 2.58%
  • Emitter: countries with % of inmigrants under 2.58% and % of migrants over 2.58%
  • Receiver: countries with % of inmigrants over 2.58% and % of migrants under 2.58%
  • Transit: countries with both % of inmigrants and % of migrants over 2.58%

To create the map I use leaflet package as I did in my previous post. Shapefile of the world can be downloaded here. This is how the world looks like according to this segmentation:

Migration Flows

Some conclusions:

  • There are just sixteen receiver countries: United Arab Emirates, Argentina, Australia, Bhutan, Botswana, Costa Rica, Djibouti, Spain, Gabon, The Gambia, Libya, Qatar, Rwanda, Saudi Arabia, United States and Venezuela
  • China and India (the two most populous countries in the world) are isolated
  • Transit countries are concentrated in the north hemisphere and most of them are located in cold latitudes
  • There are six emitter countries with more than 30% of emigrants between 2000 and 2009: Guyana, Tonga, Tuvalu, Jamaica, Bosnia and Herzegovina and Albania

This is the code you need to reproduce the map:

populflows = read.csv(file="enigma-org.worldbank.migration-remittances.migrants.migration-flow-c57405e33412118c8757b1052e8a1490.csv", stringsAsFactors=FALSE)
population = fread("")
# Population
population %>% 
  filter(indicator_name=="Population, total") %>% %>% 
  mutate(decade=(year-year%%10)) %>% 
  group_by(country_name, country_code, decade) %>% 
  summarise(avg_pop=mean(value)) -> population2
# Inmigrants by country
populflows %>% filter(! %>% 
  group_by(migration_year, destination_country) %>% 
  summarise(inmigrants = sum(total_migrants))  %>% 
  merge(population2, by.x = c("destination_country", "migration_year"), by.y = c("country_name", "decade"))  %>% 
  mutate(p_inmigrants=inmigrants/avg_pop) -> inmigrants
# Migrants by country
populflows %>% filter(! %>% 
  group_by(migration_year, country_of_origin) %>% 
  summarise(migrants = sum(total_migrants)) %>%  
  merge(population2, by.x = c("country_of_origin", "migration_year"), by.y = c("country_name", "decade"))  %>%
  mutate(p_migrants=migrants/avg_pop) -> migrants
# Join of data sets
migrants %>% 
  merge(inmigrants, by.x = c("country_code", "migration_year"), by.y = c("country_code", "migration_year")) %>%
  filter(migration_year==2000) %>% 
  select(country_of_origin, country_code, avg_pop.x, migrants, p_migrants, inmigrants, p_inmigrants) %>% 
  plyr::rename(., c("country_of_origin"="Country", 
                    "p_inmigrants"="p.of.inmigrants")) -> populflows2000
# Threshold to create groups
populflows2000 %>% 
  summarise(x=sum(Total.migrants), y=sum(Total.inmigrants), z=sum(Population.mean)) %>% 
  mutate(m=y/z) %>% 
  select(m)  %>% 
  as.numeric -> avg
# Segmentation
populflows2000[populflows2000$p.of.migrants>avg & populflows2000$p.of.inmigrants>avg, "Group"]="Transit"
populflows2000[populflows2000$p.of.migrants<avg & populflows2000$p.of.inmigrants<avg, "Group"]="Isolated" 
populflows2000[populflows2000$p.of.migrants>avg & populflows2000$p.of.inmigrants<avg, "Group"]="Emitter" 
#Loading shapefile from 
countries=readOGR("json/countries.geojson", "OGRGeoJSON") 
# Join shapefile and enigma information 
joined=merge(countries, populflows2000, by.x="wb_a3", by.y="Country.code", all=FALSE, sort = FALSE) 
# To define one color by segment 
factpal=colorFactor(brewer.pal(4, "Dark2"), joined$Group) 
leaflet(joined) %>%
  addPolygons(stroke = TRUE, color="white", weight=1, smoothFactor = 0.2, fillOpacity = .8, fillColor = ~factpal(Group)) %>%
  addTiles() %>%
  addLegend(pal = factpal, values=c("Emitter", "Isolated", "Receiver", "Transit"))

A Simple Interactive Map Of US Prisons With Leaflet

The love of one’s country is a splendid thing. But why should love stop at the border? (Pablo Casals, Spanish cellist)

Some time ago, I discovered Enigma, an amazing open platform that unifies billions of records from thousands of government sources to make the world of public data universally accessible and useful. This is the first experiment I have done using data from Enigma. This is what I did:

  1. Create a free account, search and download data. Save the csv file in your working directory. File contains information about all prison facilities in the United States (private and state run) as recorded by the Department of Corrections in each state. Facility types, names, addresses (or lat/long coordinates) ownership names and detailed. In sum, there is information about 1.248 prison facilities.
  2. Since most of the prisons of the file do not contain geographical coordinates, I obtain latitude and longitude using geocode function from ggmap package. This step takes some time. I also remove closed facilities. Finally, I obtain a data set with complete information of 953 prison facilities.
  3. After cleaning and filling out data, generating the map is very easy using leaflet package for R. I create a column named popup_info pasting name and address to be shown in the popup. Instead using default OpenStreetMap basemap I use a CartoDB one.

In my opinion, resulting map is very appealing with a minimal effort. Since I cannot embed the map here, this is a screenshot of it:

jailsThis plot could be a good example of visual correlation, because it depends on this. Here you have the code. To see the map in your browser, press Show in new window option, a little arrow on the upper left side of the RStudio viewer window:

prisons = read.csv(file="enigma-enigma.prisons.all-facilities-bd6a927c4024c16d8ba9e21d52292b0f.csv", stringsAsFactors=FALSE)
prisons %>% 
  mutate(address=paste(facility_address1, city, state, zip, "EEUU", sep=", ")) %>%
  select(address) %>% 
  lapply(function(x){geocode(x, output="latlon")})  %>% %>% 
  cbind(prisons) -> prisons
prisons %>%  
  mutate(popup_info=paste(sep = "<br/>", paste0("<b>", facility_name, "</b>"), facility_address1, city, state, zip)) %>% 
  mutate(lon=ifelse(, address.lon, longitude),
         lat=ifelse(,, latitude)) %>%
  filter(! & !grepl("CLOSED", facility_name)) -> prisons
leaflet(prisons) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircleMarkers(lng = ~lon, 
                   lat = ~lat, 
                   radius = 3, 
                   color = "red",
                   fillOpacity = 0.5,
                   popup = ~popup_info)

How Big Is The Vatican City?

Dici che il fiume trova la via al mare e come il fiume giungerai a me (Miss Sarajevo, U2)

One way to calculate approximately the area of some place is to circumscribe it into a polygon of which you know its area. After that, generate coordinates inside the polygon and count how many of them fall into the place. The percentage of coordinates inside the place by the area of the polygon is an approximation of the desired area.

I applied this technique to calculate the area of the Vatican City. I generated a squared grid of coordinates around the Capella Sistina (located inside the Vatican City). To calculate the area I easily obtain the convex hull polygon of the coordinates using chull function of grDevices package. Then, I calculate the area of the polygon using areaPolygon function of geosphere package.

To obtain how many coordinates of the grid fall inside the Vatican City, I use revgeocode function of ggmap package (I love this function). For me, one coordinate is inside the Vatican City if its related address contains the words “Vatican City”.

What happens generating a grid of 20×20 coordinates? I obtain that the area of the Vatican City is about 0.32Km2 but according to Wikipedia, the area is 0.44Km2: this method underestimates the area around a 27%. But why? Look at this:


This plot shows which addresses of the grid fall inside the Vatican City (ones) and which of them do not fall inside (zeros). As you can see, there is a big zone in the South, and a smaller one in the North of the city where reverse geocode do not return “Vatican City” addresses.

Maybe Pope Francis should phone Larry Page and Sergey Brin to claim this 27% of his wonderful country.

I was willing to do this experiment since I wrote this post. This is the code:

#Coordinates of Capella Sistina
capella=geocode("capella sistina, Vatican City, Roma")
#20x20 grid of coordinates around the Capella
g=expand.grid(lon = seq(capella$lon-0.010, capella$lon+0.010, length.out=20),
lat = seq(capella$lat-0.005, capella$lat+0.005, length.out=20))
#Hull Polygon containing coordinates
#Address of each coordinate of grid
a=apply(g, 1, revgeocode)
#Estimated area of the vatican city
length(grep("Vatican City", a))/length(a)*areaPolygon(p)/1000/1000
s=cbind(g, a)
s$InOut=apply(s, 1, function(x) grepl('Vatican City', x[3]))+0
proj4string(s)=CRS('+proj=longlat +datum=WGS84')
ic=iconlabels(s$InOut, height=12)
plotGoogleMaps(s, iconMarker=ic, mapTypeId="ROADMAP", legend=FALSE)

What If You Dig A Hole Through The Earth?

It suddenly struck me that that tiny pea, pretty and blue, was the Earth. I put up my thumb and shut one eye, and my thumb blotted out the planet Earth. I didn’t feel like a giant. I felt very, very small (Neil Armstrong)

Where would you come out if you dig a hole straight downward from where you live through the center of the Earth? Supposing you survive to the extremely high pressure and temperature of the nucleus, would you find water or land at the other side? It maybe sound a ridiculous question (I will not refute that) but in this post I will estimate how many people would die drowned and how many will find land in the antipode of where they live. At least, knowledge does not take up any space.

I found a database of the United Nations with very useful information for my experiment: longitude, latitude and population of all capital cities of the world in 2011(1). I assumed that capital cities are a good sample of where people live. Maybe is not the best one since some very big countries are represented by only a city but is a good way to obtain a quick estimation. On the other hand, capital cities represent approximately 7% of the world population so in this sense is a very good sample.

Process is simple: loading the xls file, calculating the antipode of each point and checking where it is. Google provides information about country where a coordinate belongs. For coordinates on the sea, no information is returned. Once you have this, is easy to calculate proportion of people that will find water. My estimation is around 77% of people will find water on the other side. Taking into account that all people leave from land and approximately 70% of the Earth’s surface is water, this figure seems to be small but since both poles are symmetrical and are uninhabited, the estimation makes sense. Here you have a map with the result of the experimet. Points are capital cities and size is related with population. In blue, capitals with antipode on the sea and in brown, capitals with antipode in land:
By the way, I am one of the 23% of lucky people that would find land in the other side. I live in Madrid, Spain:
An if some rainy afternoon having little to do I dig a hole through the Earth, I will appear in a place called Weber, in New Zealand:
This estimation can be very silly but the physics involved in the experiment are very interesting as you can see here. By the way, there is a film called Total Recall (2012) where the only way to travel between last two cities in the world is using an elevator through the Earth. Here you have the code:

#The xls file is in
CapitalCities <- read.xlsx("WUP2011-F13-Capital_Cities.xls", sheetName="Capital_Cities", startRow=13,header=TRUE)
names(CapitalCities) = gsub("\\.", "", names(CapitalCities))
#Obtain symmetric coordinates for each capital
CapitalCities$LatitudeSym <- -CapitalCities$Latitude
CapitalCities$LongitudeSym <- -sign(CapitalCities$Longitude)*(180-abs(CapitalCities$Longitude))
CapitalCities$DigResult <- apply(CapitalCities, 1, function(x) {unlist(revgeocode(c(as.numeric(x[11]),as.numeric(x[10]))))})
CapitalCities$Drowned <-$DigResult)*1
#Percentage of population saved
world <- map_data("world")
opt <- theme(legend.position="none",
axis.text =element_blank(),
plot.title = element_text(size = 35),
panel.background = element_rect(fill="turquoise1"))
p <- ggplot()
p <- p + geom_polygon(data=world, aes(x=long, y=lat, group = group),colour="white", fill="lightgoldenrod2" )
p <- p + geom_point(data=CapitalCities, aes(x=Longitude, y=Latitude, color=Drowned, size = Populationthousands)) + scale_size(range = c(2, 20), name="Population (thousands)")
p <- p + labs(title = "What if you dig a hole through the Earth?")
p <- p + scale_colour_gradient(low = "brown", high = "blue")
p <- p + annotate("rect", xmin = -135, xmax = -105, ymin = -70, ymax = -45, fill = "white")
p <- p + annotate("text", label = "Drowned", x = -120, y = -60, size = 6, colour = "blue")
p <- p + annotate("text", label = "Saved", x = -120, y = -50, size = 6, colour = "brown")
p <- p + geom_point(aes(x = -120, y = -65), size=8, colour="blue")
p <- p + geom_point(aes(x = -120, y = -55), size=8, colour = "brown")
p + opt
# Get a map of Spain, centered and signed in Madrid
madrid <- geocode('Madrid, Spain') <- get_map( location = as.numeric(madrid), color = "color", maptype = "roadmap", scale = 2, zoom = 6)
ggmap( + geom_point(aes(x = lon, y = lat), data = madrid, colour = 'red', size = 4)
# Get a map of New Zealand, centered and signed in Weber (the antipode of Madrid)
weber <- geocode('Weber, New Zealand') <- get_map( location = as.numeric(weber), color = "color", maptype = "roadmap", scale = 2, zoom = 6)
ggmap( + geom_point(aes(x = lon, y = lat), data = weber, colour = 'red', size = 4)

(1) United Nations, Department of Economic and Social Affairs, Population Division (2012). World Urbanization Prospects: The 2011 Revision, CD-ROM Edition.