Finding a Good Data Source For a World Map

Last year I contributed a Flash-based world map to the open source Piwik Web Analytics project. At that time, all of the Piwik charts where rendered in Flash, but as with release 1.5, all Flash charts were replaced with JS/canvas charts. Now, my map plugin is the only Flash component in the whole project, which is why they asked me if I could rewrite it in SVG. I will take some time in October to do this and I want to write here about my work. In contrast to many past projects where I wrote making-of like posts after finishing, I now want to write while working – literally. This kind of invites everyone interested to take part in the design process and, besides of that, provides a good amount of documentation and tutorial-like resources.

So, let’s start the project. A year ago, I already wrote a proposal for some new features for the Piwik map, which I now want to implement. The first part of the work will be to prepare the maps, the second part will be to render the maps using SVG and, presumably, a JS/SVG framework like RaphaelJS. The process underlying the first part looks like this:

In this post I will focus on the initial step, the map data source.

Finding a map data source

The biggest challenge was to find map data of inner-country regions (like the U.S. states, the German Bundesländer or the Regions of France) for all countries of the world. Initially I thought of extracting the administrative boundaries from the huge OpenStreetMap database, but this is not as easy as it sounds. Finally I decided to look for other data sources – and succeeded. I found gadm.org, a website whose entire purpose is to provide free map files of the worlds administrative boundaries and they had exactly what I was looking for: a shapefile of all countries with their first-level sub-divisions.

Update: I will use the Natural Earth shapefile instead of GADM, see below.

Handling huge shapefiles

After unpacking, the shapefiles where 439MB big. I wanted to know a little more about the data before starting to actually work with them, so I wrote a small Python script to get some information. There are plenty of Python libraries for reading shapefiles, I kind of like this one, mostly because its simplicity. So here’s the script:

import shapefile, csv
# create a new csv file for writing and write csv header
csvout = csv.writer(open('shp_meta.csv', 'w'))
# open the shapefile and read records
sf = shapefile.Reader('gadm1_lev1.shp')
shape_records = sf.shapeRecords()
 
for record in shape_records:
	numPoints = len(record.shape.points)
	numParts = len(record.shape.parts)
	shapeMeta = record.record
	csvout.writerow([ numParts, numPoints] + record.record)

The first interesting fact is that executing the script took my Macbook a little more than three minutes and the RAM usage peaked at something around 3GB(!). This means that the script might not run on some machines. Maybe I should consider trying out other shapefile libraries later, to test their shapefile reading performance.

Analyzing the map data

However, here’s the resulting csv file. The most interesting thing about it is what is stored in the shape record dBase metadata. Fortunately, the shapefile stores the countries ISO codes, which I need for mapping them to the country ids used in Piwik. Next good thing is that we also have official two-letter codes and names language for the level-1 boundaries. The names are provided both in English and in the countries local language, which is very good since we can use them later for the map rendering.

Another thing of great interest is the quality of the shapefile. The big file size is a first indicator for a high level of detail, but I wanted to know more about it. This is why I added two more columns: the total number of points per shape and the number of shape parts (a shape may contain multiple parts, which are polygons in our case). To analyze this data in more detail, I loaded the csv file into R, which is a great (if not the best) software for data analyzing:

data = read.csv('shp_meta.csv', header=TRUE)
pc = data$point.count
sc = data$shape.count
country = data$iso3
region = data$region.code

The first thing I looked at was the total number of points in the shapefile, which is 28.5 million points(!) or something around 8500 points per shape.

> sum(pc)
[1] 28565495
> mean(pc)
[1] 8486.481
> plot(pc)

However, after looking at the scatterplot of point counts I saw that only few of the shapes seem to have that much of points.

Here’s a look at the distributions of parts per shape, points per shape, points per part and points per country. I used a boxplot of the log10 counts for each measurement. One can now see, that the median of parts per shape is at 10^0, which means that at least half of the shapes consist of only one part. However there are some shapes that are made up of almost 10^4 or 10000 shapes, which I assume to be near pole-regions that contain many small island-like parts. The points per shape has it’s median at 10^3 or 1000 with a few outliers to the top. Points per country peaks at 10^6, which means we have to deal with up to one million points per country.

Of course I wanted to know what countries and regions are responsible for those outliers in point counts, so I replaced the dots with labels. Interestingly, it’s Chile, Canada and the United States.

plot(pc, col="white")
text(1:length(pc), pc, iso3)

This is the region Magallanes and Antártica Chilena, which is the one with an outstanding number of about 1.2 million points.

One of the next smaller regions is Alaska, which has about 600 000 points.

alaska

Closing thoughts

Since the admin-level 1 shapefile doesn’t contain the admin-level 0 boundaries, at least not explicitly, we will either need to compute them using the “outer” regional boundaries of each country, or take them from the admin-level 0 shapefile from gadm.org.

Obviously, 1.2 mio points for a single shape are way too much for display in a browser using SVG. Therefore, in the next session we will proceed with map projection and the polygon simplification process. We will take a good look to Magallanes and Antártica Chilena then.

Update (Oct, 25)

Unfortunately, the GADM maps are licensed under a not-so-free license, which doesn’t permits commercial use. Also, it seems that GADM doesn’t update their shapefiles very frequently, since it still doesn’t contain South Sudan. Fortunately, Manu told me about Natural Earth, a site which I doesn’t only provide hi-res images of the earth (which I initially thought), but also provides detailed shapefiles. I checked the Admin 1 – States and Provinces shapefile and it contains South Sudan. And all data is licensed as public domain. That’s not a great reason to switch. So, no GADM anymore.