All the GeoJSON on GitHub

TLDR: I made a map of all the GeoJSON files on GitHub.

A while back I saw that GitHub has an archive of all files and their contents available in Google Cloud’s Big Query. While it seems like the data is no longer kept up to date (last updated Nov 2022), I had always wanted to do something with it. I thought it might be interesting to see what kind of GeoJSON files people end up putting on GitHub.

The whole GitHub contents table in Big Query is quite large and most of that isn’t GeoJSON, so I started off with the most expensive part of this process: running a query to extract just the GeoJSON file contents. Looked something like:

  SELECT
    f.id,
    f.path,
    f.repo_name,
    c.content,
    c.size
  FROM `bigquery-public-data.github_repos.contents` c
  JOIN `bigquery-public-data.github_repos.files` f USING (id)
  WHERE f.path LIKE '%.geojson'

From there I extracted the results to parquet and downloaded those to process locally.

DuckDB has become my go to for working with all sorts of tabular & columnar datasets. It’s fast, has a great SQL interface, embeddable, and has beta support for spatial data. Plus it reads and writes a lot of useful formats.

First issue: there are a lot of messy GeoJSON files. There are invalid GeoJSON files, sometimes on purpose because these files are used in tests and sometimes not. To try to get as much of the valid or close to valid features out the files I made a UDF to parse, validate, and normalize the features. For now I just care about the geometries not the properties.

con.create_function("extract_geojson", extract_valid_features, [VARCHAR], VARCHAR)

with some acrobatics to parse the json and unnest the lists as well as dedupe geometries the final query looks something like this:

  WITH github_contents AS (
    SELECT * FROM "data/github-geojson-full/*.parquet" WHERE content is not null
  ),
  extract_geojson AS (
    SELECT
      json_array_length(json(extract_geojson(content))) as feature_count,
      repo_name, path, unnest(json_extract_string(extract_geojson(content), '$[*]')) as geojson
    FROM github_contents
  ),
  parsed_geoms AS (
    SELECT
      ST_AsWKB(ST_Normalize(ST_GeomFromGeoJSON(json_extract_string(geojson, '$.geometry')))) as geom,
      cast(json_extract_string(geojson, '$.properties.area') as int64) as area,
      array_to_string(array_agg(repo_name), ',') as repos,
      array_to_string(array_agg(path), ',') as paths,
      count(*) as count
    FROM extract_geojson
    GROUP BY 1, 2
  )
  SELECT * FROM parsed_geoms WHERE geom is not null

I ended up splitting out “large” features because there were so many features that covered the whole world and I didn’t want to render those all to as high of a zoom level as the smaller polygons and points. So in exporting parquet files I make use of the area column that was calculated as part of the python UDF. There are also a lot of exact duplicates of geometries so we keep count of those as well.

Ideally in the future DuckDB will write geoparquet and tippecanoe will read it, but until then the gpq cli tool let me pretend the future is here. I used gpq to convert the standard parquet files to have the metadata needed and then to stream line delimited GeoJSON to stdin of tippecanoe to create the vector tiles. Using FlatGeobuf is another good option here as DuckDB and tippecanoe both support it, but I ran into issues with unknown geometries types errors and haven’t yet dug in enough to figure out why.

So for the large features that ends up like:

# convert to geoparquet
gpq convert data/github-features_large.parquet data/geoparquet/github-features_large.parquet

# geoparquet to geojson to mapbox vector tiles packed into pmtiles file
gpq convert data/geoparquet/github-features_large.parquet --to=geojson | tippecanoe --force --maximum-tile-bytes=3000000 --maximum-tile-features=300000 -z9 -Z0 --drop-smallest-as-needed -P --base-zoom=9 --generate-ids --layer=github-geojson-large --output=data/tilesets/github-geojson-features-large.pmtiles

The pmtiles file is then copied on Cloudflare R2, with a Cloudflare worker running a tile server. More on that setup in the pmtiles docs.

The Map!

Explore all the GeoJSON on GitHub. Click to see a list of features at the point ant the repos they came from.

A few notes:

  • The data is from November 2022, so it’s not up to date.
  • The contents table in Big Query only has files smaller than 1mb, so large geojson files aren’t included.
  • The map tiles are a bit on the large side, so slow connection or older devices it might not render well.

All the GeoJSON on GitHub in 1 map

Full page version

The script to process the exported GitHub data and minimal frontend are on GitHub.

If you have thoughts, feedback, or just like weird maps my email is mick at mick.im.