How to Build a Coronavirus Dashboard - Part 1: Back End
At CartoLab, we build a lot of data-driven web applications. Web dashboards with maps are popular for sharing information about the spread of the novel Coronavirus, so we decided to build one. We built it from scratch in the span of a couple of days, and now we want to share how we did it. This is part one of a two-part series which covers how we build our dashboard, starting with our data processing pipeline.
Step 1: Choosing What Data to Show
There are already many variations of map dashboards out there showing global data counts. We wanted to add some additional features to our dashboard like time-series animations on a county-level, 3D visualizations, and overlays with other contextual data like demographics.
For our data source on virus counts we chose data from usafacts.org, which compiles cumulative data on a county-by-county basis along with data from Johns Hopkins University.
We also wanted to show additional data to add context to the count, so we pulled some data from the U.S. Census Bureau including county boundaries and statistics on a county level such as:
Total population per county
Percentage of people over 65
Percentage of people with Health Insurance
Median Age
Step 2: Data ETL
We have written ETL (extract-transform-load) pipelines for the data to load them into a PostGIS database alongside data from the US Census Bureau, including county boundaries and demographics statistics.
COVID-19 Data
The COVID-19 data from USAFacts is available as two CSV files: one with cumulative (info for each day since the beginning) confirmed counts and one with deaths. We used a Python script to pull the CSV files into an IO stream that ultimately imports the data into a Pandas data frame. Pandas is a Python library that is invaluable for data ETL and cleanup operations.
The data from USAFacts needs a little cleanup for it to be truly useful - we needed to pivot the time series data. With Pandas we cleaned the data up (column headers, etc.). Next, we set an index on all columns except for the time series data and run the Pandas stack function to pivot the date data into rows. We merged the two csvs into a single data table and removed any duplicate rows.
We loaded the data into a PostgreSQL database (more on that later), so with each update we truncated the destination tables and reset indexes before inserting new data. We also used a database cursor to refresh some materialized views in our database (more on that below as well)
Census Data
For county boundaries and demographics, we obtained the data from the US Census Bureau. The county boundaries come from the Cartographic Boundary dataset. The demographic data was pulled from the American Community Survey. We used another script to pull specific metrics for all counties via the Bureau's API, but the same data could be downloaded from the Data Finder tools as well.
We imported the county boundaries into our PostgreSQL database (which is spatially-enabled with PostGIS) using OGR. We also imported the demographic data to a table in the database.
Step 3: Data Joins
The point of loading this data into a PostgreSQL database is so that we can perform data joins between tables. For example, we can join virus cases and deaths to the county boundaries, and we can also join census demographics to make a useful final table we can export and load into vector tiles to display on a map. Here is a look at the SQL we used to build a materialized view that displays virus and other data alongside GIS boundaries, suitable for displaying on a map.
Step 4: Building Mapbox Vector Tiles
At CartoLab, we like to use the Mapbox Vector Tile data format for displaying data in web maps. Vector tiles are great because we get the performance of tiled data combined with data access capabilities of vector data. We feel that vector tiles, coupled with a WebGL-enhanced mapping library like MapboxGL.js, are perfect for displaying large amounts of data (say time series data on a county-level) in a web map. So how do we get from a PostGIS table or view to a vector tile?
Since we do this kind of work a lot, we have some utility scripts that make use of Python and a library called tippecanoe. Essentially the script reads a PostGIS table, converts it to GeoJSON, and then processes it with Tippecanoe. There are other ways to make vector tiles, but we feel like tippecanoe offers the most control over the final .mbtiles file in terms of size and how it renders in a browser. The script also handles uploading the tiles to Mapbox.com, which we are using for a tile server on this project. Once the vector tiles are uploaded to the tile server, they are ready to be used in a web map.
This concludes Part 1 of our series. Stay tuned for Part 2, in which we discuss developing the front end of the dashboard.