PostGIS Data Ingestion

Recently, I asked people on a forum about some of the pain points that they have when dealing with the services and infrastructure behind the web mapping applications that they develop. In my experience, GIS developers, especially those just getting their feet under them, are largely focused on adding maps and widgets to applications. These are firmly user-facing things or “front end” and often have little experience putting everything together to make a fully cohesive application. One of the topics that came up was what to do after you have a PostGIS Database setup in the cloud. How do you ingest data? How do you ingest geospatial data? At CartoLab, we use Postgresql with PostGIS extensively, so I attempted to give a thorough answer based on our experiences. Another poster on the thread reminded me of some other tools, as well.  I’ve compiled the information below.

You can really break down the ways to do it into a couple of groups based on how much processing you need to perform on the data before you copy it over to the database. More often than not, we’ll retrieve data from a source that is less-than-reliable with weird date formats, variations in the data that shouldn’t be there (fields where they couldn’t decide on case or format; i.e. Washington DC, DC, washington, WASHINGTON, or District of Columbia) that needs to be cleaned up, bad geometries, whitespace issues, and other rarer issues. In those situations, we take a data cleaning approach to standardize the inputs to work better inside of an RDBMS (Relational Database Management System). When we run across data like this, we’ll use tools like the below to handle the data:

  1. Python - Of all the methods in this blog post, Python is probably the most difficult and hardest to maintain in a small team or if you have complicated data, though if you get very good with it, it makes all of the other methods feel a lot easier. I’ll generally use pandas or geopandas packages to write up a full data pipeline that includes downloading the data I’m ingesting from a remote host, cleaning up and transforming the data using filters and joins, and then inserting the data into the database using pandas.io.sql and SQL Alchemy. Another option is to use psycopg2.

  2. FME - Like Python above, FME is great for complicated data that needs to be processed before it is inserted into the database. We use FME extensively at CartoLab because it provides a low-to-no code solution for everyone in the organization to produce reusable data pipelines that can retrieve data from remote sources, transform the data, and then insert the data into our databases or data lakes (or usually, both!). FME uses graphs to build data pipelines by linking together processing tools graphically.

  3. QGIS - Another option is to use QGIS. QGIS is great for one-off projects where you just want to clean up some rarely changing data (like a specific year’s census, for instance) where you only have to do it once and it doesn’t need a full pipeline. You can clean up your data, then use the database plugin to connect to your database and create a new or insert into an existing table. QGIS is also great for taking a look at your data after you insert it.

If your data doesn’t require a lot of cleanup and you just want to move files into PostGIS, there are a few other solutions available.

  1. GDAL - GDAL is a translation library for use with spatial data. It’s fully open source and is basically a geospatial Swiss Army knife. It’s a command line only tool available for Windows and Linux, and has bindings for Python. The main programs in GDAL that matter to us are ogr2ogr which will handle translating any spatial format to any other spatial format (geojson to postgis, for instance), gdal_translate and gdalwarp for handling raster data, raster2pgsql if you want to move a raster into PostGIS, and shp2pgsql for moving shapefile data into PostGIS.

  2. The pg client, PostGIS DBF and Shapefile Loader, or pgAdmin (or your Database client of choice) - You can use the COPY command if you have a client installed on your local computer that will allow you to move data from specific formats like CSV into the database from your local machine. In order to use this, you have to have a matching version of the PostgreSQL client installed on your computer. pgAdmin includes a graphical, if not confusing, way of doing this as well. If you install PostGIS of the right version to a local machine in Windows, it comes with the PostGIS Shapefile Import/Export Manager as well, which is really just a GUI wrapped around GDAL. 

Finally, of course, is the other way to ingest data into a PostGIS database, which is through a web application and API (Application Programming Interface) framework. Here, users and other programs are able to make a secure connection to software that sits in front of a database and interacts with the tables and functions we make available through web endpoints. These are often written in a server language like Python, NodeJS, Go, PHP, or .NET. A large part of what CartoLab does is create these API layers to allow our web applications and our clients systems’ to work together. In a single application, we’ll often ingest data to a PostGIS database from static remote sources and webhooks, scheduled tasks or cron jobs, and of course, user input.

Hopefully, this post has provided a bit of clarity on how, when, and what to do to get data into your PostGIS-enabled database. 

Previous
Previous

Elements v2 Released!

Next
Next

Secure ArcGIS Feature Service Webhooks with Node.JS and Twilio