ETL in ArcGIS Online with ArcGIS Notebooks

ESRI recently updated its ArcGIS Online Platform with a new feature -- Notebooks. This post covers some of the details on what they are, why they can be useful, and how to use them to help improve data operations in your ArcGIS Online organization.

Background

In 2015, I attended the Esri Developer Summit and stumbled upon a technical session by Chris Helm titled: "Koop: Using 3rd party services within the ArcGIS Platform".  Koop is self-described as an open source Geospatial ETL engine. In short, it provides a standard way to pull data from any source and expose it in a format that web apps understand, primarily the Esri geoservices (Feature Service) specification.  

The beauty of Koop is that it opened up a new world of being able to easily integrate external data into Esri’s ArcGIS ecosystem.   A popular example, presented by Daniel Fenton, illustrates how Koop can be used to pull rental housing data from Craigslist and expose it as a Feature Service, later added as a layer in an ArcGIS Online web map.

I mention Koop, because it is one of my favorite ways to perform ETL when building service for ArcGIS Online applications.  My second favorite approach is to publish Hosted Feature Services directly in ArcGIS Online by writing custom ETL scripts to interact with the platform.

Both approaches excel in certain areas; however, they have some drawbacks.

Koop excels by allowing HTTP requests to be the trigger to either pull from a cache or run the ETL to get new data. If cache is disabled, you can ensure the end user is getting the latest features. The drawback is that you need to manage an architecture account for heavy loads when your service goes ‘viral’ and have the forethought to handle errors, so the whole thing doesn’t crash.

You can interact programmatically with the ArcGIS platform through API’s like the ArcGIS REST API, or use a library with a higher level of abstraction such as the ArcGIS API for Python. Using this approach, we can write our ETL locally and publish a layer directly on the ArcGIS platform as a Hosted Feature Service. The main benefit here is that we offload the job of hosting our service to Esri’s scalable architecture. The drawback is that the code is generally run on your local computer or shared with a team that has to sync development environments and dependencies for it to run.

The ideal solution for designing ETL processes for integration with the ArcGIS ecosystem should build on these two solutions by allowing:

  1. ETL scripts to be easily run manually or by some other trigger

  2. Development environment to be consistent and accessible across a team

  3. The process to be easily documentable and clear

  4. Services to be published and hosted by the ArcGIS platform

Working Towards a Solution

Esri recently released ArcGIS Notebooks for ArcGIS Online, which are essentially managed and hosted Jupyter notebooks. At a high level, Notebooks in ArcGIS are like shareable documents for documenting and running Python code live in your browser. Notebooks can provide a consistent environment for team members to run the same code, and already has access to hundreds of libraries for your ETL and other needs

As a case study, I wanted to recreate the Koop example (using ArcGIS Notebooks), which pulled rental housing data from Craigslist and made it available in an ArcGIS Online Web Map.

The pseudocode is straightforward:

  1. Get rental listings from Craigslist via their API

  2. Remove old listings in a Feature Service

  3. Populate Feature Service with new listings

Implementing this in ArcGIS Notebooks with Python is trivial as well:

The Feature Service now contains the latest rental housing data from Craigslist.  When added to a Web Map, we can see the results:

Discussion

ArcGIS Notebooks are very accessible to someone wanting to write ETL that will grab external content and make it available in their ArcGIS Web Maps. It doesn’t require setting up a local Python environment, has access to most libraries you’ll ever need, and can be shared easily with other members of your ArcGIS Organization. Notebooks are simple to run and provide the benefit of being able to segment your code into cells, making it easier to write and debug code.

Features (in this example) are written to a Hosted Feature Service in ArcGIS Online, so we don’t have to worry about hosting anything on our own.  The ability to have the entire process completed ‘in the cloud’ is a big advantage.

I see ArcGIS Notebooks as a preferred method for doing ETL on services that don’t need to be updated that frequently. Koop has a concept of building ‘pass-through providers’, meaning ETL happens on every request to get the latest data. In ArcGIS Notebooks, you either need to run the code manually when you need updates or use Scheduled Tasks if you're hosting your own Notebook Server. Currently, Scheduled Tasks are only available through the ArcGIS Server role, but are on the roadmap to be included in ArcGIS Online.

A benefit of running ETL through ArcGIS Notebooks is that ‘who’ runs the notebooks can change. You can build ETL scripts in a Notebook and have another user in your Organization run it, either by transferring ownership of the item or by sharing the Notebook with them. This can be a helpful workflow for developers to write ETL and then empower their users to run them when needed to grab new content. It’s also easy to modify an existing notebook and save it as a new item, kinds of like forking someone else’s code and making it your own.

If you’re running ArcGIS Notebooks in ArcGIS Online, it does consume credits if you are using the Advanced or other premium instances. These instances included additional compute and memory resources, as well as gives you access to ArcPy. For general ETL, though, the Standard instance is free, has a large list of included libraries, and most likely will be more than enough resources for your ETL jobs.

Conclusion

There are many reasons to try out writing your ETL using ArcGIS Notebooks. It provides a very low hurdle for getting started with Python, and I like that I can empower other users to run ETL scripts on their own by giving them access to a Notebook. For simple ETL on data that doesn’t need to be ‘to the second’ up to date, it’s a great alternative to other solutions and products.

ArcGIS Notebooks doesn’t replace the need for libraries like Koop, or more intelligent solutions like FME.  It does, however, fit nicely into the toolbox of options that you can use for your next project.

Previous
Previous

Business Intelligence vs. GIS: What's the Difference?

Next
Next

What's the Deal with Vector Tiles?