Phil White. Earth, Environment & Geospatial Librarian, University of Colorado Boulder
I'm currently working on a project with a colleague examining economic outcomes for mobile home park residents following sales of parks. My role is developing a geospatial data pipeline that processes parcel data for 3000+ US counties, Microsoft's US Building Footprints data, and Census Block geometries. My pipeline runs through several custom filtering mechanisms and uses building footprints to identify potential mobile home park parcels, which are then associated with both transacted mobile home parks point data and their associated Census Blocks. As this analysis will include nearly every county in the entire lower 48 states, the process is parallelized to efficiently handle the several TBs of data. The data pipeline is complete, and I'm currently testing and preparing to run the code for the entire nation on CU's high-performance computing cluster, Alpine. Code for the processing pipeline is available on GitHub.
This was a project I worked on a few years back with collaborators from Berkeley, UCLA, and the University of Houston, comparing how faculty use materials in our collections based on their citation habits. This project involved downloading about 100k citations from Web of Science using their SOAP API (very nasty XML with unstructured citations), then parsing them with REGEX to get DOIs for each citation. These DOIs were then queried against the CrossRef REST API (very nicely structured JSON), written to CSV, analyzed using Pandas, and finally creating some attractive plots using MatPlotLib. I wrote this notebook and code for my collaborators each to run separately. It runs through the REGEX, CrossRef, Pandas, and MatPlotLib portions--the bulk of the project's methods.
Two examples. First, a Notebook I developed for a webmapping workshop. Starting from very basic webmapping using GeoPandas and Folium and advancing to a choropleth map with a custom color ramp and style function.
Next, a simple navigation web app I created using MapBox GL JavaScript library for my spouse's workplace. Includes the ability to toggle between basemaps, and a trails layer styled according to their older paper map but with the ability to view attributes on click. This works in a desktop browser, but is ideally used on a mobile device on property with location turned on so you can see your location and orientation. Code here.
I am a big collector of music in compact disc format. A while back I decided I wanted to take control of my collection, get it psuedo-cataloged, and back it up on cloud space that I manage. This script takes FLAC audio files in my collection, adds new items to my inventory, copies them to an external SSD, converts a copy to MP4 (for adding to my phone), manages the audio files' metadata, and finally uploads FLAC files to an AWS S3 bucket for cloud backup. Uses pandas, pydub, boto3, and other libraries.
I developed this workshop primarily for GEOG 4303/5303 students to give them a taste of open source raster manipulations. In contrast to ArcPy, rasterio is more like bowling without gutters. A lot of things hidden under the hood in ArcPy need more explicit management when you use Rasterio (or GDAL). These notebooks run through I/O operations, managing coordinate transformations, clipping/masking, visualizations, and some basic analyses using Numpy and Scipy. A plus side of Rasterio and most open source GIS options is that riding without the training wheels really teaches you more about the structure and management of geospatial information.
This was just an afternoon curio/rabbit-hole dive prompted by a classroom question. You learn a lot about spatial data by deconstructing and reengineering a pretty standard operation. This uses ArcPy just for IO operations, but could easily be converted to work with SHAPELY geometries and applied to a geopandas frame. Fun one!