Hacking Shapefiles
One of the first tasks I had at CartoLab was writing a data pipeline system in Python. One part of the task was to get the most up to date data from an FTP site where it was stored as, of course, Shapefiles.
Since this was a long running python process, I wanted to be sure that the data we were using was actually fresh before I started the multi-hour process. A simple way to do this would be to use an FTP Python client, and then parse out the file system dates in the FTP directory and I initially did just that.
What I found was, sometimes, the administrator that was loading files into the FTP site was loading the same files, without changes, multiple times. This lead me to running the whole process just to find out the data from step 1 was the same as the month before.
The next method that I used to check the best by date was to create an md5 hash of all of the files, and stored that in a json file for comparison, but I found that to be clunky because, well, Shapefiles have so many files! So, I started looking for another solution.
As I mentioned and you’re probably aware, the Shapefile is made up of many files, but there are 3 specific files all Shapefiles must have, the .shp, the .shx, and the .dbf.
Starting with the .shp file, I took a look at the file header and noticed that the file format only included some metadata, the shape type, and the different bounding boxes. So, I next looked at the .dbf file, the dBase file, since it’s been around forever (it’s literally older than I am).
Looking at the headers, I noticed that it was much more feature rich than the shapefile, which makes sense, as it was used for databases and database tables have a lot more metadata involved. Sure enough, right at the start of the file header was the date that the file was last updated formatted in YYMMDD. This was promising, however, the 2 digit year gave me pause (I didn’t live through Y2K for nothing!). Looking deeper, I found that the 2 digit year was a 16 bit hexadecimal number starting from 1900, so 00 is 1900, FF is going to be 2155 (add the Shapefile crisis to the list of computer scares), and everything in between of course.
So, I wrote some Python code to download the dBase files from the FTP server and then loop over them to check the date they were created. Below is the function I used to extract the date that a Shapefile, or really any dBase file, was created:
def check_shp_date(test_files:list):
tfd = {}
# Replace with the actual file store path
file_store = "path/to/file/store"
for tf in test_files:
dbf_path = os.path.join(file_store, tf)
with open(dbf_path, 'rb') as f:
date_hex = binascii.hexlify(f.read()[1:4])
year = int(date_hex[0:2], 16) + 1900
month = int(date_hex[2:4], 16)
day = int(date_hex[4:], 16)
fdate = datetime(year, month, day, 0, 0)
print(f"{tf}: {fdate.strftime('%Y-%m-%d')}")
strfdate = fdate.strftime(r'%Y-%m-%d')
I was reminded of this old code snippet because some things never change. Five (or is it 6!) years later I’m currently working on another project where I need to break down file headers to get to the gooey nugget of data that we need and I’m going down the same sort of rabbit holes, only this time with less documentation.
Sources:
https://en.wikipedia.org/wiki/Shapefile#Overview
https://en.wikipedia.org/wiki/.dbf
Note: This blog is going to start being far more active and I have plans to really turn it into a place to discuss cool GIS problems we’ve solved and hacks that we’ve pulled off over the years.