Intro: This is part three of an eight part series looking at the Elevator Explorer, a fun data interactive mostly coded between the hours of 10 PM to 2 AM during the week leading up to April Fools’ Day, 2013. I’m going to be looking at the things I learned, things I wish I could have done, and the reasoning behind my design choices. The code I’ll be referring to will be in this tagged release on github.
Consuming Data
Now that you’ve got some juicy government data. How about getting it into a database? Having done this quite a few times before, there are some tips and tricks I’d like to share.
Make your models mimic the data
I’ve gone back and forth on this, and I’ve settled on making my Django models look like the source data. The reason most people do this is:
- It’s easy.
For Texas higher education data, I tried making the Django models an API to the data, but that proved to be a disaster. Writing the importers took longer, and when collaborators tried incorporating new data sources, they ended up creating their own models instead of using the existing models; and we ended up in a worse state than if we had just made the models mimic the data. The technique I’m trying now is making models that strongly resemble the source data, and then creating additional models to hold denormalized and supplemental data. My co-worker, Noah, has gotten a little farther with this idea and hopefully we’ll be able to unveil it soon. One benefit of this technique is the ability to wipe and re-import data while keeping supplemental data intact. For example, if the source data was dirty and said there was a 2000 floor elevator in Lockhart, but a human corrects it, the curated data is preserved even between import cycles.
It should be obvious that I modeled my models.py off the csv because I even documented the source field names:
class Building(models.Model):
"""Some place with an address that holds `Elevator` object(s)."""
# LICNO
elbi = models.IntegerField(u'Building Number', unique=True)
# BNAME1
name_1 = models.CharField(max_length=100)
# BNAME2
name_2 = models.CharField(max_length=100)
# BADDR1
address_1 = models.CharField(max_length=100)
# BADDR2
address_2 = models.CharField(max_length=100)
# BCITY
city = models.CharField(max_length=50)
# BZIP
zip_code = models.CharField(max_length=5)
# BCOUNTY
county = models.CharField(max_length=20)
# ONAME1
owner = models.CharField(max_length=100)
# CNAME1
contact = models.CharField(max_length=100)
Import all the data!
Since the models and the data are so tightly coupled, the importer is basically just a simple mapping of the CSV fieldnames to the model fieldnames. There’s one technique I’m playing with that I’m liking a lot. It goes like this:
- Iterate over the csv rows.
- Separate your row data into fields that are unique, and fields that aren’t.
- Put the non-unique data into a
default_data
dict. - Do a
get_or_create(id=row['id'], defaults=default_data)
. - If the object wasn’t created, update it like
obj.__dict__.update(default_data)
. - Except don’t do the update that way, only save the changed fields, and only if you have to. I wrote some helpers to do this:
def setfield(obj, fieldname, value): """Fancy setattr with debugging.""" old = getattr(obj, fieldname) if str(old) != str(value): setattr(obj, fieldname, value) if not hasattr(obj, '_is_dirty'): obj._is_dirty = [] obj._is_dirty.append("%s %s->%s" % (fieldname, old, value)) def update(obj, data): """ Fancy way to update `obj` with `data` dict. Returns True if data changed and was saved. """ for key, value in data.items(): setfield(obj, key, value) if getattr(obj, '_is_dirty', None): logger.debug(obj._is_dirty) obj.save() del obj._is_dirty return True
- So instead of
obj.__dict__.update(default_data)
, I doupdate(obj, default_data)
. - Django 1.5 supports saving a subset of a model’s fields. It would be trivial to adapt the code to take advantage of this.
This technique is slow, but the same speed a typical Django data import. If speed is an issue, you should truncate your tables and do bulk inserts.
Log all the things!
So you may have noticed the log statement inside there. Every time I do an import, I get a stream of how the data is changing from my logger:
['last_inspection 2011-11-01->2012-11-27']
['last_inspection 2012-02-29->2013-02-13', u'drive_type UNKNOWN->HYDRAULIC']
['year_installed 2008->2007']
[u'equipment_type ESCALATOR->PASSENGER', 'year_installed 2008->2007']
[u'equipment_type NEW EQUIPMENT->PASSENGER', 'last_inspection None->2013-02-21', 'floors 0->2', u'drive_type ELEVATOR DRIVE TYPE->HYDRAULIC', 'year_installed 0->1982']
And it’s just Python logging. I can change the verbosity and where it gets sent. The best part is, it helps identify dirty data. If you do an import with the same data twice in a row, updates you see in the second pass mean there’s some duplicate data in the source because the same changes will keep getting made.
So the terrible thing about logging is by default, it has ugly output. I have a solution for that. In my settings.py logging configuration, I have a custom handler that does pretty colorized output. The handler is based on this gist by Vinay Sajip. I simplified it by stripping out Windows support and include it in all my projects now.
Conclusion
The elevator data set was easy to work with; there’s fewer than 50,000 entries and only one CSV file. The techniques I’ve learned working with it will definitely translate to bigger projects. The create or update code could use a lot of work, but I’ve already used it on more complicated data importers elsewhere with DateField
s and DecimalField
s and TextField
s, (Oh My!). And just in case you’re wondering, there is a make command for grabbing the latest data and importing it. Next time, I’m going to go over how I got the data out to do a visualization.