A Proposed Law of Data Maintenance

June 2021

Data is maintained to the level of detail shown by its most popular visualization.

You can have as many fiddly esoteric fields in your data format as you want, but to a first approximation no one will ever populate them correctly unless there's something that consumes that data, and can make it evident when it's obviously wrong or missing.

GTFS Example

We struggled with this a lot when we were first popularizing the GTFS format for public transit data interchange. It would have been a lot harder to get that effort off the ground had it not been tied to Google Maps.

The Google Maps UI provided the initial clear litmus test of whether GTFS data was encoded sensibly. (And because updating the content in Maps had a relatively long cycle time in those days, we also created a simple GTFS feed validator to provide a faster visualization to help feed publishers.)

In the bigger picture, I also came to believe that apart from a relatively small number of open knowledge nerds like me, basically no one cared about open data in its own right; but a lot of people cared about whether or not they can see their local bus times on their favorite maps app. So in the early days of GTFS we tried to harness the latter to incentivize the former.

This principle of data maintenance also informed the proposal process for GTFS format extensions: we tried to ensure that any proposal had been demonstrated by a real data producer and consumer (example). That is, any new field had to have some useful visualization. (This was also inspired by the IETF's practice of "rough consensus and running code".)

Related Concepts?

I'd love to hear about any similar/hopefully snappier formulations of similar concepts that you've come across—let me know via Twitter.