Fall is here, and I’m eagerly awaiting the first release of City of Toronto “open data”. I’ve been thinking about what I’d like to see offered, both for data-hungry citizens in general and, more greedily, for accelerating our progress on 5 Blocks Out. In that spirit, here’s a short wish list.
First, some general “tenet” suggestions:
- If you’re publishing it for humans, publish it for machines too. We need data in machine-readable open standard formats like JSON, XML, CSV, iCal, and so on. Not just PDF.
- Publish standard format street addresses, at minimum, for all location-based data. Better yet is a latitude/longitude pair.
- Publish time-based information such as events in a calendar format such as iCal.
- Any refreshable dataset needs unique durable IDs for every object in the data set so that machine readers can detect changes over time.
- Document the data at least enough for people to understand and use it productively. This sounds like a no-brainer, but apparently it has been a blocking issue in use of open data in other cities.
Second, here are a few specific data sets I would find useful for the work I’m doing:
1. Lists of places, including place name and location information.
A foundational part of what we do on 5 Blocks Out involves situating thousands of places on maps. We’re interested in all kinds of places, including businesses, private organizations, and government facilities such as parks, community centres, and libraries. To do this we need trustworthy data sources with at least a name and location for each place. Ideally the location is already stated as a latitude/longitude, but street addresses also suffice as they can be geocoded into latitude/longitude using various free geocoding web services.
In addition to a place name and location we try to describe each place. For example, if it’s a business or a private organization, what sort of products and services does it provide? If it’s a government facility, what services does it offer to the public? How might one contact this place via phone, email, or fax? Is there a website URL available? And so on. This descriptive info is useful, but not essential.
Lastly, we look for unique identifiers, so that we can tell places apart and identify changes in place information over time. For instance, if a business moves from 123 Main Street to 245 Main Street, how do we know it’s the same business? Some data sources include unique ID information that enable us to detect this sort of change. It turns out that without unique IDs to rely on you need to come up with funky duplicate detection heuristics.
Here are examples of name-and-location information data sources the City of Toronto publishes today that we would love to have in easily machine-readable format:
- DineSafe lists restaurant names & addresses along with inspection data
2. Event data
We’re interested in publishing calendars of events happening throughout the city. We define “event” broadly to mean anything interesting that is time-based… everything from a major street festival, to a sporting event, to a city councillor doing a public consultation meeting. Jon Udell has done some great work on this with his ElmCity project; check it out if you’re publishing an event calendar already, or thinking about it. We endorse Jon’s idea of publishing in iCal and related formats. We’d like to see this go a step further and have each event item include a location, as described above.
Again, the City of Toronto already publishes some event data, but it’s not in an easily machine-readable format. Here’s the Toronto Festivals and Events Calendar, for example. (Yes, we could build a parser to consume this particular web page, but that would be missing the point.)
3. TTC route, stop, and vehicle location information.
This sort of data is obviously useful for building all kinds of apps that help people get around the city. Kieran and Kevin have done a great job reverse-engineering TTC route and stop timing data on MyTTC.ca. The TTC should provide an official data stream to enable apps like theirs. TTC already offers the data in PDF format as route schedules.
There’s lots more on my list, but these three buckets are near the top.
What’s on your wish list?