Several months back Toronto mayor David Miller announced the city would embark on an “Open Data” initiative, with first steps to show by this fall. Well, fall fast approaches, and the city’s Open Data website is still a blank slate. While we don’t know yet what Open Data will be, lots of people have notions of what it ought to be. Here’s mine:
Start with a Sandbox and some Dogfood
It’s been a difficult summer for the city. The strike took a lot of resources offline, including people who would otherwise have been helping formulate and deliver the first batch of data. So while fall is probably still doable, plans for a first release must surely have been scaled back. The first go-round will have to focus on low-hanging fruit: data that happens to be readily available, privacy-clean, politically non-threatening, and already in machine-readable format.
Let’s also recognize that the first release, like any version 1, wants to be a pilot / proof of concept, not a polished product. I imagine the city will publish some sample data feeds (the “dogfood”), encourage people to build a few apps that consume the data (“dogfooding”), and then evolve a repeatable process around that while putting together a viable longer-term plan. That would be just ducky.
Beyond the first release, getting the city into the business of publishing and consuming data is a huge challenge. Technology is the least of the difficulties. It’s a huge prioritization problem, for one… the city needs to develop clear tenets, guidelines, and processes for deciding which requests to bubble to the top of the stack. And there are many “soft” barriers to overcome, including union fears (must automation lead to job losses?), privacy concerns, liability risks, and — probably most difficult — the turf struggles that will surely arise from trying to pry data out of people’s hands.
But this is not a blog post about Fear, Uncertainty, and Doubt. It’s about a happy world where the city overcomes its inertia, rises to the challenge, and Does Great Things. So let’s consider the question of what the data itself should be.
A Framework for Data Selection
If I was running the show, my framework for data selection would look something like this:
- Solve real people’s problems: focus on data that real people are requesting in order to solve real world problems. Ignore data that’s “looking for a problem to solve”, even if that data happens to be convenient to obtain and process. In other words, stay customer- and solution-driven, not expediency- and politically-driven.
- Satisfy the customers: those “real people” we want to satisfy break down into three groups: citizens, non-government organizations (both for-profit and not-for-profit), and government itself.
- Produce net benefit to society: the data’s benefits to society at large should outweigh the data processing costs. Benefit will be hard to measure in some cases. In other cases the benefit will be crystal clear in terms of dollars, e.g. money saved, time saved. Either way I say let’s measure, and get better at measuring, so that we can set goals and quantify our progress over time.
- Keep it clean: obviously the data must be OK to release from a privacy perspective, and it shouldn’t expose the city to unreasonable legal risk. That said, I would be perfectly happy with a license that exempted the city from all liability due to things like errors in the data, and I bet a lot of other people and companies would too. After all, we’ve signed a bunch of other licenses just like that for most other online data services we consume, including mission critical services like email and online document storage.
- Keep it fresh: the data can (and indeed, must) be refreshed periodically so that it doesn’t go stale. That implies an up-front commitment to continual publishing. Open Data isn’t a one-shot deal.
Open Data = Data In + Data Out
Almost all the examples I’ve read about open data initiatives are “Data Out”, i.e. cities publishing municipal data such as budget and contract details, service records for road repair, traffic flow, and so on, for the general public to consume. This is useful and necessary stuff, but there’s another equally important category I’ll refer to as “Data In”.
Data In is about society at large publishing data which the city consumes. For example, citizens noting the location of major potholes and failed streetlights; community service organizations reporting on how many people they are reaching, and how effectively (an idea Jane Zhang at TechSoup Canada is passionate about); schools reporting student attendance numbers, and so on. There’s a massive amount of “scouting” that can be done by citizens on behalf of the city, in effect crowd-sourcing information to help the city operate more efficiently and decide where to focus its limited resources. Citizens are incented to do it because they want their tax dollars spent efficiently.
“Data In” is the reason I list government itself as one of the key Open Data customers. As part of the planning process the city should be asking each of its departments for their own wish lists of data that society at large could provide in order to help them do their jobs better. Furthermore, those departments should be dogfooding the exact same data services that we the public consume. This process — internal dogfooding, and being your own customer — has a powerful built-in bias towards self-correction and accountability. You can bet the quality of city-published data feeds will be high, for instance, if internal city processes depend on those same feeds.
More to come…
I’ll write more about Open Data in the coming months. I’m selfishly hoping the city will publish some data we find useful for 5 Blocks Out, if only to save us from transcoding information trapped in PDFs (what’s with disabling copy-paste in PDFs?), and from hearing “Sorry, you’ll have to file a Access to Information Request Form for that” when we call our friends at City Hall. We can do better. Much better. Onwards!