Police departments are starting to launch open data portals (Toronto, Fort Lauderdale, South Bend, and most recently, Minneapolis, to name a few…). This is excellent news. Not only does it open a window to a historically opaque part of the criminal justice system, it also has the potential to provide both the government and an engaged citizenry with the information they need to start solving real problems facing real people.
Take the traffic accident report published on Québec’s provincial open data portal. The report provides detailed information on the day the accident occurred, what kind of road it was on, who was involved (cyclists, pedestrians, etc), how many were injured and how severely, etc. What it doesn’t provide are the coordinates at which the accident happened.
This is one of those situations where data visualization is important. I can easily query the data to see how many of the accidents happened on days when it was raining, or how many accidents happened between midnight and 4:00am, but until I can plot the data on a map, I’m only seeing part of the picture.
Anyone with a bit of time can geocode physical addresses, but in the case of the accident report the physical address that’s provided isn’t a route, street name, unit, city, and postal code. In fact, the address information doesn’t initially make a whole lot of sense.
There’s a field for the civic address of the closest landmark, but it’s 62% empty. Since these are traffic accidents, there’s always a road name, but they’re frequently recorded as (and I’m translating here) “close to the Burger King on Demasse st”, or something similar. Next to the route title, there’s a column named tp_reprr_accdn which contains the values 0, 1, and 2.
None of this helps me out if I want to plot these accidents on a map.
The Solution: Refinement and Geocoding with Unity
Metadata is critical. The SAAQ (Société de l’assurance automobile du Québec) maintains very good information to go along with the accident report, but unless I want to have a pdf explaining what I’m looking at while I’m checking out the data itself, I need to find a way to plug the information from the metadata into the data set itself.
From looking at the metadata (and a few years of French immersion) I can tell that the column cd_etat_surfc signifies road surface quality. The values in this column are the numbers 11–20 and 99. Using the extract node on Unity, I can look for those numbers in the original data set and change them to their corresponding description in the output data (“14” becomes “Sand/Gravel”, etc). Since these values are used internally at SAAQ, I’m confident that every time the data updates they’ll use the same set of values, which means I won’t have to worry about validating these descriptions every time the data set updates.
Looking through the metadata, I also learn that the column named tp_reprr_accdn relates to the type of reference point on the road. “0” means “not specified”, “2” means “other”, and “1” means intersection.
This is important information for me to have. I didn’t want to geocode based on a street name alone, because I couldn’t be sure where the pin drop would go. But knowing which accidents took place at intersections means that I can geocode based on <route name 1> & <route name 2>, <municipality>, <province>.
To do this, I have to perform a number of different transformations on unity.
Here’s the simple breakdown of the transformations I’ve added:
- The if_then_else node looks for the value “1”, signifying intersection, in the landmark (tp_reprr_accdn) column.
- When it finds a “1” it pulls the values from route 1 (rue_accdn) and route 2 (accdn_pres_de). I’ve concatenated these two values and separated them by an ampersand, which is how they’ll be geocoded.
- These, in turn, get concatenated with the municipality name (mrc) and a static node with the value “Quebec”, because I want to make sure I limit my results to the right place. This gives me the string <route name 1> & <route name 2>, <municipality>, <province>.
- Since I already know that not all of the data is ideally structured (recall the “close to the Burger King on Demasse st”), I’ll toggle the geocoding node to disallow partial matches. This will limit the number of successes I get, but it will ensure that the matches on the output data are accurate.
- That output is then plugged into the geocoding node
- The latitude and longitude are pulled from the geocoding node into a to_geo_point node which formats the latitude and longitude based on our preferred WKT format.
- The to_geo_point node output connects to the geometry column
After I’ve made these and a few other transformations, I get a data set that looks like this.
By asking a bit more from the data, we can start pulling out points that can help us start understanding accident trends. While the success rate for this kind of geocoding is currently limited by the quality of the data, it’s through seeing the data presented this way that we can start to see the value of plotting it on a map.
A note regarding Unity screenshots: Unity is currently an internal feature and the GUI is in prototype. There are UX and UI components that are in the design phase and cannot be displayed.
Click here to view a sample of the refined data set. For clarity, I’ve limited the columns and translated all French in the data set to English. For more Unity case studies, or to try it out for yourself, please don’t hesitate to get in touch.