ON JANUARY 18, 2017

Unity Case Study:
Aggregating Address Points

The Data

Address points are one of the most critically important data sets that any city can release on their open data portals. They provide an invaluable geospatial base layer to virtually any app.

A quick search for “Address Points” on Namara tells me that there are over 700 address point data sets available on the platform. These range in size from the Kodiak Island Borough in Alaska (7764 rows) to New York City(942,321 rows), and while some of these data sets are statewide/provincial (way to go, Utah!) and a number of them are by county, the vast majority of them are by city. Since they’re all released in different places, collected by different agencies, and published in different ways, it’s hard to find any two among the 700 that look the same.

The Problem

Since addresses are usually maintained at the municipal level, that means they’re usually released at the municipal level.

This is fine if you want to build an app that doesn’t extend past your city limits, but if you’ve set your mind on something bigger, you’re going to have to get your hands dirty.

Take these three address point data sets from the Waterloo, Toronto, and Ottawa open data portals:


A quick look at each tells me that there’s a lot of variation in what’s offered in a standard address point data set. Ultimately, though, it boils down to an address and a coordinate.

The Solution: Aggregating in Unity

To merge these data sets in Unity, I need to define a schema, my ideal output.

note the default value in “province”

The above contains the requisite address information, the metadata from the source (so I can keep track of when the data sets have updated), and geometry, so I can plot the aggregate on a map.

After I’ve defined the schema, I need to bring the graphs into Unity and map their attributes to the schema that I’ve already set up.

Ottawa Addresses mapped to the ideal schema

The Ottawa address data set maps pretty cleanly to the ideal schema. Despite lacking ward information and some differences in naming conventions (addrnum on theirs, street_number on mine, etc), the only thing I need to actually add is a static node specifying that for this graph the default value in city should be “Ottawa”.

The Toronto Address graph is also relatively straightforward. Since I prefer my street_name and ward_name uppercased, I’ve added that transformation. Looking at the Toronto data set, I can see that a lot of the values in fcode_des are marked as unknown. Since neither of the other data sets acknowledge unknown values I added a replace node to look for instances of the word “Unknown” and replace them with null.

Finally, since the original graph’s ward_name provides ward name and number in the same column, I added two extract nodes and plugged in a regular expression that pulls out the name and number separately. It took two minutes, and it means that later I’ll be able to query the data better.

Waterloo’s address point data set doesn’t have a column that maps cleanly to street_number, but I can see that the data in civic_addr is always recorded with a space between the street number and the street name, so I split the value on the first space and plugged that value into street_number. I want to make sure I’m only pulling numbers, though, so I add in a “to_integer” node to make sure I’m only grabbing what I want.

Waterloo Addresses to the ideal schema

The rest of the graph lines up well, with the exception of ward. In the original data set, Waterloo only provides the ward number as a data point. I could easily map that and leave ward_name blank, but I would prefer to have both values in my final data set.

Since Waterloo has pretty great open data, I decided to see if they released a ward map data set. Turns out they do.

Using the Namara node on Unity, I can point towards the ward data set on Namara, match on ward_no, and output the ward name (which on the data set is called ward_2014).

So by matching the values in one data set with the corresponding value in a second data set, I can add a totally new attribute to my final aggregate.

Once the data sets have all been mapped to the schema, I can kick off a Unity run and then upload the data to Namara, where it will update automatically.



Unity Case Study:
Geocoding Traffic Accidents

The Data Police departments are starting to launch open data portals (Toronto, Fort Lauderdale, South Bend, and…

ThinkData Case Study:
Powering AI

Behind every intelligent machine is a library of great data. Is your data ready for…