For the most part external data feeds are complimentary and are used to augment carefully curated internal data sets. But this creates a new challenge: how can any company ingest new feeds of data generated outside their environment while maintaining the integrity of the reference data set created inside their environment? What happens if some of these sources overlap and have duplicate values, or somewhat duplicate? Then imagine something worse: imagine these duplicates have different values for the same attribute. How should you consolidate these records?
Mapping such data points is known as an entity resolution (or an entity mapping) problem, and it is not a trivial one. Usually organizations resort to internal master data management strategies to address it. However these are quite context-specific, and hence pretty laborious to put in place. There is also a trade-off between an exact matching that might dismiss a good number of records and having a fuzzy matching that’s too relaxed, resulting in a high number of false positives.
On the bright side, there are a number of widely recognized third-party solutions for different domains. In this post we are going to look at how we at ThinkData Works use some of Thomson Reuters APIs for business entity matching and data enrichment.
Let’s consider the following scenario: an organization wants to use Namara to integrate a few premium data sets with its internal CRM. They are particularly interested in:
Let’s have a quick overview of aforementioned sources.
Their internal data set is a pretty typical CRM with its internal identifier, company name, company category, location, and, of course, contact information (see the breakdown of the attributes by a category below).
Canadian Company Capabilities is an Industry Canada database of local businesses that is aimed at opening exporting opportunities, facilitating a search for prospective partners, and analyze the competition. Because of this, it is quite rich with respect to classification of company activities/services provided/goods manufactured, its financial situation, and their representatives’ contact information (up to 60 for one of them!).
Corporate directory contains the information that Canadian corporations are required to file, including standard identifiers such as business and corporation numbers, corporation names, its type, governing legislation, whether this corporation is active or not (and why), location information, lists of activities and directors.
TMX feed consists of the information pertaining to a traded security, including its symbol, CUSIP, corresponding company names, nature of business, number of outstanding shares, dividend factor, etc.
Even though all of these data sets have at least one unique identifier for every record, they are internal and cannot be used as a key for a join. One possible way to address this could have been to resort to a company name. However, after a quick examination, we can clearly see that records like “Shopify Commerce Inc.”, “SHOPIFY INC” and “847871746RC0001 Inc.” would not be easily disambiguated.
This is where Thomson Reuters PermID comes to play. PermID is a unique identifier assigned to a variety of different business entities (organizations, persons, instruments, and quotes) in Thomson Reuters internal universe of linked data.
Open PermID comes with the following set of APIs for entity querying and retrieval:
We will be focusing only on the first two. Record matching API allows to run a set of business entities against TR database, and returns a ranked list of best possible matches. In order to match an organization you need to specify its name and some of the following optional arguments:
You can see a sample response below:
So first, we’ll run all of the data sets through this API and assign PermIDs to the companies that were above the cut-off point (internally set to a match score of 85%). Since now all of the data sets have a global unique key, we will be able to seamlessly join them. Voilà!
But that’s not all! Thomson Reuters Entity Search API allows you to access descriptive fields for 3,460,500 organizations, 240,000 equity instruments, and 1,170,000 equity quotes, using their PermID. You can see the attributes typically available for organizations below.
So the last step in our enrichment process will be to augment these joined data sets with metadata coming from Entity Search API. We will query the metadata for all of the unique PermIDs attached to the original CRM via entity lookup and append to the resulting data set.
This is how starting from a basic business directory with some 20 attributes led us to developing a holistic representation for both private and public businesses with almost 150 different attributes. Click here to explore the final output data set on Namara.
Thomson Reuters is one of our exclusive data partners. Follow us on Twitter to stay up-to-date with amazing work they are doing with us. Don’t hesitate to contact us, if you have any questions. And if you liked this post, make sure to check out our case study on using Unity for joining and enriching geospatial data sets.