ON APRIL 9, 2018

Decoding the Cambridge Analytica Shell Game

Mark Tunbull, managing director of Cambridge Analytica (CA), was well aware of what the public reaction would be if their connection to clients was discovered.

In the wake of the Cambridge Analytica scandal, in which the consulting firm siphoned personal data from as many as 87 million facebook users, important questions are being raised regarding privacy and digital ethics. Another side to the Cambridge Analytica story, however, is how they operated through shell companies, decentralizing their data gathering and spinning a web of corporate anonymity that was expressly designed to protect the firm itself from being exposed.
Signing contracts under a different company name may have seemed like a great way to avoid having anything traced back to Cambridge Analytica, but it didn’t work.
Thanks to the focused media attention on the firm after whistleblower Christopher Wylie exposed the shady data harvesting, it didn’t take long before journalists such as Wendy Siegelman were able to identify companies directly and indirectly connected to Cambridge Analytica. Siegelman even created a graphical representation of all related companies and individuals by pulling data from two open source databases: UK companies and Open Corporates

Image originally published in Medium by Wendy Siegelman
Siegelman’s method was journalistic and intuitive. She determined that Emerdata Ltd. was related to Cambridge Analytica, for example, because there were two key individuals that belonged to both organizations (Alexander Nix and Julian Wheatland) and it had a similar address to SCL Insight Ltd., which is operated by Cambridge Analytica’s parent company SCL Group Ltd..

This a pretty laborious process, though. According to a BBC Report, the International Consortium of Investigative Journalists (ICIJ), which has a member list of nearly 100 media partners in 67 countries, has been investigating more than 785,000 offshore companies that are implicated in the Panama Papers, the Offshore Leaks, the Bahamas Leaks, and the Paradise Papers investigations. It has taken the ICIJ five years to tease out the connections between 290,000 companies (roughly 1/3 of the total) and records in other databases.

At ThinkData Works, we have been working on a similar problem, and the result is a smart record linkage engine (RLE). Typically, when a lending institution such as a bank is trying to paint a full picture of a company in order to gauge their credibility, their methods resemble a journalist’s, which is to say they manually cross-reference different databases including import/export, procurement, public company data, etc, in order to piece together matches based on company names, registration addresses, stakeholder and CEO identity, etc. As you can imagine, this process can be time consuming. Consider the concessions you would have to make for different languages and formats, for starters, or the time inevitably spent accounting for typos, abbreviations, and other variations in the data.

Bear in mind, these are the problems that occur when a company has nothing at all to hide. When organizations such as Cambridge Analytica and its sibling companies are actively trying to stop these connections from being made, the process can become much more complex.
ThinkData’s RLE was created with these difficulties in mind. We take advantage of AI and big data tools so that millions of records in different databases can be cross-compared and records belonging to the same entities can be efficiently identified. Using Cambridge Analytica as an example, the following list of closely related companies were found in the UK Companies House database
The RLE is configured with Natural Language Processing (NLP) tools which enable it to “know” that England and the United Kingdom are often interchangeable in address fields. It can also recognize that "LEVEL 2,1 WESTFERRY CIRCUS", "PKF LITTLEJOHN 1 WESTFERRY CIRCUS", "℅ PKF LITTLEJOHN 2ND FLOOR, 1 WESTFERRY CIRCUS" probably refer to the same address.

Based on similarity scores in individual fields (in this case company name, address, town, county, country, and CEO), RLE’s AI component kicks in and makes the decision that these companies are closely related. Since the RLE has been built to scale gracefully by leveraging big data tools, the whole process to find companies related to Cambridge Analytica takes less than a second. This is very good news for anyone interested in connecting entities across numerous, often huge, databases.

None of this is intended to replace a journalist’s intuition, of course. The RLE is built to work with structured databases, which will give human experts time to dig into the individuals and companies behind the connections, or to digest unstructured data such as blogs, twitter, and insider tips. Compared to Siegelman’s graph the above list is incomplete, but it effectively automates the repetitive, time-consuming, and (one can only imagine) aggravating needle-in-a-haystack process, which gives experts looking into the data the time they need to focus on more valuable inputs.

Spreading money around isn’t a new phenomenon. There are many reasons (not all of them nefarious) why a company or individual might choose to distribute their assets across many holdings. There’s a problem, however, when the shell game played by corporations is designed to obfuscate in order to avoid detection. Whether it’s financial institutions trying to improve their anti money laundering efforts or reporters uncovering the relationships that drive investigative journalism, staying one step ahead of the curve is necessary. Data-driven AI tech can help make the connections that cut through the obscurity and paint a full picture of who’s really involved in a corporation, and why they might not want to be found.


Uber Data Sheds New Light on King St. Pilot

After almost half a year of limiting vehicles on King St, many are still wondering…

Linking Trade Data to Company Information

Making smart business decisions is dependent on a company's ability to use its data effectively.…