As we work toward a future driven by analyzing and operationalizing big data, it is evident that redundancy is a stumbling block in achieving efficiency. The amount of time we spend cleaning data in order to build effective models has gone up, not down, and the business decisions of tomorrow are being slowed by the practical considerations of today. A seemingly simple task like identifying and tracking entities as they change over time is, in fact, a complex problem that requires an innovative solution.
In order to develop a solution that works, there is an immense need for artificial intelligence to start comprehending different words for the same entities.
This is exactly what I tackled during my internship at ThinkData Works. The purpose of the problem known as Entity Resolution is to identify, match, and link a text with all texts which refer to the same entity. This text can be a company name, address, or any string in general. The process seems straightforward until you realize that I.B.M., IBM, Ibm inc. and IBM Canada are all spelled differently, but refer to the same entity. Furthermore, “Ouando 1998” and “Ouando 1999” are almost identical but they represent different entities. While this is confusing for humans, it is even more complicated for computers to come up with rules to help them link texts and meanings.
During the design phase of the project, we realized that there is more than one way we can attack the problem. Different teams in different industries are constantly coming up with new algorithms and libraries to tackle this issue. With so many people working on the same problem it would be a mistake to adopt a rigid methodology that would limit our ability to evolve the model over time. This is why we made our Entity Resolution (ER) able to support multiple client libraries, where each processes the resolved entities in its own way. In other words, ER was created so that we can integrate more libraries in the future as we collectively discover better Machine Learning algorithms that perform the same functions using the same system/interface it has now. In short, we want the system to improve as we do.
After integrating the different client libraries with the main server, the pipeline was clear and terrifying at the same time. We had to decide on seemingly invisible details to optimize the efficiency of the bigger-picture pipeline where millions of data entries will be streaming every day. This includes choosing programming languages, databases, frameworks, REST or CRUDE, and most importantly, the model of the data and the algorithms that are going to process it.
The pipeline was clear and terrifying at the same time.
One of the most difficult parts of the project was testing. If you are teaching the program rules based on company names, then it makes sense that you do not validate the answers of the program using the same program. One smart way of making this process faster is to consider the semantic distance between words as a way of filtering out completely irrelevant results. However, we, as humans, still needed to look at the results and decide whether the best predictions actually matched the rules we wanted it to learn.
While getting these different components working together, I had fun researching different code generation techniques and implementing my own unique code generation algorithm that would keep generating unique codes every millisecond for over 100 years. Eliminating the use of text as the ID of the records by replacing it with a unique generated code for each entity is a great addition to the Namara toolkit, and to the present and future products that are built on the platform, as it allows for easier communication between different components.
Dockerizing the program and seeing it working on the local network was amazing. After four months of understanding the contours of the problem, running tests, and building models, we had a working ER program – one that was customizable and developed with the needs of the end user in mind.
Interning at ThinkData Works has been an absolutely great and enriching experience. The internship at ThinkData Works gave me not only the option to develop a product that has immediate real world applications, but also the opportunity to work directly alongside the CTO, Brendan Stennett, and the Principal Data Scientist, Mehrsa Golestaneh. The ability to work on real problems, develop actual solutions, and do so with mentors that excel in their field was invaluable.
Amr joined ThinkData for his first summer co-op after his second year studying Computer Science & Math at the University of Toronto. Joining ThinkData for this co-op term helped Amr combine web development and machine learning, his two favourite branches of computer science.
Want to learn more about ThinkData’s co-op program? Reach out to us at firstname.lastname@example.org for more information.