This was on our mind when we first discussed developing a strategy for removing and merging duplicated data points in a data set. Building a tool that could uncover nuggets of value in mountains of data sounded incredibly valuable. It also sounded incredibly difficult. We like things that are difficult. We thought we’d call it Yukon.
To identify duplicate rows, we first have to specify which columns Yukon will look at. In the sample below, we’ve selected columns 1 and 5 (FirstName and Company) as the composite key, or what defines a unique record. The entire row will be used if no columns are specified.
Let’s say we take this example data set:
By specifying columns 1 and 5, Yukon will produce this deduplicated and merged dataset:
In theory, deduplication on small data sets like these sounds relatively simple. However, the data sets on Namara that need to undergo this deduplication and merging process can be greater than 100GB. To put the size of this file into perspective, a 75 page essay with 23,000 words is only about 1MB, 100,000 times smaller.
Most computers will have difficulty opening and operating on 1GB files — using programs such as Microsoft Word or Excel — due to the heavy usage on computer memory (RAM). As you could imagine, a file over 100GB simply can not be loaded into a standard computer’s memory without causing it to crash.
Though RAM presented us with limitations, we were able to solve this problem by utilizing disk memory. Disk memory is abundant on modern computers and is relatively cheap to upgrade in comparison to RAM.
Our first approach resembled how a Bloom Filter is implemented:
This meant having to check whether the rows that caused a collision were unique or true duplicates. This would determine whether or not to merge them into a single row.
We ran into issues regarding the size of the hash we were considering (7 bits of a 32 bit hash). The size of the hash file would grow exponentially with the portion of the hash we would use. However, using a smaller portion of the hash would also increase the number of collisions. This meant leaving a lot of dirt in with the gold.
We wanted to see what would happen if we got rid of the hash file and relied only on memory (RAM). We started by calculating the largest row in the data set, and giving each row padding to match that length. This allowed us to easily seek through rows in the file using an offset multiplied by a constant row length, allowing us to re-write and merge quickly. This also cut down hashing collisions since we could use all 32 bits of the hash.
This approach had us reading the data set twice — first for the reading and then for the padding, before doing any actual operations. The frequent file seeking also took a toll on runtime.
We wanted to utilize both RAM and disk memory to gain speed without sacrificing storage. Keeping tack of the duplicated rows themselves (and their offsets) in RAM allowed us to directly access the duplicate rows and merge them, instead of seeking through disk memory.
This process gives us the desired data set with all rows deduplicated and merged. No more dirt in the gold.
This approach gave us no collisions and had reasonable memory usage. We were also satisfied with the speed (about ~35 seconds on a 5 million row data set, running on our development machines).
Benchmarks are given below:
Red Line: Approach 1
Blue Line: Approach 3
Algorithm growth, in this case, means how much slower / more memory the algorithm uses as the file size increases. When one algorithm grows slower than another in terms of speed, it means the ratio of speed and file size of the faster algorithm is smaller than the ratio of the slower algorithm. The smaller the ratio, the better the algorithm scales for large files.
Given the benchmarks, Approach 3 is the clear winner. Though the memory usage is greater, both file size and memory usage grow at linear rates. The amount of memory required is about 35% of the original data set — proportional to the number of rows, rather than the size of the file. That is to say, a 100GB file with 50 million rows would take less memory than a 40GB file with 100 million rows.
The difference in processing time is significant. Not only is Approach 3 significantly faster, but also grows at a slower rate. In our case, the extra memory usage is well worth the performance boost.
If you would like to know more about how Namara’s making external data easier to use, please get in touch!
Jason is a developer intern at ThinkdataWorks, keep track of what he gets up to on his Github.