Yukon — Optimized File Deduplication and Merge

Written by Jason Piao

July 26, 2017
Between 1896 and 1899 a stampede of over 100,000 prospectors made their way north, leaving San Francisco and Seattle behind to search for gold along the Klondike. Of those that went, fewer than a third made it to the Yukon, and those that did were constantly challenged by unforgiving conditions. The gold rushes of the 19th century are characterized by this kind of atmosphere. Gold is a tricky thing to mine — it takes intelligence, grit, and patience to sift through the piles of dirt you don’t want to uncover the chunk of something you do.

This was on our mind when we first discussed developing a strategy for removing and merging duplicated data points in a data set. Building a tool that could uncover nuggets of value in mountains of data sounded incredibly valuable. It also sounded incredibly difficult. We like things that are difficult. We thought we’d call it Yukon.

How it Works

To identify duplicate rows, we first have to specify which columns Yukon will look at. In the sample below, we’ve selected columns 1 and 5 (FirstName and Company) as the composite key, or what defines a unique record. The entire row will be used if no columns are specified.

Let’s say we take this example data set:

sample data set

By specifying columns 1 and 5, Yukon will produce this deduplicated and merged dataset:

deduplicated and merged data set

In theory, deduplication on small data sets like these sounds relatively simple. However, the data sets on Namara that need to undergo this deduplication and merging process can be greater than 100GB. To put the size of this file into perspective, a 75 page essay with 23,000 words is only about 1MB, 100,000 times smaller.

Most computers will have difficulty opening and operating on 1GB files — using programs such as Microsoft Word or Excel — due to the heavy usage on computer memory (RAM). As you could imagine, a file over 100GB simply can not be loaded into a standard computer’s memory without causing it to crash.

Though RAM presented us with limitations, we were able to solve this problem by utilizing disk memory. Disk memory is abundant on modern computers and is relatively cheap to upgrade in comparison to RAM.

Holy Guacamole! Look at all the implicit value in that data set!

Approach 1

Our first approach resembled how a Bloom Filter is implemented:

  • Initialize a hash file using x buckets (dependent on data set size)- where a bucket is a string of bytes of a fixed size
  • Obtain hash of each row in the data set, and remember the current byte offset of the row
  • Using this hash, locate the corresponding bucket in the hash file and read the byte string for that bucket
  • If the byte string points to some offset in the original data set, then we’ve detected a collision. Otherwise, record the current byte offset in the bucket

This meant having to check whether the rows that caused a collision were unique or true duplicates. This would determine whether or not to merge them into a single row.

Problems with Approach 1

We ran into issues regarding the size of the hash we were considering (7 bits of a 32 bit hash). The size of the hash file would grow exponentially with the portion of the hash we would use. However, using a smaller portion of the hash would also increase the number of collisions. This meant leaving a lot of dirt in with the gold.

“Aw man look at all that dirt” 🙁

Approach 2

We wanted to see what would happen if we got rid of the hash file and relied only on memory (RAM). We started by calculating the largest row in the data set, and giving each row padding to match that length. This allowed us to easily seek through rows in the file using an offset multiplied by a constant row length, allowing us to re-write and merge quickly. This also cut down hashing collisions since we could use all 32 bits of the hash.

Problems with Approach 2

This approach had us reading the data set twice — first for the reading and then for the padding, before doing any actual operations. The frequent file seeking also took a toll on runtime.

Yeehaw

Approach 3

We wanted to utilize both RAM and disk memory to gain speed without sacrificing storage. Keeping tack of the duplicated rows themselves (and their offsets) in RAM allowed us to directly access the duplicate rows and merge them, instead of seeking through disk memory.

Deduplicate:

  • Write all “prime” rows (the initial row that duplicates would be merged into) to an intermediary file, which keeping track, in memory, of the row offset and length of their first occurring duplicates
  • When a duplicate row is detected, merge with the original, while storing both offsets. Now we have references to (a) all rows that need to be merged (RAM), and (b) all unique rows and precedent duplicate rows (disk memory)

Merge:

  • Read the intermediary file in, row by row
  • If the row we are examining is unique, we won’t have an offset to match a merged row, and it can be written to an output file. Otherwise, the offset should correspond to a merged row that we are already storing in memory. We can merge these and write them to our output file.

This process gives us the desired data set with all rows deduplicated and merged. No more dirt in the gold.

Yippee!

Approach 3 Conclusions

This approach gave us no collisions and had reasonable memory usage. We were also satisfied with the speed (about ~35 seconds on a 5 million row data set, running on our development machines).

Benchmarks are given below:

Comparisons against approach 1 and approach 3.

Red Line: Approach 1

Blue Line: Approach 3

File size vs memory usage
File size vs time

Conclusions

Algorithm growth, in this case, means how much slower / more memory the algorithm uses as the file size increases. When one algorithm grows slower than another in terms of speed, it means the ratio of speed and file size of the faster algorithm is smaller than the ratio of the slower algorithm. The smaller the ratio, the better the algorithm scales for large files.

Given the benchmarks, Approach 3 is the clear winner. Though the memory usage is greater, both file size and memory usage grow at linear rates. The amount of memory required is about 35% of the original data set — proportional to the number of rows, rather than the size of the file. That is to say, a 100GB file with 50 million rows would take less memory than a 40GB file with 100 million rows.

The difference in processing time is significant. Not only is Approach 3 significantly faster, but also grows at a slower rate. In our case, the extra memory usage is well worth the performance boost.

 

If you would like to know more about how Namara’s making external data easier to use, please get in touch!

Jason is a developer intern at ThinkdataWorks, keep track of what he gets up to on his Github.