Before the world’s Open Data ends up on Namara, it’s pumped through a multi-component pipeline that we call our ELR (an ETL with some extra steps for linking, formatting, analyzing and forging new data). While our current transformer (dubbed Unity) is used in the ELR for standardization and normalization, it also exists as a standalone service. Unity is capable of coalescing disjunct and disparate data sets into a single data set with a unified schema.
Behind the scenes, the all data sources are partitioned and processed in a MapReduce-like fashion. All outputs are then combined into a single data set. Unity runs on JRuby on Rails, using Sidekiq to handle the parallelization. The modelling component of the codebase is written in Ruby, while the data processing components are written in Java.
Though we saw significant performance benefits using Java over a straight Ruby implementation, we have noticed some memory issues that would arise when processing large files (~9GB), or multiple large files at once. The documentation on dealing with memory management during large file operations has been sparse, leading us to uncover a few gotchas.