Streaming Large Files in JRuby

Written by Chris Sandison

February 10, 2017

Before the world’s Open Data ends up on Namara, it’s pumped through a multi-component pipeline that we call our ELR (an ETL with some extra steps for linking, formatting, analyzing and forging new data). While our current transformer (dubbed Unity) is used in the ELR for standardization and normalization, it also exists as a standalone service. Unity is capable of coalescing disjunct and disparate data sets into a single data set with a unified schema.

Behind the scenes, the all data sources are partitioned and processed in a MapReduce-like fashion. All outputs are then combined into a single data set. Unity runs on JRuby on Rails, using Sidekiq to handle the parallelization. The modelling component of the codebase is written in Ruby, while the data processing components are written in Java.

Though we saw significant performance benefits using Java over a straight Ruby implementation, we have noticed some memory issues that would arise when processing large files (~9GB), or multiple large files at once. The documentation on dealing with memory management during large file operations has been sparse, leading us to uncover a few gotchas.


The Java garbage collector is only called when the free heap space begins to get tight. During a normal operation, the memory used by the service would grow at a linear rate, but explicit invocation of the GC would always reduce it to some base amount, so we were confident that there wasn’t a persistent leak.

The data sets used as input on these operations are externally hosted, so the first step in the process was to stream the file onto disk. What we had looked something like this:

uri = URI("https://www.someurl.com/file.csv")
Net::HTTP.start(uri.host, uri.port, use_ssl: (uri.scheme == "https")) do |http|
  http.request(Net::HTTP::Get.new(uri)) do |response|
    # check response object
    File.open("some/path/for/writing.csv", 'wb') do |file|
      response.read_body do |file_chunk|
        file.write(file_chunk)
      end
    end
  end
end

Which in most cases worked perfectly. Until we started reading in large files — larger than our heap allocation. Or until we were doing this 30 times for 30 different 500mb files in 30 workers on the same machine. We noticed that explicit invocation of the GC during file download was not deallocating any memory at all, but that memory use would grow until the download was completed.

The solution is subtle:

uri = URI("https://www.someurl.com/file.csv")
File.open("some/path/for/writing", 'wb') do |file|
  Net::HTTP.start(uri.host, uri.port, use_ssl: (uri.scheme == "https")) do |http|
    http.request(Net::HTTP::Get.new(uri)) do |response|
      response.read_body do |file_chunk|
        file.write(file_chunk)
      end
    end
  end
end

The File#open call must wrap the entire session, otherwise the GC will not deallocate any of the streamed response text until the session block exits. With this change, we saw that the memory usage would only fluctuate 10–15mb at a time, showing that the GC was freeing memory allocated during the stream.


An additional caveat

response.read_body must be called within a block. Using response = http.request(...) and then calling response.read_body will have slurped the entire response.