“Micro-optimizing” Parallelization
in JRuby using Sidekiq workers

Written by Chris Sandison

February 22, 2017

I’ve been reading up on improving JRuby/Java performance and I keep running into sentiments like this.

If you are pre-occupying yourself with “micro-optimizations” then your approach needs to be reconsidered

I don’t like it. It’s smug, easy, and entirely misses the point.

I discussed our JRuby-based transformation from our ELR service in a previous post (dubbed Unity). Unity coalesces disjunct and disparate data sources into a new unified schema using MapReduce-like processing by partitioning incoming data sources into smaller files. Then, a worker is dispatched to compile a separate identical instance of the processor (called our graph runner) to transform each partition. Once all partitions have completed processing, they are concatenated into a new data source.

The workers are supported by several utility classes that do a lot of the auxiliary work — type conversion and validation, handling geometric values (WKT), looking up geolocation data, etc. JRuby dispatches these workers as a single Java ThreadPool. As a result, a single instance of these utility classes are shared amongst every worker. The number of workers spawned increases with the size of the data that we are processing. If the work done by the workers is dependent on the size of the data source, then suddenly these “micro-optimizations” become significant.

We found that by proactively managing object creation strategies and minimizing thread overhead, we saw up to 25% reductions in processing time. Given these findings, we have added some new best practices for developing on Unity.


Lets say we have a utilities singleton class called TextUtils. TextUtils is going to be our junk-drawer for dealing with any irregularities coming from cells in our data source.

package blog.namara;
public class TextUtils {
  private static TextUtils utils = TextUtils.getInstance();
  private TextUtils() {}
  public static synchronized TextUtils getInstance() { 
    return utils; 
  }
}

Every worker that is initialized will also be initialized with a copy of the TextUtils instance:

private TextUtils text_utils = TextUtils.getInstance();

Given this setup, there are some enforcements we have to make on TextUtils.


  1. Utility classes should initialize anything that it can at instantiation time.

Let’s say that we want to add a function that will take a string and extract only the numbers from it. Normally, we might do something like this.

public String scrub_non_numerical_input(String input) {
    input.replaceAll("[^0-9.-]", "");
}

Which totally works and all of our tests pass. However, if we look at the docs for String#replaceAll it tells us that invocation of this method will call Pattern#compile every time. It has been shown that the pattern compilation is the truly expensive part of the operation. In that case, we can use a pre-compiled pattern.

package blog.namara;
public class TextUtils {
  
  ...
  private Pattern number_scrubber;
  private TextUtils() {
    this.number_scrubber = Pattern.compile("[^0-9.-]");
  }
  ...
  public String scrub_non_numerical_input(String input) {
    return number_scrubber.matcher(input).replaceAll("");
  }
}

A quick benchmarking shows that String#replaceAll is over 2.5x slower than using the Matcher class. If scrub_non_numerical_input is called for every value that we are processing, this significantly impacts performance. Since Pattern is threadsafe our workers, we shouldn’t need to add any thread control.

2. Benchmark your operations

If you are making repeated calls to a library, be sure that this library won’t be a bottleneck on your application. Plug in different libraries and see what performs best in the context it will repeatedly be used.

We switched from Ruby’s CSV writer to OpenCSV for Java and we found that our processing time per worker was halved. We were stunned, but thrilled.

3. Be skeptical of 3rd Party Libraries, (or at least do your research)

Let’s say that Unity has a utility class that given a state name (or some abbreviation of a state name) it will return it’s ISO-3166–2-compliant 2-character state code. Let’s also say that it has been thoroughly tested and I, as it’s developer, have complete confidence in it.

Let’s say that I’ve wrapped this class up in a library called MysteriousStatesman and shipped it out to anyone looking to keep their hands clean of this task. I’ve also licensed it so that you have no idea what the source code looks like. Now you have a dataset of 2M rows, where every row contains some sort of state abbreviation, each of which needs to be translated. You, dear reader, decide to do some benchmarking to see how this is behaving.

You find that calling MysteriousStatesman.fetch_code("Alabama") averages 2ms. A call to MysteriousStatesman.fetch_code("Florida") averages 50ms. A call to MysteriousStatesman.fetch_code("U. S. Virgin Islands") is taking a ghastly 345ms. You have no idea what forces are at work here, but whatever they are they can not be trusted.

As bad as this is, you desperately need the functionality that MS guarantees. Then it dawns on you — a single data source is very likely to contain the same kind of abbreviation for a state name for all of its rows. If Alabama is recorded as “Alab.” once, there may be a good chance that it is always recorded like that. Why not do something like this?

package blog.namara;
import MysteriousStatesman;
public class TextUtils {
  
  ...
  private Map<String, String> state_memory_map = new HashMap<>();
  public String get_state_code(String input_value) {
    if(state_memory_map.containsKey(input_value)) {
      return state_memory_map.get(input_value);
    }
    String state_code = MysteriousStatesman.fetch_code(input_value);
    state_memory_map.put(input_value, state_code);
    return state_code;
  }
 }

This is a solution that generalizes to just about every “look up” operation that we need to do. All that changes is the caching strategy.

4. Memoize predictable operations

Since these utility classes are shared by all workers, then objects like the state_memory_map will be collaboratively built by all workers. If this doesn’t make sense given your task, look into a caching strategy that would be isolated for the worker.

5. Ensure that all utility classes are threadsafe when required

Continuing with the example of MysteriousStatesman, we need to ensure that MS has safe access. You have no guarantee that MS#fetch_code is threadsafe, and who knows what an unsafe access exception would even look like. Do we want to synchronize access to the whole method itself? Will the threading overhead end up hurting performance on what will be mostly hashmap-reading calls?

package blog.namara;
import MysteriousStatesman;
public class TextUtils {
  
  ...
  private Map<String, String> state_memory_map = new HashMap<>();
  public String get_state_code(String input_value) {
    if(state_memory_map.containsKey(input_value)) {
      return state_memory_map.get(input_value);
    }
    return fetch_state_code(input_value);
  }
  private synchronized String fetch_state_code(String input_value) {
    String state_code = MysteriousStatesman.fetch_code(input_value);
    state_memory_map.put(input_value, state_code);
    return state_code;
  }
}

We saw that automatically making method calls dogmatically synchronized showed a significant increase in processing time. I’m sure you noticed that this code is susceptible to stale reads, and that’s up to you to deal with or work around. Our own best practice is that method calls that are not purely functional should be synchronized. Combining this with memoization and pre-compilation strategies, we find that instances where threads need to be isolated doesn’t happen very often.

6. Have a solid stress test and run it often

Thread issues are hard to detect in tests. You are likely not even executing the parallel workers in your test suite. Thread issues are also hard to detect on toy examples and prescribed QA walkthroughs.

Taking the time to develop, isolate or generate test instances that will stress out your application is worthwhile. I have two cases that I use often — one includes 3 source files that produce an output file over 500mb in size. Often I’ll run 2 or more instances of this at the same time when finalizing a feature. I have another 9GB file that I will dispatch on my lunch break once in a while to be sure that the application is still holding up. These have been instrumental in finding our nondeterministic errors.


I would argue that these are not micro-optimizations at all, but a return to thinking about your high-level language at a traditionally lower-level. If you are running a processing application like us, where the amount of work it has to do is dependent on the dimensions of the data being pumped through it, the numbers are telling us that spending time on object management pays off.