Sunday, September 21, 2008

What is a Memoizer and why should you care about it?

A Memoizer is a class created by Doug Lea/Brian Goetz. It's basically a create-and-store cache. That is, it has one very useful property:


  • A Memoizer is a Generic class that encapsulates the process of creating a value, and remembering the value (memoizing it) so that subsequent calls to get the value will return the memoized value, and not call the Factory method to create it.

Why do you care?

  • A Memoizer encapsulates the process of "getOrCreate()" which is highly useful when you are getting a resource that does not change by a key (think cache) and which is generally considered to be expensive to create.

But, you ask, I already can do that, why do I need a Memoizer? Glad you should ask. Memoizer is unique in its solution, because it caches a value by key, but its implementation is very efficient. It has the following additional properties:

  1. Calls to get items are concurrent

  2. Calls to get items for a particular key are guaranteed to call the Factory method for that key once and only once


It's important to understand that what the Memoizer does is both efficient, and correct. It's relatively easy to roll your own "getOrCreate()" method that has only one of those properties (and most people do). It's not so easy - or obvious - to do both and that's why you should use a Memoizer - and not roll your own.

To review, let's see the strategy most people would use for calling the Factory method once and only once. Here's how to call the Factory method once and only once:

/*** SUB OPTIMAL IMPLEMENTATION THAT IS CORRECT, BUT NOT CONCURRENT ***/
private final Factory<A,V> factory = ...;
private final Map<A, V> map = new HashMap<A, V>();

public V getOrCreate(A key) {
synchronized (map) {
if (map.contains(key)) { return map.get(key); }

// create
V value = factory.create(key);
map.put(key, value);
return value;
}
}

But of course, the line synchronized (map) should be a tip-off that this implementation is not concurrent
since there is only a single lock protecting access to our map. Can we do better? Sure. Let's use a ConcurrentHashMap to make our getOrCreate() method concurrent (but as we'll see, will not have the once-and-only-once property):

/*** SUB OPTIMAL IMPLEMENTATION THAT IS CONCURRENT, BUT NOT CORRECT ***/
private final Factory<A,V> factory = ...;
private final ConcurrentMap<A, V> map = new ConcurrentHashMap<A, V>();

public V getOrCreate(A key) {
if map.contains(key) { return map.get(key); }

// map doesn't contain key, create one -- note that first writer wins,
// all others just throw away their value
map.putIfAbsent(key, Factory.create(key));

// return the value that won
return map.get(key);
}

So, those are the two implementations, each implements either correctness (once and only once) or performance (concurrent puts and gets) but neither accomplishes both simultaneously. Of course, that's the whole point of this article. To introduce you to Memoizer. So how does Memoizer ensure correctness (once and only once) while being simultaneously concurrent?

The trick is to delay the call to the Factory create method into the future. We already know how to make the operation concurrent, but making the operation concurrent means that the first writer will win, and all the other writers will lose. So we delay the Factory create call by wrapping that call in a FutureTask. In other words, instead of putting the actual value into the map, we put a Future (which is a wrapper for getting the value sometime in the future) into the Map.

(Note: If you have not yet become familiar with the Future and FutureTask classes introduced in the JDK 1.5, you should)

By putting a Future into the map - and not our actual value - we can move the work of calling the Factory create method into the future - specifically we can move it until after the winner of the race to put the value into the map has been determined. That enables us to call get - which calls create on the Factory - on the one and only Future instance that is contained within the Map.

The full code for Memoizer illustrates this trick. Note that the Computable interface specifies a generic "Factory". Also note that I have not been able to find a library or canonical reference for Memoizer. The best I have is here: Memoizer.java.

Here it is re-created for your convenience (note that I have adjusted it slightly, for example allowing the user to specify the number of segments created in the underlying ConcurrentHashMap):

public interface Computable<A, V>
{
V compute(A arg) throws InterruptedException;
}

public class Memoizer<A, V> implements Computable<A, V>
{
private final ConcurrentMap<A, Future<V>> cache;
private final Computable<A, V> c;

public Memoizer(Computable<A, V> c)
{
this(c, 16);
}

public Memoizer(Computable<A, V> c, int segments)
{
this.cache = new ConcurrentHashMap<A, Future<V>>();
this.c = c;
}

public V compute(final A arg) throws InterruptedException
{
while (true) {
Future<V> f = cache.get(arg);
if (f == null) {
Callable<V> eval = new Callable<V>() {
public V call() throws InterruptedException {
return c.compute(arg);
}
};
FutureTask<V> ft = new FutureTask<V>(eval);
f = cache.putIfAbsent(arg, ft);
if (f == null) {
f = ft;
ft.run();
}
}
try {
return f.get();
} catch (CancellationException e) {
cache.remove(arg, f);
} catch (ExecutionException e) {
LaunderThrowable.launderThrowable(e);
}
}
}
}

You will of course need the LaunderThrowable implementation to compile this example. For a full code listing, hop on over to Terracotta, where I show not only a full working example, but demonstrate how it works with Terracotta across two nodes:

Full Memoizer Example with Terracotta

9 comments:

Khalil said...

Couple of comments: Does the Memoizer have a clear/reset method to clear the cache and allow values to be GCed, couldn't the ConcurrentMap in the Memoizer implementation be a ConcurrentReferenceMap, secondly how would you deal with multiple arguments, varargs? or box them in an array like structure with equals and hashCode support.

Cheers,
Khalil

Buddy Casino said...

I played around with the Memoizer once, and it lacks a mechanism to expire items and/or limit the bumber of entries.
Basically, it's just a example of a concurrent caching pattern done right, not a full-blown caching solution.

I don't feel confident implementing those missing features, as I'm pretty sure to screw up the concurrency issues in a subtle way. Has anyone else gotten around to do that?

Alex Miller said...

This reminded me of this link (from Doug Lea) on lazy map initialization, which is a separate but related topic:

http://artisans-serverintellect-com.si-eioswww6.com/default.asp?W122

Taylor said...

@Khalil and buddy casino

I agree, as written there is no remove. I've thought through the process of remove delegation to the CHM should be just fine, and I in fact had implemented remove in a version of the Memoizer I use (I elided it for simplicity here, I suppose I shouldn't have).

So, remove would be just:

public V remove(final A arg)
{
return cache.remove(arg);
}

mario.gleichmann said...

Nice post!

... but hmmm, did i miss something, or is your example not exactly the one from within the book 'Java Concurrency in Practice' ?

Greetings

Mario Gleichmann

Taylor said...

@mario,

No, you are right, it's not the same - I found a reference from the concurrency mailing list.

I've also begun to make my own modifications, as I noted, one of them is to control the # of segments in the backing CHM. Another, which I removed, is to implement the remove method.

Ben Manes said...

Coincidentally, I put up a decorator implementation last night due to a user request of my library. Combined, this provides an ad-hoc cache per Buddy's concern.
http://code.google.com/p/concurrentlinkedhashmap/wiki/SelfPopulatingCache

Buddy Casino said...

Thanks for sharing this, Ben! Didn't expect that, looks great.

Sahin said...

thank you very much prada shoes prada sneakers fırsat