Tuesday, September 26, 2006

Let's go distributed: how to build a parallel web spider in Java...

To combat the decline in Moore's law, CPU, OS, and application developers are beginning to have to go distributed. The only problem is, when developers think distributed programming, they think hard.

Today's parallel programming models don't help very much, especially if you're a Java developer. Java itself supports concurrent programming, which is great, but most application developers don't use it. To go distributed in our own apps, traditionally we Java programmers have had to go learn JDBC (which is overkill for transient data), or JMS (but we don't get coherency, and we now have to learn serialization), or JavaSpaces, or JBoss cache, and on and on.

There has to be an easier way!

Fortunately there is. With Terracotta, we can write 100% pure Java and let Terracotta do the heavy lifting for us. Over at the Server Side, Jonas Bonér describes the Master/Worker pattern using Terracotta. That pattern will help us write the parallel web spider, so I decided to implement the pattern and make it freely available here - as in speech and beer - along with the full source code and binary distributions of the parallel web spider project. I wrote both using Maven to help get you going as fast as possible.

Sidenote: while Jonas uses Spring in his example, I am using a pure POJO implementation. The concepts are the same, and anyone using my example can pull it into Spring with zero effort.

Getting back to that parallel web spider...

First, let's describe what the spider does in plain english:

"Take an input url, retrieve the page data for that url, parse the links in the page, and repeat the process until we decide to stop".

Sounds pretty simple. Let's break it down into an algorithm:

  1. Begin with the current URL and a stop depth
  2. Retrieve the page data for the current URL, return a list of URLs which represent links in the page
  3. For each URL, if its depth is less than the stop-depth, push it onto a (FIFO) list of URLs to process
  4. Remove the top URL from the list, and repeat at Step 2.
The parallel version of this breaks down into Master/Worker pretty easily. Steps 1, 3, and 4 constitute the "Master" side of the Master/Worker pattern, and Step 2 constitutes the "Worker" side of the Master/Worker pattern. Not only that, but as Jonas has shown us, in Java 1.5 these concepts are already implemented in the form of the java.util.concurrent package (formerly the oswego concurrent package).

There is one caveat, however. The ThreadPoolExecutor is already done for us, but it implements both sides of the Master/Worker pattern. In the real world we want our Masters and Workers to be separate, so we need something else.

Fortunately, the concurrent package helps us break the symmetry. The Master side of the equation is just a concrete implementation of a CompletionService, and the Worker side is just a concrete implementation of a Callable.

Since we are thinking in terms of Masters and Workers, not CompletionServices and Callables, I put conveniently named wrappers into the Master/Worker library.

Using the Master/Worker pattern is now just a simple matter of writing the Work tasks, running the Master in one VM and running the Workers in the other VM(s). The real magic here is the simplicity of it all. The underlying communication mechanism is just a POJO - in this case the LinkedBlockingQueue held by the Master. Making that POJO operate across VMs is Terracotta's job, not yours; leaving you free to focus on the business case at hand.

Now we are ready to implement the parallel web spider using the Master/Worker library.

First, we need to write the Spider Master. The Spider Master will be executing as a standalone application. In these cases, I like to implement my "main" logic using the Runnable interface in a separate class. This allows me to have a start class that instantiates my "main" class and invokes the run method from Runnable. Having my main class be Runnable allows me to start it in a separate Thread, or run it on the main thread just by changing the composition, not the code.

Thus the SpiderMaster class looks something like:

public class SpiderMaster implements Runnable
{
/**
* The Master is going to be a cluster-wide singleton. We need to
* share it as a root in DSO so separate VMs see the same
* instance.
*/
public static Master<List<Pagelink>> manager =
new Master<List<Pagelink>>();
...

public void run()
{
CompletionService<List<Pagelink>> service =
manager.getCompletionService();

service.submit(new SpiderPageTask(1, startUrl));
toDo++;
while (processed != toDo) {
future = service.take();
processed++;

for (PageLink link : future.get()) {
service.submit(new SpiderPageTask(link.getDepth(),
link.getUrl()));
toDo++;
}
}
}
}


That's it. We need to implement the unit of work that the Workers will execute. Remember that the unit of work is just a Callable, so we just need a class that implements Callable. Here is our SpiderPageTask:

public class SpiderPageTask implements Callable<List<Pagelink>>
{
...
public List<PageLink> call() throws Exception
{
List<PageLink> links = new ArrayList<PageLink>();
Source source = new Source(url);
List<StartTag> tags = source.findAllStartTags();

for (StartTag tag : tags) {
if (tag.getName().equals("a")) {
links.add(new PageLink(depth+1,
tag.getAttributeValue("href"));
continue;
}
if (tag.getName().equals("frame")) {
links.add(new PageLink(depth+1,
tag.getAttributeValue("src"));
continue;
}
}

return links;
}
}


All that is left is to start the SpiderMaster and the Worker threads and cluster the Master. This is done by setting the Master object in the SpiderMaster class as a Terracotta "root". You can think of a root as a cluster-wide singleton. All object instances in the cluster will see the exact same root object. (This is a somewhat foreign concept to grasp when you are new to Terracotta. The new Master(...) seems confusing - why doesn't everyone see their own version of the object if everyone does a new?. What really happens is that the first instance to execute -- in the cluster -- really does perform a new. Afterwards, everyone else just assigns the root field from the cluster wide instance).

So, to share the root, edit the Terracotta config file and add the root like so:

<roots>
<root>
<field-name>org.terracotta.project.pspider.SpiderMaster.manager</field-name>
</root>
</roots>


We need a pair of main() methods to launch our Master and Workers:

public class StartMaster
{
public static final int MAX_DEPTH = 3;

public static void main(String[] args)
{
... (parse arguments)
new SpiderMaster(maxDepth, url).run();
}
}

public class StartWorkers
{
public static void main(String[] args) throws InterruptedException
{
ThreadPoolExecutor executor =
new ThreadPoolExecutor(2,
2,
300L,
TimeUnit.SECONDS,
SpiderMaster.manager.getExecutorQueue());
new Workers(executor).run();
}
}


And we are done!

Note: This blog is somewhat out of date. You can find a working example in the "Open Data Grid" project on Terracotta

References

Terracotta
Java

Research it first...

If there's something I've learned the hard way more than once, it's that if you've thought of something, chances are good, in fact they're better than good, that someone else has beat you to it.

I'm not saying you might not have thought of something original, it's entirely possible, but you'll never know until you look it up first.

I've got some good links on the right side of my blog, so go there if you haven't found anything yet, but if you're like the rest of the world these days, you can just use google and that will suffice 9 times out of 10.

To reiterate, whether you've thought up a new gadget, a new algorithm, or a whole new pattern, I recommend looking for it first on the web. Think about what you've invented, and think about it hard. Not everyone names things the same so try to figure out the core of what your ivention is, and try to think up at least 3 or 4 names for it. Now go out and search for it!

Chances are you'll find someone else has already put some thought into it, and heck, maybe there's even an OSS project around it to get you started.

There's nothing worse than NIH syndrome, so do your research first and spend your time advancing the state-of-the-art, not re-stating what's already been done.

First Post

Thanks to my new position at Terracotta, I've headed back into the world of Java. I started coding Java right out of college, way back in 1996, so I'm no stranger.

Over the years I've collected a lot of experience in technology and coding, and have some thoughts on the subject. While this Blog is focused on Java in particular, many strategies that work in Java work in most other languages, so you will definitely find some cross-cutting concerns.

You'll notice some good links on the right hand side of my blog - these are giants that deserve respect for their on-going contributions to the community in general.

Thanks for visiting, welcome aboard -- I hope we both enjoy the ride...