Tuesday, September 26, 2006

Let's go distributed: how to build a parallel web spider in Java...

To combat the decline in Moore's law, CPU, OS, and application developers are beginning to have to go distributed. The only problem is, when developers think distributed programming, they think hard.

Today's parallel programming models don't help very much, especially if you're a Java developer. Java itself supports concurrent programming, which is great, but most application developers don't use it. To go distributed in our own apps, traditionally we Java programmers have had to go learn JDBC (which is overkill for transient data), or JMS (but we don't get coherency, and we now have to learn serialization), or JavaSpaces, or JBoss cache, and on and on.

There has to be an easier way!

Fortunately there is. With Terracotta, we can write 100% pure Java and let Terracotta do the heavy lifting for us. Over at the Server Side, Jonas Bonér describes the Master/Worker pattern using Terracotta. That pattern will help us write the parallel web spider, so I decided to implement the pattern and make it freely available here - as in speech and beer - along with the full source code and binary distributions of the parallel web spider project. I wrote both using Maven to help get you going as fast as possible.

Sidenote: while Jonas uses Spring in his example, I am using a pure POJO implementation. The concepts are the same, and anyone using my example can pull it into Spring with zero effort.

Getting back to that parallel web spider...

First, let's describe what the spider does in plain english:

"Take an input url, retrieve the page data for that url, parse the links in the page, and repeat the process until we decide to stop".

Sounds pretty simple. Let's break it down into an algorithm:

  1. Begin with the current URL and a stop depth
  2. Retrieve the page data for the current URL, return a list of URLs which represent links in the page
  3. For each URL, if its depth is less than the stop-depth, push it onto a (FIFO) list of URLs to process
  4. Remove the top URL from the list, and repeat at Step 2.
The parallel version of this breaks down into Master/Worker pretty easily. Steps 1, 3, and 4 constitute the "Master" side of the Master/Worker pattern, and Step 2 constitutes the "Worker" side of the Master/Worker pattern. Not only that, but as Jonas has shown us, in Java 1.5 these concepts are already implemented in the form of the java.util.concurrent package (formerly the oswego concurrent package).

There is one caveat, however. The ThreadPoolExecutor is already done for us, but it implements both sides of the Master/Worker pattern. In the real world we want our Masters and Workers to be separate, so we need something else.

Fortunately, the concurrent package helps us break the symmetry. The Master side of the equation is just a concrete implementation of a CompletionService, and the Worker side is just a concrete implementation of a Callable.

Since we are thinking in terms of Masters and Workers, not CompletionServices and Callables, I put conveniently named wrappers into the Master/Worker library.

Using the Master/Worker pattern is now just a simple matter of writing the Work tasks, running the Master in one VM and running the Workers in the other VM(s). The real magic here is the simplicity of it all. The underlying communication mechanism is just a POJO - in this case the LinkedBlockingQueue held by the Master. Making that POJO operate across VMs is Terracotta's job, not yours; leaving you free to focus on the business case at hand.

Now we are ready to implement the parallel web spider using the Master/Worker library.

First, we need to write the Spider Master. The Spider Master will be executing as a standalone application. In these cases, I like to implement my "main" logic using the Runnable interface in a separate class. This allows me to have a start class that instantiates my "main" class and invokes the run method from Runnable. Having my main class be Runnable allows me to start it in a separate Thread, or run it on the main thread just by changing the composition, not the code.

Thus the SpiderMaster class looks something like:

public class SpiderMaster implements Runnable
{
/**
* The Master is going to be a cluster-wide singleton. We need to
* share it as a root in DSO so separate VMs see the same
* instance.
*/
public static Master<List<Pagelink>> manager =
new Master<List<Pagelink>>();
...

public void run()
{
CompletionService<List<Pagelink>> service =
manager.getCompletionService();

service.submit(new SpiderPageTask(1, startUrl));
toDo++;
while (processed != toDo) {
future = service.take();
processed++;

for (PageLink link : future.get()) {
service.submit(new SpiderPageTask(link.getDepth(),
link.getUrl()));
toDo++;
}
}
}
}


That's it. We need to implement the unit of work that the Workers will execute. Remember that the unit of work is just a Callable, so we just need a class that implements Callable. Here is our SpiderPageTask:

public class SpiderPageTask implements Callable<List<Pagelink>>
{
...
public List<PageLink> call() throws Exception
{
List<PageLink> links = new ArrayList<PageLink>();
Source source = new Source(url);
List<StartTag> tags = source.findAllStartTags();

for (StartTag tag : tags) {
if (tag.getName().equals("a")) {
links.add(new PageLink(depth+1,
tag.getAttributeValue("href"));
continue;
}
if (tag.getName().equals("frame")) {
links.add(new PageLink(depth+1,
tag.getAttributeValue("src"));
continue;
}
}

return links;
}
}


All that is left is to start the SpiderMaster and the Worker threads and cluster the Master. This is done by setting the Master object in the SpiderMaster class as a Terracotta "root". You can think of a root as a cluster-wide singleton. All object instances in the cluster will see the exact same root object. (This is a somewhat foreign concept to grasp when you are new to Terracotta. The new Master(...) seems confusing - why doesn't everyone see their own version of the object if everyone does a new?. What really happens is that the first instance to execute -- in the cluster -- really does perform a new. Afterwards, everyone else just assigns the root field from the cluster wide instance).

So, to share the root, edit the Terracotta config file and add the root like so:

<roots>
<root>
<field-name>org.terracotta.project.pspider.SpiderMaster.manager</field-name>
</root>
</roots>


We need a pair of main() methods to launch our Master and Workers:

public class StartMaster
{
public static final int MAX_DEPTH = 3;

public static void main(String[] args)
{
... (parse arguments)
new SpiderMaster(maxDepth, url).run();
}
}

public class StartWorkers
{
public static void main(String[] args) throws InterruptedException
{
ThreadPoolExecutor executor =
new ThreadPoolExecutor(2,
2,
300L,
TimeUnit.SECONDS,
SpiderMaster.manager.getExecutorQueue());
new Workers(executor).run();
}
}


And we are done!

Note: This blog is somewhat out of date. You can find a working example in the "Open Data Grid" project on Terracotta

References

Terracotta
Java

6 comments:

Anonymous said...

不動産 婚活 結婚 会社設立 広島 不動産 尾道 賃貸 工事担任者 探偵 大阪 仙台 不動産 大阪 不動産 名古屋 不動産 福岡 不動産 京都 不動産 埼玉 不動産 静岡 不動産 神戸 不動産 浜松 不動産 堺市 不動産 川崎市 不動産 相模原市 不動産 姫路 不動産 明石 不動産 鹿児島 不動産 北九州市 不動産 熊本 不動産 広島 不動産 高松 不動産 徳島 不動産 松山市 不動産 高知 不動産 岡山 不動産

Anonymous said...

Do you like playing the game where you need to use flyff penya, when you do not have flyff money, you must borrow flyff gold from friends, or you buy flyff penya. If you get cheap penya, you can continue this game.
Do you like playing the game where you need to use wonderland Gold, when you do not have wonderland online Gold, you must borrow wonderland money from friends, or you
buy wonderland Gold. If you get cheap wonderland online Gold, you can continue this game.

Anonymous said...

Without hesitate, I bought second life linden , in the game I can find myself. I feel lonely, but I do not want to talk with anyone, so I buy lindens . At present, think the happy day I spend in knight, I am eager to enter it, and cheap linden . Own linden dollars , it means that you own the life of happiness. So I will not leave secondlife money . It is the origin of the happiness.

Anonymous said...

There are ed hardy shirts
,pretty ed hardy shirt for men, ed hardy womens in the ed hardy online store designed by ed hardy ,many cheap ed hardy shirt ,glasses,caps,trouers ed hardy shirts on sale ,
You can go to edhardyshirts.com to have a look ,you may find one of ed hardy clothing fit for you
http://straighthairgirl.blog126.fc2.com
http://www.seriousblogging.com/crazygirlsshirts
http://www.free-blog-site.com/iammyself
http://d.hatena.ne.jp/hotfishing

puma mens shoes
nike air max ltd
NIKE air shoes

nike max ltd
orange converse
adidas shoes
nike shoes
puma shoes
addidas shoes
cheap converse shoes
cheap nike shoes
cheap puma shoes

Anonymous said...

ctgouyv,

janet li said...

Hello!It is my first time visit your website.It is nice,just like our website.Yes,we are the cheapest ghd outlet.All of the Discount GHD Straighteners on sale here are high quality,and free shipping.I don't know whether you have already have a flat iron,we here have theghd uk for your choice.Australia is our main market as well.ghd australia also on sale.If you want to have a differdent hairstyle with less money,then come to our website,we have the confidence that we will bring you a nice mood everyday.Pick your beauty now,It is suitable for both men and women.Come on,your beauty is in our website waits for you.what's more,with a nice hairstyle,i think you will also need a nice christian louboutin clearance sale and Ghd Australia to match your hairstyle.Ghd Australia sincerely for your service!
At last,best flat iron thanks for sharing your articles.May you happy everyday.
ghd uk
ghd uk