Tuesday, September 09, 2008

Cluster Deadlocks *ROCK* with Terracotta

Am I crazy or what?

No not really. You see I just happen to have seen more than one customer run into a cluster deadlock, and it turns out that solving the issue with Terracotta is awesome (actually, Terracotta can automatically detect it in an upcoming version, but shhh don't tell anyone I told you that)

It's funny, really, because I have been hearing this dumb idea that somehow clustered deadlocks with Terracotta is actually this really scary thing -- ooooh watch out for that complicated Terracotta thing, it uses LOCKING (oh gosh) and that can lead to CLUSTERED DEADLOCKS. Oh my. (Anyone know where I can get a clustered deadlock costume, it's almost halloween!)

Really. It's like a bad rumor I keep hearing over and over again. What do they call that when people try to scare other people with rumors that aren't true .... F.....oh nevermind. Here's why deadlocks truly are better with Terracotta:

First, what do we get with Terracotta?
- Kill a JVM, release its lock.
- Kill a JVM, don't lose your state

Why does that matter? Well what do you do when you see a deadlock with a regular Java application? Since it's pretty much hosed, you have to restart it (usually you probably debug the hell out of it first and try to fix the deadlock). But the app is hosed. Unless you happen to have coded a "stateless" app - you've also lost your app state. Bummer :(

Well, not so with an app running on Terracotta. First of all, you don't have to kill the whole app. In fact, if you do actually get a clustered deadlock, you just have to kill one half of the deadlock (because the locks are released, get it?) and the other half will actually get to complete it's operation. How do you do that? Well since the app state is highly available, you can kill any node at will.

So it's simple to resolve a clustered deadlock with Terracotta - just do a rolling restart of every client JVM. That's it. When you hit one half of the deadlock, and kill that JVM, the lock that the other side of the clustered deadlock wants will be freed, and it will go on its merry way.

Now of course, you still need to debug the hell out of your app :). When you fix the app, just update it in place, do another rolling restart, and voila! Fixed deadlock with no downtime.

So to summarize, deadlocks with Java:
- Have to restart
- Lose app state
- Downtime BAD

Deadlocks with Terracotta (e.g. Clustered Deadlocks):
- Rolling restart of application nodes
- Preserve application state
- No downtime GOOD

15 comments:

Orion Letizi said...

I think all that time in Bulgaria has left Taylor a little punchy.

Unknown said...

Amazing post. Turn a showstopper into a feature and even paint a detection of the showstopper as an unique capability.

Can we not assume that the clustering does increase the chance of the lock which might have never appeared in a production environment in a non-clustered environement? What is the typical duration of a lock in a clustered environment and how does this compare in a non-clustered environment?

Also does your failover mechanism handle side effects of in transit IO, JDBC, JMS ... state changes?

William

Taylor said...

@wlliam - I don't think this post was for you. See the part about people spreading scary rumors.

In any case, there's no such thing as "might never have appeared" for a bug.

See http://www.osnews.com/story/19731/The-25-Year-Old-UNIX-Bug

Unknown said...

Rumor: "a currently circulating story or report of uncertain or doubtful truth"

Well if indeed it is all rumors then you and your company have wasted too much time talking and tackling something that is "uncertain".

What is truly scary is that you think you can simply just kill a JVM (more frequently) and not pay a penalty.

Taylor said...

@william - look - I entertained your initial post - clearly a troll. Now you've done it again. You're rapidly becoming a pest. I didn't start an argument with you - you started it with me.

Did I *claim* Terracotta has unique capability? No you did. Did I *claim* there is no penalty for killing a JVM, no you did.

I guess you have something to get off your chest - so why don't you just admit that you don't like Terracotta, no matter what the technology and quit acting like this is an honest discussion.

Unknown said...

Nonsense, I even had an management extension in the works for your product. It took a bit longer than I had initially planned because of the way your product instruments the classloader and throws a wobbler with encrypted bytecode only offers to ignore (filter) a hardcoded set of class name prefixes.

I do have a low tolerance for stupidity though and this takes the biscuit:

"So it's simple to resolve a clustered deadlock with Terracotta - just do a rolling restart of every client JVM. That's it. When you hit one half of the deadlock, and kill that JVM, the lock that the other side of the clustered deadlock wants will be freed, and it will go on its merry way."

Jay Meyer said...

Taylor is right - Terracotta provides a way out of deadlocks that *ROCKS* (that is better). You still gotta restart JVM's, but at least you can make that completely transparent to users and not lose data.

I see where William is headed with his post - but I disagree and he's missing 2 fundamental truths:

1)deadlocks happen, apps are not perfect, JVMs are not perfect, I need a way out of a deadlock, and I'll take the least-ugly way out. Sounds like terracotta is less-ugly

2)apps that become clustered for the first time WOULD indeed have "failed otherwise" under some other clustering solution.

Terracotta makes the clustering easier, and therefore makes a software engineer decide to cluster an app that he would not have otherswise undertaken.

Unknown said...

Jay "deadlocks happen, apps are not perfect, JVMs are not perfect, I need a way out of a deadlock, and I'll take the least-ugly way out. Sounds like terracotta is less-ugly"

Not a fundamental truth but one worth bearing in mind is that operations people (I assume a developer is not managing this) spend most of their time in incident management mode and less problem management mode.

Any agent of change that increases the rate of incidents will be kicked out of production rather than addressing the underlying issues unless of the course their is simply no alternative and the agent of change has value (which is rarely the view of ops).

Whilst developers and architects like data sharing operations want execution isolation. They do not want to have a cascade effect across a number of JVMs. One deadlock one JVM to kill.

Note my main argument is that with the product you have a greater likelihood of increase the deadlocks occurences and the deadlock coverage is not confined to one node (in theory).

Is it not also the case that additional synchonization is occasionally added to map the ** transparent ** cluster state changes to the application transaction boundaries?

If this was a clustered hello world example then yes this rocks but it is not. State is not always maintained by a single product and state changes (transactions) to not neccessarily map to one syncrhonized block/call.

2)apps that become clustered for the first time WOULD indeed have "failed otherwise" under some other clustering solution.

Not neccessarily. There are many data cache/grid products that do not require the use of locks and they do not map monitor acquisition to locks. Where used it is explicit which means you have the possibility of deadling with exceptions at boundaries. With a transparent solution god knows when it is going to happen. How does one code for this?

I know one of two frameworks that use a single thread to maintain an internal data structure that never assumed another JVM running the same framework would create a similar thread that accessed the exact same data structure.

Taylor said...

@william - this is fun. let's see....

"Is it not also the case that additional synchonization is occasionally added to map the ** transparent ** cluster state changes to the application transaction boundaries?"

Umm....no? See the parts about you making stuff up.

"Whilst developers and architects like data sharing operations want execution isolation. They do not want to have a cascade effect across a number of JVMs. One deadlock one JVM to kill."

That's a good one - first you say keeping a JVM alive is of paramount importance, now it's the fact that you have to do a rolling restart. Backpedaling? Perhaps I forgot to mention that Terracotta can do a cluster wide thread dump, which you can inspect, to determine the holders of the deadlock, by which you can kill one and only one offending JVM? Oh. Oops.

"Note my main argument is that with the product you have a greater likelihood of increase the deadlocks occurences and the deadlock coverage is not confined to one node (in theory)."

Ummm....are you serious? What part of "it's a bug" did you not get?

"If this was a clustered hello world example then yes this rocks but it is not. State is not always maintained by a single product and state changes (transactions) to not neccessarily map to one syncrhonized block/call."

Backpedaling...

"Not neccessarily. There are many data cache/grid products that do not require the use of locks and they do not map monitor acquisition to locks. Where used it is explicit which means you have the possibility of deadling with exceptions at boundaries. With a transparent solution god knows when it is going to happen. How does one code for this?"

Data cache/grid products oh yes. You must mean the ones Terracotta runs circles around. When we do 10x to 100x the speed of those products, umm, you think there might be something wrong with oh I don't know - *their* programming model?

You might want to check this out:

http://softarc.blogspot.com/2008/08/book-review-definitive-guide-to.html

and this:

http://www.byteonic.com/2008/gnip-online-message-oriented-middleware-mom/

Which offer non-biased opinions of Terracotta transparency and Terracotta vs. "other" technologies respectively.

Taylor said...

Linking to the links:

http://softarc.blogspot.com/2008/08/book-review-definitive-guide-to.html

http://www.byteonic.com/2008/gnip-online-message-oriented-middleware-mom/

Jay Meyer said...

I don't believe William is being a malicious troll. But he's missing the point of Terracotta's goals. Of course there's a downside - but is it worse than the other products? No, except for some odd cases. But then nobody at terracottatech.com has claimed that Terracotta is a panacea (except maybe at trade shows;)

Other clustering products are incapable or choose not to use locking. Of course this prevents deadlocks, but at the cost of stale data or data clobbering. Those products have no hope of achieving serializable transactions for high frequency changes in data. This is a classic engineering trade-off and its a well-known issue from years ago. Terracotta chooses to face the challenge of locking to prevent stale or clobbered data, whereas other products choose to prevent deadlocks at the cost of staleness and difficult APIs.

Nobody also claimed that terracotta was perfect and developers need not test their solution either. You still have to make some good choices and check the quality of the app (that panacea thing still doesn't apply). But you have to admit that its pretty cool to resolve a deadlock without affecting the user experience at all. And that is all Taylor was claiming.

Unknown said...

rolling restart?

Why? Do you assume incorrectly that there was no clustering? You can cluster without having deadlocks that span multiple jvms. I am not backpedaling you are just reading what you want read or know.

Taylor said...

@william,

Turns out the algorithm you propose for deadlocks has already been covered at Wikipedia:

Ostrich Algorithm

Anonymous said...

I like your blog. Thank you. They are really great . Ermunterung ++ .
Some new style Puma Speed is in fashion this year.
chaussure puma is Puma shoes in french . Many Franzose like seach “chaussure sport” by the internet when they need buy the Puma Shoes Or nike max shoes. The information age is really convenient .
By the way ,the nike max ltd is really good NIKE air shoes ,don’t forget buy the puma mens shoes and nike air max ltd by the internet when you need them . Do you know Nike Air Shoes is a best Air Shoes . another kinds of Nike shoes is better . For example , Nike Air Rift is good and Cheap Nike Shoes .the nike shox shoes is fitting to running.
Spring is coming, Do you think this season is not for Ugg Boots? maybe yes .but this season is best time that can buy the cheap ugg boots. Many sellers are selling discounted. Do not miss . Please view my fc2 blog and hair straighteners blog.
.thank you .

I like orange converse shoes ,I like to buy the cheap converse shoes by the internet shop . the puma shoes and the adidas shoes (or addidas shoes) are more on internet shop .i can buy the cheap nike shoes and cheap puma shoes online. It’s really convenient.
Many persons more like Puma basket shoes than nike air rift shoes . the Puma Cat shoes is a kind of Cheap Puma Shoes .
If you want to buy the Cheap Nike Air shoes ,you can buy them online. They are same as the Nike Air shoes authorized shop. Very high-caliber Air shoes and puma cat shoes . the cheap puma shoes as same as other.

Unknown said...

Permit Anet perform buy Diablo 3 goldwhat they really want, so what on earth if you dont get some exclusive piece or maybe what ever, cope with this! Precious and precious GW2 Content management systems! I understand that people brazilians aren't numerous ingame, but can there be any kind of opportunity that peopleCheap GW2 Gold can a certain amount of love and possess chances to incorporate some from the ingame presents?