Sophie

Sophie

distrib > Fedora > 13 > i386 > media > os > by-pkgid > 8be2a15ee5eee9f246f70603486aff76 > files > 12

jgroups-manual-2.2.9.2-6.6.fc12.i686.rpm



Merge problem in GMS/MERGE2
===========================

Author: Bela Ban
Version: $Id: MergeProblem.txt,v 1.7 2005/12/27 12:57:58 belaban Exp $

Symptom: first merge works, subsequent merges don't work
Unit test: MergeStressTest

The problem is that when many members join, we increment the VID (ViewID) every time we multicast a new view.
Say we have subgroups {A,B,C} and {D,E}. The merge is now started by MERGE2, detecting 2 coordinators (A and D).
The VID are 4 for {A,B,C} and 6 for {D,E}.
The merge coordinator (A) therefore picks VID=7 for the MergeView. It now sends a request to both A and D, to multicast
the new MergeView to their respective subgroups.
However, let's assume that in the meantime, other members have joined one of the 2 subgroups, so that the VIDs are now:

{A,B,C}: VID=9
{D,E}:   VID=5

When the MergeView is received, it will be rejected by {A,B,C} because its own VID of 9 is greater than the
MergeView's VID of 7. However, subgroup {D,E} will accept the MergeView because its own VID of 5 is smaller than the
MergeView's VID of 7 !

So now the
A, B and C have a membership of {A,B,C} with VID=9
but
D and E have a membership of {A,B,C,D,E} with VID=7 !

One of the consequences of this is that D will cease sending out MERGE2 requests, because it thinks it is not
the coordinator anymore ! This means we will not get any MERGE2 events into the GMS layer anymore, which would
cause another merge, this time with a higher VID for the MergeView.

Possible SOLUTIONs:

#1 Pick a VID for MergeViews that is high enough to be accepted by all subgroups
#2 Suspend generating view changes while a merge is in progress
#3 All JOIN, LEAVE and MERGE requests need to be handled by ViewBroadcaster (rename this!), check out
   ViewHandling.txt for details

Comments:

- #1 doesn't work well because then all previously generated views will be discarded, but clients already
  *have* the views !

- We have to process VIEWS and MERGE requests in order, possibly we need to flush the ViewBroadcaster and suspend
  view generation until the merge has been handled, then resume view generation

- Therefore #2 looks more promising: on merge() we flush the ViewBroadcaster (send out all pending views) on *all*
  coordinators involved. Then we process the merge, generate and mcast the new MergeView, then resume
  ViewBroadcaster. Probably JOIN and LEAVE requests need to be handled by ViewBroadcaster too


Design:
-------

On MERGE_REQ:
- suspendViewHandler()

On CANCEL_MERGE:
- resumeViewHandler()

On firing of MergeCanceller:
- If MergeCancellers merge_id == current merge_id:
  - resumeViewHandler()

On INSTALL_MERGE_VIEW:
- Install view
- resumeViewHandler()

suspendViewHandler():
- Complete current task, remove all other requests from queue
- Make view handler discard all new requests from now on
- Start MergeCanceller (if not started yet)

resumeViewHandler():
- Stop MergeCanceller
- Make view handler accept all requests from now on