OpenDLM_Lock_Recovery

OpenDLM_Lock_Recovery (Mar 2 2004)
Contact the author.
                OpenDLM Recovery and Hierarchical State Machine

Copyright 2004 The OpenDLM Project

Author:
Ben Cahill (bc), ben.m.cahill@intel.com

1.  Introduction
----------------
This document contains details of the OpenDLM cluster-wide lock state recovery
mechanism, which is invoked any time there is a change in membership in the
cluster.  It includes a discussion of the Hierarchical State Machine (HSM)
mechanism used for recovery.

This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of cluster recovery in OpenDLM.

This document is not intended as a user guide to OpenDLM.  Look in the OpenDLM
WHATIS-odlm document for an overview of OpenDLM.  Look in the OpenDLM
HOWTO-install for details of configuring and setting up OpenDLM.

This document may contain inaccurate statements, based on the author's limited
understanding.  Please contact the author (bc) if you see anything wrong or
unclear.


1.1. Terminology
----------------
"Client" may mean the Recovery State Machine, which uses (is the only client of)
the generic Heirarchical State Machine mechanism described herein.  "Client-
specific messages" and "client-defined handlers" apply to this definition.

Alternatively, "client" may mean a user of OpenDLM (a client process, or a
struct client that represents the client process), that is, an application or
kernel code that requests locks.

Also, see the ogfs-glossary document.


2.  Recovery Overview
---------------------
Since OpenDLM distributes the storage of lock information among all of the
computers in a cluster, this distribution must be adjusted each time a 
computer enters or leaves the cluster, for whatever reason (boot, graceful
shutdown, crash, etc.).

Whenever membership changes, each and every active node runs the recovery
mechanism.  This mechanism re-distributes or deletes resource directory
information and lock master information as the membership grows or shrinks.

When a node joins the cluster, the recovery mechanism re-distributes any
existing directory entries.  This is vital for two reasons:

-- so that the new node (actually, each node, new or not) carries its fair
   share of the directories, 
-- *and* (even more important) so that each directory entry can be found by
   a querying node.

The distribution of the directory entries is based on a formula including
a hash function of each resource name, the number of active nodes in the
cluster, and the node ID of each computer.  When a node needs directory
information, it determines the node to query, based on that formula.  The
directory *must* reside on the proper node, and that node may change as
nodes enter/exit the cluster.

When a node leaves a cluster, the recovery mechanism re-distributes directory
information among the surviving nodes, and may reconstruct directory entries
that were lost on the exiting node (if the locks were mastered on surviving
nodes), or will simply forget about directories that no surviving node knows
or cares about.  (is this accurate? -- discuss graceful exit vs crash?)

(discuss redistribution of non-directory lock information)


3.  Hierarchical State Machine (HSM)
------------------------------------
OpenDLM contains an implementation of a generic hierarchical state machine
(HSM).  The HSM works by sending "messages" to "state handlers".
A state handler may do any or all of the following:

-- perform an action, 
-- send another message to itself,
-- request a transition to a new state,

The messages may originate from inside the HSM itself (only when transitioning
states), or from client-specific code.  Client code includes the state
handlers, as well as any other code than sends messages to stimulate the
behavior of the state machine.

A client (e.g. lock state recovery) of the HSM must define:

1)  a state tree,
2)  handler functions for each state within the tree, and
3)  client-specific messages to drive the state machine from the outside

Currently, within OpenDLM, the lock state recovery mechanism is the only client
of the HSM.  For examples while you read about the HSM, see the following:

1)  state tree:  dlm_recover.h, dlm_recov_states[] array.  Each state
    definition includes a "parent" or "super" state, a state handler function
    name, and a state name.
2)  state handlers:  dlm_recover.c, functions beginning with "clm_st_" or
    "rc_" are state handler functions.  "clm" relates to cluster management,
    and "rc" relates to lock state recovery.  The two are tightly related.
3)  messages:  dlm_recover.h, an enum that starts with "CLM_MSG_NODE_INIT".

For detailed discussion of the OpenDLM Recovery State Machine (the only client
of the HSM) see sections 4 and 5 (and their subsections) within this document.


3.1  State Tree Behavior
------------------------
The HSM allows for a given state to have sub-states.  The arrangement of
states and substates creates a tree, with a single all-encompassing "top" state,
and branches hanging down from there.  The tree is defined separately by each
client of the HSM.

The HSM enters states by descending down the tree, and exits states by
ascending up the tree.

Consider the following tree:

              TOP
            /     \
           A       B
                 / | \
                1  2  3

To get from TOP to state 3, the state machine must:
-- descend and enter state B, then
-- (descend and) enter state 3.

To get from state 3 to state 1, it must:
-- ascend and exit state 3, then
-- enter state 1 by way of the "least common ancestor" (LCA), state B.
   It never enters nor exits state B during this transition; it is always
   considered within state B, merely transitioning between two sub-states
   of state B.

To get from state 1 to state A, it must:
-- (ascend and) exit state 1,
-- exit state B, then
-- enter state A by way of the LCA state, TOP.


3.2  State Handlers and HSM Internal Messages
---------------------------------------------
As the HSM traverses the tree, it sends "enter" or "exit" messages to the
handler functions.  These two messages are defined by the HSM, and are not
client-specific.  In our TOP -> state 3 example, above, the HSM would send
"enter" messages to the handlers for states B and 3.  In the 3 -> 1 example,
it would send "exit" to state 3, and "enter" to state 1.

Within many of the state handler functions defined by the recovery client,
enter/exit messages are treated as no-ops.  See dlm_recover.c, functions
that begin with "clm_st_" and "rc_".

When the HSM reaches a target/destination state, it sends a "start" message
to the state handler.  This message is also defined by the HSM, not the client.
In the TOP -> 3 example, the HSM would send "start" to state 3 (but not state
B, since that was not the target state).

Within the recovery state handlers, "start" often results in an action and/or
a transition to a new target state.  The HSM responds immediately, resulting
in more "enter" or "exit" messages while traversing, and a new "start" message
to the new target's state handler.  This can potentially iterate many times as
the HSM and handlers autonomously sequence through a series of states.


3.3  Client-defined Messages
----------------------------
In addition to the HSM-internal enter/exit/start messages, the state machine
responds to messages from sources outside the state machine.  These are the
client-defined messages that drive the HSM to start various autonomous
sequences through the client-defined state tree.  See section 4 for information
on the client-defined messages for the OpenDLM Recovery State Machine.


3.3.1  Parents and Children
---------------------------
The hierarchy allows a parent/super state to handle any messages that would
be handled identically by some or all of its child states.  For example,
consider a situation in which 3 children would handle a message a certain way,
but one other child would handle the message differently.  The parent/super
state could handle the message for the 3, each of which would simply ignore
the message, causing it to be passed up to the parent, as described below.
The one child with the different behavior would process the message in its
unique way, without the parent seeing the message.

To support the parent/super handling, the HSM executes the following procedure.
The HSM attempts to deliver the message to the handler for the current state.
However, if that handler doesn't process the message, the handler returns
the message to the HSM.  The HSM then tries the current state's parent state
(super-state) handler, then its parent, iterating to the top of the tree until
it finds a handler (the encompassing parent handler) that will process the
message (on behalf of one of its children).


3.4  HSM Implementation
----------------------------
The implementation of the HSM is in src/kernel/dlmdk/hsm.c and hsm.h.

The HSM engine waits for client-specific messages to be put on a message
queue via the hsm_process_event() function.  This is one of only two functions
that are called by code outside of the HSM or the state handlers.  The other
public function is hsm_start() (see below).

When pulling messages off of the queue, and delivering them to the state
handlers, the dialog between the HSM engine and the state handlers is
synchronous.  This includes any processing done by, and any state changes
that are requested by the current state state handler (these result in
fairly immediate state changes).

If a state handler sends a message to itself, that message gets placed on
the incoming message queue by hsm_process_event().  This is an asynchronous
operation; that message will get processed by the HSM only after all other
messages in front of it have been processed.  As an example, this means that
if a handler sends "itself" a message, then requests a state change, that
the state change will occur *before* the message gets processed.

Important functions and macros are described in the next 3 sections:


3.4.1  Public HSM functions
---------------------------
Only two calls are used for driving the HSM from the "outside", that is from
code not within the HSM engine itself.  hsm_process_event() may be called
from any client-specific code, whether within state handlers or not.

hsm_start() -- Starts the given state machine, putting it into TOP state.

hsm_process_event() -- This is the main entry point for driving the state
   machine from the outside world or from within the state handlers.  It adds
   a new message to the tail of the state machine's event queue, and
   processes the queue to deliver the messages to the state handlers.

   It will *not* execute the delivery loop if this or another process/CPU is
   already in the delivery loop driving the state machine.  As an example, a
   given process will already be in the delivery loop when it receives a
   message from a state handler (being driven by the delivery loop).  The
   message will eventually get executed, however, because the loop will keep
   delivering until the queue is empty.

   In some cases, then, processing of the message may be synchronous (within
   the hsm_process_event() call), while in other cases it will not be
   (messages from within the state handlers will always *not* be).  Note
   that there is no indication to the caller as to whether the message
   was processed before returning to the caller.  If the caller needs to
   know, it must look at global state variables to determine current state.


3.4.2  Private HSM functions
----------------------------
These (relatively) private functions support the dialog between the HSM and
the state handlers.  The first two of these functions are static within hsm.c.
The third is called only from within state handlers (using STATE_TRAN macro,
see next section).

hsm_process_event_private() -- Sends the external message to the handler for
   the current state, and if that handler doesn't consume (process) the message,
   the function iterates up the tree as far as the top level, looking for a
   consumer of the message.  Static within hsm.c

hsm_execute_transition() -- Descends downward through the tree to a target
   state, sending "enter" messages to states it enters, and a "start" message
   to the target state.  It *must* begin with the current state being an
   ancestor (parent/grandparent/etc) of the target state.  Static within hsm.c

hsm_exit_private() -- Ascends upward through the tree to a least common
   ancestor (LCA) of the current state and a target state, sending "exit"
   messages to the states it leaves.  hsm_execute_transition() completes
   the transition, doing the descending/entering part of the journey.
   Invoked from within state handlers via STATE_TRAN macro (see next section).


3.4.3  Important macros include:
-------------------------------
These macros are invoked from within state handlers to invoke a state
transition.  One or the other must be used, depending upon whether the path
through the state hierarchy tree involves an upward transition (through
a least-common-ancestor).

STATE_START -- Sets 'next' (target) state.  Used by state handlers to set
   new target state before returning to hsm_execute_transition().  Target must
   be a child/grandchild/descendent of current state, since
   hsm_execute_transition() knows only how to *descend* through the tree.

STATE_TRAN -- Sets 'next' (target) state, and invokes hsm_exit_private()
   to *ascend* to the LCA of the current and the new target state, before
   returning to hsm_execute_transition().  Used for new targets that are *not*
   child/grandchild/descendent of current state.


4.  OpenDLM Recovery State Machine Client-Defined Messages
----------------------------------------------------------
Messages get sent to the state machine via the hsm_process_event() function
(see src/kernel/dlmdk/hsm.c), which is called from 4 different locations
within OpenDLM (all also within src/kernel/dlmdk directory):

-- clm_main.c:  clm_master_loop(), based on input from the cluster manager
                or from the purge comm port
-- clm_info.c:  handle_new_topology(), when first initializing
-- clm_msg.c:   clm_sync(), when synchronizing recovery states for all nodes
-- dlm_recovery.c:  state handlers and other functions auto-sequence the
                state machine.  Also, the PORT_RECOVERY comm callback
                listens for messages from other nodes.


4.1  Cluster Manager Messages
-----------------------------
The clm_master_loop() function (clm_main.c) waits for events from the
cluster manager.  When one arrives, it calls the following functions to
process the event, and eventually place the event on the tail of the
dlm_transition_queue, implemented in dlm_sched.c:

   dlm_process_cm_message() (clm_msg.c) -> handle_new_topology() (clm_info.c)

After calling these, clm_master_loop() then calls:

   new_change_request() (dlm_sched.c)

to remove an event from the head of the dlm_transition_queue (may or may not
be the one just placed on the queue), and then soon calls:

   hsm_process_event() (hsm.c)

to send the next event message to the recovery state machine.  (?? could this
long process be simplified ??)

The following messages result from cluster management events.  Next to each
message are shown the function calls that lead to placing the event onto the
dlm_transition_queue:

   CLM_MSG_NODE_DELETE  handle_new_topology()   ->
                             update_node_status() -> clms_node_delete()
   CLM_MSG_NODE_ADD     handle_new_topology()   -> clms_node_add()
   CLM_MSG_NODE_DOWN    handle_new_topology()   ->
                             update_node_status() -> clms_node_down()

The purge message CLM_MSG_NODE_PURGE also uses the dlm_transition_queue (see
other section).

Note that CLM_MSG_NODE_UP is not among the cluster event messages; it is
generated by the recovery state machine.

Another message is driven by handle_new_topology(), when the node is first
initializing, without using the dlm_transition_queue.  This tells our state
machine to query other nodes for directory information:

   CLM_MSG_NODE_INIT


4.2  Sync Comm Port Message
---------------------------
When the cluster is recovering from an exiting node, there are several steps
at which all nodes in the cluster must be in sync with one another.  The state
machine waits for this, and moves along only when it receives:

   RC_MSG_GOT_BARRIER   clm_sync() -> [handle_sync()] ->
                             dlm_sync_notify_recovery()

This barrier message is driven from clm_sync() in clm_msg.c.  This function
sends MSYNC messages, via the PORT_SYNC comm port, to all nodes to drive their
state machines to a certain recovery state.  When reached, each node lets each
other know that the state has been reached, via the PORT_SYNCRET comm port.
When our node sees that all nodes have reported the same state, it sends
RC_MSG_GOT_BARRIER to our state machine, so it can progress to the next
recovery state.


4.3  Purge Comm Port Message
----------------------------
The purge message is driven from pti_read_request(), which monitors the
PORT_MSTR_CR comm port.  A DLM_PURGELOCKS port message will result in
a CLM_MSG_NODE_PURGE message to the recovery state machine:

   CLM_MSG_NODE_PURGE   pti_read_request() ->
                             master_purge() -> clms_node_purge()


4.3  Recovery Comm Port Messages
--------------------------------
Many messages are received from the state machines in other nodes, via the
PORT_RECOVERY comm port.  The callback for that port is clmr_message_handler()
(dlm_recovery.c).  It forwards the following to our state machine:

   CLM_MSG_NODE_UP
   RC_MSG_DIR_RESTORE
   RC_MSG_DIR_END
   RC_MSG_DIR_NAK
   RC_MSG_DIR_NOT_UP
   RC_MSG_DIR_DATA      /* directory query response */
   RC_MSG_DIR_REQUEST   /* initial directory query */
   RC_MSG_RES_REQ_1ST   /* new data to merge */
   RC_MSG_RES_REQ       /* new data to merge */
   RC_MSG_RES_DATA      /* master handle response */
   RC_MSG_RES_RESTART   /* new node starting	*/


4.4  Recovery State Machine Messages
------------------------------------
Some messages originate within the state machine itself (dlm_recovery.c), and
drive itself to perform various actions and sequence into different states.
These include:

   RC_MSG_SEND_MORE
   RC_MSG_PURGE_CHECK
   RC_MSG_DIR_RETRY


5.  Recovery State Machine Handlers
-----------------------------------
The recovery state machine is implemented within src/kernel/dlmdk/dlm_recover.c.
The recovery state hierarchy tree is only 3 states deep.  Beneath the TOP
level, there are 5 "CLM" states, 3 of which have children "RC" states.

In addition to the state machine structure (which contains the current state,
among other information), there are also a few global variables defined
within dlm_recover.c, that describe state.  Some simple methods are available
in clm_main.c to allow querying these state variables or the current state
machine state.  The query methods are:

-- clm_in_reconfig(): returns state of variable "dlm_recov_in_reconfig",
    which means we are in CLM_ST_RECONFIG (or one of its sub-states),
    trying to (re)build the cluster-wide distribution of lock information.

-- clm_is_running():  returns TRUE if recovery state machine is in CLM_ST_RUN,
    which is the normal, non-recovery, operating mode.

-- clm_in_delivery():  returns TRUE if recovery state machine is in RC_DIR_INIT,
    which means that we are gathering directory information from other nodes.

-- clm_run_rcv():  returns state of variable "dlm_recov_can_transact", which
    indicates that we are in a state in which it is is legal to receive
    transactions (requests for lock operations).


There is also a global variable which has no query method:

-- grace_period:  Indicates that we are in a particular range of sub-states of
   the CLM_ST_RECONFIG state.  Entry into the RC_BARRIER2 sub-state sets this
   to 1.  RC_BARRIER6 sub-state sets it to 0, just before entering CLM_ST_RUN,
   which also sets it to 0.  Used in clm_process() to help determine if we can
   accept a lock request from a client at the moment.


There are also a few static state variables used just within dlm_recovery.c:

-- dlm_recov_in_purge:

-- dlm_recov_in_add_delete:

-- dlm_recov_updates_inflight:  Indicates # of 

The states and messages supported by the Recovery State Machine handlers are
described below:


5.1  CLM_ST_TOP
---------------
The TOP state is a current state only for a brief instant, when the HSM is
entered and started by hsm_start(), when initializing the OpenDLM system
(kern_main() -> dlm_recov_init() -> hsm_start()).  It transitions immediately
to the CLM_ST_INIT state, never again to return to TOP as a current state.

Therefore, TOP's main role is to be the highest level ancestor state for all of 
the other states.  It serves as a default catch-all for messages that are not
handled by states lower in the hierarchy (see section 3.3).  It handles the
following:

START:  Transitions immediately to CLM_ST_INIT.

CLM_MSG_NODE_DOWN:  Some node is leaving the cluster.  All states (except for,
   currently, CLM_ST_RUN, which treats it the same way and is redundant) allow
   this message to bubble up to TOP, which (re)sets a few static recovery
   variables:
      dlm_recov_in_add_delete = FALSE;
      dlm_recov_in_purge = FALSE;
      dlm_recov_leaving_node = node that sent NODE_DOWN message.
   It then transitions to CLM_ST_DRAIN.  In other words, no matter what state
   we're in, when we get a CLM_MSG_NODE_DOWN we stop what we're doing and go
   to CLM_ST_DRAIN.

CLM_MSG_NODE_UP:  Some node is joining the cluster.  Almost all states allow
   this message to bubble up to TOP, which sends a RC_MSG_DIR_NAK back to the
   sending node, meaning that we're not in a state which can respond to a
   directory request.  The sending node can wait before trying again.  The
   only two states that handle this message directly are:  CLM_ST_RUN (the
   only state that can successfully handle it) and RC_DIR_INIT (which sends
   a special RC_MSG_DIR_NOT_UP message back to the sending node).

default:  Any message that has no meaning to the current state or its
   ancestors causes a warning message to be sent to the system log.
   Operation of the DLM continues as if the message did not occur.


5.2  CLM_ST_INIT
----------------
CLM_ST_INIT is entered only from TOP, just after hsm_start().  It transitions
immediately to its RC_INIT_WAIT sub-state, never to return to CLM_ST_INIT as
a current state.

This state has 2 leaf-level sub-states, but handles no messages on their behalf,
and handles no messages for itself as a current state.  It handles only:

START:  Transition immediately to RC_INIT_WAIT sub-state.


5.2.1  RC_INIT_WAIT
-------------------
RC_INIT_WAIT is the first leaf level state to be current when initializing.
It is entered only from CLM_ST_INIT, and once it exits, it never returns
here again.

In this state, the node is just waiting to be told to initialize its
directory information.  It handles the following messages:

CLM_MSG_NODE_INIT:  This message is sent by *this* node to itself, when
   handling the initial cluster topology configuration in handle_new_topology().
   If this node is joining a cluster with other active members, it needs to get
   directory information from its neighbor, so we transition to the RC_DIR_INIT
   state.  If this node is the only active member in the cluster, there's no
   neighbor from which to get directory info, so we transition directly to
   CLM_ST_RUN.

EXIT:  When exiting this state (to either RC_DIR_INIT or CLM_ST_RUN, or any
   other state such as CLM_ST_DRAIN), the handler initializes the cccp
   communications module via cccp_init().


5.2.2  RC_DIR_INIT
------------------
This leaf state is a bit more complex than the ones we've discussed so far.
In this state, we are trying to migrate directory information into our node
from (one? all?) other nodes within the cluster that we just joined.  Some
of these nodes may not be ready to give us that information, so there is
provision for retrying our requests, or skipping nodes that do not have
useful directory information.  It handles:

ENTRY:  It resets the purge logic, finds our closest neighbor in the cluster,
   and falls through to RC_MSG_DIR_RETRY, to try to get directory info from that
   neighbor.  (?? Other state handlers seem to send a message instead of
   "falling through".  In this case, the message would be RC_MSG_DIR_RETRY.
   Is there a reason that this state is implemented differently from others?
   Why is this done on ENTRY instead of START??).

RC_MSG_DIR_RETRY:  Sends CLM_MSG_NODE_UP to neighbor node, to request directory
   information migration from that node.  This will be a *first*, not a
   *re*-try, if it fell through from the ENTRY: message.  RC_MSG_DIR_RETRY
   always originates from within *this* node, and from within this state,
   from the handling of other messages (see below).

RC_MSG_DIR_NAK:  We receive this from the neighbor node if it is temporarily
   not in a state in which it can send directory data.  We set a timer
   (3 seconds) and send ourselves RC_MSG_DIR_RETRY, to try the same node again.

RC_MSG_DIR_NOT_UP:  We receive this from the neighbor node if it is in the
   same state that we are, trying to get directory info from a neighbor node.
   Since it's just getting started, we know that it is gathering only the
   directory information for itself, and won't have any that belongs to us.
   We skip that node, find another node, and set the 3 second timer to send
   ourselves RC_MSG_DIR_RETRY.  If we've already tried all the nodes(?? --
   logic looks incomplete for this??), we give up and enter RUN state.

RC_MSG_DIR_END:  We receive this from the neighbor node if it has finished
   sending us directory information.  We copy some migration control parameters
   from it, and enter the RUN state.  Apparently, we need directory information
   from just one other node, since there is no provision to query other nodes
   once we receive this message(??? Shouldn't we gather dir info from all
   active nodes??).

RC_MSG_DIR_RESTORE:  This message contains directory information from another
   node!  We insert the entry into our local directory.

CLM_MSG_NODE_UP:  We receive this from another node that is joining the
   cluster.  Since we are still trying to initially get together our own
   directory information, we don't have any directory info that belongs on
   this new node.  We send the RC_MSG_DIR_NOT_UP message, so it can look
   for another node to query.


5.3  CLM_ST_RUN
---------------
This is the normal operating state.  Since this is a "recovery" state machine,
this state doesn't need to do much except wait for interesting things to
happen in terms of cluster membership, then transition to the appropriate
state.  This is a "leaf" state, with no children.  It handles:

ENTRY:  Just set a few global state variables to indicate that we're running:
      grace_period = 0;
      clm_ative = TRUE;
      dlm_recov_can_transact = TRUE;
   The call to purge_active() is a no-op for Linux.
   

CLM_MSG_NODE_PURGE:  Set a couple of global state variables to indicate that
   we're in purge mode:
      dlm_recov_in_add_delete = FALSE;
      dlm_recov_in_purge = TRUE;
   Then transition to CLM_ST_DRAIN.


CLM_MSG_NODE_ADD:
CLM_MSG_NODE_DELETE:  Set a couple of global state variables to indicate that
   we're in add/delete mode:
      dlm_recov_in_add_delete = TRUE;
      dlm_recov_in_purge = FALSE;
   Then transition to CLM_ST_DRAIN.

CLM_MSG_NODE_UP:  Set the following state variable so we know which node is
   joining:
      dlm_recov_joining_node = node that sent NODE_UP message
   Transition to CLM_ST_SEND_DIR to send directory info to newly joining node.

CLM_MSG_NODE_DOWN:  Set a few global variables:
      dlm_recov_in_add_delete = FALSE;
      dlm_recov_in_purge = FALSE;
      dlm_recov_leaving_node = node that sent NODE_DOWN message.
   It then transitions to CLM_ST_DRAIN.  This is identical and redundant
   with NODE_DOWN handling in TOP, and could be removed from the RUN handler.


5.4  CLM_ST_SEND_DIR
--------------------
In this state, we're migrating some of our directory data to a node that is
joining the cluster.  This is a "leaf" state, with no children.  It handles:

ENTRY:  We're getting ready to send our directory information to a newly
   joining node, so we reset the enumerator that will walk through our
   directory database, so it will start at the beginning of the database.
   We then send an RC_MSG_SEND_MORE to ourself, to effectively fall through
   into ...

RC_MSG_SEND_MORE:  We send directory entries to the newly joining node.
   Using rl_enum_counted() to walk through the directory hash table, we send
   the entries from up to 32 hash table slots, using clmr_send_dir_entry().
   This function looks for those directory entries that belong in the newly
   joining node, based on the resource names and the new number of nodes,
   using clm_direct_dirsend() -> hashfunc().  See these functions to
   understand how directory entries are distributed throughout the cluster.
   ??? Once the 32 slots are done, how do we (or the other node) ask for more???


5.5  CLM_ST_DRAIN
-----------------
We transition to the DRAIN state when a node leaves the cluster.  After a
member node dies, this state (and its 2 leaf sub-states), clean out old,
unusable information (about the dead node) that is still on this node.

We can arrive here from several states, for several reasons.  To define each
case, the following global variables must be set appropriately in the previous
state, before transitioning to DRAIN:

dlm_recov_in_add_delete -- TRUE if here because of CLM_MSG_NODE_ADD or
   CLM_MSG_NODE_DELETE message (see RUN state handler). *

dlm_recov_in_purge --  TRUE if here because of CLM_MSG_NODE_PURGE message
   (see RUN state handler). *

* If here because of CLM_MSG_NODE_DOWN (see TOP state handler), both the
dlm_recov_add_delete and dlm_recov_purge globals are FALSE.

dlm_recov_leaving_node -- indicates which node is leaving the cluster.
   (??? it doesn't seem to get set for NODE_ADD or NODE_DELETE ???)

This state handles the following messages:

START:  We transition immediately to RC_BARRIER1.

EXIT:   We set the global variable dlm_recov_can_transact to FALSE.  This
   means that this node is not able to accept transaction requests from
   itself or other nodes; it is entering a reconfiguration state (RC_BARRIER1).


5.5.1  RC_BARRIER1
------------------
This is the first of a series of "barrier" states.  A barrier state is one
that must be achieved by *all* nodes in the cluster, before we can move on
to a next state.

This state begins the process of recovery, after a node has died.

This, like all of the other "RC_" states, is a leaf state.  It handles the
following messages:

ENTRY:  This does the real work of this state.  As the first phase of a
   reconfiguration, it does the following on *this* node:
   -- reset_inflight() marks any "inflight" requests (requests that have been
      sent to the dead node, but not acknowledged by the dead node) to try
      them again after reconfiguration is complete (and the dead node is
      running again).
   -- clmr_purge_res() removes any locks, held by dead nodes, that are on any
      of the queues (grant/convert/wait) of the resources on our resource list.
   -- clmr_dir_purge_chk() (driven by rl_enum() iterator that goes through
      the directory entries on our node) removes any directory entries that
      point to the dead node(s).
   -- clm_sync() waits until *all* nodes finish the above 3 jobs.
      We'll receive RC_MSG_GOT_BARRIER when all nodes are done.

RC_MSG_GOT_BARRIER:  This message comes from within our own node (see
   dlm_sync_notify_recovery()), after our node determines that all active
   nodes have reached a common point in the recovery sequence.  
   We simply transition to RC_PURGE_WAIT.


5.5.2  RC_PURGE_WAIT
--------------------
This, like all of the other "RC_" states, is a leaf state.  It handles the
following messages:

ENTRY:  This state does the following on *this* node:
   -- reconfig_free_client_purge() (driven via client_apply() iterator on
      each OpenDLM client app on this node) continues any purge activity on
      this node that was blocked because a migration was in progress (??).
      Or, it starts a purge process that had not started because we were in
      recovery when it was requested.
   -- sends an RC_MSG_PURGE_CHECK message to ourself, to continue on.

RC_MSG_PURGE_CHECK:  Received from ourself.  Does the following:
   -- reconfig_check_client_purge() (driven via client_apply() iterator on
      each OpenDLM client on this node) checks to make sure that all purging
      operations on this node are complete.
   -- If so, transition to RC_BARRIER2 (substate of CLM_ST_RECONFIG).
      ?? If not, there seems to be no way to continue on ????  There is no
      other source for RC_MSG_PURGE_CHECK to trigger another check!?!


5.6  CLM_ST_RECONFIG
--------------------
This state has the largest number (7) of sub-states, and is used to (re-)
distribute resource (including locks on the resource) and directory information
across the cluster after cluster membership changes.

If a node joins the cluster, then we arrive here via ??

If a node leaves the cluster, then we arrive here via the CLM_ST_DRAIN state
(actually via its sub-states RC_BARRIER1 and RC_PURGE_WAIT), which cleans out
unusable information about the dead node.  In CLM_ST_RECONFIG (and substates),
we then re-distribute information about the surviving nodes.

CLM_ST_RECONFIG is never a target state (i.e. never a final "current" state).
However, it may be entered or exited when the state machine is transitioning
to or from its sub-states.  It also serves as a "catch-all" handler for
several messages, on behalf of its sub-states.

It handles the following transitional messages from the HSM:

ENTRY:  Sets "dlm_recov_in_reconfig" global state variable to TRUE.

EXIT:   Sets "dlm_recov_in_reconfig" global state variable to FALSE.

START:  Makes sure that dlm_recov_in_reconfig = TRUE.
        Transitions immediately to RC_BARRIER2.

These are the only places that the dlm_recov_in_reconfig variable is set
or reset.

It handles the following client-specific messages on behalf of its sub-states.
In each case, this is the only place in which the message is handled.  That is,
each of the messages are valid for all of the sub-states of CLM_ST_RECONFIG,
and are passed up to here.  Conversely, they are invalid for any other states
(e.g. CLM_ST_DRAIN) in the recovery state machine, and would be handled by
TOP's default warning message.  The following are handled here:

RC_MSG_DIR_REQUEST:  This message comes from another node's clmr_res()
   function, and is received on this node via the PORT_RECOVERY comm port.
   -- handle_dir_request(), dlm_recovery.c, makes sure that we have an
      active directory entry (i.e. we are the directory node for this
      resource), and sends a RC_MSG_DIR_DATA message back to the other
      node.  This tells the other node which node is the (perhaps new)
      master node for the resource (see immediately below).

RC_MSG_DIR_DATA:  This message comes from another node's handle_dir_request()
   function, and is received on this node via the PORT_RECOVERY comm port.

   This message is a response from the (perhaps new) directory node back to
   us, confirming that it has (perhaps created) a valid directory entry for
   the resource, and telling us which node is the (perhaps new) master.

   We do the following for each resource structure on our node, depending on
   whether we are the (perhaps new) master of the resource:

   If master:
   -- clmr_clean(), dlm_recovery.c, scans through all locks on our (master)
      copy of the resource, and removes any locks that were held by dead nodes.
      It also invalidates (clears) the lock value block (LVB) for any resources
      that had exclusive (EX) or protected write (PW) locks held by a dead node.
      Such a node might have changed the LVB contents without us knowing
      about it.

   If not master:
   -- clmr_rebuild_send(), dlm_recovery.c, cleans out our (secondary) copy of
      the resource, removing any locks that are not held on this node; as a
      secondary copy, we don't need to know about any other nodes' locks.  Then,
      it sends either a RC_MSG_RES_REQ or a RC_MSG_RES_REQ_1ST message, via
      PORT_RECOVERY, to the the master node, so the master node can know what
      we know about our own locks, and integrate it into the master copy of
      the resource.  These messages are handled by either state RC_BARRIER4,
      or RC_RESOURCE_QUERY.  The master node will respond with RC_MSG_RES_DATA
      (see immediately below).

RC_MSG_RES_DATA:  This message comes from another node's clmr_respond()
   function, called from their handle_resource_request() function after
   receiving an RC_MSG_RES_REQ(_1ST) (see immediately above) from us.  It
   is received on this node via the PORT_RECOVERY comm port.

   We copy the master node's resource handle, and the node id of the master
   node, into our copy of the resource.


5.6.1  RC_BARRIER2
--------------------
This barrier state waits for all other nodes to get to barrier state 2
(meaning what??).

This, like all of the other "RC_" states, is a leaf state.  It handles the
following messages:

ENTRY:  Sets two global variables:
      grace_period = 1;
      clm_active = TRUE;
   Then waits for all nodes to achieve barrier state 2.  Once achieved,
   we receive the RC_MSG_GOT_BARRIER message, see below.


RC_MSG_GOT_BARRIER:  This message comes from within our own node (see
   dlm_sync_notify_recovery()), after our node determines that all active
   nodes have reached barrier state 2.  We branch in two different directions,
   depending on whether we are in "dlm_recov_in_purge" mode:

   If in purge, we simply transition to RC_BARRIER6.

   If not in purge, we simply transition to RC_REBUILD_DIR.


5.6.2  RC_REBUILD_DIR
--------------------
This state does a complete rebuild of directory information throughout the
cluster, using information in each node's resource table.  In this state,
we use info from resource master copies, not non-master copies.  They get
used later, in the RC_RESOURCE_QUERY state.

This state steps through all of the resource structures stored on this node,
and sends the information about each resource to that resource's directory
node.  That directory node may be a new node, building a directory from scratch,
or it may be a an old member node that is becoming the directory node for a
new set of resources, due to the change in cluster membership.

ENTRY:  Several steps are executed here:
   -- reconfig_del_dead_client(), clm_client.c, removes any knowledge that we
      have on *this* node about clients on dead nodes, by removing those
      clients from our "cli_refs" list.

   Then, if we are in "dlm_recov_in_add_delete" mode:
   -- clm_change_topology(), clm_info.c, reorganizes our node list 
      to contain only the new cluster membership.
   -- clmr_dir_move_check(), clm_recover.c, (called via rl_enum(), clm_rldb.c,
      for each directory entry on our node) deletes any active directory
      (but not cache) entries for resources for which we just were, but no
      longer are, the directory node.  That is, it deletes directory entries
      that are migrating away from us due to the change in cluster membership.

   Then, regardless of mode:
   -- resource_table_iterator_create(), clm_resource.c, creates an iterator
      for going through the resource table on this node.
   -- send an RC_MSG_SEND_MORE message to ourself to do the next job.

RC_MSG_SEND_MORE:  We use the iterator to go through our node's list of
   resources, calling: 
   -- clmr_dir(), clm_resource.c.  If we are the current resource master,
      we have the most up-to-date resource information.  We send the info to
      the resource's (perhaps new) directory node, via update_directory().
      update_directory() sends the info within a rl_update_t structure,
      and delivers it via PORT_DIRUPDT.  The info includes:
		resource name, name length, type (UNIX vs VMS)
		flags
		master node
		pointer to our copy of the resource (for use after ack)

      The port listener function for PORT_DIRUPDT is dir_read_update(),
      in clm_direct.c.  On the directory node, it either modifies a
      pre-existing directory entry, or creates a new one, using the info
      sent above.

      We keep track of the number of updates "in flight" (sent to directory
      node, but not yet acknowledged by directory node) via global variable
      "dlm_recov_updates_inflight", which we increment each time we invoke
      update_directory().  The counter is decremented via
      dlm_recov_dir_msg_retire(), called after we get an acknowledgement.

      If we are not the master, we don't send info, but take the opportunity
      to make sure that our copy of the resource contains only locks held
      by this node in its queues (that's all that a local copy should contain).

   Then, we check "dlm_recov_updates_inflight" to see if all of our directory
   updates have been sent and retired.  If so, we transistion to RC_BARRIER3.

   If not, we must get called again with RC_MSG_SEND_MORE, which is done
   by dlm_recov_dir_msg_retire() once the number of inflight updates is
   below a certain threshold (dlm_buff_limit = 500).  ?? Something looks
   odd with this mechanism ... the iterator goes through all of the resources
   in one shot, so why would it need to be re-stimulated??  Then, we get
   hundreds of RC_MSG_SEND_MORE messages before the final one gets us to
   change state??  Even though the messages are to ourself, that's still
   a lot of pretty useless messages??

EXIT:  Destroy the iterator created during ENTRY.


5.6.3  RC_BARRIER3
--------------------
This barrier state waits for all other nodes to get to barrier state 3
(meaning what??).

This, like all of the other "RC_" states, is a leaf state.  It handles the
following messages:

ENTRY:  Waits for all nodes to achieve barrier state 3.  Once achieved,
   we receive the RC_MSG_GOT_BARRIER message, see below.


RC_MSG_GOT_BARRIER:  This message comes from within our own node (see
   dlm_sync_notify_recovery()), after our node determines that all active
   nodes have reached barrier state 3.

   We simply transition to RC_RESOURCE_QUERY.


5.6.4  RC_RESOURCE_QUERY
--------------------

ENTRY:  Sets one global variable:
      dlm_recov_can_del_dir = TRUE;
   Then creates a resource table iterator, and sends an RC_MSG_SEND_MORE
   to fall through to do the job below.

RC_MSG_SEND_MORE:  Does the following:
   -- resend_updates(), clm_direct.c, handles any requests made to our node
      to delete directory entries, while we were unable to.  We put them on
      a holding stack to be processed now.
   -- clmr_res(), dlm_recover.c, gets called for each resource known by our
      node, and does one of two things, depending on whether we are the
      master for the resource:
	master:  clean out locks, held by dead nodes, from our resource's
		lock queues (grant/convert/wait)
	not master:  send RC_MSG_DIR_REQUEST to the directory node.  This
		tells the directory node to make sure that it has a directory
		entry.  Also, the directory node sends back mastership info,
		so that we don't need to consult the directory again for
		this resource.

      As in RC_REBUILD_DIR state, we keep track of the number of updates
      "in flight" via global variable "dlm_recov_updates_inflight".
      After we're done iterating through the resources, we check for that
      to be 0, then transition to RC_BARRIER4.


RC_MSG_RES_REQ:
RC_MSG_RES_REQ_1ST:  We are the master node for a certain resource.  Another
   node is sending us (pushing it; we didn't request it, see CLM_ST_RECONFIG
   state, RC_MSG_DIR_DATA message) information about locks it holds on the
   resource.  We call:
   -- handle_resource_request(), dlm_recover.c, integrates that info into our
      master copy of the resource structure, and sends the other node a copy of
      our resource handle (cluster-wide, smaller-than-the-resource-name-and-type
      resource identifier), using RC_MSG_RES_DATA.  See CLM_ST_RECONFIG state,
      RC_MSG_RES_DATA message.

   Note that RC_MSG_RES_REQ(_1ST) is handled identically by the RC_BARRIER4
   state.


EXIT:  Destroys the iterator created during ENTRY.


5.6.5  RC_BARRIER4
--------------------
This barrier state waits for all other nodes to get to barrier state 4
(meaning what??).

This, like all of the other "RC_" states, is a leaf state.  It handles the
following messages:

ENTRY:  Waits for all nodes to achieve barrier state 4.  Once achieved,
   we receive the RC_MSG_GOT_BARRIER message, see below.

RC_MSG_RES_REQ:
RC_MSG_RES_REQ_1ST:  We are the master node for a certain resource.  Another
   node is sending us (pushing it; we didn't request it, see CLM_ST_RECONFIG
   state, RC_MSG_DIR_DATA message) information about locks it holds on the
   resource.  We call:
   -- handle_resource_request(), dlm_recover.c, integrates that info into our
      master copy of the resource structure, and sends the other node a copy of
      our resource handle (cluster-wide, smaller-than-the-resource-name-and-type
      resource identifier), using RC_MSG_RES_DATA.  See CLM_ST_RECONFIG state,
      RC_MSG_RES_DATA message.

   Note that RC_MSG_RES_REQ(_1ST) is handled identically by RC_RESOURCE_QUERY
   state.


RC_MSG_GOT_BARRIER:  This message comes from within our own node (see
   dlm_sync_notify_recovery()), after our node determines that all active
   nodes have reached barrier state 4.

   We check resources that have been reassigned to this node as master, after
   their previous master node died, via:
   -- clmr_lkvlb(), dlm_recover.c, clears stale locks (marked by LOCK_STALE
      flag) from each queue (grant/convert/wait).  If we are the new master,
      we have lost information from the dead node that previously was the
      master.  We invalidate the lock value block (LVB) if there was a
      possibility that the dead master might have changed it, that is, if
      some dead node owned an exclusive (EX) or protected writed (PW) on
      the lock.  We can't know that for sure, so we check for any locks that
      *we* know about that are >= concurrent write (CW).  If there are any,
      they would have kept any dead nodes from owning a lock that allowed them
      to write the LVB.  But if not, we invalidate the LVB, to be safe.

   We then transition to RC_BARRIER5.


5.6.6  RC_BARRIER5
--------------------
This barrier state waits for all other nodes to get to barrier state 5
(meaning what??).

This, like all of the other "RC_" states, is a leaf state.  It handles the
following messages:

ENTRY:  Waits for all nodes to achieve barrier state 5.  Once achieved,
   we receive the RC_MSG_GOT_BARRIER message, see below.


RC_MSG_GOT_BARRIER:  This message comes from within our own node (see
   dlm_sync_notify_recovery()), after our node determines that all active
   nodes have reached barrier state 5.

   We remove temporary directories that were created for locallocks during
   reconfiguration (when??), via:
   -- clmr_rm_locallock_dirs(), dlm_recover.c, which deletes a directory
      if the RLINFO_LOCAL flag is set

   We then transition to RC_BARRIER6.

5.6.7  RC_BARRIER6
--------------------
This barrier state waits for all other nodes to get to barrier state 6
(meaning what??).

This, like all of the other "RC_" states, is a leaf state.  It handles the
following messages:

ENTRY:  Waits for all nodes to achieve barrier state 6.  Once achieved,
   we receive the RC_MSG_GOT_BARRIER message, see below.


RC_MSG_GOT_BARRIER:  This message comes from within our own node (see
   dlm_sync_notify_recovery()), after our node determines that all active
   nodes have reached barrier state 6.

   This is the last stage of recovery, just before going into the RUN state.
   We do the following:
   -- wake_all_locks(), clm_clmlocks.c, looks through all resources for which
      we are (now perhaps new) master, and grant any locks that were blocked
      by dead nodes.

   -- pti_flush_resend(), clm_batch.c, resends messages that were destined for
      a dead node, but failed to transmit successfully.  These messages were
      placed on our "resend" queue.  If the original destination is alive again,
      we send it there, otherwise look for the appropriate new directory (or
      master?) node.

   -- resend_inflight(), clm_batch.c, resends messages that were destined for
      a dead node, reached that node successfully (just before it died), but
      we never received a response to the request in the message.  These
      messages were placed on our "inflight" queue.  As with resend queue,
      we try again sending to the original node if it is reachable, or look
      for the appropriate new directory node if not.

   -- reconfig_allow_client_purge(), clm_purge.c, does purges that were delayed
      by the reconfiguration.

   -- reset "grace_period" global state variable to 0.

   -- clm_change_topology(), clm_info.c, to make sure that we're set up for
      the latest cluster membership.

   We then transition to CLM_ST_RUN.


6  Purging
----------

"Purging" is the cluster-wide removal of now-useless lock information.  There
are several possible reasons for needing a purge:

  Client died -- A client app on a node died (PURGE_DEAD).  PURGE_DEAD flag
     tells us to remove our knowledge of the client, in addition to our
     knowledge of the client's locks.

  Client request -- A client app requests (how?) a purge of its own locks, even
    though it is still alive (PURGE_ME).

  Purge orphan locks -- Orphan locks survive, even though the client or node,
    from which they were requested, dies.  This is a special type of lock
    used during a client recovery process.  Once recovery is complete, the
    orphan locks must be removed.

  Node died -- An entire node died

A purge operation is requested via a "transaction" structure, just like other
inter-node operations, that gets passed around within and between nodes.
Within the transaction structure, there is a request-specific data area (a
union).  For purge requests, the area contains a struct "purgereq" (see
clmstructs.h).