Contact the author.
OpenDLM Recovery and Hierarchical State Machine Copyright 2004 The OpenDLM Project Author: Ben Cahill (bc), ben.m.cahill@intel.com 1. Introduction ---------------- This document contains details of the OpenDLM cluster-wide lock state recovery mechanism, which is invoked any time there is a change in membership in the cluster. It includes a discussion of the Hierarchical State Machine (HSM) mechanism used for recovery. This document is aimed at developers, potential developers, students, and anyone who wants to know about the details of cluster recovery in OpenDLM. This document is not intended as a user guide to OpenDLM. Look in the OpenDLM WHATIS-odlm document for an overview of OpenDLM. Look in the OpenDLM HOWTO-install for details of configuring and setting up OpenDLM. This document may contain inaccurate statements, based on the author's limited understanding. Please contact the author (bc) if you see anything wrong or unclear. 1.1. Terminology ---------------- "Client" may mean the Recovery State Machine, which uses (is the only client of) the generic Heirarchical State Machine mechanism described herein. "Client- specific messages" and "client-defined handlers" apply to this definition. Alternatively, "client" may mean a user of OpenDLM (a client process, or a struct client that represents the client process), that is, an application or kernel code that requests locks. Also, see the ogfs-glossary document. 2. Recovery Overview --------------------- Since OpenDLM distributes the storage of lock information among all of the computers in a cluster, this distribution must be adjusted each time a computer enters or leaves the cluster, for whatever reason (boot, graceful shutdown, crash, etc.). Whenever membership changes, each and every active node runs the recovery mechanism. This mechanism re-distributes or deletes resource directory information and lock master information as the membership grows or shrinks. When a node joins the cluster, the recovery mechanism re-distributes any existing directory entries. This is vital for two reasons: -- so that the new node (actually, each node, new or not) carries its fair share of the directories, -- *and* (even more important) so that each directory entry can be found by a querying node. The distribution of the directory entries is based on a formula including a hash function of each resource name, the number of active nodes in the cluster, and the node ID of each computer. When a node needs directory information, it determines the node to query, based on that formula. The directory *must* reside on the proper node, and that node may change as nodes enter/exit the cluster. When a node leaves a cluster, the recovery mechanism re-distributes directory information among the surviving nodes, and may reconstruct directory entries that were lost on the exiting node (if the locks were mastered on surviving nodes), or will simply forget about directories that no surviving node knows or cares about. (is this accurate? -- discuss graceful exit vs crash?) (discuss redistribution of non-directory lock information) 3. Hierarchical State Machine (HSM) ------------------------------------ OpenDLM contains an implementation of a generic hierarchical state machine (HSM). The HSM works by sending "messages" to "state handlers". A state handler may do any or all of the following: -- perform an action, -- send another message to itself, -- request a transition to a new state, The messages may originate from inside the HSM itself (only when transitioning states), or from client-specific code. Client code includes the state handlers, as well as any other code than sends messages to stimulate the behavior of the state machine. A client (e.g. lock state recovery) of the HSM must define: 1) a state tree, 2) handler functions for each state within the tree, and 3) client-specific messages to drive the state machine from the outside Currently, within OpenDLM, the lock state recovery mechanism is the only client of the HSM. For examples while you read about the HSM, see the following: 1) state tree: dlm_recover.h, dlm_recov_states[] array. Each state definition includes a "parent" or "super" state, a state handler function name, and a state name. 2) state handlers: dlm_recover.c, functions beginning with "clm_st_" or "rc_" are state handler functions. "clm" relates to cluster management, and "rc" relates to lock state recovery. The two are tightly related. 3) messages: dlm_recover.h, an enum that starts with "CLM_MSG_NODE_INIT". For detailed discussion of the OpenDLM Recovery State Machine (the only client of the HSM) see sections 4 and 5 (and their subsections) within this document. 3.1 State Tree Behavior ------------------------ The HSM allows for a given state to have sub-states. The arrangement of states and substates creates a tree, with a single all-encompassing "top" state, and branches hanging down from there. The tree is defined separately by each client of the HSM. The HSM enters states by descending down the tree, and exits states by ascending up the tree. Consider the following tree: TOP / \ A B / | \ 1 2 3 To get from TOP to state 3, the state machine must: -- descend and enter state B, then -- (descend and) enter state 3. To get from state 3 to state 1, it must: -- ascend and exit state 3, then -- enter state 1 by way of the "least common ancestor" (LCA), state B. It never enters nor exits state B during this transition; it is always considered within state B, merely transitioning between two sub-states of state B. To get from state 1 to state A, it must: -- (ascend and) exit state 1, -- exit state B, then -- enter state A by way of the LCA state, TOP. 3.2 State Handlers and HSM Internal Messages --------------------------------------------- As the HSM traverses the tree, it sends "enter" or "exit" messages to the handler functions. These two messages are defined by the HSM, and are not client-specific. In our TOP -> state 3 example, above, the HSM would send "enter" messages to the handlers for states B and 3. In the 3 -> 1 example, it would send "exit" to state 3, and "enter" to state 1. Within many of the state handler functions defined by the recovery client, enter/exit messages are treated as no-ops. See dlm_recover.c, functions that begin with "clm_st_" and "rc_". When the HSM reaches a target/destination state, it sends a "start" message to the state handler. This message is also defined by the HSM, not the client. In the TOP -> 3 example, the HSM would send "start" to state 3 (but not state B, since that was not the target state). Within the recovery state handlers, "start" often results in an action and/or a transition to a new target state. The HSM responds immediately, resulting in more "enter" or "exit" messages while traversing, and a new "start" message to the new target's state handler. This can potentially iterate many times as the HSM and handlers autonomously sequence through a series of states. 3.3 Client-defined Messages ---------------------------- In addition to the HSM-internal enter/exit/start messages, the state machine responds to messages from sources outside the state machine. These are the client-defined messages that drive the HSM to start various autonomous sequences through the client-defined state tree. See section 4 for information on the client-defined messages for the OpenDLM Recovery State Machine. 3.3.1 Parents and Children --------------------------- The hierarchy allows a parent/super state to handle any messages that would be handled identically by some or all of its child states. For example, consider a situation in which 3 children would handle a message a certain way, but one other child would handle the message differently. The parent/super state could handle the message for the 3, each of which would simply ignore the message, causing it to be passed up to the parent, as described below. The one child with the different behavior would process the message in its unique way, without the parent seeing the message. To support the parent/super handling, the HSM executes the following procedure. The HSM attempts to deliver the message to the handler for the current state. However, if that handler doesn't process the message, the handler returns the message to the HSM. The HSM then tries the current state's parent state (super-state) handler, then its parent, iterating to the top of the tree until it finds a handler (the encompassing parent handler) that will process the message (on behalf of one of its children). 3.4 HSM Implementation ---------------------------- The implementation of the HSM is in src/kernel/dlmdk/hsm.c and hsm.h. The HSM engine waits for client-specific messages to be put on a message queue via the hsm_process_event() function. This is one of only two functions that are called by code outside of the HSM or the state handlers. The other public function is hsm_start() (see below). When pulling messages off of the queue, and delivering them to the state handlers, the dialog between the HSM engine and the state handlers is synchronous. This includes any processing done by, and any state changes that are requested by the current state state handler (these result in fairly immediate state changes). If a state handler sends a message to itself, that message gets placed on the incoming message queue by hsm_process_event(). This is an asynchronous operation; that message will get processed by the HSM only after all other messages in front of it have been processed. As an example, this means that if a handler sends "itself" a message, then requests a state change, that the state change will occur *before* the message gets processed. Important functions and macros are described in the next 3 sections: 3.4.1 Public HSM functions --------------------------- Only two calls are used for driving the HSM from the "outside", that is from code not within the HSM engine itself. hsm_process_event() may be called from any client-specific code, whether within state handlers or not. hsm_start() -- Starts the given state machine, putting it into TOP state. hsm_process_event() -- This is the main entry point for driving the state machine from the outside world or from within the state handlers. It adds a new message to the tail of the state machine's event queue, and processes the queue to deliver the messages to the state handlers. It will *not* execute the delivery loop if this or another process/CPU is already in the delivery loop driving the state machine. As an example, a given process will already be in the delivery loop when it receives a message from a state handler (being driven by the delivery loop). The message will eventually get executed, however, because the loop will keep delivering until the queue is empty. In some cases, then, processing of the message may be synchronous (within the hsm_process_event() call), while in other cases it will not be (messages from within the state handlers will always *not* be). Note that there is no indication to the caller as to whether the message was processed before returning to the caller. If the caller needs to know, it must look at global state variables to determine current state. 3.4.2 Private HSM functions ---------------------------- These (relatively) private functions support the dialog between the HSM and the state handlers. The first two of these functions are static within hsm.c. The third is called only from within state handlers (using STATE_TRAN macro, see next section). hsm_process_event_private() -- Sends the external message to the handler for the current state, and if that handler doesn't consume (process) the message, the function iterates up the tree as far as the top level, looking for a consumer of the message. Static within hsm.c hsm_execute_transition() -- Descends downward through the tree to a target state, sending "enter" messages to states it enters, and a "start" message to the target state. It *must* begin with the current state being an ancestor (parent/grandparent/etc) of the target state. Static within hsm.c hsm_exit_private() -- Ascends upward through the tree to a least common ancestor (LCA) of the current state and a target state, sending "exit" messages to the states it leaves. hsm_execute_transition() completes the transition, doing the descending/entering part of the journey. Invoked from within state handlers via STATE_TRAN macro (see next section). 3.4.3 Important macros include: ------------------------------- These macros are invoked from within state handlers to invoke a state transition. One or the other must be used, depending upon whether the path through the state hierarchy tree involves an upward transition (through a least-common-ancestor). STATE_START -- Sets 'next' (target) state. Used by state handlers to set new target state before returning to hsm_execute_transition(). Target must be a child/grandchild/descendent of current state, since hsm_execute_transition() knows only how to *descend* through the tree. STATE_TRAN -- Sets 'next' (target) state, and invokes hsm_exit_private() to *ascend* to the LCA of the current and the new target state, before returning to hsm_execute_transition(). Used for new targets that are *not* child/grandchild/descendent of current state. 4. OpenDLM Recovery State Machine Client-Defined Messages ---------------------------------------------------------- Messages get sent to the state machine via the hsm_process_event() function (see src/kernel/dlmdk/hsm.c), which is called from 4 different locations within OpenDLM (all also within src/kernel/dlmdk directory): -- clm_main.c: clm_master_loop(), based on input from the cluster manager or from the purge comm port -- clm_info.c: handle_new_topology(), when first initializing -- clm_msg.c: clm_sync(), when synchronizing recovery states for all nodes -- dlm_recovery.c: state handlers and other functions auto-sequence the state machine. Also, the PORT_RECOVERY comm callback listens for messages from other nodes. 4.1 Cluster Manager Messages ----------------------------- The clm_master_loop() function (clm_main.c) waits for events from the cluster manager. When one arrives, it calls the following functions to process the event, and eventually place the event on the tail of the dlm_transition_queue, implemented in dlm_sched.c: dlm_process_cm_message() (clm_msg.c) -> handle_new_topology() (clm_info.c) After calling these, clm_master_loop() then calls: new_change_request() (dlm_sched.c) to remove an event from the head of the dlm_transition_queue (may or may not be the one just placed on the queue), and then soon calls: hsm_process_event() (hsm.c) to send the next event message to the recovery state machine. (?? could this long process be simplified ??) The following messages result from cluster management events. Next to each message are shown the function calls that lead to placing the event onto the dlm_transition_queue: CLM_MSG_NODE_DELETE handle_new_topology() -> update_node_status() -> clms_node_delete() CLM_MSG_NODE_ADD handle_new_topology() -> clms_node_add() CLM_MSG_NODE_DOWN handle_new_topology() -> update_node_status() -> clms_node_down() The purge message CLM_MSG_NODE_PURGE also uses the dlm_transition_queue (see other section). Note that CLM_MSG_NODE_UP is not among the cluster event messages; it is generated by the recovery state machine. Another message is driven by handle_new_topology(), when the node is first initializing, without using the dlm_transition_queue. This tells our state machine to query other nodes for directory information: CLM_MSG_NODE_INIT 4.2 Sync Comm Port Message --------------------------- When the cluster is recovering from an exiting node, there are several steps at which all nodes in the cluster must be in sync with one another. The state machine waits for this, and moves along only when it receives: RC_MSG_GOT_BARRIER clm_sync() -> [handle_sync()] -> dlm_sync_notify_recovery() This barrier message is driven from clm_sync() in clm_msg.c. This function sends MSYNC messages, via the PORT_SYNC comm port, to all nodes to drive their state machines to a certain recovery state. When reached, each node lets each other know that the state has been reached, via the PORT_SYNCRET comm port. When our node sees that all nodes have reported the same state, it sends RC_MSG_GOT_BARRIER to our state machine, so it can progress to the next recovery state. 4.3 Purge Comm Port Message ---------------------------- The purge message is driven from pti_read_request(), which monitors the PORT_MSTR_CR comm port. A DLM_PURGELOCKS port message will result in a CLM_MSG_NODE_PURGE message to the recovery state machine: CLM_MSG_NODE_PURGE pti_read_request() -> master_purge() -> clms_node_purge() 4.3 Recovery Comm Port Messages -------------------------------- Many messages are received from the state machines in other nodes, via the PORT_RECOVERY comm port. The callback for that port is clmr_message_handler() (dlm_recovery.c). It forwards the following to our state machine: CLM_MSG_NODE_UP RC_MSG_DIR_RESTORE RC_MSG_DIR_END RC_MSG_DIR_NAK RC_MSG_DIR_NOT_UP RC_MSG_DIR_DATA /* directory query response */ RC_MSG_DIR_REQUEST /* initial directory query */ RC_MSG_RES_REQ_1ST /* new data to merge */ RC_MSG_RES_REQ /* new data to merge */ RC_MSG_RES_DATA /* master handle response */ RC_MSG_RES_RESTART /* new node starting */ 4.4 Recovery State Machine Messages ------------------------------------ Some messages originate within the state machine itself (dlm_recovery.c), and drive itself to perform various actions and sequence into different states. These include: RC_MSG_SEND_MORE RC_MSG_PURGE_CHECK RC_MSG_DIR_RETRY 5. Recovery State Machine Handlers ----------------------------------- The recovery state machine is implemented within src/kernel/dlmdk/dlm_recover.c. The recovery state hierarchy tree is only 3 states deep. Beneath the TOP level, there are 5 "CLM" states, 3 of which have children "RC" states. In addition to the state machine structure (which contains the current state, among other information), there are also a few global variables defined within dlm_recover.c, that describe state. Some simple methods are available in clm_main.c to allow querying these state variables or the current state machine state. The query methods are: -- clm_in_reconfig(): returns state of variable "dlm_recov_in_reconfig", which means we are in CLM_ST_RECONFIG (or one of its sub-states), trying to (re)build the cluster-wide distribution of lock information. -- clm_is_running(): returns TRUE if recovery state machine is in CLM_ST_RUN, which is the normal, non-recovery, operating mode. -- clm_in_delivery(): returns TRUE if recovery state machine is in RC_DIR_INIT, which means that we are gathering directory information from other nodes. -- clm_run_rcv(): returns state of variable "dlm_recov_can_transact", which indicates that we are in a state in which it is is legal to receive transactions (requests for lock operations). There is also a global variable which has no query method: -- grace_period: Indicates that we are in a particular range of sub-states of the CLM_ST_RECONFIG state. Entry into the RC_BARRIER2 sub-state sets this to 1. RC_BARRIER6 sub-state sets it to 0, just before entering CLM_ST_RUN, which also sets it to 0. Used in clm_process() to help determine if we can accept a lock request from a client at the moment. There are also a few static state variables used just within dlm_recovery.c: -- dlm_recov_in_purge: -- dlm_recov_in_add_delete: -- dlm_recov_updates_inflight: Indicates # of The states and messages supported by the Recovery State Machine handlers are described below: 5.1 CLM_ST_TOP --------------- The TOP state is a current state only for a brief instant, when the HSM is entered and started by hsm_start(), when initializing the OpenDLM system (kern_main() -> dlm_recov_init() -> hsm_start()). It transitions immediately to the CLM_ST_INIT state, never again to return to TOP as a current state. Therefore, TOP's main role is to be the highest level ancestor state for all of the other states. It serves as a default catch-all for messages that are not handled by states lower in the hierarchy (see section 3.3). It handles the following: START: Transitions immediately to CLM_ST_INIT. CLM_MSG_NODE_DOWN: Some node is leaving the cluster. All states (except for, currently, CLM_ST_RUN, which treats it the same way and is redundant) allow this message to bubble up to TOP, which (re)sets a few static recovery variables: dlm_recov_in_add_delete = FALSE; dlm_recov_in_purge = FALSE; dlm_recov_leaving_node = node that sent NODE_DOWN message. It then transitions to CLM_ST_DRAIN. In other words, no matter what state we're in, when we get a CLM_MSG_NODE_DOWN we stop what we're doing and go to CLM_ST_DRAIN. CLM_MSG_NODE_UP: Some node is joining the cluster. Almost all states allow this message to bubble up to TOP, which sends a RC_MSG_DIR_NAK back to the sending node, meaning that we're not in a state which can respond to a directory request. The sending node can wait before trying again. The only two states that handle this message directly are: CLM_ST_RUN (the only state that can successfully handle it) and RC_DIR_INIT (which sends a special RC_MSG_DIR_NOT_UP message back to the sending node). default: Any message that has no meaning to the current state or its ancestors causes a warning message to be sent to the system log. Operation of the DLM continues as if the message did not occur. 5.2 CLM_ST_INIT ---------------- CLM_ST_INIT is entered only from TOP, just after hsm_start(). It transitions immediately to its RC_INIT_WAIT sub-state, never to return to CLM_ST_INIT as a current state. This state has 2 leaf-level sub-states, but handles no messages on their behalf, and handles no messages for itself as a current state. It handles only: START: Transition immediately to RC_INIT_WAIT sub-state. 5.2.1 RC_INIT_WAIT ------------------- RC_INIT_WAIT is the first leaf level state to be current when initializing. It is entered only from CLM_ST_INIT, and once it exits, it never returns here again. In this state, the node is just waiting to be told to initialize its directory information. It handles the following messages: CLM_MSG_NODE_INIT: This message is sent by *this* node to itself, when handling the initial cluster topology configuration in handle_new_topology(). If this node is joining a cluster with other active members, it needs to get directory information from its neighbor, so we transition to the RC_DIR_INIT state. If this node is the only active member in the cluster, there's no neighbor from which to get directory info, so we transition directly to CLM_ST_RUN. EXIT: When exiting this state (to either RC_DIR_INIT or CLM_ST_RUN, or any other state such as CLM_ST_DRAIN), the handler initializes the cccp communications module via cccp_init(). 5.2.2 RC_DIR_INIT ------------------ This leaf state is a bit more complex than the ones we've discussed so far. In this state, we are trying to migrate directory information into our node from (one? all?) other nodes within the cluster that we just joined. Some of these nodes may not be ready to give us that information, so there is provision for retrying our requests, or skipping nodes that do not have useful directory information. It handles: ENTRY: It resets the purge logic, finds our closest neighbor in the cluster, and falls through to RC_MSG_DIR_RETRY, to try to get directory info from that neighbor. (?? Other state handlers seem to send a message instead of "falling through". In this case, the message would be RC_MSG_DIR_RETRY. Is there a reason that this state is implemented differently from others? Why is this done on ENTRY instead of START??). RC_MSG_DIR_RETRY: Sends CLM_MSG_NODE_UP to neighbor node, to request directory information migration from that node. This will be a *first*, not a *re*-try, if it fell through from the ENTRY: message. RC_MSG_DIR_RETRY always originates from within *this* node, and from within this state, from the handling of other messages (see below). RC_MSG_DIR_NAK: We receive this from the neighbor node if it is temporarily not in a state in which it can send directory data. We set a timer (3 seconds) and send ourselves RC_MSG_DIR_RETRY, to try the same node again. RC_MSG_DIR_NOT_UP: We receive this from the neighbor node if it is in the same state that we are, trying to get directory info from a neighbor node. Since it's just getting started, we know that it is gathering only the directory information for itself, and won't have any that belongs to us. We skip that node, find another node, and set the 3 second timer to send ourselves RC_MSG_DIR_RETRY. If we've already tried all the nodes(?? -- logic looks incomplete for this??), we give up and enter RUN state. RC_MSG_DIR_END: We receive this from the neighbor node if it has finished sending us directory information. We copy some migration control parameters from it, and enter the RUN state. Apparently, we need directory information from just one other node, since there is no provision to query other nodes once we receive this message(??? Shouldn't we gather dir info from all active nodes??). RC_MSG_DIR_RESTORE: This message contains directory information from another node! We insert the entry into our local directory. CLM_MSG_NODE_UP: We receive this from another node that is joining the cluster. Since we are still trying to initially get together our own directory information, we don't have any directory info that belongs on this new node. We send the RC_MSG_DIR_NOT_UP message, so it can look for another node to query. 5.3 CLM_ST_RUN --------------- This is the normal operating state. Since this is a "recovery" state machine, this state doesn't need to do much except wait for interesting things to happen in terms of cluster membership, then transition to the appropriate state. This is a "leaf" state, with no children. It handles: ENTRY: Just set a few global state variables to indicate that we're running: grace_period = 0; clm_ative = TRUE; dlm_recov_can_transact = TRUE; The call to purge_active() is a no-op for Linux. CLM_MSG_NODE_PURGE: Set a couple of global state variables to indicate that we're in purge mode: dlm_recov_in_add_delete = FALSE; dlm_recov_in_purge = TRUE; Then transition to CLM_ST_DRAIN. CLM_MSG_NODE_ADD: CLM_MSG_NODE_DELETE: Set a couple of global state variables to indicate that we're in add/delete mode: dlm_recov_in_add_delete = TRUE; dlm_recov_in_purge = FALSE; Then transition to CLM_ST_DRAIN. CLM_MSG_NODE_UP: Set the following state variable so we know which node is joining: dlm_recov_joining_node = node that sent NODE_UP message Transition to CLM_ST_SEND_DIR to send directory info to newly joining node. CLM_MSG_NODE_DOWN: Set a few global variables: dlm_recov_in_add_delete = FALSE; dlm_recov_in_purge = FALSE; dlm_recov_leaving_node = node that sent NODE_DOWN message. It then transitions to CLM_ST_DRAIN. This is identical and redundant with NODE_DOWN handling in TOP, and could be removed from the RUN handler. 5.4 CLM_ST_SEND_DIR -------------------- In this state, we're migrating some of our directory data to a node that is joining the cluster. This is a "leaf" state, with no children. It handles: ENTRY: We're getting ready to send our directory information to a newly joining node, so we reset the enumerator that will walk through our directory database, so it will start at the beginning of the database. We then send an RC_MSG_SEND_MORE to ourself, to effectively fall through into ... RC_MSG_SEND_MORE: We send directory entries to the newly joining node. Using rl_enum_counted() to walk through the directory hash table, we send the entries from up to 32 hash table slots, using clmr_send_dir_entry(). This function looks for those directory entries that belong in the newly joining node, based on the resource names and the new number of nodes, using clm_direct_dirsend() -> hashfunc(). See these functions to understand how directory entries are distributed throughout the cluster. ??? Once the 32 slots are done, how do we (or the other node) ask for more??? 5.5 CLM_ST_DRAIN ----------------- We transition to the DRAIN state when a node leaves the cluster. After a member node dies, this state (and its 2 leaf sub-states), clean out old, unusable information (about the dead node) that is still on this node. We can arrive here from several states, for several reasons. To define each case, the following global variables must be set appropriately in the previous state, before transitioning to DRAIN: dlm_recov_in_add_delete -- TRUE if here because of CLM_MSG_NODE_ADD or CLM_MSG_NODE_DELETE message (see RUN state handler). * dlm_recov_in_purge -- TRUE if here because of CLM_MSG_NODE_PURGE message (see RUN state handler). * * If here because of CLM_MSG_NODE_DOWN (see TOP state handler), both the dlm_recov_add_delete and dlm_recov_purge globals are FALSE. dlm_recov_leaving_node -- indicates which node is leaving the cluster. (??? it doesn't seem to get set for NODE_ADD or NODE_DELETE ???) This state handles the following messages: START: We transition immediately to RC_BARRIER1. EXIT: We set the global variable dlm_recov_can_transact to FALSE. This means that this node is not able to accept transaction requests from itself or other nodes; it is entering a reconfiguration state (RC_BARRIER1). 5.5.1 RC_BARRIER1 ------------------ This is the first of a series of "barrier" states. A barrier state is one that must be achieved by *all* nodes in the cluster, before we can move on to a next state. This state begins the process of recovery, after a node has died. This, like all of the other "RC_" states, is a leaf state. It handles the following messages: ENTRY: This does the real work of this state. As the first phase of a reconfiguration, it does the following on *this* node: -- reset_inflight() marks any "inflight" requests (requests that have been sent to the dead node, but not acknowledged by the dead node) to try them again after reconfiguration is complete (and the dead node is running again). -- clmr_purge_res() removes any locks, held by dead nodes, that are on any of the queues (grant/convert/wait) of the resources on our resource list. -- clmr_dir_purge_chk() (driven by rl_enum() iterator that goes through the directory entries on our node) removes any directory entries that point to the dead node(s). -- clm_sync() waits until *all* nodes finish the above 3 jobs. We'll receive RC_MSG_GOT_BARRIER when all nodes are done. RC_MSG_GOT_BARRIER: This message comes from within our own node (see dlm_sync_notify_recovery()), after our node determines that all active nodes have reached a common point in the recovery sequence. We simply transition to RC_PURGE_WAIT. 5.5.2 RC_PURGE_WAIT -------------------- This, like all of the other "RC_" states, is a leaf state. It handles the following messages: ENTRY: This state does the following on *this* node: -- reconfig_free_client_purge() (driven via client_apply() iterator on each OpenDLM client app on this node) continues any purge activity on this node that was blocked because a migration was in progress (??). Or, it starts a purge process that had not started because we were in recovery when it was requested. -- sends an RC_MSG_PURGE_CHECK message to ourself, to continue on. RC_MSG_PURGE_CHECK: Received from ourself. Does the following: -- reconfig_check_client_purge() (driven via client_apply() iterator on each OpenDLM client on this node) checks to make sure that all purging operations on this node are complete. -- If so, transition to RC_BARRIER2 (substate of CLM_ST_RECONFIG). ?? If not, there seems to be no way to continue on ???? There is no other source for RC_MSG_PURGE_CHECK to trigger another check!?! 5.6 CLM_ST_RECONFIG -------------------- This state has the largest number (7) of sub-states, and is used to (re-) distribute resource (including locks on the resource) and directory information across the cluster after cluster membership changes. If a node joins the cluster, then we arrive here via ?? If a node leaves the cluster, then we arrive here via the CLM_ST_DRAIN state (actually via its sub-states RC_BARRIER1 and RC_PURGE_WAIT), which cleans out unusable information about the dead node. In CLM_ST_RECONFIG (and substates), we then re-distribute information about the surviving nodes. CLM_ST_RECONFIG is never a target state (i.e. never a final "current" state). However, it may be entered or exited when the state machine is transitioning to or from its sub-states. It also serves as a "catch-all" handler for several messages, on behalf of its sub-states. It handles the following transitional messages from the HSM: ENTRY: Sets "dlm_recov_in_reconfig" global state variable to TRUE. EXIT: Sets "dlm_recov_in_reconfig" global state variable to FALSE. START: Makes sure that dlm_recov_in_reconfig = TRUE. Transitions immediately to RC_BARRIER2. These are the only places that the dlm_recov_in_reconfig variable is set or reset. It handles the following client-specific messages on behalf of its sub-states. In each case, this is the only place in which the message is handled. That is, each of the messages are valid for all of the sub-states of CLM_ST_RECONFIG, and are passed up to here. Conversely, they are invalid for any other states (e.g. CLM_ST_DRAIN) in the recovery state machine, and would be handled by TOP's default warning message. The following are handled here: RC_MSG_DIR_REQUEST: This message comes from another node's clmr_res() function, and is received on this node via the PORT_RECOVERY comm port. -- handle_dir_request(), dlm_recovery.c, makes sure that we have an active directory entry (i.e. we are the directory node for this resource), and sends a RC_MSG_DIR_DATA message back to the other node. This tells the other node which node is the (perhaps new) master node for the resource (see immediately below). RC_MSG_DIR_DATA: This message comes from another node's handle_dir_request() function, and is received on this node via the PORT_RECOVERY comm port. This message is a response from the (perhaps new) directory node back to us, confirming that it has (perhaps created) a valid directory entry for the resource, and telling us which node is the (perhaps new) master. We do the following for each resource structure on our node, depending on whether we are the (perhaps new) master of the resource: If master: -- clmr_clean(), dlm_recovery.c, scans through all locks on our (master) copy of the resource, and removes any locks that were held by dead nodes. It also invalidates (clears) the lock value block (LVB) for any resources that had exclusive (EX) or protected write (PW) locks held by a dead node. Such a node might have changed the LVB contents without us knowing about it. If not master: -- clmr_rebuild_send(), dlm_recovery.c, cleans out our (secondary) copy of the resource, removing any locks that are not held on this node; as a secondary copy, we don't need to know about any other nodes' locks. Then, it sends either a RC_MSG_RES_REQ or a RC_MSG_RES_REQ_1ST message, via PORT_RECOVERY, to the the master node, so the master node can know what we know about our own locks, and integrate it into the master copy of the resource. These messages are handled by either state RC_BARRIER4, or RC_RESOURCE_QUERY. The master node will respond with RC_MSG_RES_DATA (see immediately below). RC_MSG_RES_DATA: This message comes from another node's clmr_respond() function, called from their handle_resource_request() function after receiving an RC_MSG_RES_REQ(_1ST) (see immediately above) from us. It is received on this node via the PORT_RECOVERY comm port. We copy the master node's resource handle, and the node id of the master node, into our copy of the resource. 5.6.1 RC_BARRIER2 -------------------- This barrier state waits for all other nodes to get to barrier state 2 (meaning what??). This, like all of the other "RC_" states, is a leaf state. It handles the following messages: ENTRY: Sets two global variables: grace_period = 1; clm_active = TRUE; Then waits for all nodes to achieve barrier state 2. Once achieved, we receive the RC_MSG_GOT_BARRIER message, see below. RC_MSG_GOT_BARRIER: This message comes from within our own node (see dlm_sync_notify_recovery()), after our node determines that all active nodes have reached barrier state 2. We branch in two different directions, depending on whether we are in "dlm_recov_in_purge" mode: If in purge, we simply transition to RC_BARRIER6. If not in purge, we simply transition to RC_REBUILD_DIR. 5.6.2 RC_REBUILD_DIR -------------------- This state does a complete rebuild of directory information throughout the cluster, using information in each node's resource table. In this state, we use info from resource master copies, not non-master copies. They get used later, in the RC_RESOURCE_QUERY state. This state steps through all of the resource structures stored on this node, and sends the information about each resource to that resource's directory node. That directory node may be a new node, building a directory from scratch, or it may be a an old member node that is becoming the directory node for a new set of resources, due to the change in cluster membership. ENTRY: Several steps are executed here: -- reconfig_del_dead_client(), clm_client.c, removes any knowledge that we have on *this* node about clients on dead nodes, by removing those clients from our "cli_refs" list. Then, if we are in "dlm_recov_in_add_delete" mode: -- clm_change_topology(), clm_info.c, reorganizes our node list to contain only the new cluster membership. -- clmr_dir_move_check(), clm_recover.c, (called via rl_enum(), clm_rldb.c, for each directory entry on our node) deletes any active directory (but not cache) entries for resources for which we just were, but no longer are, the directory node. That is, it deletes directory entries that are migrating away from us due to the change in cluster membership. Then, regardless of mode: -- resource_table_iterator_create(), clm_resource.c, creates an iterator for going through the resource table on this node. -- send an RC_MSG_SEND_MORE message to ourself to do the next job. RC_MSG_SEND_MORE: We use the iterator to go through our node's list of resources, calling: -- clmr_dir(), clm_resource.c. If we are the current resource master, we have the most up-to-date resource information. We send the info to the resource's (perhaps new) directory node, via update_directory(). update_directory() sends the info within a rl_update_t structure, and delivers it via PORT_DIRUPDT. The info includes: resource name, name length, type (UNIX vs VMS) flags master node pointer to our copy of the resource (for use after ack) The port listener function for PORT_DIRUPDT is dir_read_update(), in clm_direct.c. On the directory node, it either modifies a pre-existing directory entry, or creates a new one, using the info sent above. We keep track of the number of updates "in flight" (sent to directory node, but not yet acknowledged by directory node) via global variable "dlm_recov_updates_inflight", which we increment each time we invoke update_directory(). The counter is decremented via dlm_recov_dir_msg_retire(), called after we get an acknowledgement. If we are not the master, we don't send info, but take the opportunity to make sure that our copy of the resource contains only locks held by this node in its queues (that's all that a local copy should contain). Then, we check "dlm_recov_updates_inflight" to see if all of our directory updates have been sent and retired. If so, we transistion to RC_BARRIER3. If not, we must get called again with RC_MSG_SEND_MORE, which is done by dlm_recov_dir_msg_retire() once the number of inflight updates is below a certain threshold (dlm_buff_limit = 500). ?? Something looks odd with this mechanism ... the iterator goes through all of the resources in one shot, so why would it need to be re-stimulated?? Then, we get hundreds of RC_MSG_SEND_MORE messages before the final one gets us to change state?? Even though the messages are to ourself, that's still a lot of pretty useless messages?? EXIT: Destroy the iterator created during ENTRY. 5.6.3 RC_BARRIER3 -------------------- This barrier state waits for all other nodes to get to barrier state 3 (meaning what??). This, like all of the other "RC_" states, is a leaf state. It handles the following messages: ENTRY: Waits for all nodes to achieve barrier state 3. Once achieved, we receive the RC_MSG_GOT_BARRIER message, see below. RC_MSG_GOT_BARRIER: This message comes from within our own node (see dlm_sync_notify_recovery()), after our node determines that all active nodes have reached barrier state 3. We simply transition to RC_RESOURCE_QUERY. 5.6.4 RC_RESOURCE_QUERY -------------------- ENTRY: Sets one global variable: dlm_recov_can_del_dir = TRUE; Then creates a resource table iterator, and sends an RC_MSG_SEND_MORE to fall through to do the job below. RC_MSG_SEND_MORE: Does the following: -- resend_updates(), clm_direct.c, handles any requests made to our node to delete directory entries, while we were unable to. We put them on a holding stack to be processed now. -- clmr_res(), dlm_recover.c, gets called for each resource known by our node, and does one of two things, depending on whether we are the master for the resource: master: clean out locks, held by dead nodes, from our resource's lock queues (grant/convert/wait) not master: send RC_MSG_DIR_REQUEST to the directory node. This tells the directory node to make sure that it has a directory entry. Also, the directory node sends back mastership info, so that we don't need to consult the directory again for this resource. As in RC_REBUILD_DIR state, we keep track of the number of updates "in flight" via global variable "dlm_recov_updates_inflight". After we're done iterating through the resources, we check for that to be 0, then transition to RC_BARRIER4. RC_MSG_RES_REQ: RC_MSG_RES_REQ_1ST: We are the master node for a certain resource. Another node is sending us (pushing it; we didn't request it, see CLM_ST_RECONFIG state, RC_MSG_DIR_DATA message) information about locks it holds on the resource. We call: -- handle_resource_request(), dlm_recover.c, integrates that info into our master copy of the resource structure, and sends the other node a copy of our resource handle (cluster-wide, smaller-than-the-resource-name-and-type resource identifier), using RC_MSG_RES_DATA. See CLM_ST_RECONFIG state, RC_MSG_RES_DATA message. Note that RC_MSG_RES_REQ(_1ST) is handled identically by the RC_BARRIER4 state. EXIT: Destroys the iterator created during ENTRY. 5.6.5 RC_BARRIER4 -------------------- This barrier state waits for all other nodes to get to barrier state 4 (meaning what??). This, like all of the other "RC_" states, is a leaf state. It handles the following messages: ENTRY: Waits for all nodes to achieve barrier state 4. Once achieved, we receive the RC_MSG_GOT_BARRIER message, see below. RC_MSG_RES_REQ: RC_MSG_RES_REQ_1ST: We are the master node for a certain resource. Another node is sending us (pushing it; we didn't request it, see CLM_ST_RECONFIG state, RC_MSG_DIR_DATA message) information about locks it holds on the resource. We call: -- handle_resource_request(), dlm_recover.c, integrates that info into our master copy of the resource structure, and sends the other node a copy of our resource handle (cluster-wide, smaller-than-the-resource-name-and-type resource identifier), using RC_MSG_RES_DATA. See CLM_ST_RECONFIG state, RC_MSG_RES_DATA message. Note that RC_MSG_RES_REQ(_1ST) is handled identically by RC_RESOURCE_QUERY state. RC_MSG_GOT_BARRIER: This message comes from within our own node (see dlm_sync_notify_recovery()), after our node determines that all active nodes have reached barrier state 4. We check resources that have been reassigned to this node as master, after their previous master node died, via: -- clmr_lkvlb(), dlm_recover.c, clears stale locks (marked by LOCK_STALE flag) from each queue (grant/convert/wait). If we are the new master, we have lost information from the dead node that previously was the master. We invalidate the lock value block (LVB) if there was a possibility that the dead master might have changed it, that is, if some dead node owned an exclusive (EX) or protected writed (PW) on the lock. We can't know that for sure, so we check for any locks that *we* know about that are >= concurrent write (CW). If there are any, they would have kept any dead nodes from owning a lock that allowed them to write the LVB. But if not, we invalidate the LVB, to be safe. We then transition to RC_BARRIER5. 5.6.6 RC_BARRIER5 -------------------- This barrier state waits for all other nodes to get to barrier state 5 (meaning what??). This, like all of the other "RC_" states, is a leaf state. It handles the following messages: ENTRY: Waits for all nodes to achieve barrier state 5. Once achieved, we receive the RC_MSG_GOT_BARRIER message, see below. RC_MSG_GOT_BARRIER: This message comes from within our own node (see dlm_sync_notify_recovery()), after our node determines that all active nodes have reached barrier state 5. We remove temporary directories that were created for locallocks during reconfiguration (when??), via: -- clmr_rm_locallock_dirs(), dlm_recover.c, which deletes a directory if the RLINFO_LOCAL flag is set We then transition to RC_BARRIER6. 5.6.7 RC_BARRIER6 -------------------- This barrier state waits for all other nodes to get to barrier state 6 (meaning what??). This, like all of the other "RC_" states, is a leaf state. It handles the following messages: ENTRY: Waits for all nodes to achieve barrier state 6. Once achieved, we receive the RC_MSG_GOT_BARRIER message, see below. RC_MSG_GOT_BARRIER: This message comes from within our own node (see dlm_sync_notify_recovery()), after our node determines that all active nodes have reached barrier state 6. This is the last stage of recovery, just before going into the RUN state. We do the following: -- wake_all_locks(), clm_clmlocks.c, looks through all resources for which we are (now perhaps new) master, and grant any locks that were blocked by dead nodes. -- pti_flush_resend(), clm_batch.c, resends messages that were destined for a dead node, but failed to transmit successfully. These messages were placed on our "resend" queue. If the original destination is alive again, we send it there, otherwise look for the appropriate new directory (or master?) node. -- resend_inflight(), clm_batch.c, resends messages that were destined for a dead node, reached that node successfully (just before it died), but we never received a response to the request in the message. These messages were placed on our "inflight" queue. As with resend queue, we try again sending to the original node if it is reachable, or look for the appropriate new directory node if not. -- reconfig_allow_client_purge(), clm_purge.c, does purges that were delayed by the reconfiguration. -- reset "grace_period" global state variable to 0. -- clm_change_topology(), clm_info.c, to make sure that we're set up for the latest cluster membership. We then transition to CLM_ST_RUN. 6 Purging ---------- "Purging" is the cluster-wide removal of now-useless lock information. There are several possible reasons for needing a purge: Client died -- A client app on a node died (PURGE_DEAD). PURGE_DEAD flag tells us to remove our knowledge of the client, in addition to our knowledge of the client's locks. Client request -- A client app requests (how?) a purge of its own locks, even though it is still alive (PURGE_ME). Purge orphan locks -- Orphan locks survive, even though the client or node, from which they were requested, dies. This is a special type of lock used during a client recovery process. Once recovery is complete, the orphan locks must be removed. Node died -- An entire node died A purge operation is requested via a "transaction" structure, just like other inter-node operations, that gets passed around within and between nodes. Within the transaction structure, there is a request-specific data area (a union). For purge requests, the area contains a struct "purgereq" (see clmstructs.h).