OpenDLM_VMS_Lock_API (Feb 23 2004)


Contact the author.

                 OpenDLM Locking

Copyright 2004 The OpenDLM Project

Author:
Stanley Wang, stanley.wang@intel.com
Ben Cahill (bc), ben.m.cahill@intel.com


Introduction
------------

This document describes the implementation of the lock request/grant procedure
within OpenDLM, as well as the lock value block.

This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of lock management in OpenDLM.

This document is not intended as a user guide to OpenDLM.  Look in the OpenDLM
WHATIS-odlm document for an overview of OpenDLM.  Look in the OpenDLM
HOWTO-install for details of configuring and setting up OpenDLM.

This document may contain inaccurate statements, based on the authors' limited
understanding.  Please contact the authors if you see anything wrong or
unclear.


Overview
--------

The basic working model of DLM Locking API could be divided into two phases. 
During the first phase, it validates the arguments provided by the caller, 
builds a description for this request, and sends this description to the local
lock manager (on *this* node). Then the local lock manager forwards it to the
proper lock manager (which may be on another node). This (forwarded) request is
serviced during the second phase. 

The DLM Locking API is implemented in src/api/ directory, in four source files:

api_scn.c
api_setnotify.c
api_vms.c
ast_stub.c.

The OpenDLM Locking API can be used in either kernel and user space, and the
source code for the top level API is shared (although, of course, the kernel
API builds separately from the user-space API).  The source code splits,
however, for lower levels (just before committing the lock request descripton
to the local lock manager). At the level of k_clm_send() and below, user-level
code is implemented in src/api, and kernel level is implemented in
src/kernel/dlmdk.

The lock manager lives in kernel space, regardless of whether accessed via
the user- or kernel-space API. Source for the kernel level functionality
resides in the src/kernel/dlmdk directory.

The user space API accesses the lock manager via a character device (IOCTL).
The kernel space API calls directly into the lock manager. User- and
kernel-space functionality are almost the same, with one big exception;
Only the user-space API supports synchronous (blocking) lock request calls.

There are nine DLM Locking Model-specific routines. They can be divided
into several groups:

	1. Locking operation 
		dlmlock()
			Makes an asynchronous request to acquire or convert a 
			lock on a lock resource.

		dlmlock_sync()
			Acquires or converts a lock on a lock resource and 
			obtains a synchronous return. (User space only)

		dlmlockx()
			Makes an asynchronous request to acquire or convert a
                        lock on a lock resource, and specifies a transaction 
			ID for that lock.

		dlmlockx_sync()
			Acquires or converts a lock on a lock resource, 
			specifies a transaction ID for that lock, and obtains a
			synchronous return. (User space only)

		dlmunlock()
			Releases a lock on a lock resource or cancels a lock 
			request that is in blocked or converting state.			

		dlmunlock_sync()
			Releases a lock on a lock resource or cancels a lock 
			request that is in blocked or converting statei, and 
			obtains a synchronous return. (User space only)

	2. Asynchronous notify
		ASTpoll()
			Triggers the execution of pending AST routines. (User 
			space only)

		dlm_setnotify()
			Specifies which signal the lock manager should use to 
			notify your application of a pending AST. (User space 
			only)

	3. LVB operation
		dlm_scnop()
			Manipulates the SCN, a specialized use of the lock 
			value block associated with a lock resource. (Discussed
			in LVB section of this document.)

Locking Operation
-----------------
The following discussion will focus on just the kernel space "VMS" API.
The VMS API is the one that was used on VAX computers, from which OpenDLM
was derived.  The UNIX API was developed for the Unix environment, but our
discussion will focus on the VMS API.

1. dlmlock()

This function can be found in api_vms.c, line 451. It can be used for acquiring
a new lock, or converting a pre-existing lock, on a lock resource. As mentioned
before, it can be divided into two phases. 

1.1 First phase

A description of the request is built and committed in the first phase. The 
request is represented by a data structure named "transaction", defined in 
include/clmstructs.h, line 636. This same structure is also used for responses
to lock requests. Important fields in "transaction":

	clm_client_space	- Client in user or kernel space
	clm_prog		- Desired service
	clm_status		- Status of request
	clm_type		- Request type, defined by "dlm_ops_t"
	clm_direction		- Indicate request or response, defined by 
				"ptype_t"
	clm_locktype		- Indicate UNIX or VMS lock
	clm_sequence		- Transaction sequence 
	pti_sequence		- Transaction sequence (What does "PTI" stand 
				for?)
	clm_route		- Information necessary to complete a lock 
				request after it has been granted.
	clm_pid			- Process ID of client (0x7F7F7F7F for kernel)
	clm_data		- Data of the request. For DLM Lock operation,
				it's a data structure named "lockreq"
	
Following is the function calling graph for the first phase of "dlmlock":

dlmlock
    |
    |------> clm_lock_reqresp  	       
    |		 |	      		     		 		 
    |		 |-----> api_reqresp 			 
    |		 |	     |       			 		
    |		 |	     |---------> k_clm_send    	       	       	 
    |		 |	     |       	     |       	 		 
    |		 |	     |               V	    	 		 
    |		 |	     |           clm_send   	 		 
    |		 |	     |               |	    	 		 
    |		 |	     |               V	    	 		 
    |		 |	     |           clm_process	 	      	 
    |		 |	     |               |	    	 	      	 
    |		 |	     |               V 	    	 	      	 
    |		 |	     |           clm_request	 	      	 
    |		 |	     |	             | 	    	 	      	 
    |		 |	     |               V 	    	 	      	
    |		 |	     |           lockswitch.begin_lock 	       	
    |		 |	     |	      	 (begin_lock_clm)       	      	
    |  	       	 |	     |       	     | 	       	       	       	 
    |		 |     	     |         	     |-------------> sendudp   	 
    |		 |	     |		     | 	 	      	|   
    |  	       	 |     	     | 	       	     |<-----------------| 
    |		 |	     |         	     |      	 
    |		 |	     |               V 	       	      	      	 
    |		 |	     |	         pti_call	 	      	 
    |		 |	     |               | 		 	      	 
    |		 |	     |               V 		 	     
    |		 |	     |	         dlm_pti_send	 	      	 
    |		 |	     |	       	     | 	       	       	      	 
    |		 |	     | 	       	     V 	     	 	      	
    |		 |	     | 	      	 cccp_send   	 	      	 
    |		 |	     | 	       	     | 		 	      	 
    |		 |	     |<--------------| 	       	       	      	
    |		 |	     |       	       		 	      	 
    |		 |	     |---------> k_clm_recv	 	      	 
    |		 |	     |         	     | 	     	 	      	
    |		 |	     | 	       	     V 	     	 	      	 
    |		 | 	     | 	       	 clm_recv    	 	      	 
    |  	       	 |     	     | 	       	     | 	       	       	       	
    |		 | 	     |<--------------| 	       	       	      	      
    |		 | 	     | 	       	       	       	       	       	      
    |  	       	 |<----------| 	       	       	       	       	       	
    |		 |     	      		     		 	      
    |<-----------| 			     			      
    |		 
    O

According to the graph above, "clm_lock_reqresp" and "api_reqresp" build a
description of the request, in the form of a transaction structure.
"api_reqresp" is the lowest level call in the shared user/kernel source.
For the kernel implementation, it calls "k_clm_send" to direct the transaction
toward the proper lock manager, as described below.

"k_clm_send" grabs a spinlock before calling "clm_send".
"clm_send" makes a copy (why?) of the transaction before calling "clm_process".
"clm_process" does a few sanity checks, then calls "clm_request".

"clm_request" fills in a few fields of the transaction structure, and selects
two functions based on the type of lock request (DLM_LOCK) and the type of
lock (VMS vs. UNIX API).  One function is the next function to call within
*this* node ("begin_lock_clm"), and the other is the function for starting
the operation of the second phase ("cont_lock_clm").  This will be called
from the port receiver on this or a remote node, depending on where the
transaction will be sent.

The selection of both functions uses the "lockswitch" table to make the
choice.  This table contains calls, much like a Linux "ops" structure, that
differentiate between the VMS API and UNIX API functionality.  The table is
defined in src/include/clm_locksw.h and src/kernel/dlmdk/clm_locksw.c.
 
"begin_lock_clm" performs a number of sanity checks, and fills in more fields
of the transaction structure before sending it to the "master" layer.  It
differentiates between lock creation (new lock) and lock conversion (changing
the state of an existing lock).  Accordingly, some of the processing is
handled by either the "begin_clm_lock_new" or the "begin_clm_lock_convert"
function.
 
At this point, the transaction is ready to be sent to the lock master.
Before doing this, "begin_lock_clm" calls "sendudp" to send a reply
message (also a transaction) to the client that initiated this lock request.
This reply gives status for phase 1 of the request processing.  For clients
in the kernel, remember, synchronous (blocking) requests are not supported.
This reply will provide the basis for the return from the API call, providing
information on the validity of the request.  The actual request results will
be provided to the client later, after phase 2.

"sendudp" doesn't really send a UDP message between nodes; the client is
on *this* node.  It simply attaches the response transaction to a member of
the client's "client" structure (defined in src/include/clmstructs.h),
cp->cli_reply.  This is a queue of reply transactions.  Later, "k_clm_recv"
will look through this queue for the reply that matches the "clm_sequence"
field of the original request transaction.

Now, the request transaction needs to be sent to the lock resource master.
So, which node is the lock resource master for this lock request? 

If the lock request is a "convert" request, the node ID of the lock resource
master can be found in the pre-existing lock record (struct reclock, defined in
src/include/clmstructs.h) of the in-question lock, and this request transaction
is sent to the lock resource master node's "PORT_MSTR" port via the "CCCP"
transport.

On the other hand, if the lock request is a "lock" request, there is no
pre-existing reclock. We need to obtain the node ID of lock resource master,
via the local (on this node) directory table, or via the "directory node",
as follows, searching by the name (and type:  UNIX or VMS) of the lockable
resource:

-- Search local directory table via "rl_lookup()" (called from "pti_call()").
   If either an active directory entry (i.e. *this* node is the directory
   node), or a cache directory entry (i.e. *this* node has a copy of the
   directory node's info) are found, we obtain the master node ID locally,
   and send the request directly to the lock resource master node's
   "PORT_MSTR" port via "CCCP", just as with a "convert" request.

-- If no local directory entry exists, calculate the directory node via
   "clm_direct()" (called from "dlm_pti_send()"), which applies a simple hash
   function to the resource name to determine the directory node.  Send the
   transaction to the directory node's "PORT_DIRECT" port via "CCCP". If the
   directory node and lock resource master are the same node, it will deal with
   this transaction. If not, the directory node will forward this transaction
   to the appropriate lock resource master node.

After sending the lock request, execution returns to "api_reqresp" eventually. 
"k_clm_recv" is called at this point to retrieve the status of the operations
in phase 1. The return status is also represented as a transaction data
structure. "clm_recv" gets the status from a linked-list. As mentioned before,
it was linked to the list by "sendudp".

Caveat: There is a textbook race condition between "clm_recv" and "sendudp". An
ugly work-around is used to resolve this problem. More detailed information
in "src/kernel/dlmdk/dlm_kernel.c" line 368. For the current execution chain,
however, the reply gets put on the client structure's queue before "clm_recv"
gets called, so this race condition should not be an issue in this case.

At last, the status of the first phase is returned to dlmlock. First phase
finished!


1.2 Second phase

The second phase starts when the directory node or the lock resource master 
receives the transaction. At this point, a callback function is called by
"CCCP" transport layer. This is *not* the callback function, specified in the
request transaction, to continue the work of the request.  Instead, it is a
port handler which will parse the request and send it on its way to the
request callback.  Following is a list of all port handling callbacks, some
of which are not related to locking, per se:

	PORT		CALLBACK

	PORT_MSTR	pti_read_request		clm_pti.c
	PORT_DIRECT	dir_read_request		clm_direct.c
	PORT_RESPONSE	pti_read_response		clm_pti.c
	PORT_DIRUPDT	dir_read_update			clm_direct.c
	PORT_MSTR_CR	pti_read_request		clm_pti.c
	PORT_BOUNCE	dir_bounce_request		clm_direct.c
	PORT_GROUP	grp_proc_request		clm_group.c
	PORT_NACK	clmll_read_nack_response	clm_locallock.c
	PORT_SYNC	clm_inband_sync			clm_msg.c
	PORT_SYNCRET	clm_inband_sync_resp		clm_msg.c

This list is defined in "cb_tab", in "kernel/dlmdk/clm_main.c", line 143.
In addition, the following are set up by calls to cccp_listen() in other
places in the code base:

	PORT		CALLBACK

	PORT_DEADLOCK	clmdd_incoming			clm_deadlock.c	
	PORT_MIGR	clmm_incoming			clm_migrate.c	
	PORT_RECOVERY	clmr_message_handler		clm_migrate.c	


As mentioned before, the transaction may be sent to the directory node or the 
lock resource master. If it is the directory node that receives the 
transaction, the "dir_read_request" port handler is called by "CCCP". And then 
"dir_proc_request" is called to do all hard work. 

If the corresponding lock record is found within the directory node, the
destination of the transaction is modified to be the lock resource master
node's "PORT_MSTR" port. If no lock record is found, the directory node
creates a new directory entry for the resource, and modifies the transaction's
destination to be the "PORT_MSTR_CR" ("CR" means "create" a new lock mastership)
of the originator of this transaction (the first node to request a lock on a
given resource becomes the master for that resource). The "CCCP" transport
layer forwards the transaction to the new destination when returning from
"dir_read_request".

All of the following operations are on the master node, and are the same,
regardless of whether the original request transaction was sent via the
directory node, or directly to the (known by the originator) master node.

If it is the lock resource master that receives the transaction, "CCCP" calls 
the PORT_MSTR handler, "pti_read_request". Following is the function calling 
graph for servicing the lock request(transaction):
						   			  
                       	       pti_read_request	   
			   	      |		
			       	      V		   
			       lockswitch.master_lock	
			       (master_lock_clm)     
				      |		  
       	       	       	lock   	      V	      convert	
		     -----------------------------------
		     |	   	       		       |
		     V	   	       		       V
		 clm_lock      	       	       	   clm_convert 
		     |		       		       |
      	       grant V block   	       		       V
	       -------------   	       	       	   convert_lock_clm
       	       |	   |	       		       |
       	       V	   V	       	       	 grant V block
       	 grant_lock    blocked_lock    	       	 -------------	
       	    			       		 |     	     |	
	    			       		 V     	     V	
	    			            grant_convert  blocked_convert

According to the previous graph, the receiving port is checked in 
"pti_read_request". If this transaction is received from "PORT_MSTR_CR" port, 
"DLM_RES_CREATE" flag is specified to distinguish it from a lock request for a 
existing lock resource. After processing this lock request, the result (also
described by a transaction data structure) is sent back to "PORT_RESPONSE"
port of the sender of this lock request.

"pti_read_response" is the port handler called when the respond transaction is
received by the sender of the lock request. Following is the call graph:

pti_read_response 
    |
    |----> pti_proc_response
    |          	|
    |          	|-----------------> match_request
    |         	|      	       	          |
    |      	|<------------------------|
    |      	|
    |      	|-----------------> lockswitch.cont_lock
    |         	|      	              (cont_lock_clm)
    |         	|      	       	       	  | 
    |      	|      	       	lock   	  |     convert
    |         	|      	     -------------|-------------
    |         	|	     | 	       	  |    	       |
    |         	|	     V	   	  |    	       V
    |  		|    cont_lock_clm_new 	  |     cont_lock_clm_convert
    |         	|      	     | 	       	  |    	       |
    |         	|      grant V block   	  |    	 grant V block
    |  		|      -------------   	  |    	 -------------
    |  		|      |	   |	  |    	 |     	     |
    |  		|      V	   V	  |    	 V     	     V
    |  		|grant_lock blocked_lock  | grant_convert blocked_convert
    |  		|      |       	   |   	  |    	 |     	     |
    |  		|      |       	   |----->|<-----|     	     |
    |  		|      |----------------->|<-----------------|
    |  	       	|      	       	       	  |
    |  		|		  	  |---> clm_reply
    |  		|		  	  |         |
    |  		|		  	  |	    |----> dlm_reply_guts
    |  	       	|	  	      	  |	    |  	       |
    |  		|	  	      	  |	    |          |-->sendast
    |  		|	  	      	  |    	    |  	       |      |
    |  		|	  	      	  |	    |	       |<-----|
    |  		|	  	      	  |	    |	       |
    |  	       	|      	       	       	  |    	    |<---------|
    |		|			  |	    |
    |		|			  |<--------|
    |		|			  |
    |		|<------------------------|
    |		|
    |<----------|
    |
    O
 
After receiving the response, it is the first thing to find the corresponding
transaction of the lock request. This work is done in "match_request" by using
"clm_sender" field of the transaction data structure. And then the callback 
function ("cont_lock_clm") to continue the lock request is gotten. The local
lock record is updated by "cont_lock_clm_new" or "cont_lock_clm_convert" based
on the respond transaction. After that, "sendast" is called to deliver AST to 
the client. For user space client, the AST is added into a per client 
linked-list, and if the notify signal number is specified, a signal is also 
sent to the client. For kernel space client, the AST is delivered directly by
using "kernel_thread". After AST is delivered, the lock request is completed.

2. dlmlock_sync()

This function can be found in api_vms.c, line 540. It can be used for acquiring
or converting a lock on a lock resource in synchronous fashion. It is 
implemented based on "dlmlock". It calls "dlmlock" directly with "LKM_BLOCK" 
flag.

How does "LKM_BLOCK" work? All tricks are in "sendudp". As mentioned before, 
"sendudp" is called in the first phase of the "dlmlock" and a reply message is
sent back to the client. If "LKM_BLOCK" is specified, "sendudp" does not send
the reply message and just skip it. Hence the client is blocked when it try to
retrieve the reply message. At the second phase, if "sendast" detects 
"LKM_BLOCK" flag, it clears "LKM_BLOCK" and calls "sendudp" to send reply 
message to the client. And lock request continues because it gets the reply
message.

3. dlmlockx()

This function can be found in api_vms.c, line 317. It can be used for acquiring
or converting a lock on a lock resource, and specifies a transaction ID for 
that lock.

It is almost the same with "dlmlock", the only difference is that it provides
a transaction ID to the lock request when calling "clm_lock_reqresp". The 
transaction ID is used as client ID instead of using PID in "dlmlock".

4. dlmlockx_sync()

This function can be found in api_vms.c, line 405. It can be used for acquiring
or converting a lock on a lock resource in synchronous fashion, and specifies 
a transaction ID for that lock. It is implemented based on "dlmlockx". It calls
"dlmlockx" directly with "LKM_BLOCK" flag.

5. dlmunlock()

This function can be found in api_vms.c, line 651. It can be used for releasing
a lock on a lock resource or canceling a lock request that is in blocked or 
converting state. As being same with "dlmlock", it can also be divided into two
phases.

5.1 First phase

Just like "dlmlock", a description of the unlock request is built and committed
in the first phase. The request is also represented by a transaction. Following
is the function calling graph for the first phase of "dlmunlock":

dlmunlock
    |	
    |-----> api_reqresp 
    |	     |       
    |	     |---------> k_clm_send 
    |	     |       	     |     
    |	     |               V	    	 
    |	     |           clm_send   	 		      
    |	     |               |	    	 		      
    |	     |               V	    	 		      
    |	     |           clm_process	 	      	      
    |	     |               |	      	 	      	      
    |	     |               V 	      	 	      	      	
    |	     |           clm_request  	 	      	      	
    |	     |	             | 	      	 	      	      	
    |	     |               V 	      	 	      	      	
    |	     |           lockswitch.begin_unlock 	       	     	
    |	     |	      	 (begin_unlock_clm)       	      	
    |	     |       	     | 	       	       	       	      	
    |        |         	     |------> pre_proc_unlock_clm    	     
    |	     |		     | 	      	      	|   	      	
    |        | 	       	     |<-----------------| 	      
    |	     |         	     |        	 		      	
    |  	     | 	 -------------------------     	  	      
    |  	     | 	 |     |     | 	   |   	 |  	  	      
    |	     |   V     V     V 	   V     V    	       	      	     	
    |	     |	(1)   (2)   (3)	  (4)   (5)            	      	     	
    |	     |   |     |     |     |     |    	      	      	     	
    |	     |   V     V     V     V     V    	      	      	     	
    |	     |	 -------------------------     	  	      	   
    |	     |	       	     | 	              	          
    |	     |<--------------| 	       	       	      	      	
    |	     |       	       	      	 	      	      	
    |	     |---------> k_clm_recv   	 	      	      	
    |	     |         	     | 	      	 	      	      	
    |	     | 	       	     V 	      	 	      	      	
    | 	     | 	       	 clm_recv     	 	      	      	
    |        | 	       	     | 	       	       	       	      
    | 	     |<--------------| 	       	       	      	      	
    |        | 	       	       	       	       	       	      
    |<-------|						      
    |		  				   
    O

The implementation of "dlmunlock" is very similar to "dlmlock", hence only the
difference point will be discussed in following part. 

The transaction that represents the unlock request is built in "dlmunlock", 
and then "api_reqresp" is called directly. Before entering "begin_unlock_clm",
the callback function ("cont_unlock_clm") for starting the operations of the 
second phase is specified in "clm_request".

In "begin_unlock_clm", "pre_proc_unlock_clm" is called to see if the unlock 
request can be processed completed on this node or at least before it gets
passed to the master. This routine handles the initial processing of an unlock 
request before it is forwarded to the distributed layer. If the lock resouce is
mastered locally, this optimization yields a significant performance 
improvement. Depending on the outcomes of "pre_proc_unlock_clm", there are five
possible results and corresponding operations:
	
	(1). An error is detected locally. The unlock request is aborted. The 
 	reply message and the unlock AST are sent to the client. No interaction
	with transport layer is needed. This work is done by calling 
	"clm_reply".

	(2). The lock resource is mastered locally, processing is complete.  
	The reply message and the unlock AST are sent to the client. No 
	interaction with transport layer is needed. This work is done by 
	calling "send_unlck_ast" and "clm_reply" sequentially.

	(3). The lock resource is mastered remotely. Local unlock completed.
	The unlock request is forward to lock resource master. The reply 
	message and the unlock AST are sent to the client. This work is done 
	by calling "send_unlck_ast", "pti_call" and "clm_reply" sequentially.

	(4). The lock resource is mastered remotely. The lock handle is found,
	but local unlock is not possible because of local lock state. The 
	unlock request is forward to lock resource master. The reply message 
	is sent to the client and the unlock AST is delivered when response 
	comes back. This work is done by calling "sendudp" and "pti_call" 
	sequentially.

	(5). The lock resource is mastered remotely. The lock handle is not 
	found, and local unlock is not possible. (This condition occurs when
	trying to cancel an inflight lock request.) The reply message is sent 
	to the client. The unlock request is added to a pending queue in lock 
	record, and will be forwarded to lock resource master when the respond
	transaction for the original lock request is received. This work is 
	done by calling "sendudp" and "hold_cancel" sequentially. (The holded
	unlock request is resent by calling "submit_cancels" from 
	"cont_lock_clm" in the second phase of "dlmlock".)

After this point, all operations are same with "dlmlock".

5.2 Second phase

Just like "dlmlock", the transaction also may be sent to the directory node or
the lock resource master. If it is the directory node that receive the 
transaction, "dir_read_request" is called by "CCCP". And then 
"dir_proc_request" is called to do all hard works. 

If the corresponding lock record is found, the destination of the transaction
is modified to lock resource master's "PORT_MSTR" port. "CCCP" transport layer
forwards the modified transaction to the new destination when returning from 
"dir_read_request". And the following operation is same with the condition 
when the lock resource master receives the transaction.

If no lock record is found, only a "DLM_VOID" respond transaction is built and
sent to "PORT_RESPONSE" port of the sender of this transaction. This condition
occurs when trying to unlock a lock that was just purged.

If it is the lock resource master that receive the transaction, 
"pti_read_request" is called by "CCCP". Following is the function calling 
graph for servicing the unlock request(transaction):

	pti_read_request	         
        	 |     	       	                   
		 V     	   	       	
	lockswitch.master_unlock      		
	(master_unlock_clm)           
		 |     	        	      			  
		 V			
     	convert_locks	
     		 |   	
     		 V	
     	send_basts_again	
     		       	
According to the previous graph, all unlock/cancel work is done in 
"master_unlock_clm". After doing that, "convert_locks" is called to see if
there is any lock in "converting" or "waiting" state and try to grant them.
This routine also sends block AST to the corresponding client by calling
"send_basts_again". At last, the result (is also described by a transaction 
data structure) is sent back to "PORT_RESPONSE" port of the sender of this 
unlock request.

"pti_read_response" is called when the respond transaction is received by the
sender of the unlock request. Following is the call graph:

pti_read_response  	   	
    |			   		       	 
    |----> pti_proc_response	       	      		
    |          	|  		       	      	  		  
    |          	|-----------------> match_request 	  	
    |         	|      	       	          |	      
    |      	|<------------------------|	      
    |      	|      	       	       	     
    |      	|-----------------> lockswitch.cont_unlock   	  
    |         	|      	              (cont_unlock_clm)       
    |          	|      	       	       	  |  	   
    |  	       	| 			  V
    |		| 		    -------------
    |		| 		    |  	  |	|
    |		| 		    V	  V	V
    |		| 		   (1) 	 (2)   (3)
    |		| 		    |  	  |    	|
    |		| 		    V	  V	V
    |		| 		    -------------
    |		| 			  |
    |		|<------------------------|
    |	       	|      	       	       	  
    |<----------| 			  
    |  	       	  
    O		  
 
According to the previous graph, there are three possible operations in 
"cont_unlock_clm" depending on the respond transaction:

	(1). The transaction is a response to a timeout request. Then 
	processing still has to happen on the local node. Call the master 
	unlock to send the timeout AST, but no client response for the cancel 
	is required because a client didn't send it - the timeout subsystem 
	did.  Free the request at last. This work is done by calling 
	"local_unlock_clm" and "clm_free_trans" sequentially.

	(2). Corresponding to the condition (4) in the first phase, processing
 	still has to happen on the local node. Call local unlock to handle the 
	response then send the reply to the client. This work is done by 
	calling "local_unlock_clm" and "clm_reply" sequentially.

	(3). Coreesponding to the condition (1)(2)(3) in the first phase, after
 	reading a master's response to an unlock request that has already been 
	handled locally, toss it. This work is done by calling 
	"local_unlock_clm" and "clm_free_trans" sequentially.

	NOTE: Condition (5) in the first phase will be converted in to one of 
	the other four conditions eventually.

After these operations, the whole unlock request completes.

6. dlmunlock_sync()
    
This function can be found in api_vms.c, line 573. It can be used for releasing
a lock on a lock resource or canceling a lock request that is in blocked or
converting state in synchronous fashion. It is almost same with "dlmunlock".

The most important difference between these two function is that "RESP_AST" is
not specified in "dlmunlock_sync". Without this flag, the reply message is not
sent in condition (4)(5) of first phase, and is sent in condition (2) of second
phase. Hence "dlmunlock_sync" is blocked until all unlock works is done.

Asynchronous notify
-------------------

1. ASTpoll()
This function can be found in api/ast_stub.c, line 59. It can be used for 
triggering the execution of pending AST routines.

As mentioned before, the ASTs is placed in a per client linked-list. "ASTpoll" 
accesses lock manager via a character device (IOCTL), retrieves the pending
ASTs and executes them.

2. dlm_setnotify()

This function can be found in api/api_setnotify.c, line 77. It can be used for 
specifying which signal the lock manager should use to notify your application 
of a pending AST.

This routine updates a global variable named "notify_signo". The value of 
"notify_signo" is copied into the transaction data structure when "dlmlock" is
called. And a specified signal is sent to the client when the lock manager add
AST to a client's pending list.

Lock management and LVB operation
---------------------------------

1. Lock record management

In order to manage all the locks, OpenDLM defined several data structures to
represent "resource", "lock" and "client".

1.1 "resource"

A "resource" can be any entity accessible to the operating system, or to one
or more process. "resource" is represented by a data structure named 
"resource", defined in include/clmstructs.h, line 1279. "resource" is created
when it is requested first time and is deleted when it is not referenced by 
any client. Following are the most important fields in this structure:

	refcount 	- Reference count
	rh		- Resource handle
	type		- Resource type
	name		- Dynamic resource name
	namelen		- Length of resource name
	master		- Site id of master node
	lockqs		- Grant, convert, wait lists for lock requests
	value		- Lock value block
	rsrc_flags	- State bits
	
Resource handle (rh) is the key field for managing all "resource". OpenDLM 
stores all "resource" in a resource table ("restab"). And this handle contains
the index into the resource table to locate the resource. Note, this handle is
identical for a "resource", even if it has multi copies in different nodes. 
Hence In the event of node failure, the resource table is easily reconstructed
from the elements known by each other nodes. The "resource handle" is kept 
with a lock object (reclock, discussed in next section) and used for fast 
access to the resource entry in "restab". It is sometimes necessary to search
the resource table (only on the master) when creating a new resource. To this 
end, a second table is maintained, "reskeytab". This is a hash table of 
pointers into "restab". This provides quick search capability used for ADD,
FIND, and DELETE operations.

      		      "restab" 	       	       	 	    
      -------------------------------------------      	     
      |   seg 0	  |   seg 1   | ... |   seg n  	|      	       	
      -------------------------------------------   	
       	    |  	     	|     	     	  |    	   	   
       	    V  	       	V      	       	  V    	   	    
       ----------- ----------- 	     -----------   	    
   --->|resource | |resource | 	     |resource |<-----	    
   |   |---------| |---------| 	     |---------|     |	    
   |   |resource | |resource |<-     |resource |<--  |	    
   |   |---------| |---------| |     |---------|  |  |	    
   |   |    .  	 | |   	.    | |     | 	  .    |  |  |	    
   |   |    .	 | |	.    | |     |	  .    |  |  |	    
   |   |    .	 | |	.    | |     |	  .    |  |  |	    
   |   |    .	 | |	.    | |     |	  .    |  |  |	    
   |   |    .	 | |	.    | |     |	  .    |  |  |	    
   |   |    .	 | |	.    | |     |	  .    |  |  |	    
   |   |---------| |---------| |     |---------|  |  |	    
   | ->|resource | |resource | |  -->|resource |  |  |	    
   | | ----------- ----------- |  |  -----------  |  |	    
   | |      	    	       |  |  	       	  |  |	    
   | | 	       	       	       |  |    	       	  |  |	    
   | | 	       	       	       |  |    	       	  |  |	    
   | |			       |  |	     	  |  |	    
   | | 	       	       	       |  |    	     	  |  |	    
   | |	    .	      	.      |  |	  .  	  |  |	    
   | | 	    .  	       	.      |  |    	  .    	  |  |	    
   | | 	    .	      	.      |  |	  .  	  |  |	    
   | | 	    .	      	.      |  |	  .  	  |  |	    
   | | 	    .	      	.      |  |	  .  	  |  |	    
   | | 	    .	       	.      |  |	  .  	  |  |	    
   | |      ^           ^      |  |       ^       |  |	 
   | |      |           |      |  |       |       |  |	
   | | ----------- ----------- |  |  -----------  |  |	
   | --| reshash | | reshash |--  |  | reshash |---  |	 
   |   ----------- ----------- 	  |  -----------     |	
   |        ^           ^         |       ^          |	
   |        |  	        |         |       |    	     |	
   |   ----------- ----------- 	  |  -----------     |	 
   |---| reshash | | reshash |-----  | reshash |------	
       ----------- ----------- 	     -----------  
       	    ^  	       	^      	       	  ^  	  	   
       	    |  	       	|      	       	  |    	  	    
      ------------------------------------------- 	    
      |   key 0	  |   key 1   | ... |   key n  	| 	    
      ------------------------------------------- 	    
      	       	     "reskeytab"		  	    
 
In the directory node of a "resource", this "resource" is described in a
different way. It is represented by "rl_info_t" data structure. For the sake 
of saving space, "rl_info_t" is just a subset of "resource". It is only used
to find who is the master of this "resource". And all of "rl_info_t" structures
are stored into a hash table ("rldbtab").

      		      "rldbtab" 
      -------------------------------------------  
      |   slot 0  |   slot 1  | ... |   slot n 	| 
      -------------------------------------------  
       	    |  	     	|     	     	  |    	   
       	    V  	       	V      	       	  V    	   
       ----------- ----------- 	     -----------   
       |rl_info_t| |rl_info_t| 	     |rl_info_t|   
       ----------- ----------- 	     -----------   
            |           |                 |        
            V           V                 V        
       ----------- -----------       -----------   
       |rl_info_t| |rl_info_t|       |rl_info_t|   
       ----------- -----------       -----------   
            |  	       	|		  |
            V		V		  V
            .		.		  .
            .		.		  .
            .		.		  .
            .		.		  .
	    .		.         	  .
	    .		.       	  .
	

1.2 "lock"

When a client requests a lock first time, the name of the resource is used for
specifying the lock is requested. OpenDLM associates a lock ID with the lock 
request, and returns the lock ID to the client. This lock ID uniquely
identifies the lock record ("reclock") used by the lock manager to describe 
the lock request. Following are all fields in "reclock":

	link		- List structure used to link reclock into "resource".
	tq_node		- Pointer to timeout queue structure.
	route		- Callback information.
	lck		- All information about the lock request.
	dl		- Used for dead lock detecting
	seg		- Lock table segment index
	set		- Timer set/reset

"lck" field is a data structure named "lockinfo", defined in 
include/clmstructs.h, line 467. Following are important fields in "lockinfo":

	mode		- Granted mode
	reqmode		- Requested mode
	bastmode	- Mode passed back in blocking AST
	site		- Site ID where request was made
	lk_flags	- Flags from application
	pid		- Process ID that requested lock
	remlockid	- Remote lockid
	lockid		- Lock ID
	state		- Internal state flags
	rh		- Resource handle
	lkvalue		- Lock value block
	request		- Pointer to create request
	seqnum		- Sequence number on queue
	type		- Lock type
	xid		- Transaction ID
	
Whan a new lock request is submitted and before respond message is received, 
it is linked into a per client pending queue ("cli_queue" field of 
"clm_client_t"). After the lock request is serviced, the lock request is 
linked into "resource" structure (grant/convert/wait queue. Note: convert 
queue is not possible for new lock request). Subsequent references to this
lock are made by lock ID, and not by resource name. In order to improve the 
performance, a hash table ("rlh_table") is implemented to locate the lock 
record by specifying the lock ID.

      		      "resource" 	       	       	 	    
      -------------------------------------------      
      | grant queue |convert queue| wait queue  |       	 
      -------------------------------------------    	  
       	    |  	       	   |  	     	  |    	      	    
       	    V  	       	   V   	       	  V    	      	    
       -----------    -----------    -----------      	     	    
   --->| reclock |    | reclock |<-- | reclock |<-----	     	    
   |   -----------    -----------  | -----------     |	     	    
   |        |              |       |      |          |	    
   |        V              V       |      V          |	    
   |   -----------    -----------  | -----------     |	    
   | ->| reclock |    |	reclock |<-| | reclock |<--  |	    
   | | -----------    ----------- || -----------  |  |	    
   | |      |  	           |      ||   	  |    	  |  |
   | |      V	           V      ||	  V   	  |  |
   | |      .	           .      ||	  .   	  |  |
   | |      .              .      || 	  .   	  |  |
   | |      .              .      ||   	  .    	  |  |
   | |      .              .      || 	  .   	  |  |
   | |      .  	       	   .      ||   	  .    	  |  |	    
   | | 	    .  	       	   .      ||   	  .    	  |  |	    
   | | 	       	       	          ||   	       	  |  |	    
   | |	    	    	          || 	      	  |  |	    
   | | 	       	       	          ||   	      	  |  |	    
   | |	    .	      	.         || 	  .   	  |  |	    
   | | 	    .  	       	.         ||   	  .    	  |  |	    
   | | 	    .	      	.         || 	  .   	  |  |	    
   | | 	    .	      	.         || 	  .   	  |  |	    
   | | 	    .	      	.         || 	  .   	  |  |	    
   | | 	    .	       	.         || 	  .   	  |  |	    
   | |      ^           ^         ||      ^       |  |	   
   | |      |           |         ||      |       |  |	   
   | | ----------- -----------    || -----------  |  |	  
   | --| rl_hash | | rl_hash |-----| | rl_hash |---  |	  
   |   ----------- ----------- 	   | -----------     |	   
   |        ^           ^          |      ^          |	   
   |        |  	        |          |      |    	     |	   
   |   ----------- ----------- 	   | -----------     |	   
   |---| rl_hash | | rl_hash |------ | rl_hash |------	   
       ----------- ----------- 	     -----------     	
       	    ^  	       	^      	       	  ^   	     	    
       	    |  	       	|      	       	  |    	     	    
      -------------------------------------------    	    
      |   key 0	  |   key 1   | ... |   key n  	|    	    
      -------------------------------------------    	    
      	       	     "rlh_table"     		     	    
 
Note, there are actual two lock ID for a lock request:

	lockid		- Local lock ID, used to locate the lock record.
	remlockid	- Remote lock ID, must be converted to local lock ID 
			to locate the lock record.

The convert map between local lock ID and remote lock ID is stored in 
"hashent" and eventually linked into a hash table ("idhash").

      		      "idhash" 	       	       	 
      -------------------------------------------  
      |   key 0   |   key 1   | ... |   key n  	|   	
      -------------------------------------------   	
       	    |  	     	|     	     	  |    	    	
       	    V  	       	V      	       	  V    	    
       ----------- ----------- 	     ----------- 
       | hashent | | hashent | 	     | hashent |    
       ----------- ----------- 	     -----------  
            |           |                 |      
            V           V                 V    
       ----------- -----------       ----------- 
       | hashent | | hashent |       | hashent | 
       ----------- -----------       -----------
            |  	       	|		  |
            V		V		  V
            .		.		  .
            .		.		  .
            .		.		  .
            .		.		  .
	    .		.         	  .
	    .		.       	  .

1.3 "client"

In OpenDLM, "client" is the system entity that requests "lock" on one or more
"resource", and it can be a process, a transaction or a group of processes. A
database of lock manager clients is maintained on every nodes. Each node 
tracks the clients locally and maintains a record of the sequence number 
contained in the most recent request received. In this way the lock manager 
can detect when a UDP packet has been dropped. On the "resource" master, all
clients that reference this "resource" are recorded in this database. Again,
only local clients are checked for dropped packets. However, every client in
the database has associated with its entry a list of all reclock structures
held by that client. This structure is traversed when performing global
deadlock detection.

"client" is represented by a data structure named "clm_client_t", defined in
"include/clmstructs.h", line 790. Following is some important fields in 
"clm_client_t":

	cli_id		- Client ID
	cli_site	- Site of client
	cli_groupid	- Group id of client if any
	cli_type	- Client type: PROCESS, GROUP or TRANSACTION
	cli_sequence	- Client sequence
	sync_seq	- Sequence number of lock request
	cli_queue	- Owned locks list
	cli_ast_pnd	- List of pending ASTs
	cli_reply	- Used to queue the UDP reply
	cli_reply_event	- Wait queue for waitting UDP reply
	cli_state	- State code & flags

The "client" database is actually a sorted array ("cli_refs"). And it is 
protected by a spin lock ("cli_refs_lock").

      		      "cli_refs" 	       	       	 
      -------------------------------------------  
      |  client 0 |  client 1 | ... |  client n	|   	
      -------------------------------------------   	
       	    |  	     	   |  	     	  |    	    	
       	    V  	       	   V   	       	  V    	    
      -------------- --------------  --------------
      |clm_client_t| |clm_client_t|  |clm_client_t| 
      -------------- --------------  --------------

2. LVB operation

As described in lock management section, the lock value block is store in both
"resource" and "reclock". It is manipulated based on modes:

	(1). If moving from >= LKM_PWMODE to lower, update "resource" value 
	block else update valblk from "resource".
	(2). If " LKM_OLD_LVB" flag is cleared, a conversion to the same mode
	can set LVB.

All above function are implemented in "valueblock", defined in 
"kernel/dlmdk/clm_clmlock.c", line 2748. It is called when a lock/convert 
request is granted if the "LKM_VALBLK" is specified.

2.1 "dlm_scnop"

Manipulates a System Commit Number (SCN), a specialized use of the lock value
block associated with a resource.  The SCN is a cluster-global counter used
to ??

The SCN is stored in the LVB associated with a lock resource.
An LVB is an array of MAXLOCKVAL bytes (16 bytes as defined currently, but
for OpenGFS, we need to change it to 32 bytes). The lock manager represents
an SCN value of up to 128 bits (this fills the 16-byte LVB size) by using four
end-to-end concatenated unsigned integers. 

These integers are the four fields contained in the "scn_t" structure, defined
in "include/clmstructs.h", line 871. The following figure illustrates how the 
"scn_t" structure overlays the bytes in the LVB:

       	bytes						    
       	-------------------------------------------------   
   LVB	| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|   
	-------------------------------------------------   
	     	       		    	     		    
	unsigned integers	    	     		    
       	-------------------------------------------------   
   SCN 	|   base    |    wrap1  |    wrap2  |    wrap3 	|    
	-------------------------------------------------   
	
You define the range of SCN counter by specifying, in the "bit_len" parameter
passed to the "dlm_scnop". If you specify a "bit_len" value of 32 or less, the
lock manager uses only the "base" field of the structure. If you specify a 
"bit_len" value of 64 or less, the lock manager uses the "base" and the 
"wrap1" fields in the structure. The integer value in the "base" field 
overflows into the "wrap1" field. The value of SCN should be interpreted by 
concatenating the integer fields, and not by adding them. For example, the 
following figure shows the representation for a value of SCN is 2^32:

       	-------------------------------------------------   
   SCN 	|0000......0|000......01|    N/A    |    N/A    |    
	-------------------------------------------------   
	
"dlm_scnop" is implemented in "api/api_scn.c", line 64. The SCN operation 
("DLM_SCN_OP") is also one of the basic request types in OpenDLM, and it is
implemented in very similar way that "dlmlock"/"dlmunlock" is. It can be also
divided into two phase.

2.1.1 First phase

Just like "dlmunlock", a description of the SCN request is built and 
committed in the first phase. The request is also represented by a transaction.
Following is the function calling graph for the first phase of "dlm_scnop":

dlm_scnop							      
    |	 						      				     
    |-----> api_reqresp  			 	      	     							     
    |	     |       	 		 		      	     		 		 		     
    |	     |---------> k_clm_send    	       	       	      
    |	     |       	     |       	 		      
    |	     |               V	    	 		      
    |	     |           clm_send   	 		      
    |	     |               |	    	 		      
    |	     |               V	    	 		      
    |	     |           clm_process	 	      	      
    |	     |               |	      	 	      	      
    |	     |               V 	      	 	      	      	
    |	     |           clm_request  	 	      	      	
    |	     |	             | 	      	 	      	      	
    |	     |               V 	      	 	      	      	
    |	     |           begin_scn_op  	       	 	       	     	
    |	     |	      	     | 	       	       	  	      	
    |	     |       	     V 	       	       	       	      	
    |        |         	 pti_call                             	     
    |	     |		     | 	      	      	       	      	
    |        | 	       	     V                    	      
    |	     |         	 dlm_pti_send 	 		      	
    |  	     | 	             |                 	  	      
    |  	     | 	             V 	       	    	  	      
    |	     |           cccp_send            	       	      	     	
    |	     |	             |                         	      	     	
    |	     |<--------------|                	      	      	     	
    |	     |       	       	      	 	      	      	
    |	     |---------> k_clm_recv   	 	      	      	
    |	     |         	     | 	      	 	      	      	
    |	     | 	       	     V 	      	 	      	      	
    | 	     | 	       	 clm_recv     	 	      	      	
    |        | 	       	     | 	       	       	       	      
    | 	     |<--------------| 	       	       	      	      	
    |        | 	       	       	       	       	       	      
    |<-------|						      
    |		  				      	      	     			     
    O		  				      	      	     			     
 
The implementation of "dlm_scnop" is very similar to "dlmlock", hence only the
difference point will be discussed in following part.

The transaction that represents the SCN request is built in "dlm_scnop", and
then "api_reqresp" is called directly. Before entering "begin_scn_op", the 
callback function ("cont_scn_op") for starting the operations of the second
phase is specified in "clm_request". Since a "NL" mode lock must be acquired
on the "resource" in question, the node ID of "resource" master can be found
in local "resource" copy. And this SCN request is sent to master's "PORT_MSTR"
port dircetly. Note, the UDP reply message is not sent out, hence the client
will be blocked until the respond from master is received.

2.1.2 Second phase

When the lock resource master receives the transaction, "pti_read_request" is
called by "CCCP". Following is the function calling graph for servicing the 
SCN operation request(transaction):

	pti_read_request	              	       	       			      			  
              |      	       	                      	       			  
	      V	      	   	       		       			  
	master_scn_op      		       			  

According to the above graph, all SCN operations are done in "master_scn_op".
After doing that, the result (is also described by a transaction data 
structure) is sent back to "PORT_RESPONSE" port of the sender of this SCN 
operation request.

"pti_read_response" is called when the respond transaction is received by the
sender of the SCN operation request. Following is the call graph:

pti_read_response  	   		       	      			    
    |			   		       	      			    
    |----> pti_proc_response	       	      			    		    
    |          	|  		       	      	  		    		    		
    |          	|-----------------> match_request 	  		    	    		
    |         	|      	       	          |	      			    	    	       	
    |      	|<------------------------|	      			    	    		
    |      	|      	       	       	      			    		    		
    |      	|-----------------> cont_scn_op   	       	      	    		
    |          	|      	       	       	  |  	   		  		    	       	
    |  	       	| 			  V
    |		| 		    clm_reply
    |		| 			  |
    |		|<------------------------|
    |	       	|      	       	       	  
    |<----------| 			  
    |  	       	  
    O		  

According to the above graph, "cont_scn_op" return the result of request to
the client by calling "clm_reply". The client is un-blocked at this point and
gets the operating result.


Deadlock Detecting
------------------

While OpenDLM permits processes to synchronize their access to share resource,
it cannot prevent them from creating deadlock. The deadlocks are not problems
in OpenDLM, but a mechanism for detecting/resolving deadlocks is built into
OpenDLM to facilitate applications that using OpenDLM.

The deadlock detecting module is implemented in "kernel/dlmdm/clm_deadlock.c".
It seems like that this module is implemented based on a distributed deadlock
detecting algorithm ("PATH-PUSHING"). 

In this algorithm, the local TWFG (Transaction Wait-For Graph) is built at 
each node, and both resource waits and communication wiats are gathered. 
Resources not mastered local or non-local clients are marked as external 
vertexs in the TWFG. To detect the deadlocks, following actions are done in 
each node:

	1. Wait and receive deadlock releated informaiton (produced in the 
	previous deadlock detecting iteration) from other nodes.

	2. Combine the received informaiton with local TWFG and search for
	circle.

	3. Resolve local deadlock by choosing a victim from the local circle
	that doesn't contain an external vertex.

	4. Send information about the circle that contains external vertex to
	the node which the external resource is mastered on or the external
	client resides on.

The deadlock detecing cycle ends when a node ensure no deadlocks or external
vertex in its TWFG (including lock TWFG the received TWFG). The detailed 
information for this module will be discussed in following section.

1. Start deadlock detecting

In OpenDLM, detecting deadlock is a periodically task. It is executed per 
"deadlock_timeout/2" (default value : 30/2 seconds). Since it is based a 
distributed deadlock detecting algorithm, inter-node coordination is necessary.
And all of the communication among all nodes are transfered via "PORT_DEADLOCK"
port of "CCCP".

In "clmdd_init", the communication infrastructure for detecting deadlock is 
initialized. "clmdd_incoming" is assigned to process all incoming message from
"PORT_DEADLOCK". Following graph illustrates how the deadlock detecting starts
every time:

clm_master_loop	       	       	       	       	     clmdd_incoming   
      |						       	    |	      
      |-> do_timer_stuff			       	    |MSG_STRT 
      |		|				    ----------------- 
      |		|-> clmdd_task			    |  	|   |  	|   | 
      |		|      	 |    	       		    V  	V   V  	V   V 
      |		|	 |-> clmdd_primo	   (1) (2) (3) (4) (5)
      |		|	 | 	  |    					      
      |		|	 |   N 	  V    Y				      
      |		|	 |   ----------- 				      
      |		|	 |   | 	       | 				      
      |		|	 |<--| 	       V 				      
      |		|	 |     	 clmdd_startstart			      
      |		|	 |	       |       				      
      |		|	 |	       |-> clmdd_suspect		      
      |		|	 |	       |       	|      			      
      |		|	 |	       |<-------|			      
      |		|	 |	       |       				      
      |		|	 |	       |-> clm_send			      
      |		|	 |	       |        |  			      
      |		|	 |	       |<-------|			      
      |		|	 |	       | 				      
      |		|	 |<------------| 				      
      |		|	 |	        				      
      |		|<-------|	        
      |		|      	  	        
      |<--------|	  	        
      |	       	 	  	        
      O	       	       	  	        
       		 	  	       
According to the above graph, the inter-node coordination is a little bit
complex when starting deadlock detecting. But it is necessary because if you
have a deadlock condition it is possible that it will block several clients'
requests at about the same time. If we kicked off a periodic deadlock detecting
cycle on all nodes without some kind of coordination, this would start deadlock
detection cycles on all of the affected nodes at roughly the same time. 

This method is used to synchronize the beginning of a cycle. The cycle begins
on the lowest numbered site (determinated by "clmdd_primo"), triggered by the
periodic task "clmdd_task". In "clmdd_startstart", the first node calls 
"clmdd_suspect" to checks its deadlock detection queue "dl_timeout" for
suspicious requests. All blocked lock request are linked in "dl_timout" queue,
and the criterion for suspicious request is that it is blocked more than 
"deadlock_timeout" seconds. If it finds one it sets the start message timestamp
to the age of the oldest request, otherwise it sets "sdd->sdd_node" to NO_NODE.
The start message is then sent to the next node in the cluster. The start 
message is represented by a structure named "start_msg_t". Following are all
fields in "start_msg_t":

	sdd_cycle		- Deadlock cycle number
	sdd_instigator		- Instigator of this cycle
	sdd_node		- Node of oldest waiting lock request
	sdd_tv			- Age of oldest waiting lock request

When a node receives the start message (type "MSG_STRT"), "clmdd_incoming" is
called, and there are five possibilities for type "MSG_STRT":

	(1). The receiving node is the instigator and no suspicious request is
	found.
	
	(2). The receiving node is the instigator and the oldest suspicious
	request also resides on the receiving node.

	(3). The receiving node is the instigator and the oldest suspicious
	request doesn't reside on the receiving node.

	(4). The receiving node isn't the instigator and the oldest suspicious
	request resides on the receiving node.

	(5). The receiving node isn't the instigator and the oldest suspicious
	request doesn't reside on the receiving node. (Or not found)

For condition (1), it is not necessary to start the deadlock detecting cycle,
current cycle reach its end. 
For condition (2) and (4), start the deadlock detecting cycle by calling 
"clmdd_check".
For condition (3), since the start message has traveled all nodes, the start
message is forwarded to the node owns the oldest suspicious request.
For condition (5), the receiving node checks its queue, and if its oldest
request is older than the one in the start message it sets the timestamp and
node ID to its own values. All this are done by calling "clmdd_dostart".

Note that we need to use the *age* of requests, rather than the timestamp, due
to the implementation of gettimeofday() in linux kernel.
			 
2. Build deadlock detecting stack

Deadlock detecting stack is the core data structure for the whole deadlock
detecting cycle. It is used for transferring TWFG among nodes. It is 
represented as a queue "ddx_t" struct, "ddx_t" is defined as following:

typedef struct ddx_s {
        ddx_flavor_t    ddx_type;
        union {
                struct {
                        int             cli_site;
                        clm_client_id_t cli_id;
                        unsigned long   cli_rid;
                        int             cli_marked;
                } un_cli;
                struct {
                        int             rsc_namelen;
                        char            *rsc_name;
                        short           rsc_mode;
                        int             rsc_lockid;
                        int             rsc_locknode ;
                } un_rsc;
                struct {
                        unsigned long   blk_lockid;
                        int             blk_site;
                        unsigned long   blk_age;
                } un_blk;
        } ddx_un;
} ddx_t;

"ddx_t" is used for describing the in-core wait-for graph. A node in the graph
is a discriminated union; permissible types are given in the enum
"ddx_flavor_t":

	INVALID		- Invalid entry, for bombproofing
	CLIENT		- Known client reference
	XCLIENT		- Unknown client reference
	RSRC		- Resource reference
	XRSRC		- Unknown resource reference
	BRANCH		- Beginning of a set
	RETURN		- End of a set
	SCRUBBED	- Null entry, but not invalid

As mentioned before, the deadlock detecting cycle starts by calling 
"clmdd_check". And the first work is build local TWFG:

clmdd_check
     | 	   
     |-> clmdd_analyze
     |	      |	      
     |	      |------------
     |	      |	  	  |
     |	      |	  	  V
     |	      |	---------------------
     |	      |	| |->clmdd_pushr--- |
     |	      |	| |    	       	  | |
     |	      |	| ---clmdd_pushc<-| |
     |	      |	---------------------
     |	      |	  	  |    	    
     |	      |<----------|
     |	      |	  	  
     |<-------|	  
     | 	       	  
     V 	       	
clmdd_processddx
 
In "clmdd_check", every suspicious request is checked by calling 
"clmdd_analyze". "clmdd_pushr" and "clmdd_pushc" is invoked recursively to
build the local TWFG:

	clmdd_pushr	- Push a resource corresponding to a lock request onto
			the deadlock detection stack. A resource will 
			eventually expand into a list of clients holding locks
			on that resource (i.e.,in the wait/convert queues) that
			are incompatible with the lock request passed in as an
			argument. If the resource is mastered locally, the 
			expansion is done here. If the resource is remote, an
			"XRSRC" entry is pushed onto the deadlock detection 
			stack. In case of an error the deadlock stack is freed.

	clmdd_pushc	- Push a client onto the deadlock detection stack.
			Eventually, a client expands into a list of all its
			blocked lock requests. If the client is local, marks
			and expands it. If the client is remote, an "XCLIENT"
			entry is pushed onto the deadlock detection stack. In
			case of an error, free the deadlock stack.

Following figure illustrates a deadlock detecting stack (for the sake of simple,
all resources and clients are local):

               TWFG              ddx_stack

           ------                 ------              
       	   | C1	|<----|		  | C1 |	     
      	   ------     |           |----|
      	     | 	      |	  	  | BR |
       	     V 	      |   	  |----|
	    (R1)      |	  	  | C2 |
	     | 	      |   	  |----|
	     V	      |	  	  | BR |
	   ------     |   	  |----|
	   | C2 |     |	  	  | C3 |
	   ------     |	  	  |----|
	     |	      |		  | RT |
	     |	      |		  |----|
	   ------     |		  | BR |
	   |   	|     |		  |----|
	   V	V     |		  | C1 |
	 (R2)  (R3)   |		  |----|
	   |   	|     |		  | RT |
	   V  	-------		  |----|
	------			  | RT |
	| C3 |			  ------
	------
	
	C - Client
	R - Resource
	BR - Branch
	RT - Return (end of branch)

3. Deal with "ddx" stack

After building the local TWFG, "clmdd_processddx" is called with the ddx stack
as inputing information. And as mentioned before, only the circle without 
external vertex is checked and breaked if exists. For the circle that contain
external vertex, the related information (part of TWFG) is sent to the proper
node. The deadlock detecing cycle ends when a node ensure no deadlocks or
external vertex in its TWFG (including lock TWFG the received TWFG). Following
is the calling graph for deadlock detecting cycle:

                           clmdd_proccessddx          	
       	       	       	       	   |		
			   	   V				 
			      clmdd_walk			  
			       	   |   	 			  
		   N   	       	   V   	       	   Y		  
		   ---------------------------------		  
		   |   	       	       	       	   |		  
		   V   	       	       	       	   V		  
	    clmdd_massage      	       	     shoot_me_now	  
	     	   |   	     	   	 	   |   	 	  
	     	   V 	       	       	   remote  V   local	  
	    clmdd_encode     	   	   -----------------	  
		   |   	     	   	   |   	       	   |	  
	       	   V 	     	   	   V	       	   V	  
	    cccp_send	     	       cccp_send      clmdd_purgeclnt
	    (MSG_TWFG) 	       	       (MSG_PURG)      	   |	     
			     	       	       	   remote  V   local 
			     	   	       	   ----------------- 
			     	   	       	   |   	       	   | 
			     	   	       	   V		   V 
			     	   	       cccp_send      clmdd_purgersrc
			     	   	       (MSG_BLOW)      
			     	   	       	    	       
			     clmdd_incoming    	    	       
			     	   |	       	    	       
			     	   V	       	    	       
		   --------------------------------- 	       
		   |   	       	   |	       	   | 	       
		   |   	       	   |   	       	   |   	       
	       (MSG_TWFG)      (MSG_PURG)      (MSG_BLOW)      
		   |   	       	   |   	       	   |	       
		   V	     	   V	       	   V	       
	    clmdd_decode       clmdd_purgeclnt clmdd_purgersrc 
	           |	       	   |	       	    	       
	       	   V	       	   |	       	    	       
	    clmdd_processddx   	   |	       		       
		       	       	   |   	       	       	       
		           remote  V   local           	       
		       	   -----------------   	       	       
		       	   |   	       	   |	    	       
		       	   V		   V	    	       
		        cccp_send    clmdd_purgersrc           
		        (MSG_BLOW)      	   	       
		        	       	       	   	       	
According to the above graph, firstly "clmdd_walk" is called for detecting the
circle without external vertex. If no such circle is found, "clmdd_massage" is
called to decide the destination of the "MSG_TWFG" message. The other effect of
"clmdd_massage" is to scrub the "ddx" stack, hence only TWFG that contains 
external edges is sent out. Sequentially the "ddx" stack is encoded into a 
"MSG_TWFG" message by invoking "clmdd_encode". Then the "MSG_TWFG" message is
sent to port "PORT_DEADLOCK" of the destination. 

If a circle without external vertex is found by "clmdd_walk", that means
deadlock condition occurs. Then "shoot_me_now" is called to find the most
recent blocked request in this circle. If this request resides on remote node,
a "MSG_PURG" message is sent to port "PORT_DEADLOCK" of that node. If this
request resides on local node, "clmdd_purgeclnt" is called to deny this 
request. 

"clmdd_purgeclnt" needs to find the requested resource first. If this resource
is mastered locally, "clmdd_purgersrc" is called to deny the blocked request
directly. If this resource is mastered on remote node, a "MSG_BLOW" message is
sent to port "PORT_DEADLOCK" of the resource master.

On the other hand, "clmdd_incoming" is invoked when a message is received from
port "PORT_DEADLOCK". Depending on the type of the message, different funcitons
are called to process the incoming information.

In case of receiving a "MSG_TWFG" message, "clmdd_decode" re-builds the "ddx"
stack based on the incoming message and local information. Then 
"clmdd_processddx" is called with the ddx stack as inputing information. The
following process is all the same as "clmdd_processddx" is called in the first
node. The deadlock detecing cycle ends when a node ensure no deadlocks or
external vertex in its TWFG (including lock TWFG the received TWFG).

In case of receiving a "MSG_PURG" message, "clmdd_purgeclnt" is called to deny
the specified request. The following process is all the same as 
"clmdd_purgeclnt" is called in the first node.

In case of receiving a "MSG_BLOW" message, "clmdd_purgersrc" is called to deny
the specified request directly.

Till now, a complete deadlock detecting cycle ends.

References
----------
1. Roy G. Davis. 1993. "VAXcluster Principles". Digital Press
2. Kristin Thomas. 2001. "Programming Locking Applications"





SourceForge Logo