Contact the author.
OpenDLM Locking Copyright 2004 The OpenDLM Project Author: Stanley Wang, stanley.wang@intel.com Ben Cahill (bc), ben.m.cahill@intel.com Introduction ------------ This document describes the implementation of the lock request/grant procedure within OpenDLM, as well as the lock value block. This document is aimed at developers, potential developers, students, and anyone who wants to know about the details of lock management in OpenDLM. This document is not intended as a user guide to OpenDLM. Look in the OpenDLM WHATIS-odlm document for an overview of OpenDLM. Look in the OpenDLM HOWTO-install for details of configuring and setting up OpenDLM. This document may contain inaccurate statements, based on the authors' limited understanding. Please contact the authors if you see anything wrong or unclear. Overview -------- The basic working model of DLM Locking API could be divided into two phases. During the first phase, it validates the arguments provided by the caller, builds a description for this request, and sends this description to the local lock manager (on *this* node). Then the local lock manager forwards it to the proper lock manager (which may be on another node). This (forwarded) request is serviced during the second phase. The DLM Locking API is implemented in src/api/ directory, in four source files: api_scn.c api_setnotify.c api_vms.c ast_stub.c. The OpenDLM Locking API can be used in either kernel and user space, and the source code for the top level API is shared (although, of course, the kernel API builds separately from the user-space API). The source code splits, however, for lower levels (just before committing the lock request descripton to the local lock manager). At the level of k_clm_send() and below, user-level code is implemented in src/api, and kernel level is implemented in src/kernel/dlmdk. The lock manager lives in kernel space, regardless of whether accessed via the user- or kernel-space API. Source for the kernel level functionality resides in the src/kernel/dlmdk directory. The user space API accesses the lock manager via a character device (IOCTL). The kernel space API calls directly into the lock manager. User- and kernel-space functionality are almost the same, with one big exception; Only the user-space API supports synchronous (blocking) lock request calls. There are nine DLM Locking Model-specific routines. They can be divided into several groups: 1. Locking operation dlmlock() Makes an asynchronous request to acquire or convert a lock on a lock resource. dlmlock_sync() Acquires or converts a lock on a lock resource and obtains a synchronous return. (User space only) dlmlockx() Makes an asynchronous request to acquire or convert a lock on a lock resource, and specifies a transaction ID for that lock. dlmlockx_sync() Acquires or converts a lock on a lock resource, specifies a transaction ID for that lock, and obtains a synchronous return. (User space only) dlmunlock() Releases a lock on a lock resource or cancels a lock request that is in blocked or converting state. dlmunlock_sync() Releases a lock on a lock resource or cancels a lock request that is in blocked or converting statei, and obtains a synchronous return. (User space only) 2. Asynchronous notify ASTpoll() Triggers the execution of pending AST routines. (User space only) dlm_setnotify() Specifies which signal the lock manager should use to notify your application of a pending AST. (User space only) 3. LVB operation dlm_scnop() Manipulates the SCN, a specialized use of the lock value block associated with a lock resource. (Discussed in LVB section of this document.) Locking Operation ----------------- The following discussion will focus on just the kernel space "VMS" API. The VMS API is the one that was used on VAX computers, from which OpenDLM was derived. The UNIX API was developed for the Unix environment, but our discussion will focus on the VMS API. 1. dlmlock() This function can be found in api_vms.c, line 451. It can be used for acquiring a new lock, or converting a pre-existing lock, on a lock resource. As mentioned before, it can be divided into two phases. 1.1 First phase A description of the request is built and committed in the first phase. The request is represented by a data structure named "transaction", defined in include/clmstructs.h, line 636. This same structure is also used for responses to lock requests. Important fields in "transaction": clm_client_space - Client in user or kernel space clm_prog - Desired service clm_status - Status of request clm_type - Request type, defined by "dlm_ops_t" clm_direction - Indicate request or response, defined by "ptype_t" clm_locktype - Indicate UNIX or VMS lock clm_sequence - Transaction sequence pti_sequence - Transaction sequence (What does "PTI" stand for?) clm_route - Information necessary to complete a lock request after it has been granted. clm_pid - Process ID of client (0x7F7F7F7F for kernel) clm_data - Data of the request. For DLM Lock operation, it's a data structure named "lockreq" Following is the function calling graph for the first phase of "dlmlock": dlmlock | |------> clm_lock_reqresp | | | |-----> api_reqresp | | | | | |---------> k_clm_send | | | | | | | V | | | clm_send | | | | | | | V | | | clm_process | | | | | | | V | | | clm_request | | | | | | | V | | | lockswitch.begin_lock | | | (begin_lock_clm) | | | | | | | |-------------> sendudp | | | | | | | | |<-----------------| | | | | | | | V | | | pti_call | | | | | | | V | | | dlm_pti_send | | | | | | | V | | | cccp_send | | | | | | |<--------------| | | | | | |---------> k_clm_recv | | | | | | | V | | | clm_recv | | | | | | |<--------------| | | | | |<----------| | | |<-----------| | O According to the graph above, "clm_lock_reqresp" and "api_reqresp" build a description of the request, in the form of a transaction structure. "api_reqresp" is the lowest level call in the shared user/kernel source. For the kernel implementation, it calls "k_clm_send" to direct the transaction toward the proper lock manager, as described below. "k_clm_send" grabs a spinlock before calling "clm_send". "clm_send" makes a copy (why?) of the transaction before calling "clm_process". "clm_process" does a few sanity checks, then calls "clm_request". "clm_request" fills in a few fields of the transaction structure, and selects two functions based on the type of lock request (DLM_LOCK) and the type of lock (VMS vs. UNIX API). One function is the next function to call within *this* node ("begin_lock_clm"), and the other is the function for starting the operation of the second phase ("cont_lock_clm"). This will be called from the port receiver on this or a remote node, depending on where the transaction will be sent. The selection of both functions uses the "lockswitch" table to make the choice. This table contains calls, much like a Linux "ops" structure, that differentiate between the VMS API and UNIX API functionality. The table is defined in src/include/clm_locksw.h and src/kernel/dlmdk/clm_locksw.c. "begin_lock_clm" performs a number of sanity checks, and fills in more fields of the transaction structure before sending it to the "master" layer. It differentiates between lock creation (new lock) and lock conversion (changing the state of an existing lock). Accordingly, some of the processing is handled by either the "begin_clm_lock_new" or the "begin_clm_lock_convert" function. At this point, the transaction is ready to be sent to the lock master. Before doing this, "begin_lock_clm" calls "sendudp" to send a reply message (also a transaction) to the client that initiated this lock request. This reply gives status for phase 1 of the request processing. For clients in the kernel, remember, synchronous (blocking) requests are not supported. This reply will provide the basis for the return from the API call, providing information on the validity of the request. The actual request results will be provided to the client later, after phase 2. "sendudp" doesn't really send a UDP message between nodes; the client is on *this* node. It simply attaches the response transaction to a member of the client's "client" structure (defined in src/include/clmstructs.h), cp->cli_reply. This is a queue of reply transactions. Later, "k_clm_recv" will look through this queue for the reply that matches the "clm_sequence" field of the original request transaction. Now, the request transaction needs to be sent to the lock resource master. So, which node is the lock resource master for this lock request? If the lock request is a "convert" request, the node ID of the lock resource master can be found in the pre-existing lock record (struct reclock, defined in src/include/clmstructs.h) of the in-question lock, and this request transaction is sent to the lock resource master node's "PORT_MSTR" port via the "CCCP" transport. On the other hand, if the lock request is a "lock" request, there is no pre-existing reclock. We need to obtain the node ID of lock resource master, via the local (on this node) directory table, or via the "directory node", as follows, searching by the name (and type: UNIX or VMS) of the lockable resource: -- Search local directory table via "rl_lookup()" (called from "pti_call()"). If either an active directory entry (i.e. *this* node is the directory node), or a cache directory entry (i.e. *this* node has a copy of the directory node's info) are found, we obtain the master node ID locally, and send the request directly to the lock resource master node's "PORT_MSTR" port via "CCCP", just as with a "convert" request. -- If no local directory entry exists, calculate the directory node via "clm_direct()" (called from "dlm_pti_send()"), which applies a simple hash function to the resource name to determine the directory node. Send the transaction to the directory node's "PORT_DIRECT" port via "CCCP". If the directory node and lock resource master are the same node, it will deal with this transaction. If not, the directory node will forward this transaction to the appropriate lock resource master node. After sending the lock request, execution returns to "api_reqresp" eventually. "k_clm_recv" is called at this point to retrieve the status of the operations in phase 1. The return status is also represented as a transaction data structure. "clm_recv" gets the status from a linked-list. As mentioned before, it was linked to the list by "sendudp". Caveat: There is a textbook race condition between "clm_recv" and "sendudp". An ugly work-around is used to resolve this problem. More detailed information in "src/kernel/dlmdk/dlm_kernel.c" line 368. For the current execution chain, however, the reply gets put on the client structure's queue before "clm_recv" gets called, so this race condition should not be an issue in this case. At last, the status of the first phase is returned to dlmlock. First phase finished! 1.2 Second phase The second phase starts when the directory node or the lock resource master receives the transaction. At this point, a callback function is called by "CCCP" transport layer. This is *not* the callback function, specified in the request transaction, to continue the work of the request. Instead, it is a port handler which will parse the request and send it on its way to the request callback. Following is a list of all port handling callbacks, some of which are not related to locking, per se: PORT CALLBACK PORT_MSTR pti_read_request clm_pti.c PORT_DIRECT dir_read_request clm_direct.c PORT_RESPONSE pti_read_response clm_pti.c PORT_DIRUPDT dir_read_update clm_direct.c PORT_MSTR_CR pti_read_request clm_pti.c PORT_BOUNCE dir_bounce_request clm_direct.c PORT_GROUP grp_proc_request clm_group.c PORT_NACK clmll_read_nack_response clm_locallock.c PORT_SYNC clm_inband_sync clm_msg.c PORT_SYNCRET clm_inband_sync_resp clm_msg.c This list is defined in "cb_tab", in "kernel/dlmdk/clm_main.c", line 143. In addition, the following are set up by calls to cccp_listen() in other places in the code base: PORT CALLBACK PORT_DEADLOCK clmdd_incoming clm_deadlock.c PORT_MIGR clmm_incoming clm_migrate.c PORT_RECOVERY clmr_message_handler clm_migrate.c As mentioned before, the transaction may be sent to the directory node or the lock resource master. If it is the directory node that receives the transaction, the "dir_read_request" port handler is called by "CCCP". And then "dir_proc_request" is called to do all hard work. If the corresponding lock record is found within the directory node, the destination of the transaction is modified to be the lock resource master node's "PORT_MSTR" port. If no lock record is found, the directory node creates a new directory entry for the resource, and modifies the transaction's destination to be the "PORT_MSTR_CR" ("CR" means "create" a new lock mastership) of the originator of this transaction (the first node to request a lock on a given resource becomes the master for that resource). The "CCCP" transport layer forwards the transaction to the new destination when returning from "dir_read_request". All of the following operations are on the master node, and are the same, regardless of whether the original request transaction was sent via the directory node, or directly to the (known by the originator) master node. If it is the lock resource master that receives the transaction, "CCCP" calls the PORT_MSTR handler, "pti_read_request". Following is the function calling graph for servicing the lock request(transaction): pti_read_request | V lockswitch.master_lock (master_lock_clm) | lock V convert ----------------------------------- | | V V clm_lock clm_convert | | grant V block V ------------- convert_lock_clm | | | V V grant V block grant_lock blocked_lock ------------- | | V V grant_convert blocked_convert According to the previous graph, the receiving port is checked in "pti_read_request". If this transaction is received from "PORT_MSTR_CR" port, "DLM_RES_CREATE" flag is specified to distinguish it from a lock request for a existing lock resource. After processing this lock request, the result (also described by a transaction data structure) is sent back to "PORT_RESPONSE" port of the sender of this lock request. "pti_read_response" is the port handler called when the respond transaction is received by the sender of the lock request. Following is the call graph: pti_read_response | |----> pti_proc_response | | | |-----------------> match_request | | | | |<------------------------| | | | |-----------------> lockswitch.cont_lock | | (cont_lock_clm) | | | | | lock | convert | | -------------|------------- | | | | | | | V | V | | cont_lock_clm_new | cont_lock_clm_convert | | | | | | | grant V block | grant V block | | ------------- | ------------- | | | | | | | | | V V | V V | |grant_lock blocked_lock | grant_convert blocked_convert | | | | | | | | | | |----->|<-----| | | | |----------------->|<-----------------| | | | | | |---> clm_reply | | | | | | | |----> dlm_reply_guts | | | | | | | | | |-->sendast | | | | | | | | | | |<-----| | | | | | | | | |<---------| | | | | | | |<--------| | | | | |<------------------------| | | |<----------| | O After receiving the response, it is the first thing to find the corresponding transaction of the lock request. This work is done in "match_request" by using "clm_sender" field of the transaction data structure. And then the callback function ("cont_lock_clm") to continue the lock request is gotten. The local lock record is updated by "cont_lock_clm_new" or "cont_lock_clm_convert" based on the respond transaction. After that, "sendast" is called to deliver AST to the client. For user space client, the AST is added into a per client linked-list, and if the notify signal number is specified, a signal is also sent to the client. For kernel space client, the AST is delivered directly by using "kernel_thread". After AST is delivered, the lock request is completed. 2. dlmlock_sync() This function can be found in api_vms.c, line 540. It can be used for acquiring or converting a lock on a lock resource in synchronous fashion. It is implemented based on "dlmlock". It calls "dlmlock" directly with "LKM_BLOCK" flag. How does "LKM_BLOCK" work? All tricks are in "sendudp". As mentioned before, "sendudp" is called in the first phase of the "dlmlock" and a reply message is sent back to the client. If "LKM_BLOCK" is specified, "sendudp" does not send the reply message and just skip it. Hence the client is blocked when it try to retrieve the reply message. At the second phase, if "sendast" detects "LKM_BLOCK" flag, it clears "LKM_BLOCK" and calls "sendudp" to send reply message to the client. And lock request continues because it gets the reply message. 3. dlmlockx() This function can be found in api_vms.c, line 317. It can be used for acquiring or converting a lock on a lock resource, and specifies a transaction ID for that lock. It is almost the same with "dlmlock", the only difference is that it provides a transaction ID to the lock request when calling "clm_lock_reqresp". The transaction ID is used as client ID instead of using PID in "dlmlock". 4. dlmlockx_sync() This function can be found in api_vms.c, line 405. It can be used for acquiring or converting a lock on a lock resource in synchronous fashion, and specifies a transaction ID for that lock. It is implemented based on "dlmlockx". It calls "dlmlockx" directly with "LKM_BLOCK" flag. 5. dlmunlock() This function can be found in api_vms.c, line 651. It can be used for releasing a lock on a lock resource or canceling a lock request that is in blocked or converting state. As being same with "dlmlock", it can also be divided into two phases. 5.1 First phase Just like "dlmlock", a description of the unlock request is built and committed in the first phase. The request is also represented by a transaction. Following is the function calling graph for the first phase of "dlmunlock": dlmunlock | |-----> api_reqresp | | | |---------> k_clm_send | | | | | V | | clm_send | | | | | V | | clm_process | | | | | V | | clm_request | | | | | V | | lockswitch.begin_unlock | | (begin_unlock_clm) | | | | | |------> pre_proc_unlock_clm | | | | | | |<-----------------| | | | | | ------------------------- | | | | | | | | | V V V V V | | (1) (2) (3) (4) (5) | | | | | | | | | V V V V V | | ------------------------- | | | | |<--------------| | | | |---------> k_clm_recv | | | | | V | | clm_recv | | | | |<--------------| | | |<-------| | O The implementation of "dlmunlock" is very similar to "dlmlock", hence only the difference point will be discussed in following part. The transaction that represents the unlock request is built in "dlmunlock", and then "api_reqresp" is called directly. Before entering "begin_unlock_clm", the callback function ("cont_unlock_clm") for starting the operations of the second phase is specified in "clm_request". In "begin_unlock_clm", "pre_proc_unlock_clm" is called to see if the unlock request can be processed completed on this node or at least before it gets passed to the master. This routine handles the initial processing of an unlock request before it is forwarded to the distributed layer. If the lock resouce is mastered locally, this optimization yields a significant performance improvement. Depending on the outcomes of "pre_proc_unlock_clm", there are five possible results and corresponding operations: (1). An error is detected locally. The unlock request is aborted. The reply message and the unlock AST are sent to the client. No interaction with transport layer is needed. This work is done by calling "clm_reply". (2). The lock resource is mastered locally, processing is complete. The reply message and the unlock AST are sent to the client. No interaction with transport layer is needed. This work is done by calling "send_unlck_ast" and "clm_reply" sequentially. (3). The lock resource is mastered remotely. Local unlock completed. The unlock request is forward to lock resource master. The reply message and the unlock AST are sent to the client. This work is done by calling "send_unlck_ast", "pti_call" and "clm_reply" sequentially. (4). The lock resource is mastered remotely. The lock handle is found, but local unlock is not possible because of local lock state. The unlock request is forward to lock resource master. The reply message is sent to the client and the unlock AST is delivered when response comes back. This work is done by calling "sendudp" and "pti_call" sequentially. (5). The lock resource is mastered remotely. The lock handle is not found, and local unlock is not possible. (This condition occurs when trying to cancel an inflight lock request.) The reply message is sent to the client. The unlock request is added to a pending queue in lock record, and will be forwarded to lock resource master when the respond transaction for the original lock request is received. This work is done by calling "sendudp" and "hold_cancel" sequentially. (The holded unlock request is resent by calling "submit_cancels" from "cont_lock_clm" in the second phase of "dlmlock".) After this point, all operations are same with "dlmlock". 5.2 Second phase Just like "dlmlock", the transaction also may be sent to the directory node or the lock resource master. If it is the directory node that receive the transaction, "dir_read_request" is called by "CCCP". And then "dir_proc_request" is called to do all hard works. If the corresponding lock record is found, the destination of the transaction is modified to lock resource master's "PORT_MSTR" port. "CCCP" transport layer forwards the modified transaction to the new destination when returning from "dir_read_request". And the following operation is same with the condition when the lock resource master receives the transaction. If no lock record is found, only a "DLM_VOID" respond transaction is built and sent to "PORT_RESPONSE" port of the sender of this transaction. This condition occurs when trying to unlock a lock that was just purged. If it is the lock resource master that receive the transaction, "pti_read_request" is called by "CCCP". Following is the function calling graph for servicing the unlock request(transaction): pti_read_request | V lockswitch.master_unlock (master_unlock_clm) | V convert_locks | V send_basts_again According to the previous graph, all unlock/cancel work is done in "master_unlock_clm". After doing that, "convert_locks" is called to see if there is any lock in "converting" or "waiting" state and try to grant them. This routine also sends block AST to the corresponding client by calling "send_basts_again". At last, the result (is also described by a transaction data structure) is sent back to "PORT_RESPONSE" port of the sender of this unlock request. "pti_read_response" is called when the respond transaction is received by the sender of the unlock request. Following is the call graph: pti_read_response | |----> pti_proc_response | | | |-----------------> match_request | | | | |<------------------------| | | | |-----------------> lockswitch.cont_unlock | | (cont_unlock_clm) | | | | | V | | ------------- | | | | | | | V V V | | (1) (2) (3) | | | | | | | V V V | | ------------- | | | | |<------------------------| | | |<----------| | O According to the previous graph, there are three possible operations in "cont_unlock_clm" depending on the respond transaction: (1). The transaction is a response to a timeout request. Then processing still has to happen on the local node. Call the master unlock to send the timeout AST, but no client response for the cancel is required because a client didn't send it - the timeout subsystem did. Free the request at last. This work is done by calling "local_unlock_clm" and "clm_free_trans" sequentially. (2). Corresponding to the condition (4) in the first phase, processing still has to happen on the local node. Call local unlock to handle the response then send the reply to the client. This work is done by calling "local_unlock_clm" and "clm_reply" sequentially. (3). Coreesponding to the condition (1)(2)(3) in the first phase, after reading a master's response to an unlock request that has already been handled locally, toss it. This work is done by calling "local_unlock_clm" and "clm_free_trans" sequentially. NOTE: Condition (5) in the first phase will be converted in to one of the other four conditions eventually. After these operations, the whole unlock request completes. 6. dlmunlock_sync() This function can be found in api_vms.c, line 573. It can be used for releasing a lock on a lock resource or canceling a lock request that is in blocked or converting state in synchronous fashion. It is almost same with "dlmunlock". The most important difference between these two function is that "RESP_AST" is not specified in "dlmunlock_sync". Without this flag, the reply message is not sent in condition (4)(5) of first phase, and is sent in condition (2) of second phase. Hence "dlmunlock_sync" is blocked until all unlock works is done. Asynchronous notify ------------------- 1. ASTpoll() This function can be found in api/ast_stub.c, line 59. It can be used for triggering the execution of pending AST routines. As mentioned before, the ASTs is placed in a per client linked-list. "ASTpoll" accesses lock manager via a character device (IOCTL), retrieves the pending ASTs and executes them. 2. dlm_setnotify() This function can be found in api/api_setnotify.c, line 77. It can be used for specifying which signal the lock manager should use to notify your application of a pending AST. This routine updates a global variable named "notify_signo". The value of "notify_signo" is copied into the transaction data structure when "dlmlock" is called. And a specified signal is sent to the client when the lock manager add AST to a client's pending list. Lock management and LVB operation --------------------------------- 1. Lock record management In order to manage all the locks, OpenDLM defined several data structures to represent "resource", "lock" and "client". 1.1 "resource" A "resource" can be any entity accessible to the operating system, or to one or more process. "resource" is represented by a data structure named "resource", defined in include/clmstructs.h, line 1279. "resource" is created when it is requested first time and is deleted when it is not referenced by any client. Following are the most important fields in this structure: refcount - Reference count rh - Resource handle type - Resource type name - Dynamic resource name namelen - Length of resource name master - Site id of master node lockqs - Grant, convert, wait lists for lock requests value - Lock value block rsrc_flags - State bits Resource handle (rh) is the key field for managing all "resource". OpenDLM stores all "resource" in a resource table ("restab"). And this handle contains the index into the resource table to locate the resource. Note, this handle is identical for a "resource", even if it has multi copies in different nodes. Hence In the event of node failure, the resource table is easily reconstructed from the elements known by each other nodes. The "resource handle" is kept with a lock object (reclock, discussed in next section) and used for fast access to the resource entry in "restab". It is sometimes necessary to search the resource table (only on the master) when creating a new resource. To this end, a second table is maintained, "reskeytab". This is a hash table of pointers into "restab". This provides quick search capability used for ADD, FIND, and DELETE operations. "restab" ------------------------------------------- | seg 0 | seg 1 | ... | seg n | ------------------------------------------- | | | V V V ----------- ----------- ----------- --->|resource | |resource | |resource |<----- | |---------| |---------| |---------| | | |resource | |resource |<- |resource |<-- | | |---------| |---------| | |---------| | | | | . | | . | | | . | | | | | . | | . | | | . | | | | | . | | . | | | . | | | | | . | | . | | | . | | | | | . | | . | | | . | | | | | . | | . | | | . | | | | |---------| |---------| | |---------| | | | ->|resource | |resource | | -->|resource | | | | | ----------- ----------- | | ----------- | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | . . | | . | | | | . . | | . | | | | . . | | . | | | | . . | | . | | | | . . | | . | | | | . . | | . | | | | ^ ^ | | ^ | | | | | | | | | | | | | ----------- ----------- | | ----------- | | | --| reshash | | reshash |-- | | reshash |--- | | ----------- ----------- | ----------- | | ^ ^ | ^ | | | | | | | | ----------- ----------- | ----------- | |---| reshash | | reshash |----- | reshash |------ ----------- ----------- ----------- ^ ^ ^ | | | ------------------------------------------- | key 0 | key 1 | ... | key n | ------------------------------------------- "reskeytab" In the directory node of a "resource", this "resource" is described in a different way. It is represented by "rl_info_t" data structure. For the sake of saving space, "rl_info_t" is just a subset of "resource". It is only used to find who is the master of this "resource". And all of "rl_info_t" structures are stored into a hash table ("rldbtab"). "rldbtab" ------------------------------------------- | slot 0 | slot 1 | ... | slot n | ------------------------------------------- | | | V V V ----------- ----------- ----------- |rl_info_t| |rl_info_t| |rl_info_t| ----------- ----------- ----------- | | | V V V ----------- ----------- ----------- |rl_info_t| |rl_info_t| |rl_info_t| ----------- ----------- ----------- | | | V V V . . . . . . . . . . . . . . . . . . 1.2 "lock" When a client requests a lock first time, the name of the resource is used for specifying the lock is requested. OpenDLM associates a lock ID with the lock request, and returns the lock ID to the client. This lock ID uniquely identifies the lock record ("reclock") used by the lock manager to describe the lock request. Following are all fields in "reclock": link - List structure used to link reclock into "resource". tq_node - Pointer to timeout queue structure. route - Callback information. lck - All information about the lock request. dl - Used for dead lock detecting seg - Lock table segment index set - Timer set/reset "lck" field is a data structure named "lockinfo", defined in include/clmstructs.h, line 467. Following are important fields in "lockinfo": mode - Granted mode reqmode - Requested mode bastmode - Mode passed back in blocking AST site - Site ID where request was made lk_flags - Flags from application pid - Process ID that requested lock remlockid - Remote lockid lockid - Lock ID state - Internal state flags rh - Resource handle lkvalue - Lock value block request - Pointer to create request seqnum - Sequence number on queue type - Lock type xid - Transaction ID Whan a new lock request is submitted and before respond message is received, it is linked into a per client pending queue ("cli_queue" field of "clm_client_t"). After the lock request is serviced, the lock request is linked into "resource" structure (grant/convert/wait queue. Note: convert queue is not possible for new lock request). Subsequent references to this lock are made by lock ID, and not by resource name. In order to improve the performance, a hash table ("rlh_table") is implemented to locate the lock record by specifying the lock ID. "resource" ------------------------------------------- | grant queue |convert queue| wait queue | ------------------------------------------- | | | V V V ----------- ----------- ----------- --->| reclock | | reclock |<-- | reclock |<----- | ----------- ----------- | ----------- | | | | | | | | V V | V | | ----------- ----------- | ----------- | | ->| reclock | | reclock |<-| | reclock |<-- | | | ----------- ----------- || ----------- | | | | | | || | | | | | V V || V | | | | . . || . | | | | . . || . | | | | . . || . | | | | . . || . | | | | . . || . | | | | . . || . | | | | || | | | | || | | | | || | | | | . . || . | | | | . . || . | | | | . . || . | | | | . . || . | | | | . . || . | | | | . . || . | | | | ^ ^ || ^ | | | | | | || | | | | | ----------- ----------- || ----------- | | | --| rl_hash | | rl_hash |-----| | rl_hash |--- | | ----------- ----------- | ----------- | | ^ ^ | ^ | | | | | | | | ----------- ----------- | ----------- | |---| rl_hash | | rl_hash |------ | rl_hash |------ ----------- ----------- ----------- ^ ^ ^ | | | ------------------------------------------- | key 0 | key 1 | ... | key n | ------------------------------------------- "rlh_table" Note, there are actual two lock ID for a lock request: lockid - Local lock ID, used to locate the lock record. remlockid - Remote lock ID, must be converted to local lock ID to locate the lock record. The convert map between local lock ID and remote lock ID is stored in "hashent" and eventually linked into a hash table ("idhash"). "idhash" ------------------------------------------- | key 0 | key 1 | ... | key n | ------------------------------------------- | | | V V V ----------- ----------- ----------- | hashent | | hashent | | hashent | ----------- ----------- ----------- | | | V V V ----------- ----------- ----------- | hashent | | hashent | | hashent | ----------- ----------- ----------- | | | V V V . . . . . . . . . . . . . . . . . . 1.3 "client" In OpenDLM, "client" is the system entity that requests "lock" on one or more "resource", and it can be a process, a transaction or a group of processes. A database of lock manager clients is maintained on every nodes. Each node tracks the clients locally and maintains a record of the sequence number contained in the most recent request received. In this way the lock manager can detect when a UDP packet has been dropped. On the "resource" master, all clients that reference this "resource" are recorded in this database. Again, only local clients are checked for dropped packets. However, every client in the database has associated with its entry a list of all reclock structures held by that client. This structure is traversed when performing global deadlock detection. "client" is represented by a data structure named "clm_client_t", defined in "include/clmstructs.h", line 790. Following is some important fields in "clm_client_t": cli_id - Client ID cli_site - Site of client cli_groupid - Group id of client if any cli_type - Client type: PROCESS, GROUP or TRANSACTION cli_sequence - Client sequence sync_seq - Sequence number of lock request cli_queue - Owned locks list cli_ast_pnd - List of pending ASTs cli_reply - Used to queue the UDP reply cli_reply_event - Wait queue for waitting UDP reply cli_state - State code & flags The "client" database is actually a sorted array ("cli_refs"). And it is protected by a spin lock ("cli_refs_lock"). "cli_refs" ------------------------------------------- | client 0 | client 1 | ... | client n | ------------------------------------------- | | | V V V -------------- -------------- -------------- |clm_client_t| |clm_client_t| |clm_client_t| -------------- -------------- -------------- 2. LVB operation As described in lock management section, the lock value block is store in both "resource" and "reclock". It is manipulated based on modes: (1). If moving from >= LKM_PWMODE to lower, update "resource" value block else update valblk from "resource". (2). If " LKM_OLD_LVB" flag is cleared, a conversion to the same mode can set LVB. All above function are implemented in "valueblock", defined in "kernel/dlmdk/clm_clmlock.c", line 2748. It is called when a lock/convert request is granted if the "LKM_VALBLK" is specified. 2.1 "dlm_scnop" Manipulates a System Commit Number (SCN), a specialized use of the lock value block associated with a resource. The SCN is a cluster-global counter used to ?? The SCN is stored in the LVB associated with a lock resource. An LVB is an array of MAXLOCKVAL bytes (16 bytes as defined currently, but for OpenGFS, we need to change it to 32 bytes). The lock manager represents an SCN value of up to 128 bits (this fills the 16-byte LVB size) by using four end-to-end concatenated unsigned integers. These integers are the four fields contained in the "scn_t" structure, defined in "include/clmstructs.h", line 871. The following figure illustrates how the "scn_t" structure overlays the bytes in the LVB: bytes ------------------------------------------------- LVB | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15| ------------------------------------------------- unsigned integers ------------------------------------------------- SCN | base | wrap1 | wrap2 | wrap3 | ------------------------------------------------- You define the range of SCN counter by specifying, in the "bit_len" parameter passed to the "dlm_scnop". If you specify a "bit_len" value of 32 or less, the lock manager uses only the "base" field of the structure. If you specify a "bit_len" value of 64 or less, the lock manager uses the "base" and the "wrap1" fields in the structure. The integer value in the "base" field overflows into the "wrap1" field. The value of SCN should be interpreted by concatenating the integer fields, and not by adding them. For example, the following figure shows the representation for a value of SCN is 2^32: ------------------------------------------------- SCN |0000......0|000......01| N/A | N/A | ------------------------------------------------- "dlm_scnop" is implemented in "api/api_scn.c", line 64. The SCN operation ("DLM_SCN_OP") is also one of the basic request types in OpenDLM, and it is implemented in very similar way that "dlmlock"/"dlmunlock" is. It can be also divided into two phase. 2.1.1 First phase Just like "dlmunlock", a description of the SCN request is built and committed in the first phase. The request is also represented by a transaction. Following is the function calling graph for the first phase of "dlm_scnop": dlm_scnop | |-----> api_reqresp | | | |---------> k_clm_send | | | | | V | | clm_send | | | | | V | | clm_process | | | | | V | | clm_request | | | | | V | | begin_scn_op | | | | | V | | pti_call | | | | | V | | dlm_pti_send | | | | | V | | cccp_send | | | | |<--------------| | | | |---------> k_clm_recv | | | | | V | | clm_recv | | | | |<--------------| | | |<-------| | O The implementation of "dlm_scnop" is very similar to "dlmlock", hence only the difference point will be discussed in following part. The transaction that represents the SCN request is built in "dlm_scnop", and then "api_reqresp" is called directly. Before entering "begin_scn_op", the callback function ("cont_scn_op") for starting the operations of the second phase is specified in "clm_request". Since a "NL" mode lock must be acquired on the "resource" in question, the node ID of "resource" master can be found in local "resource" copy. And this SCN request is sent to master's "PORT_MSTR" port dircetly. Note, the UDP reply message is not sent out, hence the client will be blocked until the respond from master is received. 2.1.2 Second phase When the lock resource master receives the transaction, "pti_read_request" is called by "CCCP". Following is the function calling graph for servicing the SCN operation request(transaction): pti_read_request | V master_scn_op According to the above graph, all SCN operations are done in "master_scn_op". After doing that, the result (is also described by a transaction data structure) is sent back to "PORT_RESPONSE" port of the sender of this SCN operation request. "pti_read_response" is called when the respond transaction is received by the sender of the SCN operation request. Following is the call graph: pti_read_response | |----> pti_proc_response | | | |-----------------> match_request | | | | |<------------------------| | | | |-----------------> cont_scn_op | | | | | V | | clm_reply | | | | |<------------------------| | | |<----------| | O According to the above graph, "cont_scn_op" return the result of request to the client by calling "clm_reply". The client is un-blocked at this point and gets the operating result. Deadlock Detecting ------------------ While OpenDLM permits processes to synchronize their access to share resource, it cannot prevent them from creating deadlock. The deadlocks are not problems in OpenDLM, but a mechanism for detecting/resolving deadlocks is built into OpenDLM to facilitate applications that using OpenDLM. The deadlock detecting module is implemented in "kernel/dlmdm/clm_deadlock.c". It seems like that this module is implemented based on a distributed deadlock detecting algorithm ("PATH-PUSHING"). In this algorithm, the local TWFG (Transaction Wait-For Graph) is built at each node, and both resource waits and communication wiats are gathered. Resources not mastered local or non-local clients are marked as external vertexs in the TWFG. To detect the deadlocks, following actions are done in each node: 1. Wait and receive deadlock releated informaiton (produced in the previous deadlock detecting iteration) from other nodes. 2. Combine the received informaiton with local TWFG and search for circle. 3. Resolve local deadlock by choosing a victim from the local circle that doesn't contain an external vertex. 4. Send information about the circle that contains external vertex to the node which the external resource is mastered on or the external client resides on. The deadlock detecing cycle ends when a node ensure no deadlocks or external vertex in its TWFG (including lock TWFG the received TWFG). The detailed information for this module will be discussed in following section. 1. Start deadlock detecting In OpenDLM, detecting deadlock is a periodically task. It is executed per "deadlock_timeout/2" (default value : 30/2 seconds). Since it is based a distributed deadlock detecting algorithm, inter-node coordination is necessary. And all of the communication among all nodes are transfered via "PORT_DEADLOCK" port of "CCCP". In "clmdd_init", the communication infrastructure for detecting deadlock is initialized. "clmdd_incoming" is assigned to process all incoming message from "PORT_DEADLOCK". Following graph illustrates how the deadlock detecting starts every time: clm_master_loop clmdd_incoming | | |-> do_timer_stuff |MSG_STRT | | ----------------- | |-> clmdd_task | | | | | | | | V V V V V | | |-> clmdd_primo (1) (2) (3) (4) (5) | | | | | | | N V Y | | | ----------- | | | | | | | |<--| V | | | clmdd_startstart | | | | | | | |-> clmdd_suspect | | | | | | | | |<-------| | | | | | | | |-> clm_send | | | | | | | | |<-------| | | | | | | |<------------| | | | | |<-------| | | |<--------| | O According to the above graph, the inter-node coordination is a little bit complex when starting deadlock detecting. But it is necessary because if you have a deadlock condition it is possible that it will block several clients' requests at about the same time. If we kicked off a periodic deadlock detecting cycle on all nodes without some kind of coordination, this would start deadlock detection cycles on all of the affected nodes at roughly the same time. This method is used to synchronize the beginning of a cycle. The cycle begins on the lowest numbered site (determinated by "clmdd_primo"), triggered by the periodic task "clmdd_task". In "clmdd_startstart", the first node calls "clmdd_suspect" to checks its deadlock detection queue "dl_timeout" for suspicious requests. All blocked lock request are linked in "dl_timout" queue, and the criterion for suspicious request is that it is blocked more than "deadlock_timeout" seconds. If it finds one it sets the start message timestamp to the age of the oldest request, otherwise it sets "sdd->sdd_node" to NO_NODE. The start message is then sent to the next node in the cluster. The start message is represented by a structure named "start_msg_t". Following are all fields in "start_msg_t": sdd_cycle - Deadlock cycle number sdd_instigator - Instigator of this cycle sdd_node - Node of oldest waiting lock request sdd_tv - Age of oldest waiting lock request When a node receives the start message (type "MSG_STRT"), "clmdd_incoming" is called, and there are five possibilities for type "MSG_STRT": (1). The receiving node is the instigator and no suspicious request is found. (2). The receiving node is the instigator and the oldest suspicious request also resides on the receiving node. (3). The receiving node is the instigator and the oldest suspicious request doesn't reside on the receiving node. (4). The receiving node isn't the instigator and the oldest suspicious request resides on the receiving node. (5). The receiving node isn't the instigator and the oldest suspicious request doesn't reside on the receiving node. (Or not found) For condition (1), it is not necessary to start the deadlock detecting cycle, current cycle reach its end. For condition (2) and (4), start the deadlock detecting cycle by calling "clmdd_check". For condition (3), since the start message has traveled all nodes, the start message is forwarded to the node owns the oldest suspicious request. For condition (5), the receiving node checks its queue, and if its oldest request is older than the one in the start message it sets the timestamp and node ID to its own values. All this are done by calling "clmdd_dostart". Note that we need to use the *age* of requests, rather than the timestamp, due to the implementation of gettimeofday() in linux kernel. 2. Build deadlock detecting stack Deadlock detecting stack is the core data structure for the whole deadlock detecting cycle. It is used for transferring TWFG among nodes. It is represented as a queue "ddx_t" struct, "ddx_t" is defined as following: typedef struct ddx_s { ddx_flavor_t ddx_type; union { struct { int cli_site; clm_client_id_t cli_id; unsigned long cli_rid; int cli_marked; } un_cli; struct { int rsc_namelen; char *rsc_name; short rsc_mode; int rsc_lockid; int rsc_locknode ; } un_rsc; struct { unsigned long blk_lockid; int blk_site; unsigned long blk_age; } un_blk; } ddx_un; } ddx_t; "ddx_t" is used for describing the in-core wait-for graph. A node in the graph is a discriminated union; permissible types are given in the enum "ddx_flavor_t": INVALID - Invalid entry, for bombproofing CLIENT - Known client reference XCLIENT - Unknown client reference RSRC - Resource reference XRSRC - Unknown resource reference BRANCH - Beginning of a set RETURN - End of a set SCRUBBED - Null entry, but not invalid As mentioned before, the deadlock detecting cycle starts by calling "clmdd_check". And the first work is build local TWFG: clmdd_check | |-> clmdd_analyze | | | |------------ | | | | | V | | --------------------- | | | |->clmdd_pushr--- | | | | | | | | | | ---clmdd_pushc<-| | | | --------------------- | | | | |<----------| | | |<-------| | V clmdd_processddx In "clmdd_check", every suspicious request is checked by calling "clmdd_analyze". "clmdd_pushr" and "clmdd_pushc" is invoked recursively to build the local TWFG: clmdd_pushr - Push a resource corresponding to a lock request onto the deadlock detection stack. A resource will eventually expand into a list of clients holding locks on that resource (i.e.,in the wait/convert queues) that are incompatible with the lock request passed in as an argument. If the resource is mastered locally, the expansion is done here. If the resource is remote, an "XRSRC" entry is pushed onto the deadlock detection stack. In case of an error the deadlock stack is freed. clmdd_pushc - Push a client onto the deadlock detection stack. Eventually, a client expands into a list of all its blocked lock requests. If the client is local, marks and expands it. If the client is remote, an "XCLIENT" entry is pushed onto the deadlock detection stack. In case of an error, free the deadlock stack. Following figure illustrates a deadlock detecting stack (for the sake of simple, all resources and clients are local): TWFG ddx_stack ------ ------ | C1 |<----| | C1 | ------ | |----| | | | BR | V | |----| (R1) | | C2 | | | |----| V | | BR | ------ | |----| | C2 | | | C3 | ------ | |----| | | | RT | | | |----| ------ | | BR | | | | |----| V V | | C1 | (R2) (R3) | |----| | | | | RT | V ------- |----| ------ | RT | | C3 | ------ ------ C - Client R - Resource BR - Branch RT - Return (end of branch) 3. Deal with "ddx" stack After building the local TWFG, "clmdd_processddx" is called with the ddx stack as inputing information. And as mentioned before, only the circle without external vertex is checked and breaked if exists. For the circle that contain external vertex, the related information (part of TWFG) is sent to the proper node. The deadlock detecing cycle ends when a node ensure no deadlocks or external vertex in its TWFG (including lock TWFG the received TWFG). Following is the calling graph for deadlock detecting cycle: clmdd_proccessddx | V clmdd_walk | N V Y --------------------------------- | | V V clmdd_massage shoot_me_now | | V remote V local clmdd_encode ----------------- | | | V V V cccp_send cccp_send clmdd_purgeclnt (MSG_TWFG) (MSG_PURG) | remote V local ----------------- | | V V cccp_send clmdd_purgersrc (MSG_BLOW) clmdd_incoming | V --------------------------------- | | | | | | (MSG_TWFG) (MSG_PURG) (MSG_BLOW) | | | V V V clmdd_decode clmdd_purgeclnt clmdd_purgersrc | | V | clmdd_processddx | | remote V local ----------------- | | V V cccp_send clmdd_purgersrc (MSG_BLOW) According to the above graph, firstly "clmdd_walk" is called for detecting the circle without external vertex. If no such circle is found, "clmdd_massage" is called to decide the destination of the "MSG_TWFG" message. The other effect of "clmdd_massage" is to scrub the "ddx" stack, hence only TWFG that contains external edges is sent out. Sequentially the "ddx" stack is encoded into a "MSG_TWFG" message by invoking "clmdd_encode". Then the "MSG_TWFG" message is sent to port "PORT_DEADLOCK" of the destination. If a circle without external vertex is found by "clmdd_walk", that means deadlock condition occurs. Then "shoot_me_now" is called to find the most recent blocked request in this circle. If this request resides on remote node, a "MSG_PURG" message is sent to port "PORT_DEADLOCK" of that node. If this request resides on local node, "clmdd_purgeclnt" is called to deny this request. "clmdd_purgeclnt" needs to find the requested resource first. If this resource is mastered locally, "clmdd_purgersrc" is called to deny the blocked request directly. If this resource is mastered on remote node, a "MSG_BLOW" message is sent to port "PORT_DEADLOCK" of the resource master. On the other hand, "clmdd_incoming" is invoked when a message is received from port "PORT_DEADLOCK". Depending on the type of the message, different funcitons are called to process the incoming information. In case of receiving a "MSG_TWFG" message, "clmdd_decode" re-builds the "ddx" stack based on the incoming message and local information. Then "clmdd_processddx" is called with the ddx stack as inputing information. The following process is all the same as "clmdd_processddx" is called in the first node. The deadlock detecing cycle ends when a node ensure no deadlocks or external vertex in its TWFG (including lock TWFG the received TWFG). In case of receiving a "MSG_PURG" message, "clmdd_purgeclnt" is called to deny the specified request. The following process is all the same as "clmdd_purgeclnt" is called in the first node. In case of receiving a "MSG_BLOW" message, "clmdd_purgersrc" is called to deny the specified request directly. Till now, a complete deadlock detecting cycle ends. References ---------- 1. Roy G. Davis. 1993. "VAXcluster Principles". Digital Press 2. Kristin Thomas. 2001. "Programming Locking Applications"