Contact the author.
OpenDLM Locking
Copyright 2004 The OpenDLM Project
Author:
Stanley Wang, stanley.wang@intel.com
Ben Cahill (bc), ben.m.cahill@intel.com
Introduction
------------
This document describes the implementation of the lock request/grant procedure
within OpenDLM, as well as the lock value block.
This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of lock management in OpenDLM.
This document is not intended as a user guide to OpenDLM. Look in the OpenDLM
WHATIS-odlm document for an overview of OpenDLM. Look in the OpenDLM
HOWTO-install for details of configuring and setting up OpenDLM.
This document may contain inaccurate statements, based on the authors' limited
understanding. Please contact the authors if you see anything wrong or
unclear.
Overview
--------
The basic working model of DLM Locking API could be divided into two phases.
During the first phase, it validates the arguments provided by the caller,
builds a description for this request, and sends this description to the local
lock manager (on *this* node). Then the local lock manager forwards it to the
proper lock manager (which may be on another node). This (forwarded) request is
serviced during the second phase.
The DLM Locking API is implemented in src/api/ directory, in four source files:
api_scn.c
api_setnotify.c
api_vms.c
ast_stub.c.
The OpenDLM Locking API can be used in either kernel and user space, and the
source code for the top level API is shared (although, of course, the kernel
API builds separately from the user-space API). The source code splits,
however, for lower levels (just before committing the lock request descripton
to the local lock manager). At the level of k_clm_send() and below, user-level
code is implemented in src/api, and kernel level is implemented in
src/kernel/dlmdk.
The lock manager lives in kernel space, regardless of whether accessed via
the user- or kernel-space API. Source for the kernel level functionality
resides in the src/kernel/dlmdk directory.
The user space API accesses the lock manager via a character device (IOCTL).
The kernel space API calls directly into the lock manager. User- and
kernel-space functionality are almost the same, with one big exception;
Only the user-space API supports synchronous (blocking) lock request calls.
There are nine DLM Locking Model-specific routines. They can be divided
into several groups:
1. Locking operation
dlmlock()
Makes an asynchronous request to acquire or convert a
lock on a lock resource.
dlmlock_sync()
Acquires or converts a lock on a lock resource and
obtains a synchronous return. (User space only)
dlmlockx()
Makes an asynchronous request to acquire or convert a
lock on a lock resource, and specifies a transaction
ID for that lock.
dlmlockx_sync()
Acquires or converts a lock on a lock resource,
specifies a transaction ID for that lock, and obtains a
synchronous return. (User space only)
dlmunlock()
Releases a lock on a lock resource or cancels a lock
request that is in blocked or converting state.
dlmunlock_sync()
Releases a lock on a lock resource or cancels a lock
request that is in blocked or converting statei, and
obtains a synchronous return. (User space only)
2. Asynchronous notify
ASTpoll()
Triggers the execution of pending AST routines. (User
space only)
dlm_setnotify()
Specifies which signal the lock manager should use to
notify your application of a pending AST. (User space
only)
3. LVB operation
dlm_scnop()
Manipulates the SCN, a specialized use of the lock
value block associated with a lock resource. (Discussed
in LVB section of this document.)
Locking Operation
-----------------
The following discussion will focus on just the kernel space "VMS" API.
The VMS API is the one that was used on VAX computers, from which OpenDLM
was derived. The UNIX API was developed for the Unix environment, but our
discussion will focus on the VMS API.
1. dlmlock()
This function can be found in api_vms.c, line 451. It can be used for acquiring
a new lock, or converting a pre-existing lock, on a lock resource. As mentioned
before, it can be divided into two phases.
1.1 First phase
A description of the request is built and committed in the first phase. The
request is represented by a data structure named "transaction", defined in
include/clmstructs.h, line 636. This same structure is also used for responses
to lock requests. Important fields in "transaction":
clm_client_space - Client in user or kernel space
clm_prog - Desired service
clm_status - Status of request
clm_type - Request type, defined by "dlm_ops_t"
clm_direction - Indicate request or response, defined by
"ptype_t"
clm_locktype - Indicate UNIX or VMS lock
clm_sequence - Transaction sequence
pti_sequence - Transaction sequence (What does "PTI" stand
for?)
clm_route - Information necessary to complete a lock
request after it has been granted.
clm_pid - Process ID of client (0x7F7F7F7F for kernel)
clm_data - Data of the request. For DLM Lock operation,
it's a data structure named "lockreq"
Following is the function calling graph for the first phase of "dlmlock":
dlmlock
|
|------> clm_lock_reqresp
| |
| |-----> api_reqresp
| | |
| | |---------> k_clm_send
| | | |
| | | V
| | | clm_send
| | | |
| | | V
| | | clm_process
| | | |
| | | V
| | | clm_request
| | | |
| | | V
| | | lockswitch.begin_lock
| | | (begin_lock_clm)
| | | |
| | | |-------------> sendudp
| | | | |
| | | |<-----------------|
| | | |
| | | V
| | | pti_call
| | | |
| | | V
| | | dlm_pti_send
| | | |
| | | V
| | | cccp_send
| | | |
| | |<--------------|
| | |
| | |---------> k_clm_recv
| | | |
| | | V
| | | clm_recv
| | | |
| | |<--------------|
| | |
| |<----------|
| |
|<-----------|
|
O
According to the graph above, "clm_lock_reqresp" and "api_reqresp" build a
description of the request, in the form of a transaction structure.
"api_reqresp" is the lowest level call in the shared user/kernel source.
For the kernel implementation, it calls "k_clm_send" to direct the transaction
toward the proper lock manager, as described below.
"k_clm_send" grabs a spinlock before calling "clm_send".
"clm_send" makes a copy (why?) of the transaction before calling "clm_process".
"clm_process" does a few sanity checks, then calls "clm_request".
"clm_request" fills in a few fields of the transaction structure, and selects
two functions based on the type of lock request (DLM_LOCK) and the type of
lock (VMS vs. UNIX API). One function is the next function to call within
*this* node ("begin_lock_clm"), and the other is the function for starting
the operation of the second phase ("cont_lock_clm"). This will be called
from the port receiver on this or a remote node, depending on where the
transaction will be sent.
The selection of both functions uses the "lockswitch" table to make the
choice. This table contains calls, much like a Linux "ops" structure, that
differentiate between the VMS API and UNIX API functionality. The table is
defined in src/include/clm_locksw.h and src/kernel/dlmdk/clm_locksw.c.
"begin_lock_clm" performs a number of sanity checks, and fills in more fields
of the transaction structure before sending it to the "master" layer. It
differentiates between lock creation (new lock) and lock conversion (changing
the state of an existing lock). Accordingly, some of the processing is
handled by either the "begin_clm_lock_new" or the "begin_clm_lock_convert"
function.
At this point, the transaction is ready to be sent to the lock master.
Before doing this, "begin_lock_clm" calls "sendudp" to send a reply
message (also a transaction) to the client that initiated this lock request.
This reply gives status for phase 1 of the request processing. For clients
in the kernel, remember, synchronous (blocking) requests are not supported.
This reply will provide the basis for the return from the API call, providing
information on the validity of the request. The actual request results will
be provided to the client later, after phase 2.
"sendudp" doesn't really send a UDP message between nodes; the client is
on *this* node. It simply attaches the response transaction to a member of
the client's "client" structure (defined in src/include/clmstructs.h),
cp->cli_reply. This is a queue of reply transactions. Later, "k_clm_recv"
will look through this queue for the reply that matches the "clm_sequence"
field of the original request transaction.
Now, the request transaction needs to be sent to the lock resource master.
So, which node is the lock resource master for this lock request?
If the lock request is a "convert" request, the node ID of the lock resource
master can be found in the pre-existing lock record (struct reclock, defined in
src/include/clmstructs.h) of the in-question lock, and this request transaction
is sent to the lock resource master node's "PORT_MSTR" port via the "CCCP"
transport.
On the other hand, if the lock request is a "lock" request, there is no
pre-existing reclock. We need to obtain the node ID of lock resource master,
via the local (on this node) directory table, or via the "directory node",
as follows, searching by the name (and type: UNIX or VMS) of the lockable
resource:
-- Search local directory table via "rl_lookup()" (called from "pti_call()").
If either an active directory entry (i.e. *this* node is the directory
node), or a cache directory entry (i.e. *this* node has a copy of the
directory node's info) are found, we obtain the master node ID locally,
and send the request directly to the lock resource master node's
"PORT_MSTR" port via "CCCP", just as with a "convert" request.
-- If no local directory entry exists, calculate the directory node via
"clm_direct()" (called from "dlm_pti_send()"), which applies a simple hash
function to the resource name to determine the directory node. Send the
transaction to the directory node's "PORT_DIRECT" port via "CCCP". If the
directory node and lock resource master are the same node, it will deal with
this transaction. If not, the directory node will forward this transaction
to the appropriate lock resource master node.
After sending the lock request, execution returns to "api_reqresp" eventually.
"k_clm_recv" is called at this point to retrieve the status of the operations
in phase 1. The return status is also represented as a transaction data
structure. "clm_recv" gets the status from a linked-list. As mentioned before,
it was linked to the list by "sendudp".
Caveat: There is a textbook race condition between "clm_recv" and "sendudp". An
ugly work-around is used to resolve this problem. More detailed information
in "src/kernel/dlmdk/dlm_kernel.c" line 368. For the current execution chain,
however, the reply gets put on the client structure's queue before "clm_recv"
gets called, so this race condition should not be an issue in this case.
At last, the status of the first phase is returned to dlmlock. First phase
finished!
1.2 Second phase
The second phase starts when the directory node or the lock resource master
receives the transaction. At this point, a callback function is called by
"CCCP" transport layer. This is *not* the callback function, specified in the
request transaction, to continue the work of the request. Instead, it is a
port handler which will parse the request and send it on its way to the
request callback. Following is a list of all port handling callbacks, some
of which are not related to locking, per se:
PORT CALLBACK
PORT_MSTR pti_read_request clm_pti.c
PORT_DIRECT dir_read_request clm_direct.c
PORT_RESPONSE pti_read_response clm_pti.c
PORT_DIRUPDT dir_read_update clm_direct.c
PORT_MSTR_CR pti_read_request clm_pti.c
PORT_BOUNCE dir_bounce_request clm_direct.c
PORT_GROUP grp_proc_request clm_group.c
PORT_NACK clmll_read_nack_response clm_locallock.c
PORT_SYNC clm_inband_sync clm_msg.c
PORT_SYNCRET clm_inband_sync_resp clm_msg.c
This list is defined in "cb_tab", in "kernel/dlmdk/clm_main.c", line 143.
In addition, the following are set up by calls to cccp_listen() in other
places in the code base:
PORT CALLBACK
PORT_DEADLOCK clmdd_incoming clm_deadlock.c
PORT_MIGR clmm_incoming clm_migrate.c
PORT_RECOVERY clmr_message_handler clm_migrate.c
As mentioned before, the transaction may be sent to the directory node or the
lock resource master. If it is the directory node that receives the
transaction, the "dir_read_request" port handler is called by "CCCP". And then
"dir_proc_request" is called to do all hard work.
If the corresponding lock record is found within the directory node, the
destination of the transaction is modified to be the lock resource master
node's "PORT_MSTR" port. If no lock record is found, the directory node
creates a new directory entry for the resource, and modifies the transaction's
destination to be the "PORT_MSTR_CR" ("CR" means "create" a new lock mastership)
of the originator of this transaction (the first node to request a lock on a
given resource becomes the master for that resource). The "CCCP" transport
layer forwards the transaction to the new destination when returning from
"dir_read_request".
All of the following operations are on the master node, and are the same,
regardless of whether the original request transaction was sent via the
directory node, or directly to the (known by the originator) master node.
If it is the lock resource master that receives the transaction, "CCCP" calls
the PORT_MSTR handler, "pti_read_request". Following is the function calling
graph for servicing the lock request(transaction):
pti_read_request
|
V
lockswitch.master_lock
(master_lock_clm)
|
lock V convert
-----------------------------------
| |
V V
clm_lock clm_convert
| |
grant V block V
------------- convert_lock_clm
| | |
V V grant V block
grant_lock blocked_lock -------------
| |
V V
grant_convert blocked_convert
According to the previous graph, the receiving port is checked in
"pti_read_request". If this transaction is received from "PORT_MSTR_CR" port,
"DLM_RES_CREATE" flag is specified to distinguish it from a lock request for a
existing lock resource. After processing this lock request, the result (also
described by a transaction data structure) is sent back to "PORT_RESPONSE"
port of the sender of this lock request.
"pti_read_response" is the port handler called when the respond transaction is
received by the sender of the lock request. Following is the call graph:
pti_read_response
|
|----> pti_proc_response
| |
| |-----------------> match_request
| | |
| |<------------------------|
| |
| |-----------------> lockswitch.cont_lock
| | (cont_lock_clm)
| | |
| | lock | convert
| | -------------|-------------
| | | | |
| | V | V
| | cont_lock_clm_new | cont_lock_clm_convert
| | | | |
| | grant V block | grant V block
| | ------------- | -------------
| | | | | | |
| | V V | V V
| |grant_lock blocked_lock | grant_convert blocked_convert
| | | | | | |
| | | |----->|<-----| |
| | |----------------->|<-----------------|
| | |
| | |---> clm_reply
| | | |
| | | |----> dlm_reply_guts
| | | | |
| | | | |-->sendast
| | | | | |
| | | | |<-----|
| | | | |
| | | |<---------|
| | | |
| | |<--------|
| | |
| |<------------------------|
| |
|<----------|
|
O
After receiving the response, it is the first thing to find the corresponding
transaction of the lock request. This work is done in "match_request" by using
"clm_sender" field of the transaction data structure. And then the callback
function ("cont_lock_clm") to continue the lock request is gotten. The local
lock record is updated by "cont_lock_clm_new" or "cont_lock_clm_convert" based
on the respond transaction. After that, "sendast" is called to deliver AST to
the client. For user space client, the AST is added into a per client
linked-list, and if the notify signal number is specified, a signal is also
sent to the client. For kernel space client, the AST is delivered directly by
using "kernel_thread". After AST is delivered, the lock request is completed.
2. dlmlock_sync()
This function can be found in api_vms.c, line 540. It can be used for acquiring
or converting a lock on a lock resource in synchronous fashion. It is
implemented based on "dlmlock". It calls "dlmlock" directly with "LKM_BLOCK"
flag.
How does "LKM_BLOCK" work? All tricks are in "sendudp". As mentioned before,
"sendudp" is called in the first phase of the "dlmlock" and a reply message is
sent back to the client. If "LKM_BLOCK" is specified, "sendudp" does not send
the reply message and just skip it. Hence the client is blocked when it try to
retrieve the reply message. At the second phase, if "sendast" detects
"LKM_BLOCK" flag, it clears "LKM_BLOCK" and calls "sendudp" to send reply
message to the client. And lock request continues because it gets the reply
message.
3. dlmlockx()
This function can be found in api_vms.c, line 317. It can be used for acquiring
or converting a lock on a lock resource, and specifies a transaction ID for
that lock.
It is almost the same with "dlmlock", the only difference is that it provides
a transaction ID to the lock request when calling "clm_lock_reqresp". The
transaction ID is used as client ID instead of using PID in "dlmlock".
4. dlmlockx_sync()
This function can be found in api_vms.c, line 405. It can be used for acquiring
or converting a lock on a lock resource in synchronous fashion, and specifies
a transaction ID for that lock. It is implemented based on "dlmlockx". It calls
"dlmlockx" directly with "LKM_BLOCK" flag.
5. dlmunlock()
This function can be found in api_vms.c, line 651. It can be used for releasing
a lock on a lock resource or canceling a lock request that is in blocked or
converting state. As being same with "dlmlock", it can also be divided into two
phases.
5.1 First phase
Just like "dlmlock", a description of the unlock request is built and committed
in the first phase. The request is also represented by a transaction. Following
is the function calling graph for the first phase of "dlmunlock":
dlmunlock
|
|-----> api_reqresp
| |
| |---------> k_clm_send
| | |
| | V
| | clm_send
| | |
| | V
| | clm_process
| | |
| | V
| | clm_request
| | |
| | V
| | lockswitch.begin_unlock
| | (begin_unlock_clm)
| | |
| | |------> pre_proc_unlock_clm
| | | |
| | |<-----------------|
| | |
| | -------------------------
| | | | | | |
| | V V V V V
| | (1) (2) (3) (4) (5)
| | | | | | |
| | V V V V V
| | -------------------------
| | |
| |<--------------|
| |
| |---------> k_clm_recv
| | |
| | V
| | clm_recv
| | |
| |<--------------|
| |
|<-------|
|
O
The implementation of "dlmunlock" is very similar to "dlmlock", hence only the
difference point will be discussed in following part.
The transaction that represents the unlock request is built in "dlmunlock",
and then "api_reqresp" is called directly. Before entering "begin_unlock_clm",
the callback function ("cont_unlock_clm") for starting the operations of the
second phase is specified in "clm_request".
In "begin_unlock_clm", "pre_proc_unlock_clm" is called to see if the unlock
request can be processed completed on this node or at least before it gets
passed to the master. This routine handles the initial processing of an unlock
request before it is forwarded to the distributed layer. If the lock resouce is
mastered locally, this optimization yields a significant performance
improvement. Depending on the outcomes of "pre_proc_unlock_clm", there are five
possible results and corresponding operations:
(1). An error is detected locally. The unlock request is aborted. The
reply message and the unlock AST are sent to the client. No interaction
with transport layer is needed. This work is done by calling
"clm_reply".
(2). The lock resource is mastered locally, processing is complete.
The reply message and the unlock AST are sent to the client. No
interaction with transport layer is needed. This work is done by
calling "send_unlck_ast" and "clm_reply" sequentially.
(3). The lock resource is mastered remotely. Local unlock completed.
The unlock request is forward to lock resource master. The reply
message and the unlock AST are sent to the client. This work is done
by calling "send_unlck_ast", "pti_call" and "clm_reply" sequentially.
(4). The lock resource is mastered remotely. The lock handle is found,
but local unlock is not possible because of local lock state. The
unlock request is forward to lock resource master. The reply message
is sent to the client and the unlock AST is delivered when response
comes back. This work is done by calling "sendudp" and "pti_call"
sequentially.
(5). The lock resource is mastered remotely. The lock handle is not
found, and local unlock is not possible. (This condition occurs when
trying to cancel an inflight lock request.) The reply message is sent
to the client. The unlock request is added to a pending queue in lock
record, and will be forwarded to lock resource master when the respond
transaction for the original lock request is received. This work is
done by calling "sendudp" and "hold_cancel" sequentially. (The holded
unlock request is resent by calling "submit_cancels" from
"cont_lock_clm" in the second phase of "dlmlock".)
After this point, all operations are same with "dlmlock".
5.2 Second phase
Just like "dlmlock", the transaction also may be sent to the directory node or
the lock resource master. If it is the directory node that receive the
transaction, "dir_read_request" is called by "CCCP". And then
"dir_proc_request" is called to do all hard works.
If the corresponding lock record is found, the destination of the transaction
is modified to lock resource master's "PORT_MSTR" port. "CCCP" transport layer
forwards the modified transaction to the new destination when returning from
"dir_read_request". And the following operation is same with the condition
when the lock resource master receives the transaction.
If no lock record is found, only a "DLM_VOID" respond transaction is built and
sent to "PORT_RESPONSE" port of the sender of this transaction. This condition
occurs when trying to unlock a lock that was just purged.
If it is the lock resource master that receive the transaction,
"pti_read_request" is called by "CCCP". Following is the function calling
graph for servicing the unlock request(transaction):
pti_read_request
|
V
lockswitch.master_unlock
(master_unlock_clm)
|
V
convert_locks
|
V
send_basts_again
According to the previous graph, all unlock/cancel work is done in
"master_unlock_clm". After doing that, "convert_locks" is called to see if
there is any lock in "converting" or "waiting" state and try to grant them.
This routine also sends block AST to the corresponding client by calling
"send_basts_again". At last, the result (is also described by a transaction
data structure) is sent back to "PORT_RESPONSE" port of the sender of this
unlock request.
"pti_read_response" is called when the respond transaction is received by the
sender of the unlock request. Following is the call graph:
pti_read_response
|
|----> pti_proc_response
| |
| |-----------------> match_request
| | |
| |<------------------------|
| |
| |-----------------> lockswitch.cont_unlock
| | (cont_unlock_clm)
| | |
| | V
| | -------------
| | | | |
| | V V V
| | (1) (2) (3)
| | | | |
| | V V V
| | -------------
| | |
| |<------------------------|
| |
|<----------|
|
O
According to the previous graph, there are three possible operations in
"cont_unlock_clm" depending on the respond transaction:
(1). The transaction is a response to a timeout request. Then
processing still has to happen on the local node. Call the master
unlock to send the timeout AST, but no client response for the cancel
is required because a client didn't send it - the timeout subsystem
did. Free the request at last. This work is done by calling
"local_unlock_clm" and "clm_free_trans" sequentially.
(2). Corresponding to the condition (4) in the first phase, processing
still has to happen on the local node. Call local unlock to handle the
response then send the reply to the client. This work is done by
calling "local_unlock_clm" and "clm_reply" sequentially.
(3). Coreesponding to the condition (1)(2)(3) in the first phase, after
reading a master's response to an unlock request that has already been
handled locally, toss it. This work is done by calling
"local_unlock_clm" and "clm_free_trans" sequentially.
NOTE: Condition (5) in the first phase will be converted in to one of
the other four conditions eventually.
After these operations, the whole unlock request completes.
6. dlmunlock_sync()
This function can be found in api_vms.c, line 573. It can be used for releasing
a lock on a lock resource or canceling a lock request that is in blocked or
converting state in synchronous fashion. It is almost same with "dlmunlock".
The most important difference between these two function is that "RESP_AST" is
not specified in "dlmunlock_sync". Without this flag, the reply message is not
sent in condition (4)(5) of first phase, and is sent in condition (2) of second
phase. Hence "dlmunlock_sync" is blocked until all unlock works is done.
Asynchronous notify
-------------------
1. ASTpoll()
This function can be found in api/ast_stub.c, line 59. It can be used for
triggering the execution of pending AST routines.
As mentioned before, the ASTs is placed in a per client linked-list. "ASTpoll"
accesses lock manager via a character device (IOCTL), retrieves the pending
ASTs and executes them.
2. dlm_setnotify()
This function can be found in api/api_setnotify.c, line 77. It can be used for
specifying which signal the lock manager should use to notify your application
of a pending AST.
This routine updates a global variable named "notify_signo". The value of
"notify_signo" is copied into the transaction data structure when "dlmlock" is
called. And a specified signal is sent to the client when the lock manager add
AST to a client's pending list.
Lock management and LVB operation
---------------------------------
1. Lock record management
In order to manage all the locks, OpenDLM defined several data structures to
represent "resource", "lock" and "client".
1.1 "resource"
A "resource" can be any entity accessible to the operating system, or to one
or more process. "resource" is represented by a data structure named
"resource", defined in include/clmstructs.h, line 1279. "resource" is created
when it is requested first time and is deleted when it is not referenced by
any client. Following are the most important fields in this structure:
refcount - Reference count
rh - Resource handle
type - Resource type
name - Dynamic resource name
namelen - Length of resource name
master - Site id of master node
lockqs - Grant, convert, wait lists for lock requests
value - Lock value block
rsrc_flags - State bits
Resource handle (rh) is the key field for managing all "resource". OpenDLM
stores all "resource" in a resource table ("restab"). And this handle contains
the index into the resource table to locate the resource. Note, this handle is
identical for a "resource", even if it has multi copies in different nodes.
Hence In the event of node failure, the resource table is easily reconstructed
from the elements known by each other nodes. The "resource handle" is kept
with a lock object (reclock, discussed in next section) and used for fast
access to the resource entry in "restab". It is sometimes necessary to search
the resource table (only on the master) when creating a new resource. To this
end, a second table is maintained, "reskeytab". This is a hash table of
pointers into "restab". This provides quick search capability used for ADD,
FIND, and DELETE operations.
"restab"
-------------------------------------------
| seg 0 | seg 1 | ... | seg n |
-------------------------------------------
| | |
V V V
----------- ----------- -----------
--->|resource | |resource | |resource |<-----
| |---------| |---------| |---------| |
| |resource | |resource |<- |resource |<-- |
| |---------| |---------| | |---------| | |
| | . | | . | | | . | | |
| | . | | . | | | . | | |
| | . | | . | | | . | | |
| | . | | . | | | . | | |
| | . | | . | | | . | | |
| | . | | . | | | . | | |
| |---------| |---------| | |---------| | |
| ->|resource | |resource | | -->|resource | | |
| | ----------- ----------- | | ----------- | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | . . | | . | |
| | . . | | . | |
| | . . | | . | |
| | . . | | . | |
| | . . | | . | |
| | . . | | . | |
| | ^ ^ | | ^ | |
| | | | | | | | |
| | ----------- ----------- | | ----------- | |
| --| reshash | | reshash |-- | | reshash |--- |
| ----------- ----------- | ----------- |
| ^ ^ | ^ |
| | | | | |
| ----------- ----------- | ----------- |
|---| reshash | | reshash |----- | reshash |------
----------- ----------- -----------
^ ^ ^
| | |
-------------------------------------------
| key 0 | key 1 | ... | key n |
-------------------------------------------
"reskeytab"
In the directory node of a "resource", this "resource" is described in a
different way. It is represented by "rl_info_t" data structure. For the sake
of saving space, "rl_info_t" is just a subset of "resource". It is only used
to find who is the master of this "resource". And all of "rl_info_t" structures
are stored into a hash table ("rldbtab").
"rldbtab"
-------------------------------------------
| slot 0 | slot 1 | ... | slot n |
-------------------------------------------
| | |
V V V
----------- ----------- -----------
|rl_info_t| |rl_info_t| |rl_info_t|
----------- ----------- -----------
| | |
V V V
----------- ----------- -----------
|rl_info_t| |rl_info_t| |rl_info_t|
----------- ----------- -----------
| | |
V V V
. . .
. . .
. . .
. . .
. . .
. . .
1.2 "lock"
When a client requests a lock first time, the name of the resource is used for
specifying the lock is requested. OpenDLM associates a lock ID with the lock
request, and returns the lock ID to the client. This lock ID uniquely
identifies the lock record ("reclock") used by the lock manager to describe
the lock request. Following are all fields in "reclock":
link - List structure used to link reclock into "resource".
tq_node - Pointer to timeout queue structure.
route - Callback information.
lck - All information about the lock request.
dl - Used for dead lock detecting
seg - Lock table segment index
set - Timer set/reset
"lck" field is a data structure named "lockinfo", defined in
include/clmstructs.h, line 467. Following are important fields in "lockinfo":
mode - Granted mode
reqmode - Requested mode
bastmode - Mode passed back in blocking AST
site - Site ID where request was made
lk_flags - Flags from application
pid - Process ID that requested lock
remlockid - Remote lockid
lockid - Lock ID
state - Internal state flags
rh - Resource handle
lkvalue - Lock value block
request - Pointer to create request
seqnum - Sequence number on queue
type - Lock type
xid - Transaction ID
Whan a new lock request is submitted and before respond message is received,
it is linked into a per client pending queue ("cli_queue" field of
"clm_client_t"). After the lock request is serviced, the lock request is
linked into "resource" structure (grant/convert/wait queue. Note: convert
queue is not possible for new lock request). Subsequent references to this
lock are made by lock ID, and not by resource name. In order to improve the
performance, a hash table ("rlh_table") is implemented to locate the lock
record by specifying the lock ID.
"resource"
-------------------------------------------
| grant queue |convert queue| wait queue |
-------------------------------------------
| | |
V V V
----------- ----------- -----------
--->| reclock | | reclock |<-- | reclock |<-----
| ----------- ----------- | ----------- |
| | | | | |
| V V | V |
| ----------- ----------- | ----------- |
| ->| reclock | | reclock |<-| | reclock |<-- |
| | ----------- ----------- || ----------- | |
| | | | || | | |
| | V V || V | |
| | . . || . | |
| | . . || . | |
| | . . || . | |
| | . . || . | |
| | . . || . | |
| | . . || . | |
| | || | |
| | || | |
| | || | |
| | . . || . | |
| | . . || . | |
| | . . || . | |
| | . . || . | |
| | . . || . | |
| | . . || . | |
| | ^ ^ || ^ | |
| | | | || | | |
| | ----------- ----------- || ----------- | |
| --| rl_hash | | rl_hash |-----| | rl_hash |--- |
| ----------- ----------- | ----------- |
| ^ ^ | ^ |
| | | | | |
| ----------- ----------- | ----------- |
|---| rl_hash | | rl_hash |------ | rl_hash |------
----------- ----------- -----------
^ ^ ^
| | |
-------------------------------------------
| key 0 | key 1 | ... | key n |
-------------------------------------------
"rlh_table"
Note, there are actual two lock ID for a lock request:
lockid - Local lock ID, used to locate the lock record.
remlockid - Remote lock ID, must be converted to local lock ID
to locate the lock record.
The convert map between local lock ID and remote lock ID is stored in
"hashent" and eventually linked into a hash table ("idhash").
"idhash"
-------------------------------------------
| key 0 | key 1 | ... | key n |
-------------------------------------------
| | |
V V V
----------- ----------- -----------
| hashent | | hashent | | hashent |
----------- ----------- -----------
| | |
V V V
----------- ----------- -----------
| hashent | | hashent | | hashent |
----------- ----------- -----------
| | |
V V V
. . .
. . .
. . .
. . .
. . .
. . .
1.3 "client"
In OpenDLM, "client" is the system entity that requests "lock" on one or more
"resource", and it can be a process, a transaction or a group of processes. A
database of lock manager clients is maintained on every nodes. Each node
tracks the clients locally and maintains a record of the sequence number
contained in the most recent request received. In this way the lock manager
can detect when a UDP packet has been dropped. On the "resource" master, all
clients that reference this "resource" are recorded in this database. Again,
only local clients are checked for dropped packets. However, every client in
the database has associated with its entry a list of all reclock structures
held by that client. This structure is traversed when performing global
deadlock detection.
"client" is represented by a data structure named "clm_client_t", defined in
"include/clmstructs.h", line 790. Following is some important fields in
"clm_client_t":
cli_id - Client ID
cli_site - Site of client
cli_groupid - Group id of client if any
cli_type - Client type: PROCESS, GROUP or TRANSACTION
cli_sequence - Client sequence
sync_seq - Sequence number of lock request
cli_queue - Owned locks list
cli_ast_pnd - List of pending ASTs
cli_reply - Used to queue the UDP reply
cli_reply_event - Wait queue for waitting UDP reply
cli_state - State code & flags
The "client" database is actually a sorted array ("cli_refs"). And it is
protected by a spin lock ("cli_refs_lock").
"cli_refs"
-------------------------------------------
| client 0 | client 1 | ... | client n |
-------------------------------------------
| | |
V V V
-------------- -------------- --------------
|clm_client_t| |clm_client_t| |clm_client_t|
-------------- -------------- --------------
2. LVB operation
As described in lock management section, the lock value block is store in both
"resource" and "reclock". It is manipulated based on modes:
(1). If moving from >= LKM_PWMODE to lower, update "resource" value
block else update valblk from "resource".
(2). If " LKM_OLD_LVB" flag is cleared, a conversion to the same mode
can set LVB.
All above function are implemented in "valueblock", defined in
"kernel/dlmdk/clm_clmlock.c", line 2748. It is called when a lock/convert
request is granted if the "LKM_VALBLK" is specified.
2.1 "dlm_scnop"
Manipulates a System Commit Number (SCN), a specialized use of the lock value
block associated with a resource. The SCN is a cluster-global counter used
to ??
The SCN is stored in the LVB associated with a lock resource.
An LVB is an array of MAXLOCKVAL bytes (16 bytes as defined currently, but
for OpenGFS, we need to change it to 32 bytes). The lock manager represents
an SCN value of up to 128 bits (this fills the 16-byte LVB size) by using four
end-to-end concatenated unsigned integers.
These integers are the four fields contained in the "scn_t" structure, defined
in "include/clmstructs.h", line 871. The following figure illustrates how the
"scn_t" structure overlays the bytes in the LVB:
bytes
-------------------------------------------------
LVB | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
-------------------------------------------------
unsigned integers
-------------------------------------------------
SCN | base | wrap1 | wrap2 | wrap3 |
-------------------------------------------------
You define the range of SCN counter by specifying, in the "bit_len" parameter
passed to the "dlm_scnop". If you specify a "bit_len" value of 32 or less, the
lock manager uses only the "base" field of the structure. If you specify a
"bit_len" value of 64 or less, the lock manager uses the "base" and the
"wrap1" fields in the structure. The integer value in the "base" field
overflows into the "wrap1" field. The value of SCN should be interpreted by
concatenating the integer fields, and not by adding them. For example, the
following figure shows the representation for a value of SCN is 2^32:
-------------------------------------------------
SCN |0000......0|000......01| N/A | N/A |
-------------------------------------------------
"dlm_scnop" is implemented in "api/api_scn.c", line 64. The SCN operation
("DLM_SCN_OP") is also one of the basic request types in OpenDLM, and it is
implemented in very similar way that "dlmlock"/"dlmunlock" is. It can be also
divided into two phase.
2.1.1 First phase
Just like "dlmunlock", a description of the SCN request is built and
committed in the first phase. The request is also represented by a transaction.
Following is the function calling graph for the first phase of "dlm_scnop":
dlm_scnop
|
|-----> api_reqresp
| |
| |---------> k_clm_send
| | |
| | V
| | clm_send
| | |
| | V
| | clm_process
| | |
| | V
| | clm_request
| | |
| | V
| | begin_scn_op
| | |
| | V
| | pti_call
| | |
| | V
| | dlm_pti_send
| | |
| | V
| | cccp_send
| | |
| |<--------------|
| |
| |---------> k_clm_recv
| | |
| | V
| | clm_recv
| | |
| |<--------------|
| |
|<-------|
|
O
The implementation of "dlm_scnop" is very similar to "dlmlock", hence only the
difference point will be discussed in following part.
The transaction that represents the SCN request is built in "dlm_scnop", and
then "api_reqresp" is called directly. Before entering "begin_scn_op", the
callback function ("cont_scn_op") for starting the operations of the second
phase is specified in "clm_request". Since a "NL" mode lock must be acquired
on the "resource" in question, the node ID of "resource" master can be found
in local "resource" copy. And this SCN request is sent to master's "PORT_MSTR"
port dircetly. Note, the UDP reply message is not sent out, hence the client
will be blocked until the respond from master is received.
2.1.2 Second phase
When the lock resource master receives the transaction, "pti_read_request" is
called by "CCCP". Following is the function calling graph for servicing the
SCN operation request(transaction):
pti_read_request
|
V
master_scn_op
According to the above graph, all SCN operations are done in "master_scn_op".
After doing that, the result (is also described by a transaction data
structure) is sent back to "PORT_RESPONSE" port of the sender of this SCN
operation request.
"pti_read_response" is called when the respond transaction is received by the
sender of the SCN operation request. Following is the call graph:
pti_read_response
|
|----> pti_proc_response
| |
| |-----------------> match_request
| | |
| |<------------------------|
| |
| |-----------------> cont_scn_op
| | |
| | V
| | clm_reply
| | |
| |<------------------------|
| |
|<----------|
|
O
According to the above graph, "cont_scn_op" return the result of request to
the client by calling "clm_reply". The client is un-blocked at this point and
gets the operating result.
Deadlock Detecting
------------------
While OpenDLM permits processes to synchronize their access to share resource,
it cannot prevent them from creating deadlock. The deadlocks are not problems
in OpenDLM, but a mechanism for detecting/resolving deadlocks is built into
OpenDLM to facilitate applications that using OpenDLM.
The deadlock detecting module is implemented in "kernel/dlmdm/clm_deadlock.c".
It seems like that this module is implemented based on a distributed deadlock
detecting algorithm ("PATH-PUSHING").
In this algorithm, the local TWFG (Transaction Wait-For Graph) is built at
each node, and both resource waits and communication wiats are gathered.
Resources not mastered local or non-local clients are marked as external
vertexs in the TWFG. To detect the deadlocks, following actions are done in
each node:
1. Wait and receive deadlock releated informaiton (produced in the
previous deadlock detecting iteration) from other nodes.
2. Combine the received informaiton with local TWFG and search for
circle.
3. Resolve local deadlock by choosing a victim from the local circle
that doesn't contain an external vertex.
4. Send information about the circle that contains external vertex to
the node which the external resource is mastered on or the external
client resides on.
The deadlock detecing cycle ends when a node ensure no deadlocks or external
vertex in its TWFG (including lock TWFG the received TWFG). The detailed
information for this module will be discussed in following section.
1. Start deadlock detecting
In OpenDLM, detecting deadlock is a periodically task. It is executed per
"deadlock_timeout/2" (default value : 30/2 seconds). Since it is based a
distributed deadlock detecting algorithm, inter-node coordination is necessary.
And all of the communication among all nodes are transfered via "PORT_DEADLOCK"
port of "CCCP".
In "clmdd_init", the communication infrastructure for detecting deadlock is
initialized. "clmdd_incoming" is assigned to process all incoming message from
"PORT_DEADLOCK". Following graph illustrates how the deadlock detecting starts
every time:
clm_master_loop clmdd_incoming
| |
|-> do_timer_stuff |MSG_STRT
| | -----------------
| |-> clmdd_task | | | | |
| | | V V V V V
| | |-> clmdd_primo (1) (2) (3) (4) (5)
| | | |
| | | N V Y
| | | -----------
| | | | |
| | |<--| V
| | | clmdd_startstart
| | | |
| | | |-> clmdd_suspect
| | | | |
| | | |<-------|
| | | |
| | | |-> clm_send
| | | | |
| | | |<-------|
| | | |
| | |<------------|
| | |
| |<-------|
| |
|<--------|
|
O
According to the above graph, the inter-node coordination is a little bit
complex when starting deadlock detecting. But it is necessary because if you
have a deadlock condition it is possible that it will block several clients'
requests at about the same time. If we kicked off a periodic deadlock detecting
cycle on all nodes without some kind of coordination, this would start deadlock
detection cycles on all of the affected nodes at roughly the same time.
This method is used to synchronize the beginning of a cycle. The cycle begins
on the lowest numbered site (determinated by "clmdd_primo"), triggered by the
periodic task "clmdd_task". In "clmdd_startstart", the first node calls
"clmdd_suspect" to checks its deadlock detection queue "dl_timeout" for
suspicious requests. All blocked lock request are linked in "dl_timout" queue,
and the criterion for suspicious request is that it is blocked more than
"deadlock_timeout" seconds. If it finds one it sets the start message timestamp
to the age of the oldest request, otherwise it sets "sdd->sdd_node" to NO_NODE.
The start message is then sent to the next node in the cluster. The start
message is represented by a structure named "start_msg_t". Following are all
fields in "start_msg_t":
sdd_cycle - Deadlock cycle number
sdd_instigator - Instigator of this cycle
sdd_node - Node of oldest waiting lock request
sdd_tv - Age of oldest waiting lock request
When a node receives the start message (type "MSG_STRT"), "clmdd_incoming" is
called, and there are five possibilities for type "MSG_STRT":
(1). The receiving node is the instigator and no suspicious request is
found.
(2). The receiving node is the instigator and the oldest suspicious
request also resides on the receiving node.
(3). The receiving node is the instigator and the oldest suspicious
request doesn't reside on the receiving node.
(4). The receiving node isn't the instigator and the oldest suspicious
request resides on the receiving node.
(5). The receiving node isn't the instigator and the oldest suspicious
request doesn't reside on the receiving node. (Or not found)
For condition (1), it is not necessary to start the deadlock detecting cycle,
current cycle reach its end.
For condition (2) and (4), start the deadlock detecting cycle by calling
"clmdd_check".
For condition (3), since the start message has traveled all nodes, the start
message is forwarded to the node owns the oldest suspicious request.
For condition (5), the receiving node checks its queue, and if its oldest
request is older than the one in the start message it sets the timestamp and
node ID to its own values. All this are done by calling "clmdd_dostart".
Note that we need to use the *age* of requests, rather than the timestamp, due
to the implementation of gettimeofday() in linux kernel.
2. Build deadlock detecting stack
Deadlock detecting stack is the core data structure for the whole deadlock
detecting cycle. It is used for transferring TWFG among nodes. It is
represented as a queue "ddx_t" struct, "ddx_t" is defined as following:
typedef struct ddx_s {
ddx_flavor_t ddx_type;
union {
struct {
int cli_site;
clm_client_id_t cli_id;
unsigned long cli_rid;
int cli_marked;
} un_cli;
struct {
int rsc_namelen;
char *rsc_name;
short rsc_mode;
int rsc_lockid;
int rsc_locknode ;
} un_rsc;
struct {
unsigned long blk_lockid;
int blk_site;
unsigned long blk_age;
} un_blk;
} ddx_un;
} ddx_t;
"ddx_t" is used for describing the in-core wait-for graph. A node in the graph
is a discriminated union; permissible types are given in the enum
"ddx_flavor_t":
INVALID - Invalid entry, for bombproofing
CLIENT - Known client reference
XCLIENT - Unknown client reference
RSRC - Resource reference
XRSRC - Unknown resource reference
BRANCH - Beginning of a set
RETURN - End of a set
SCRUBBED - Null entry, but not invalid
As mentioned before, the deadlock detecting cycle starts by calling
"clmdd_check". And the first work is build local TWFG:
clmdd_check
|
|-> clmdd_analyze
| |
| |------------
| | |
| | V
| | ---------------------
| | | |->clmdd_pushr--- |
| | | | | |
| | | ---clmdd_pushc<-| |
| | ---------------------
| | |
| |<----------|
| |
|<-------|
|
V
clmdd_processddx
In "clmdd_check", every suspicious request is checked by calling
"clmdd_analyze". "clmdd_pushr" and "clmdd_pushc" is invoked recursively to
build the local TWFG:
clmdd_pushr - Push a resource corresponding to a lock request onto
the deadlock detection stack. A resource will
eventually expand into a list of clients holding locks
on that resource (i.e.,in the wait/convert queues) that
are incompatible with the lock request passed in as an
argument. If the resource is mastered locally, the
expansion is done here. If the resource is remote, an
"XRSRC" entry is pushed onto the deadlock detection
stack. In case of an error the deadlock stack is freed.
clmdd_pushc - Push a client onto the deadlock detection stack.
Eventually, a client expands into a list of all its
blocked lock requests. If the client is local, marks
and expands it. If the client is remote, an "XCLIENT"
entry is pushed onto the deadlock detection stack. In
case of an error, free the deadlock stack.
Following figure illustrates a deadlock detecting stack (for the sake of simple,
all resources and clients are local):
TWFG ddx_stack
------ ------
| C1 |<----| | C1 |
------ | |----|
| | | BR |
V | |----|
(R1) | | C2 |
| | |----|
V | | BR |
------ | |----|
| C2 | | | C3 |
------ | |----|
| | | RT |
| | |----|
------ | | BR |
| | | |----|
V V | | C1 |
(R2) (R3) | |----|
| | | | RT |
V ------- |----|
------ | RT |
| C3 | ------
------
C - Client
R - Resource
BR - Branch
RT - Return (end of branch)
3. Deal with "ddx" stack
After building the local TWFG, "clmdd_processddx" is called with the ddx stack
as inputing information. And as mentioned before, only the circle without
external vertex is checked and breaked if exists. For the circle that contain
external vertex, the related information (part of TWFG) is sent to the proper
node. The deadlock detecing cycle ends when a node ensure no deadlocks or
external vertex in its TWFG (including lock TWFG the received TWFG). Following
is the calling graph for deadlock detecting cycle:
clmdd_proccessddx
|
V
clmdd_walk
|
N V Y
---------------------------------
| |
V V
clmdd_massage shoot_me_now
| |
V remote V local
clmdd_encode -----------------
| | |
V V V
cccp_send cccp_send clmdd_purgeclnt
(MSG_TWFG) (MSG_PURG) |
remote V local
-----------------
| |
V V
cccp_send clmdd_purgersrc
(MSG_BLOW)
clmdd_incoming
|
V
---------------------------------
| | |
| | |
(MSG_TWFG) (MSG_PURG) (MSG_BLOW)
| | |
V V V
clmdd_decode clmdd_purgeclnt clmdd_purgersrc
| |
V |
clmdd_processddx |
|
remote V local
-----------------
| |
V V
cccp_send clmdd_purgersrc
(MSG_BLOW)
According to the above graph, firstly "clmdd_walk" is called for detecting the
circle without external vertex. If no such circle is found, "clmdd_massage" is
called to decide the destination of the "MSG_TWFG" message. The other effect of
"clmdd_massage" is to scrub the "ddx" stack, hence only TWFG that contains
external edges is sent out. Sequentially the "ddx" stack is encoded into a
"MSG_TWFG" message by invoking "clmdd_encode". Then the "MSG_TWFG" message is
sent to port "PORT_DEADLOCK" of the destination.
If a circle without external vertex is found by "clmdd_walk", that means
deadlock condition occurs. Then "shoot_me_now" is called to find the most
recent blocked request in this circle. If this request resides on remote node,
a "MSG_PURG" message is sent to port "PORT_DEADLOCK" of that node. If this
request resides on local node, "clmdd_purgeclnt" is called to deny this
request.
"clmdd_purgeclnt" needs to find the requested resource first. If this resource
is mastered locally, "clmdd_purgersrc" is called to deny the blocked request
directly. If this resource is mastered on remote node, a "MSG_BLOW" message is
sent to port "PORT_DEADLOCK" of the resource master.
On the other hand, "clmdd_incoming" is invoked when a message is received from
port "PORT_DEADLOCK". Depending on the type of the message, different funcitons
are called to process the incoming information.
In case of receiving a "MSG_TWFG" message, "clmdd_decode" re-builds the "ddx"
stack based on the incoming message and local information. Then
"clmdd_processddx" is called with the ddx stack as inputing information. The
following process is all the same as "clmdd_processddx" is called in the first
node. The deadlock detecing cycle ends when a node ensure no deadlocks or
external vertex in its TWFG (including lock TWFG the received TWFG).
In case of receiving a "MSG_PURG" message, "clmdd_purgeclnt" is called to deny
the specified request. The following process is all the same as
"clmdd_purgeclnt" is called in the first node.
In case of receiving a "MSG_BLOW" message, "clmdd_purgersrc" is called to deny
the specified request directly.
Till now, a complete deadlock detecting cycle ends.
References
----------
1. Roy G. Davis. 1993. "VAXcluster Principles". Digital Press
2. Kristin Thomas. 2001. "Programming Locking Applications"