Contact the author.
WHATIS-OpenDLM Copyright 2003 The OpenDLM Project Author: Ben Cahill (bc), ben.m.cahill@intel.com 1.0 Introduction ----------------- This document contains an overview of the OpenDLM Distributed Lock Manager. This document is aimed at developers, potential developers, students, and anyone who wants to know about the details of distributed lock management under OpenDLM. This document is not intended as a user guide to the OpenDLM API, although this document should serve as a good introduction to OpenDLM functionality. For a discussion of the API, as well as complementary discussion of the OpenDLM locking model, and configuration info, see "Programming Locking Applications" (dlmbook_final.pdf). Several topics are discussed in depth in that document; the author has kept discussion of those topics in this document complementary in nature. Much of this document is based on a one-hour interview session with Peter Badovinatz in October, 2003. Peter was the lead of a porting effort of IBM's DLM from the AIX operating system to Linux, before the project was dropped by IBM. Peter assisted with getting the current OpenDLM project going on Sourceforge, at http://opendlm.sourceforge.net, but is now stretched pretty thin as chair of the Carrier Grade Linux (CGL) specification committee, sponsored by the Open Source Development Lab (OSDL). The author is grateful for the time Peter spent describing the "big picture" of OpenDLM operation. This document may contain inaccurate statements, based on the author's limited understanding. In particular, the author has not studied any OpenDLM code (I'm trying to write up what Peter told me before it drops out of my head)! Please contact the author (bc) if you see anything wrong or unclear. 1.1 Terminology --------------- "Hold" and "Own" may be used interchangeably, to indicate that the lock holding/ owning compute node has been granted a lock, and therefore may access the entity protected by the lock. 2.0. Overview ------------------------------ OpenDLM is a true distributed lock manager. Knowledge of each lock is distributed among "interested" compute nodes within a cluster of computers, and each compute node contains and executes a complete set of logic for requesting and granting locks among the nodes. With OpenDLM, there is no single central lock server (i.e. a single computer that executes logic for managing locks for all nodes in the cluster), nor is there even a single central lock *storage* server (storing lock data, but not executing lock logic). Therefore, there is no single point of failure (SPOF) for OpenDLM. OpenDLM is efficient in terms of inter-node messaging. It uses a query/request/grant messaging scheme, described a bit later. This is much more efficient than a load/store polling protocol that repeatedly polls a central lock storage server. Such a scheme is used with the "memexp" lock module in the OpenGFS shared filesystem project. Polling can generate a *great deal* of internode LAN traffic when a lock is not immediately available. OpenDLM simply waits for a grant message, thus avoiding the repeated polling. Much of the incentive for developing OpenDLM is for using it with OpenGFS, a clustered/shared filesystem (http://opengfs.sourceforge.net). However, OpenDLM is a general purpose inter-node locking system that can be used at the application level as well, or for other kernel level uses beyond OpenGFS. See "Programming Locking Applications" (dlmbook-final.pdf) for a discussion of the user-level API. The kernel level API is more or less a subset of the user-level API. 3.0 Requesting and Obtaining a Lock ------------------------------------ Knowledge about a lock is divided into 3 different realms: -- Directory (node queried as first stop) -- Master (node containing lock state info) -- Requestor/Grantee (node that holds or wants to hold the lock) When a node needs a lock, it first queries the "directory" node. Each node in the cluster serves as a directory node for a range of lock names. The mapping of name ranges to nodes is well known across the cluster, so each requesting node knows which directory node to query, even for never-before- requested lock names. The well-known algorithm is based on a simple function of the lock name (e.g. the first character in the name). The directory node knows: -- whether a lock has ever been requested -- which node is the lock's master node When a directory node is queried about a never-before-requested lock name, the directory assigns the querying node to be the "master" for the lock, and creates a new directory entry for the new lock. The directory entry simply shows which node is the master; the directory contains no lock state information. Normally, the master will persist as the master of the lock forever, and the directory entry will not need to change. Since the new master had queried the directory because it was interested in obtaining the lock, the new master grants itself ownership of the lock. When a directory node is queried about a previously-requested lock name, the directory tells the requestor which node is the master. The requestor then requests the lock from the master. The master stores the state of the lock (which nodes hold the lock, and in which states), and is the single management server for the lock. With our earlier talk about SPOFs, you may be wondering about this a bit. We'll discuss it later under "Node Death". Note that the query/request/grant protocol efficiently overlaps message functionality between the directory, mastership, and ownership functions. The first node to request a lock handles only these two messages: -- the directory query (outbound) -- the response (inbound), which tells it to be the master (and grant itself the lock) Note also that a single node might be the directory, master, and owner of a lock. In this case, there would be no network traffic for this lock until another node requests it. 4.0 Lock Modes --------------- OpenDLM supports 6 lock modes. Some of these make sense only when you realize that they were intended for use in a hierarchical (parent/child) lock scheme: -- EX, exclusive mode, traditional write lock -- PW, protected write, traditional update lock -- PR, protected read, traditional shared read lock -- CW, concurrent write, see below -- CR, concurrent read, see below -- NL, null, unlocked placeholder, see below The EX mode is the most restrictive, and the NL is the least restrictive. If a lock is requested for a restrictive mode, it will not be granted until all holders of less restrictive modes (or, for some states, equally restrictive states) have given up their lock ownership. The NL state means that the lock is totally unlocked, but some node wants to keep the lock from being garbage collected. That is, the state is unlocked, but there is a lock usage count applied by at least one node, so the master and directory nodes maintain knowledge of the lock. The CR and CW modes are intended for use with a parent/child hierarchy, in which a relatively large area of memory, a directory, or other large entity is protected by a CR or CW lock. The following paragraph describes how this could be used, but please note that current OpenDLM implementation does *not* support this operation, because (in part) it does not support parent/child relationships (no parent is included as a parameter in the lock request call): While the CW lock would presumably allow concurrent writes of a large area by multiple nodes, it really should not directly write-protect the entire large area. Instead, finer grained, more restrictive locks (PW or EX) should be obtained to write into smaller areas contained within the larger area. These finer-grained locks would be requested as "children" of the larger CW lock (with the parent lock included as a parameter in the lock request call). The all-encompassing CW lock, when released, would in turn cause the release of all of the child locks, providing easy management of the large area and all of the small areas that it contains. There is more information on modes in the "Programming Locking Applications" document. 5.0 Lock States ---------------- A "lock resource" is the object contained in the master's memory, indicating the cluster-wide state of the locks associated with a given lock name. A given lock resource protects a single entity (e.g. file, data structure, etc.), and is identified by a unique name. Note that several locks may exist on a given lock name/resource, since many of the lock modes are sharable; several nodes or processes may simultaneously hold locks on the same lock resource. Locks may be in one of 3 states: -- Granted, lock is now in the mode requested -- Converting, waiting for grant to more restrictive mode -- Blocked, waiting for grant For each of these states, there exists a queue attached to the lock resource containing identities of the processes/nodes that hold locks in each of those states (only for this lock resource, of course). There is much more information on states in the "Programming Locking Applications" document. 6.0 Node Death, Lock Recovery, Orphan Locks -------------------------------------------- OpenDLM relies on a (separate) cluster manager to determine which nodes are active members of the cluster. When the cluster manager detects that a node has died, the rest of the cluster must recover the locks that were: -- held by the dead node -- mastered by the dead node -- contained in the dead node's directory A lot of the recovery process depends on whether any other node is interested in the locks held/mastered/directoried by the dead node. If no other node was interested in a particular lock at the time of death, the lock can be safely forgotten. If another node later becomes interested (i.e. requests the lock) the lock can be treated as a never-before-requested lock, thus creating a new directory entry, and assigning a new master. Actually, the recovery process works just like requesting a never-before- requested lock! 6.1 Directory recovery ----------------------- Directory support can migrate from node to node, and the distribution of directory responsibility (range of lock resource names) is based on the number of nodes in the cluster. When OpenDLM (that is, each computer still in the cluster) finds out that a node has entered or left the cluster, it redistributes the directories across the new inventory of nodes. This may mean copying directory information from node to node (if possible), or it may mean simply assuming responsibility for a range of names that was previously handled by a node that just died. In this case, the directory entries must be rebuilt, and new masters assigned, as surviving nodes query the directory (just like a never-before-requested lock). 6.2 Master recovery -------------------- It's important to realize that often, more than one node knows about a lock resource. If a master node for a particular lock resource dies, any other node that holds a lock on the resource, or is blocked waiting for a lock, knows who the master is, and also knows (by means of the cluster manager) that the master node has died. When a master node dies, holders/requestors query the lock's directory node. The directory node will assign the first such querying node to be the new master, and direct other requestors to the new node (just like a never-before- requested lock). 6.3 Owner recovery ------------------- If a lock owner dies, the master node treats the situation as if the dead node had just released the lock. Based on that, it may grant locks to other requesting nodes. "Orphan" locks do not behave that way; the master does *not* treat the dead node's lock as if it were released, and thus may not be able to grant locks to other requestors. See section 2.3.6 "Leaving the Grant Queue" in the "Programming Locking Applications" doc for more info. 6.4 Multiple recovery ---------------------- There are many cases in which a single node may be various combinations of directory/master/holder/requestor. 7.0 Message Ordering --------------------- OpenDLM does not assume that the communications protocol between nodes guarantees proper ordering of messages. OpenDLM has a significant amount of logic dedicated to remembering and re-ordering messages that arrive out of logical sequence. Even if the comm protocol guaranteed message order, the fact that the various nodes operate somewhat independently means that some crazy ordering can occur. For example, when a lock resource's master node dies, several nodes may query its directory node. The first will be assigned to be the new master. Others will request locks from the new master. Lock requests *could* arrive at the new master, before the new master gets the message that it is the new master! In such a case, the new master would need to remember these incoming lock requests, assuming that it *might* be becoming the new master. Logic for remembering and re-sequencing messages is split between the CCCP communication module and the lock management module. Peter suggested that the logic is not very clean (i.e. fairly intertwined and interdependent), and could benefit from some re-design. References: ----------- OpenDLM web site: http://opendlm.sourceforge.net (see especially "Docs/Info" page, via menu) OpenGFS CVS (browsable/downloadable via OpenGFS website): opengfs/docs/* Many docs are also mirrored to http://opengfs.sourceforge.net/docs.php Short descriptions of DLM and DLM API within HP OpenVMS: http://h71000.www7.hp.com/doc/731FINAL/5841/5841pro_021.html#145_conceptslocks http://h71000.www7.hp.com/doc/731FINAL/4527/4527pro_045.html