WHATIS_OpenDLM

WHATIS_OpenDLM (Apr 7 2004)
Contact the author.

                          WHATIS-OpenDLM

Copyright 2003 The OpenDLM Project

Author:
Ben Cahill (bc), ben.m.cahill@intel.com


1.0  Introduction
-----------------

This document contains an overview of the OpenDLM Distributed Lock Manager.

This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of distributed lock management under
OpenDLM.

This document is not intended as a user guide to the OpenDLM API, although this
document should serve as a good introduction to OpenDLM functionality.

For a discussion of the API, as well as complementary discussion of the OpenDLM
locking model, and configuration info, see "Programming Locking Applications"
(dlmbook_final.pdf).  Several topics are discussed in depth in that document;
the author has kept discussion of those topics in this document complementary
in nature.

Much of this document is based on a one-hour interview session with
Peter Badovinatz in October, 2003.  Peter was the lead of a porting effort
of IBM's DLM from the AIX operating system to Linux, before the project was
dropped by IBM.  Peter assisted with getting the current OpenDLM project
going on Sourceforge, at http://opendlm.sourceforge.net, but is now stretched
pretty thin as chair of the Carrier Grade Linux (CGL) specification committee,
sponsored by the Open Source Development Lab (OSDL).  The author is grateful
for the time Peter spent describing the "big picture" of OpenDLM operation.

This document may contain inaccurate statements, based on the author's limited
understanding.  In particular, the author has not studied any OpenDLM code
(I'm trying to write up what Peter told me before it drops out of my head)!
Please contact the author (bc) if you see anything wrong or unclear.


1.1 Terminology
---------------

"Hold" and "Own" may be used interchangeably, to indicate that the lock holding/
owning compute node has been granted a lock, and therefore may access the
entity protected by the lock.



2.0. Overview
------------------------------

OpenDLM is a true distributed lock manager.  Knowledge of each lock is
distributed among "interested" compute nodes within a cluster of computers,
and each compute node contains and executes a complete set of logic for
requesting and granting locks among the nodes.

With OpenDLM, there is no single central lock server (i.e. a single computer
that executes logic for managing locks for all nodes in the cluster), nor is
there even a single central lock *storage* server (storing lock data, but not
executing lock logic).  Therefore, there is no single point of failure (SPOF)
for OpenDLM.

OpenDLM is efficient in terms of inter-node messaging.  It uses a
query/request/grant messaging scheme, described a bit later.  This is much
more efficient than a load/store polling protocol that repeatedly polls a
central lock storage server.  Such a scheme is used with the "memexp"
lock module in the OpenGFS shared filesystem project.  Polling can generate
a *great deal* of internode LAN traffic when a lock is not immediately
available.  OpenDLM simply waits for a grant message, thus avoiding the
repeated polling.

Much of the incentive for developing OpenDLM is for using it with OpenGFS,
a clustered/shared filesystem (http://opengfs.sourceforge.net).  However,
OpenDLM is a general purpose inter-node locking system that can be used
at the application level as well, or for other kernel level uses beyond
OpenGFS.  See "Programming Locking Applications" (dlmbook-final.pdf) for a
discussion of the user-level API.  The kernel level API is more or less a
subset of the user-level API.


3.0  Requesting and Obtaining a Lock
------------------------------------

Knowledge about a lock is divided into 3 different realms:

-- Directory (node queried as first stop)
-- Master (node containing lock state info)
-- Requestor/Grantee (node that holds or wants to hold the lock)

When a node needs a lock, it first queries the "directory" node.  Each node
in the cluster serves as a directory node for a range of lock names.  The
mapping of name ranges to nodes is well known across the cluster, so each
requesting node knows which directory node to query, even for never-before-
requested lock names.  The well-known algorithm is based on a simple function
of the lock name (e.g. the first character in the name).

The directory node knows:

-- whether a lock has ever been requested
-- which node is the lock's master node

When a directory node is queried about a never-before-requested lock name,
the directory assigns the querying node to be the "master" for the lock, and
creates a new directory entry for the new lock.  The directory entry simply
shows which node is the master; the directory contains no lock state
information.  Normally, the master will persist as the master of the lock
forever, and the directory entry will not need to change.

Since the new master had queried the directory because it was interested
in obtaining the lock, the new master grants itself ownership of the lock.

When a directory node is queried about a previously-requested lock name, the
directory tells the requestor which node is the master.  The requestor then
requests the lock from the master.

The master stores the state of the lock (which nodes hold the lock, and in which
states), and is the single management server for the lock.  With our earlier
talk about SPOFs, you may be wondering about this a bit.  We'll discuss it
later under "Node Death".

Note that the query/request/grant protocol efficiently overlaps message
functionality between the directory, mastership, and ownership functions.
The first node to request a lock handles only these two messages:

-- the directory query (outbound)
-- the response (inbound), which tells it to be the master (and grant itself
     the lock)

Note also that a single node might be the directory, master, and owner of a
lock.  In this case, there would be no network traffic for this lock until
another node requests it.



4.0  Lock Modes
---------------

OpenDLM supports 6 lock modes.  Some of these make sense only when you realize
that they were intended for use in a hierarchical (parent/child) lock scheme:

-- EX, exclusive mode, traditional write lock
-- PW, protected write, traditional update lock
-- PR, protected read, traditional shared read lock
-- CW, concurrent write, see below
-- CR, concurrent read, see below
-- NL, null, unlocked placeholder, see below

The EX mode is the most restrictive, and the NL is the least restrictive.  If
a lock is requested for a restrictive mode, it will not be granted until all
holders of less restrictive modes (or, for some states, equally restrictive
states) have given up their lock ownership.

The NL state means that the lock is totally unlocked, but some node wants to
keep the lock from being garbage collected.  That is, the state is unlocked,
but there is a lock usage count applied by at least one node, so the master
and directory nodes maintain knowledge of the lock.

The CR and CW modes are intended for use with a parent/child hierarchy, in
which a relatively large area of memory, a directory, or other large entity
is protected by a CR or CW lock.  The following paragraph describes how this
could be used, but please note that current OpenDLM implementation does *not*
support this operation, because (in part) it does not support parent/child
relationships (no parent is included as a parameter in the lock request call):

While the CW lock would presumably allow concurrent writes of a large area by
multiple nodes, it really should not directly write-protect the entire large
area.  Instead, finer grained, more restrictive locks (PW or EX) should be
obtained to write into smaller areas contained within the larger area.  These
finer-grained locks would be requested as "children" of the larger CW lock
(with the parent lock included as a parameter in the lock request call).
The all-encompassing CW lock, when released, would in turn cause the release of
all of the child locks, providing easy management of the large area and all of
the small areas that it contains.

There is more information on modes in the "Programming Locking Applications"
document.


5.0  Lock States
----------------

A "lock resource" is the object contained in the master's memory, indicating
the cluster-wide state of the locks associated with a given lock name.  A
given lock resource protects a single entity (e.g. file, data structure, etc.),
and is identified by a unique name.

Note that several locks may exist on a given lock name/resource, since many of
the lock modes are sharable; several nodes or processes may simultaneously hold
locks on the same lock resource.

Locks may be in one of 3 states:

-- Granted, lock is now in the mode requested
-- Converting, waiting for grant to more restrictive mode
-- Blocked, waiting for grant

For each of these states, there exists a queue attached to the lock resource
containing identities of the processes/nodes that hold locks in each of those
states (only for this lock resource, of course).

There is much more information on states in the "Programming Locking
Applications" document.


6.0  Node Death, Lock Recovery, Orphan Locks
--------------------------------------------

OpenDLM relies on a (separate) cluster manager to determine which nodes are
active members of the cluster.  When the cluster manager detects that a node
has died, the rest of the cluster must recover the locks that were:

-- held by the dead node
-- mastered by the dead node
-- contained in the dead node's directory

A lot of the recovery process depends on whether any other node is interested
in the locks held/mastered/directoried by the dead node.  If no other node was
interested in a particular lock at the time of death, the lock can be safely
forgotten.  If another node later becomes interested (i.e. requests the lock)
the lock can be treated as a never-before-requested lock, thus creating
a new directory entry, and assigning a new master.

Actually, the recovery process works just like requesting a never-before-
requested lock!  


6.1  Directory recovery
-----------------------

Directory support can migrate from node to node, and the distribution of
directory responsibility (range of lock resource names) is based on the number
of nodes in the cluster.

When OpenDLM (that is, each computer still in the cluster) finds out that a
node has entered or left the cluster, it redistributes the directories across
the new inventory of nodes.  This may mean copying directory information from
node to node (if possible), or it may mean simply assuming responsibility for
a range of names that was previously handled by a node that just died.  In
this case, the directory entries must be rebuilt, and new masters assigned,
as surviving nodes query the directory (just like a never-before-requested
lock).



6.2  Master recovery
--------------------

It's important to realize that often, more than one node knows about a lock
resource.  If a master node for a particular lock resource dies, any other node
that holds a lock on the resource, or is blocked waiting for a lock, knows who
the master is, and also knows (by means of the cluster manager) that the
master node has died.

When a master node dies, holders/requestors query the lock's directory node.
The directory node will assign the first such querying node to be the new
master, and direct other requestors to the new node (just like a never-before-
requested lock).


6.3  Owner recovery
-------------------

If a lock owner dies, the master node treats the situation as if the dead
node had just released the lock.  Based on that, it may grant locks to other
requesting nodes.

"Orphan" locks do not behave that way; the master does *not* treat the dead
node's lock as if it were released, and thus may not be able to grant locks
to other requestors.  See section 2.3.6 "Leaving the Grant Queue" in the
"Programming Locking Applications" doc for more info.


6.4  Multiple recovery
----------------------

There are many cases in which a single node may be various combinations of
directory/master/holder/requestor.


7.0  Message Ordering
---------------------

OpenDLM does not assume that the communications protocol between nodes
guarantees proper ordering of messages.  OpenDLM has a significant amount
of logic dedicated to remembering and re-ordering messages that arrive out of
logical sequence.

Even if the comm protocol guaranteed message order, the fact that the various
nodes operate somewhat independently means that some crazy ordering can occur.
For example, when a lock resource's master node dies, several nodes may query
its directory node.  The first will be assigned to be the new master.  Others
will request locks from the new master.  Lock requests *could* arrive at the
new master, before the new master gets the message that it is the new master!
In such a case, the new master would need to remember these incoming lock
requests, assuming that it *might* be becoming the new master.

Logic for remembering and re-sequencing messages is split between the CCCP
communication module and the lock management module.  Peter suggested that
the logic is not very clean (i.e. fairly intertwined and interdependent), and
could benefit from some re-design.



References:
-----------

OpenDLM web site:

http://opendlm.sourceforge.net  (see especially "Docs/Info" page, via menu)


OpenGFS CVS (browsable/downloadable via OpenGFS website):

opengfs/docs/*
   Many docs are also mirrored to http://opengfs.sourceforge.net/docs.php


Short descriptions of DLM and DLM API within HP OpenVMS:

http://h71000.www7.hp.com/doc/731FINAL/5841/5841pro_021.html#145_conceptslocks

http://h71000.www7.hp.com/doc/731FINAL/4527/4527pro_045.html