Contact the author.
OpenDLM Important Software Structures
Copyright 2004 The OpenDLM Project
Author:
Ben Cahill (bc), ben.m.cahill@intel.com
1. Introduction
----------------
This document contains details of important structures within OpenDLM, and
information on their usage.
This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of structures in OpenDLM.
This document is not intended as a user guide to OpenDLM. Look in the OpenDLM
WHATIS-odlm document for an overview of OpenDLM. Look in the OpenDLM
HOWTO-install for details of configuring and setting up OpenDLM.
This document may contain inaccurate statements, based on the author's limited
understanding. Please contact the author (bc) if you see anything wrong or
unclear.
1.1 Structure management and definitions
-----------------------------------------
For many structure types, OpenDLM has a source file dedicated to managing a list
or lists of that type of structure. As an example, see section 2 below.
Unless otherwise noted, the structures discussed in this document are defined
in:
src/include/clmstructs.h
2 Resource (struct resource)
-----------------------------
A "resource" is a lockable object. A resource is identified by its name
and family/type (UNIX or VMS), and is represented within a node as a "struct
resource".
If a cluster member node is the first in the cluster to use a resource, that
node becomes the "resource master". As it and other nodes grab and release
locks on the resource, the master node keeps a master copy of the resource
structure, tracking information on all locks on the resource throughout the
cluster. A master resource structure contains "0" as the master node.
When a non-master (i.e. not the first node to grab a lock on the resource)
node uses the resource, it generates a local copy of the resource structure.
This includes info on just those locks that are within that node, and contains
a non-zero value indicating the master node.
In src/kernel/dlmdk/clm_migrate.c, functions clmm_master2slave() and
clmm_slave2master() convert resource structures from master to slave,
and vice versa.
All resources are maintained within the "restab" data structure, which is
managed by source file src/kernel/dlmdk/clm_resource.c. Take a look
at comments in that file (around line 175) for more information.
struct resource {
int refcount; /* reference count */
union dlm_rh rh; /* resource handle */
union dlm_rh mst_rh; /* master resource handle */
dlm_restype_t type; /* resource type */
char *name; /* dynamic resource name */
short namelen; /* length of resource name */
short master; /* site id of master node */
dlm_list_t lockqs[ RLQ_MAX ]; /* grant, convert, wait lists */
char value[MAXLOCKVAL]; /* lock value block */
int rsrc_flags; /* state bits */
clmm_info_t clminfo; /* migration data */
int lastnode; /* nodeid of last request */
dlm_statistics_t stats; /* resource statistics */
union {
struct resource *free_next; /* next resource in freelist */
struct resref *refs; /* resource reference list */
} reuse;
void *migr_ack; /* pointer to migration ack message */
};
3 Directory (struct rl_info)
-----------------------------
A directory maps a resource to its master node. The rl_info structure
is considerably simpler than a resource structure, because *all* that it
needs to do is to map a resource name/type to its master.
A directory entry must reside on a certain node within the cluster, determined
by a hash function of the resource name, and the number of active cluster
members. Using this "well known" (to all nodes) formula, any node can
calculate the node to query for directory information for a given resource.
The directory node assignment has nothing to do with which nodes might use
a given resource. Therefore, even though the info in a directory structure
is a subset of the info in a resource structure, it makes sense to have the
directory structure be independent and separate from (and smaller than)
the resource structure.
Directory entries are linked to slot lists within a hash table "rldbtab",
managed by code within src/kernel/dlmdk/clm_rldb.c. Current code allocates
100000 slots, via rl_init(100000) in src/kernel/dlmdk/clm_main.c.
typedef struct rl_info {
dlm_listnode_t link; /* link into hash slot list */
void *rl_name; /* name field of the resource */
ushort rl_namelen; /* length of the name value */
ushort rl_flags; /* RLDB flags, see below */
ushort rl_node; /* node where resource is mastered */
unsigned char rl_type; /* type of resource (VMS vs UNIX) */
} rl_info_t;
When searching through the directory for a given entry, a match is found
if and only if the following fields match:
-- rl_name
-- rl_namelen
-- rl_type
The following flags are used in the rl_flags field:
RLINFO_DIRECTORY 0x01 /* set for a directory entry, cleared for cache entry */
RLINFO_FROZEN 0x02 /* entry is frozen - don't use */
RLINFO_TOUCHED 0x04 /* used for LRU reclaim algorithm */
RLINFO_LOCAL 0x08 /* set for local lock in reconfig */
An active directory entry has RLINFO_DIRECTORY set. This means that the entry
is serving as the cluster-wide master copy of the directory entry, and resides
on the *directory node* for that resource.
Each node has a cache of local directory entries. These are copies of
directory entries on other directory nodes. These are in the same hash table
as the active directory entries, but they do not have RLINFO_DIRECTORY set.
Directory cache entries may exist for two reasons:
-- cache entry is created when *this* node gets directory info from a
directory node (e.g. when requesting a lock on a new resource).
-- cache entry is left over from migrating a directory from this node
(the former directory node) to a new directory node after a change in
cluster membership.
The RLINFO_TOUCHED flag is used for what seems to be a weak least-recently-
used algorithm. With current code, this algo is used *only* when the Linux
slab cache allocator doesn't work. (?? Could we eliminate this LRU stuff ??)
The RLINFO_FROZEN flag seems to live up to the "don't use" comment; I can't
find anywhere in the code that RLINFO_FROZEN is used! (Look for use of
RSRC_FRZEN, though??).
The RLINFO_LOCAL flag??
4 Lock (struct reclock)
------------------------
The reclock structure contains 3 significant structures describing a given
lock, along with a few other elements for managing the lock within a hash
table or the linked list queues in the resource structure.
Reclock structures are managed via functions in src/kernel/dlmdk/clm_alloc.c.
struct reclock {
dlm_listnode_t link;
struct tq_node *tq_node; /* ptr to timeout queue structure */
struct routeinfo route;
struct lockinfo lck;
struct deadlock dl;
unsigned char seg; /* lock table segment index */
unsigned char set; /* timer set/reset */
};
4.1 Route Info (struct routeinfo)
----------------------------------
/*
* This structure is contained in a lock record as well as a lock
* request. It holds the information necessary to complete a lock
* request after it has been granted.
*/
struct routeinfo {
short respflags; /* response flags */
short rt_master; /* master id for response */
void (*callback)(struct transaction *, struct transaction *);
/* completion func on secondary */
union {
int signo; /* signal to use for notification */
union dlm_rh lcl_rh; /* local resource handle */
} rt_reuse;
int pollfd; /* fd to use for notification */
ast_handle_t asth; /* for completion ASTs */
ast_handle_t basth; /* for blocking ASTs */
ast_handle_t cbasth; /* for convert pending blocking ASTs */
};
4.2 Lock Info (struct lockinfo)
--------------------------------
/*
* This is the latest copy of the lockinfo structure updated to contain
* a transaction identifier.
*
* This structure contains lock specific information - most of which
* originates in the API
*/
struct lockinfo {
unsigned mode:3; /* grant mode */
unsigned reqmode:3; /* request mode */
unsigned bastmode:3; /* mode passed back in blocking AST */
unsigned site:16; /* site # where request was made */
int lk_flags; /* flags from application */
int pid; /* process ID that requested lock */
uint remlockid; /* remote lockid from secondary */
uint lockid; /* lockid */
short state; /* internal state flags */
union dlm_rh rh; /* resource handle */
union reuse_li li_reuse;
char lkvalue[MAXLOCKVAL];
struct transaction *request;/* pointer to create request */
u_int seqnum; /* sequence number on queue */
dlm_restype_t type;
dlm_xid_t xid; /* transaction id */
clm_orphan_holder_t orphan_holder; /* holder of orphan lock */
};
4.3 Deadlock (struct deadlock)
-------------------------------
/* Structure to hold all info a reclock needs to keep
about deadlock detection. */
struct deadlock {
dlm_listnode_t tq; /* Links for the waiter-timeout queue. */
dlm_listnode_t cl; /* Links for the client/owner list. */
/* Timestamp for Deadlock timeout queue. */
struct timeval timestamp; /* time lock was added to queue */
struct timeval checkstamp; /* time last checked for deadlock */
/* Deadlock pass stamp to avoid redundant deadlock searches. */
short deadlock_stamp;
};
5 Transaction (struct transaction)
-----------------------------------
struct transaction {
int clm_client_space; /* Client in user or kernel space */
u_long clm_prog; /* desired service */
u_long clm_vers; /* service version */
dlm_stats_t clm_status; /* status of request */
dlm_ops_t clm_type; /* request type */
ptype_t clm_direction; /* request or response */
short clm_locktype; /* UNIX or VMS lock */
unsigned int clm_sequence; /* transaction sequence # */
unsigned int pti_sequence; /* transaction sequence # */
int clm_authpid; /* authorized process id of group */
void *clm_sender; /* to detect lost replies */
/* routing information */
struct routeinfo clm_route;
int clm_pid; /* process id of client */
void *clm_next; /* next trans or msg */
union trans_data {
/* requests */
struct clm_regreq _clm_register; /* DLM_REGISTER */
struct lockreq _clm_lockreq; /* LOCK */
struct lockinfo _clm_lockinfo; /* LOCK,CANCEL,UNLOCK*/
struct purgereq _clm_purgelocks; /* DLM_PURGELOCKSPID */
struct scninfo _clm_scninfo; /* DLM_SCN_OP */
struct res_migr_params _clm_rmpinfo; /* DLM_RMIGR_OP */
struct glob_migr_params _clm_gmpinfo; /* DLM_GMIGR_OP */
struct getstats _clm_statsinfo; /* DLM_STATS */
/* responses */
struct lockstatus _clm_lockstatus;
union dlm_rh _clm_handle;
struct appreg _clm_appreg;
/* holder info will go here for DLM_TEST */
} clm_data;
#if defined(TIME_STATS)
timestats_t stats;
#endif
/* make sure this stuff is at the end */
void *clm_oldtrans; /* pointer to saved copy of old version */
};
6 Client (struct clm_client)
-----------------------------
struct clm_client
{
clm_client_id_t cli_id; /* client id */
int cli_site; /* site of client */
int cli_groupid; /* group id of client if any */
clm_client_type_t cli_type; /* client type: PROCESS,
GROUP or TRANSACTION */
unsigned int cli_sequence; /* client sequence */
dlm_list_t sync_seq; /* seq # of lock request */
dlm_list_t cli_queue; /* owned locks list */
dlm_list_t cli_ast_pnd; /* list of pending ASTs */
/* Used for syscall interface (see cllockd/clm_cti.c). -jjd- */
struct transaction *cli_reply;
wait_queue_head_t cli_reply_event;
int cli_migrcount ; /* # res w/locks migrating */
unsigned int cli_state ; /* state code & flags */
/* flags for nodes replying */
struct noderesp cli_resp[MAXNODES] ;
#ifdef CONFIG_PROC_FS
struct proc_dir_entry * proc_entry;
#endif /* CONFIG_PROC_FS */
};
7 Node List (struct node_list)
-------------------------------
The node_list contains a complete list of active nodes, and the range of
the versions of the lock managers within the cluster (simultaneous use of
different OpenDLM versions is supported). Each node contains one active
copy of this list.
Each time the cluster membership changes, this list must be updated in each
node. Each node obtains the info about itself from its own cluster manager
(e.g. heartbeat) and its own dlm_version.h, then passes the info around to
all of the other nodes.
This structure is managed by dlm_base.c, which contains two static instances
of the structure:
s_node_list
s_new_node_list
At any given instant, one of these is the current node list, while the other
can be used offline for building an updated list when cluster membership
changes. Don't let the names fool you; they swap roles in a ping-pong fashion.
s_new_node_list can be the current node list, even for a long time (?? should
we change their names for clarity ??).
These static structures are accessed from various source files via 2 global
pointers. Unlike the structures they point to, these always maintain roles
in line with their names:
node_list -- pointer to current node list
new_node_list -- pointer to offline node list
These pointers swap their contents each time cluster membership changes (see
the end of clm_change_topology()), in clm_info.c. The handoff happens
atomically with:
node_list = new_node_list;
The node_list structure is defined in src/kernel/dlmdk/dlm_clust.h:
struct node_list
{
int low_version; /* oldest lock manager version in cluster */
int high_version; /* newest lock manager version in cluster */
int node_count; /* # active nodes in cluster */
struct node_item node[ MAXNODES ]; /* info on each node */
} ;
The node_item structure looks like:
struct node_item
{
long nodeid ; /* cluster wide node ID (from cluster mgr) */
struct in_addr saddr ; /* service IP address (one to use now) */
int version ; /* lock mgr (clm) version #, see dlm_version.h */
#ifdef CONFIG_PROC_FS
struct proc_dir_entry * proc_entry; /* each node gets a /proc entry! */
#endif /* CONFIG_PROC_FS */
};
The nodeid is obtained from the cluster manager (e.g. heartbeat), and reflects
the cluster configuration file written by the cluster administrator person.
Note that the nodeid does *not* correlate with the index of "node" in the
struct node_list. The NODEID_TO_INDEX macro searches the node_list to find the
index number within *this* node. Note that different nodes may have different
index numbers for the same node, since the node_list on each node is built
independently.
The version # is established by the src/include/dlm_version.h file with which
OpenDLM was built for a given node.
8 Lock ID (and struct hashent)
-------------------------------
A lockid is a node-specific "handle" for a given lock. It includes the
following bit fields:
0x00001fff -- 13 bits for the site (node) ID
0xFFFFE000 -- 19 bits for a "generation" number
Within a given node, the generation number is unique for each lock. It is a
monotonically increasing number, incremented each time a new lock structure
(struct reclock) is allocated. See get_le(), in clm_alloc.c.
The combination of node ID and generation number is sufficient for identifying
a reclock structure uniquely throughout the cluster. And, it is much more
compact than using the resource name as part of the identifier. However,
representations of the same lock within different nodes (e.g. the resource
master and another node that has a lock on the resource), will have *different*
lockids in their respective copies of the reclock structure.
So, for different nodes to recognize the same lock via different lockids, each
node contains mapping tables, one for each of the other nodes in the cluster.
Each table is a hash table, with DEFLOCKSEGSIZE (16384) slots (changeable
with a command line option). These tables are managed by clm_lockid.c.
Each slot contains a struct hashent, to which other hashents can be linked,
if there is more than one lock in a slot:
struct hashent {
struct hashent *next;
int remoteid;
int localid;
};
?? There seem to be several seemingly identical definitions of this structure
scattered through the code base, in clm_lockid.c, clm_lockid.h, and
dlm_kernel.h. Can we consolidate these into one definition??
The tables are allocated on an as-needed basis, as are additional hashent
structures for attaching map entries to the slots' lists. Pointers to each
table are kept in the global idhash[MAXNODES] array in clm_lockid.c.
MAXNODES is defined in src/include/dlm_cluster.h as 8.
9 Global Variables
-------------------
OpenDLM's current implementation makes use of a significant number of global
variables. In addition to the ones described in other sections of this
document, global variables are defined in *at least* the following locations:
src/kernel/dlmdk/clm_data.c -- among other things, holds values from command
line or config file.