Contact the author.
OpenDLM Important Software Structures Copyright 2004 The OpenDLM Project Author: Ben Cahill (bc), ben.m.cahill@intel.com 1. Introduction ---------------- This document contains details of important structures within OpenDLM, and information on their usage. This document is aimed at developers, potential developers, students, and anyone who wants to know about the details of structures in OpenDLM. This document is not intended as a user guide to OpenDLM. Look in the OpenDLM WHATIS-odlm document for an overview of OpenDLM. Look in the OpenDLM HOWTO-install for details of configuring and setting up OpenDLM. This document may contain inaccurate statements, based on the author's limited understanding. Please contact the author (bc) if you see anything wrong or unclear. 1.1 Structure management and definitions ----------------------------------------- For many structure types, OpenDLM has a source file dedicated to managing a list or lists of that type of structure. As an example, see section 2 below. Unless otherwise noted, the structures discussed in this document are defined in: src/include/clmstructs.h 2 Resource (struct resource) ----------------------------- A "resource" is a lockable object. A resource is identified by its name and family/type (UNIX or VMS), and is represented within a node as a "struct resource". If a cluster member node is the first in the cluster to use a resource, that node becomes the "resource master". As it and other nodes grab and release locks on the resource, the master node keeps a master copy of the resource structure, tracking information on all locks on the resource throughout the cluster. A master resource structure contains "0" as the master node. When a non-master (i.e. not the first node to grab a lock on the resource) node uses the resource, it generates a local copy of the resource structure. This includes info on just those locks that are within that node, and contains a non-zero value indicating the master node. In src/kernel/dlmdk/clm_migrate.c, functions clmm_master2slave() and clmm_slave2master() convert resource structures from master to slave, and vice versa. All resources are maintained within the "restab" data structure, which is managed by source file src/kernel/dlmdk/clm_resource.c. Take a look at comments in that file (around line 175) for more information. struct resource { int refcount; /* reference count */ union dlm_rh rh; /* resource handle */ union dlm_rh mst_rh; /* master resource handle */ dlm_restype_t type; /* resource type */ char *name; /* dynamic resource name */ short namelen; /* length of resource name */ short master; /* site id of master node */ dlm_list_t lockqs[ RLQ_MAX ]; /* grant, convert, wait lists */ char value[MAXLOCKVAL]; /* lock value block */ int rsrc_flags; /* state bits */ clmm_info_t clminfo; /* migration data */ int lastnode; /* nodeid of last request */ dlm_statistics_t stats; /* resource statistics */ union { struct resource *free_next; /* next resource in freelist */ struct resref *refs; /* resource reference list */ } reuse; void *migr_ack; /* pointer to migration ack message */ }; 3 Directory (struct rl_info) ----------------------------- A directory maps a resource to its master node. The rl_info structure is considerably simpler than a resource structure, because *all* that it needs to do is to map a resource name/type to its master. A directory entry must reside on a certain node within the cluster, determined by a hash function of the resource name, and the number of active cluster members. Using this "well known" (to all nodes) formula, any node can calculate the node to query for directory information for a given resource. The directory node assignment has nothing to do with which nodes might use a given resource. Therefore, even though the info in a directory structure is a subset of the info in a resource structure, it makes sense to have the directory structure be independent and separate from (and smaller than) the resource structure. Directory entries are linked to slot lists within a hash table "rldbtab", managed by code within src/kernel/dlmdk/clm_rldb.c. Current code allocates 100000 slots, via rl_init(100000) in src/kernel/dlmdk/clm_main.c. typedef struct rl_info { dlm_listnode_t link; /* link into hash slot list */ void *rl_name; /* name field of the resource */ ushort rl_namelen; /* length of the name value */ ushort rl_flags; /* RLDB flags, see below */ ushort rl_node; /* node where resource is mastered */ unsigned char rl_type; /* type of resource (VMS vs UNIX) */ } rl_info_t; When searching through the directory for a given entry, a match is found if and only if the following fields match: -- rl_name -- rl_namelen -- rl_type The following flags are used in the rl_flags field: RLINFO_DIRECTORY 0x01 /* set for a directory entry, cleared for cache entry */ RLINFO_FROZEN 0x02 /* entry is frozen - don't use */ RLINFO_TOUCHED 0x04 /* used for LRU reclaim algorithm */ RLINFO_LOCAL 0x08 /* set for local lock in reconfig */ An active directory entry has RLINFO_DIRECTORY set. This means that the entry is serving as the cluster-wide master copy of the directory entry, and resides on the *directory node* for that resource. Each node has a cache of local directory entries. These are copies of directory entries on other directory nodes. These are in the same hash table as the active directory entries, but they do not have RLINFO_DIRECTORY set. Directory cache entries may exist for two reasons: -- cache entry is created when *this* node gets directory info from a directory node (e.g. when requesting a lock on a new resource). -- cache entry is left over from migrating a directory from this node (the former directory node) to a new directory node after a change in cluster membership. The RLINFO_TOUCHED flag is used for what seems to be a weak least-recently- used algorithm. With current code, this algo is used *only* when the Linux slab cache allocator doesn't work. (?? Could we eliminate this LRU stuff ??) The RLINFO_FROZEN flag seems to live up to the "don't use" comment; I can't find anywhere in the code that RLINFO_FROZEN is used! (Look for use of RSRC_FRZEN, though??). The RLINFO_LOCAL flag?? 4 Lock (struct reclock) ------------------------ The reclock structure contains 3 significant structures describing a given lock, along with a few other elements for managing the lock within a hash table or the linked list queues in the resource structure. Reclock structures are managed via functions in src/kernel/dlmdk/clm_alloc.c. struct reclock { dlm_listnode_t link; struct tq_node *tq_node; /* ptr to timeout queue structure */ struct routeinfo route; struct lockinfo lck; struct deadlock dl; unsigned char seg; /* lock table segment index */ unsigned char set; /* timer set/reset */ }; 4.1 Route Info (struct routeinfo) ---------------------------------- /* * This structure is contained in a lock record as well as a lock * request. It holds the information necessary to complete a lock * request after it has been granted. */ struct routeinfo { short respflags; /* response flags */ short rt_master; /* master id for response */ void (*callback)(struct transaction *, struct transaction *); /* completion func on secondary */ union { int signo; /* signal to use for notification */ union dlm_rh lcl_rh; /* local resource handle */ } rt_reuse; int pollfd; /* fd to use for notification */ ast_handle_t asth; /* for completion ASTs */ ast_handle_t basth; /* for blocking ASTs */ ast_handle_t cbasth; /* for convert pending blocking ASTs */ }; 4.2 Lock Info (struct lockinfo) -------------------------------- /* * This is the latest copy of the lockinfo structure updated to contain * a transaction identifier. * * This structure contains lock specific information - most of which * originates in the API */ struct lockinfo { unsigned mode:3; /* grant mode */ unsigned reqmode:3; /* request mode */ unsigned bastmode:3; /* mode passed back in blocking AST */ unsigned site:16; /* site # where request was made */ int lk_flags; /* flags from application */ int pid; /* process ID that requested lock */ uint remlockid; /* remote lockid from secondary */ uint lockid; /* lockid */ short state; /* internal state flags */ union dlm_rh rh; /* resource handle */ union reuse_li li_reuse; char lkvalue[MAXLOCKVAL]; struct transaction *request;/* pointer to create request */ u_int seqnum; /* sequence number on queue */ dlm_restype_t type; dlm_xid_t xid; /* transaction id */ clm_orphan_holder_t orphan_holder; /* holder of orphan lock */ }; 4.3 Deadlock (struct deadlock) ------------------------------- /* Structure to hold all info a reclock needs to keep about deadlock detection. */ struct deadlock { dlm_listnode_t tq; /* Links for the waiter-timeout queue. */ dlm_listnode_t cl; /* Links for the client/owner list. */ /* Timestamp for Deadlock timeout queue. */ struct timeval timestamp; /* time lock was added to queue */ struct timeval checkstamp; /* time last checked for deadlock */ /* Deadlock pass stamp to avoid redundant deadlock searches. */ short deadlock_stamp; }; 5 Transaction (struct transaction) ----------------------------------- struct transaction { int clm_client_space; /* Client in user or kernel space */ u_long clm_prog; /* desired service */ u_long clm_vers; /* service version */ dlm_stats_t clm_status; /* status of request */ dlm_ops_t clm_type; /* request type */ ptype_t clm_direction; /* request or response */ short clm_locktype; /* UNIX or VMS lock */ unsigned int clm_sequence; /* transaction sequence # */ unsigned int pti_sequence; /* transaction sequence # */ int clm_authpid; /* authorized process id of group */ void *clm_sender; /* to detect lost replies */ /* routing information */ struct routeinfo clm_route; int clm_pid; /* process id of client */ void *clm_next; /* next trans or msg */ union trans_data { /* requests */ struct clm_regreq _clm_register; /* DLM_REGISTER */ struct lockreq _clm_lockreq; /* LOCK */ struct lockinfo _clm_lockinfo; /* LOCK,CANCEL,UNLOCK*/ struct purgereq _clm_purgelocks; /* DLM_PURGELOCKSPID */ struct scninfo _clm_scninfo; /* DLM_SCN_OP */ struct res_migr_params _clm_rmpinfo; /* DLM_RMIGR_OP */ struct glob_migr_params _clm_gmpinfo; /* DLM_GMIGR_OP */ struct getstats _clm_statsinfo; /* DLM_STATS */ /* responses */ struct lockstatus _clm_lockstatus; union dlm_rh _clm_handle; struct appreg _clm_appreg; /* holder info will go here for DLM_TEST */ } clm_data; #if defined(TIME_STATS) timestats_t stats; #endif /* make sure this stuff is at the end */ void *clm_oldtrans; /* pointer to saved copy of old version */ }; 6 Client (struct clm_client) ----------------------------- struct clm_client { clm_client_id_t cli_id; /* client id */ int cli_site; /* site of client */ int cli_groupid; /* group id of client if any */ clm_client_type_t cli_type; /* client type: PROCESS, GROUP or TRANSACTION */ unsigned int cli_sequence; /* client sequence */ dlm_list_t sync_seq; /* seq # of lock request */ dlm_list_t cli_queue; /* owned locks list */ dlm_list_t cli_ast_pnd; /* list of pending ASTs */ /* Used for syscall interface (see cllockd/clm_cti.c). -jjd- */ struct transaction *cli_reply; wait_queue_head_t cli_reply_event; int cli_migrcount ; /* # res w/locks migrating */ unsigned int cli_state ; /* state code & flags */ /* flags for nodes replying */ struct noderesp cli_resp[MAXNODES] ; #ifdef CONFIG_PROC_FS struct proc_dir_entry * proc_entry; #endif /* CONFIG_PROC_FS */ }; 7 Node List (struct node_list) ------------------------------- The node_list contains a complete list of active nodes, and the range of the versions of the lock managers within the cluster (simultaneous use of different OpenDLM versions is supported). Each node contains one active copy of this list. Each time the cluster membership changes, this list must be updated in each node. Each node obtains the info about itself from its own cluster manager (e.g. heartbeat) and its own dlm_version.h, then passes the info around to all of the other nodes. This structure is managed by dlm_base.c, which contains two static instances of the structure: s_node_list s_new_node_list At any given instant, one of these is the current node list, while the other can be used offline for building an updated list when cluster membership changes. Don't let the names fool you; they swap roles in a ping-pong fashion. s_new_node_list can be the current node list, even for a long time (?? should we change their names for clarity ??). These static structures are accessed from various source files via 2 global pointers. Unlike the structures they point to, these always maintain roles in line with their names: node_list -- pointer to current node list new_node_list -- pointer to offline node list These pointers swap their contents each time cluster membership changes (see the end of clm_change_topology()), in clm_info.c. The handoff happens atomically with: node_list = new_node_list; The node_list structure is defined in src/kernel/dlmdk/dlm_clust.h: struct node_list { int low_version; /* oldest lock manager version in cluster */ int high_version; /* newest lock manager version in cluster */ int node_count; /* # active nodes in cluster */ struct node_item node[ MAXNODES ]; /* info on each node */ } ; The node_item structure looks like: struct node_item { long nodeid ; /* cluster wide node ID (from cluster mgr) */ struct in_addr saddr ; /* service IP address (one to use now) */ int version ; /* lock mgr (clm) version #, see dlm_version.h */ #ifdef CONFIG_PROC_FS struct proc_dir_entry * proc_entry; /* each node gets a /proc entry! */ #endif /* CONFIG_PROC_FS */ }; The nodeid is obtained from the cluster manager (e.g. heartbeat), and reflects the cluster configuration file written by the cluster administrator person. Note that the nodeid does *not* correlate with the index of "node" in the struct node_list. The NODEID_TO_INDEX macro searches the node_list to find the index number within *this* node. Note that different nodes may have different index numbers for the same node, since the node_list on each node is built independently. The version # is established by the src/include/dlm_version.h file with which OpenDLM was built for a given node. 8 Lock ID (and struct hashent) ------------------------------- A lockid is a node-specific "handle" for a given lock. It includes the following bit fields: 0x00001fff -- 13 bits for the site (node) ID 0xFFFFE000 -- 19 bits for a "generation" number Within a given node, the generation number is unique for each lock. It is a monotonically increasing number, incremented each time a new lock structure (struct reclock) is allocated. See get_le(), in clm_alloc.c. The combination of node ID and generation number is sufficient for identifying a reclock structure uniquely throughout the cluster. And, it is much more compact than using the resource name as part of the identifier. However, representations of the same lock within different nodes (e.g. the resource master and another node that has a lock on the resource), will have *different* lockids in their respective copies of the reclock structure. So, for different nodes to recognize the same lock via different lockids, each node contains mapping tables, one for each of the other nodes in the cluster. Each table is a hash table, with DEFLOCKSEGSIZE (16384) slots (changeable with a command line option). These tables are managed by clm_lockid.c. Each slot contains a struct hashent, to which other hashents can be linked, if there is more than one lock in a slot: struct hashent { struct hashent *next; int remoteid; int localid; }; ?? There seem to be several seemingly identical definitions of this structure scattered through the code base, in clm_lockid.c, clm_lockid.h, and dlm_kernel.h. Can we consolidate these into one definition?? The tables are allocated on an as-needed basis, as are additional hashent structures for attaching map entries to the slots' lists. Pointers to each table are kept in the global idhash[MAXNODES] array in clm_lockid.c. MAXNODES is defined in src/include/dlm_cluster.h as 8. 9 Global Variables ------------------- OpenDLM's current implementation makes use of a significant number of global variables. In addition to the ones described in other sections of this document, global variables are defined in *at least* the following locations: src/kernel/dlmdk/clm_data.c -- among other things, holds values from command line or config file.