OpenDLM_Structures

OpenDLM_Structures (Feb 19 2004)
Contact the author.
             OpenDLM Important Software Structures

Copyright 2004 The OpenDLM Project

Author:
Ben Cahill (bc), ben.m.cahill@intel.com

1.  Introduction
----------------
This document contains details of important structures within OpenDLM, and
information on their usage.

This document is aimed at developers, potential developers, students, and
anyone who wants to know about the details of structures in OpenDLM.

This document is not intended as a user guide to OpenDLM.  Look in the OpenDLM
WHATIS-odlm document for an overview of OpenDLM.  Look in the OpenDLM
HOWTO-install for details of configuring and setting up OpenDLM.

This document may contain inaccurate statements, based on the author's limited
understanding.  Please contact the author (bc) if you see anything wrong or
unclear.


1.1  Structure management and definitions
-----------------------------------------
For many structure types, OpenDLM has a source file dedicated to managing a list
or lists of that type of structure.  As an example, see section 2 below.

Unless otherwise noted, the structures discussed in this document are defined
in:

src/include/clmstructs.h


2  Resource (struct resource)
-----------------------------
A "resource" is a lockable object.  A resource is identified by its name
and family/type (UNIX or VMS), and is represented within a node as a "struct
resource".

If a cluster member node is the first in the cluster to use a resource, that
node becomes the "resource master".  As it and other nodes grab and release
locks on the resource, the master node keeps a master copy of the resource
structure, tracking information on all locks on the resource throughout the
cluster.  A master resource structure contains "0" as the master node.

When a non-master (i.e. not the first node to grab a lock on the resource)
node uses the resource, it generates a local copy of the resource structure.
This includes info on just those locks that are within that node, and contains
a non-zero value indicating the master node.

In src/kernel/dlmdk/clm_migrate.c, functions clmm_master2slave() and
clmm_slave2master() convert resource structures from master to slave,
and vice versa.

All resources are maintained within the "restab" data structure, which is
managed by source file src/kernel/dlmdk/clm_resource.c.  Take a look
at comments in that file (around line 175) for more information.


struct resource {
	int	refcount;		/* reference count 		*/
	union dlm_rh rh;		/* resource handle 		*/
	union dlm_rh	mst_rh;		/* master resource handle       */
	dlm_restype_t type;		/* resource type 		*/
	char	*name;			/* dynamic resource name 	*/
	short 	namelen;		/* length of resource name 	*/
	short	master;			/* site id of master node       */
	dlm_list_t   lockqs[ RLQ_MAX ];	/* grant, convert, wait lists */
	char 	value[MAXLOCKVAL];	/* lock value block 		*/
	int 	rsrc_flags;		/* state bits 			*/
	clmm_info_t clminfo;		/* migration data 		*/
	int     lastnode;               /* nodeid of last request       */
	dlm_statistics_t stats;         /* resource statistics          */
	union {
		struct resource	*free_next;	/* next resource in freelist */
		struct resref *refs;		/* resource reference list */
	} reuse;
	void	*migr_ack;		/* pointer to migration ack message */
};


3  Directory (struct rl_info)
-----------------------------
A directory maps a resource to its master node.   The rl_info structure
is considerably simpler than a resource structure, because *all* that it
needs to do is to map a resource name/type to its master.  

A directory entry must reside on a certain node within the cluster, determined
by a hash function of the resource name, and the number of active cluster
members.  Using this "well known" (to all nodes) formula, any node can
calculate the node to query for directory information for a given resource.

The directory node assignment has nothing to do with which nodes might use
a given resource.  Therefore, even though the info in a directory structure
is a subset of the info in a resource structure, it makes sense to have the
directory structure be independent and separate from (and smaller than)
the resource structure.

Directory entries are linked to slot lists within a hash table "rldbtab",
managed by code within src/kernel/dlmdk/clm_rldb.c.  Current code allocates
100000 slots, via rl_init(100000) in src/kernel/dlmdk/clm_main.c.

typedef struct rl_info {
	dlm_listnode_t  link;       /* link into hash slot list */
	void           *rl_name;    /* name field of the resource */
	ushort          rl_namelen; /* length of the name value */
	ushort          rl_flags;   /* RLDB flags, see below */
	ushort          rl_node;    /* node where resource is mastered */
	unsigned char   rl_type;    /* type of resource (VMS vs UNIX) */
} rl_info_t;

When searching through the directory for a given entry, a match is found
if and only if the following fields match:

-- rl_name
-- rl_namelen
-- rl_type

The following flags are used in the rl_flags field:

RLINFO_DIRECTORY 0x01  /* set for a directory entry, cleared for cache entry */
RLINFO_FROZEN    0x02  /* entry is frozen - don't use */
RLINFO_TOUCHED   0x04  /* used for LRU reclaim algorithm */
RLINFO_LOCAL     0x08  /* set for local lock in reconfig */

An active directory entry has RLINFO_DIRECTORY set.  This means that the entry
is serving as the cluster-wide master copy of the directory entry, and resides
on the *directory node* for that resource.

Each node has a cache of local directory entries.  These are copies of
directory entries on other directory nodes.  These are in the same hash table
as the active directory entries, but they do not have RLINFO_DIRECTORY set.
Directory cache entries may exist for two reasons:

-- cache entry is created when *this* node gets directory info from a
   directory node (e.g. when requesting a lock on a new resource).
-- cache entry is left over from migrating a directory from this node
   (the former directory node) to a new directory node after a change in
   cluster membership.

The RLINFO_TOUCHED flag is used for what seems to be a weak least-recently-
used algorithm.  With current code, this algo is used *only* when the Linux
slab cache allocator doesn't work.  (?? Could we eliminate this LRU stuff ??)

The RLINFO_FROZEN flag seems to live up to the "don't use" comment; I can't
find anywhere in the code that RLINFO_FROZEN is used!  (Look for use of
RSRC_FRZEN, though??).

The RLINFO_LOCAL flag??



4  Lock (struct reclock)
------------------------
The reclock structure contains 3 significant structures describing a given
lock, along with a few other elements for managing the lock within a hash
table or the linked list queues in the resource structure.

Reclock structures are managed via functions in src/kernel/dlmdk/clm_alloc.c.


struct reclock {
	dlm_listnode_t   link;
	struct tq_node	*tq_node;	/* ptr to timeout queue structure */
	struct routeinfo route;
	struct lockinfo	lck;
	struct deadlock dl;
	unsigned char	seg;		/* lock table segment index */
	unsigned char	set;		/* timer set/reset */
};


4.1  Route Info (struct routeinfo)
----------------------------------
/*
 * This structure is contained in a lock record as well as a lock
 * request.  It holds the information necessary to complete a lock
 * request after it has been granted.
 */
struct routeinfo {
	short	respflags;		/* response flags */
	short	rt_master;		/* master id for response */
	void	(*callback)(struct transaction *, struct transaction *);
					/* completion func on secondary */
	union {
		int 		signo;	/* signal to use for notification */
		union dlm_rh	lcl_rh;	/* local resource handle */
	} rt_reuse;
	int pollfd;		        /* fd to use for notification */
	ast_handle_t asth;		/* for completion ASTs */
	ast_handle_t basth;		/* for blocking ASTs */
	ast_handle_t cbasth;		/* for convert pending blocking ASTs */
};


4.2  Lock Info (struct lockinfo)
--------------------------------
/*
 * This is the latest copy of the lockinfo structure updated to contain
 * a transaction identifier.
 *
 * This structure contains lock specific information - most of which
 * originates in the API
 */
struct lockinfo {
	unsigned 	mode:3;		/* grant mode */
	unsigned	reqmode:3;	/* request mode */
	unsigned 	bastmode:3;	/* mode passed back in blocking AST */
	unsigned	site:16;	/* site # where request was made */	
	int 		lk_flags;	/* flags from application */
	int		pid;		/* process ID that requested lock */
	uint		remlockid;	/* remote lockid from secondary */
	uint		lockid;		/* lockid */
	short 		state;		/* internal state flags */
	union dlm_rh	rh;		/* resource handle */
	union reuse_li	li_reuse;
	char 		lkvalue[MAXLOCKVAL];
	struct transaction	*request;/* pointer to create request */
	u_int		seqnum;		/* sequence number on queue */
	dlm_restype_t   type;
	dlm_xid_t	xid;		/* transaction id */
	clm_orphan_holder_t	orphan_holder; /* holder of orphan lock */
};


4.3  Deadlock (struct deadlock)
-------------------------------
/*  Structure to hold all info a reclock needs to keep
    about deadlock detection.  */

struct deadlock {
	dlm_listnode_t   tq;    /*  Links for the waiter-timeout queue.  */
	dlm_listnode_t   cl;    /*  Links for the client/owner list.  */
	/*  Timestamp for Deadlock timeout queue.  */
	struct timeval timestamp;	/* time lock was added to queue */
	struct timeval checkstamp;	/* time last checked for deadlock */
	/*  Deadlock pass stamp to avoid redundant deadlock searches.  */
	short deadlock_stamp;
};


5  Transaction (struct transaction)
-----------------------------------


struct transaction {
  int           clm_client_space; /* Client in user or kernel space */
  u_long        clm_prog;         /* desired service */
  u_long        clm_vers;         /* service version */
  dlm_stats_t   clm_status;       /* status of request */
  dlm_ops_t     clm_type;         /* request type */
  ptype_t       clm_direction;    /* request or response */
  short         clm_locktype;     /* UNIX or VMS lock */
  unsigned int  clm_sequence;     /* transaction sequence # */
  unsigned int  pti_sequence;     /* transaction sequence # */
  int           clm_authpid;      /* authorized process id of group */
  void         *clm_sender;       /* to detect lost replies */

  /* routing information */
  struct routeinfo clm_route;
  int           clm_pid;          /* process id of client */
  void         *clm_next;         /* next trans or msg */

  union trans_data {
    /* requests */
    struct clm_regreq       _clm_register;   /* DLM_REGISTER */
    struct lockreq          _clm_lockreq;    /* LOCK */
    struct lockinfo         _clm_lockinfo;   /* LOCK,CANCEL,UNLOCK*/
    struct purgereq         _clm_purgelocks; /* DLM_PURGELOCKSPID */
    struct scninfo          _clm_scninfo;    /* DLM_SCN_OP */
    struct res_migr_params  _clm_rmpinfo;    /* DLM_RMIGR_OP */
    struct glob_migr_params _clm_gmpinfo;    /* DLM_GMIGR_OP */
    struct getstats         _clm_statsinfo;  /* DLM_STATS */

    /* responses */
    struct lockstatus       _clm_lockstatus;
    union dlm_rh            _clm_handle;
    struct appreg           _clm_appreg;
    /* holder info will go here for DLM_TEST */
  } clm_data;

#if defined(TIME_STATS)
  timestats_t	stats;
#endif
  /* make sure this stuff is at the end */
  void	*clm_oldtrans;	/* pointer to saved copy of old version */
};



6  Client (struct clm_client)
-----------------------------

struct clm_client
{
  clm_client_id_t   cli_id;             /* client id */
  int               cli_site;           /* site of client */
  int               cli_groupid;        /* group id of client if any */
  clm_client_type_t cli_type;           /* client type: PROCESS, 
                                             GROUP or TRANSACTION */
  unsigned int      cli_sequence;       /* client sequence */
  dlm_list_t        sync_seq;           /* seq # of lock request */
  dlm_list_t        cli_queue;          /* owned locks list */
  dlm_list_t        cli_ast_pnd;        /* list of pending ASTs */

  /* Used for syscall interface (see cllockd/clm_cti.c).  -jjd- */
  struct transaction *cli_reply;

  wait_queue_head_t cli_reply_event;
  int               cli_migrcount ;     /* # res w/locks migrating */
  unsigned int      cli_state ;         /* state code & flags */
  /* flags for nodes replying */
  struct noderesp   cli_resp[MAXNODES] ;

#ifdef CONFIG_PROC_FS
  struct proc_dir_entry * proc_entry;
#endif /* CONFIG_PROC_FS */
};


7  Node List (struct node_list)
-------------------------------
The node_list contains a complete list of active nodes, and the range of
the versions of the lock managers within the cluster (simultaneous use of
different OpenDLM versions is supported).  Each node contains one active
copy of this list.

Each time the cluster membership changes, this list must be updated in each
node.  Each node obtains the info about itself from its own cluster manager
(e.g. heartbeat) and its own dlm_version.h, then passes the info around to
all of the other nodes.

This structure is managed by dlm_base.c, which contains two static instances
of the structure:

s_node_list
s_new_node_list

At any given instant, one of these is the current node list, while the other
can be used offline for building an updated list when cluster membership
changes.  Don't let the names fool you; they swap roles in a ping-pong fashion.
s_new_node_list can be the current node list, even for a long time (?? should
we change their names for clarity ??).

These static structures are accessed from various source files via 2 global
pointers.  Unlike the structures they point to, these always maintain roles
in line with their names:

node_list -- pointer to current node list
new_node_list -- pointer to offline node list

These pointers swap their contents each time cluster membership changes (see
the end of clm_change_topology()), in clm_info.c.  The handoff happens
atomically with:

node_list = new_node_list;

The node_list structure is defined in src/kernel/dlmdk/dlm_clust.h:

struct node_list
{
    int low_version;       /* oldest lock manager version in cluster */
    int high_version;      /* newest lock manager version in cluster */
    int node_count;        /* # active nodes in cluster */
    struct node_item   node[ MAXNODES ];  /* info on each node */
} ;


The node_item structure looks like:

struct node_item
{
    long nodeid ;          /* cluster wide node ID (from cluster mgr) */
    struct in_addr saddr ; /* service IP address (one to use now) */
    int version ;          /* lock mgr (clm) version #, see dlm_version.h */
#ifdef CONFIG_PROC_FS
    struct proc_dir_entry * proc_entry; /* each node gets a /proc entry! */
#endif /* CONFIG_PROC_FS */
};

The nodeid is obtained from the cluster manager (e.g. heartbeat), and reflects
the cluster configuration file written by the cluster administrator person.

Note that the nodeid does *not* correlate with the index of "node" in the
struct node_list.  The NODEID_TO_INDEX macro searches the node_list to find the
index number within *this* node.  Note that different nodes may have different
index numbers for the same node, since the node_list on each node is built
independently.

The version # is established by the src/include/dlm_version.h file with which
OpenDLM was built for a given node.


8  Lock ID (and struct hashent)
-------------------------------
A lockid is a node-specific "handle" for a given lock.  It includes the
following bit fields:

0x00001fff -- 13 bits for the site (node) ID
0xFFFFE000 -- 19 bits for a "generation" number

Within a given node, the generation number is unique for each lock.  It is a
monotonically increasing number, incremented each time a new lock structure
(struct reclock) is allocated.  See get_le(), in clm_alloc.c.

The combination of node ID and generation number is sufficient for identifying
a reclock structure uniquely throughout the cluster.  And, it is much more
compact than using the resource name as part of the identifier.  However,
representations of the same lock within different nodes (e.g. the resource
master and another node that has a lock on the resource), will have *different*
lockids in their respective copies of the reclock structure.

So, for different nodes to recognize the same lock via different lockids, each
node contains mapping tables, one for each of the other nodes in the cluster.

Each table is a hash table, with DEFLOCKSEGSIZE (16384) slots (changeable
with a command line option).  These tables are managed by clm_lockid.c.

Each slot contains a struct hashent, to which other hashents can be linked,
if there is more than one lock in a slot:

struct hashent {
	struct hashent *next;
	int remoteid;
	int localid;
};

?? There seem to be several seemingly identical definitions of this structure
scattered through the code base, in clm_lockid.c, clm_lockid.h, and
dlm_kernel.h.  Can we consolidate these into one definition??

The tables are allocated on an as-needed basis, as are additional hashent
structures for attaching map entries to the slots' lists.  Pointers to each
table are kept in the global idhash[MAXNODES] array in clm_lockid.c.
MAXNODES is defined in src/include/dlm_cluster.h as 8.


9  Global Variables
-------------------
OpenDLM's current implementation makes use of a significant number of global
variables.  In addition to the ones described in other sections of this
document, global variables are defined in *at least* the following locations:

src/kernel/dlmdk/clm_data.c -- among other things, holds values from command
    line or config file.