| Reference counting in pnfs: |
| ========================== |
| |
| The are several inter-related caches. We have layouts which can |
| reference multiple devices, each of which can reference multiple data servers. |
| Each data server can be referenced by multiple devices. Each device |
| can be referenced by multiple layouts. To keep all of this straight, |
| we need to reference count. |
| |
| |
| struct pnfs_layout_hdr |
| ---------------------- |
| The on-the-wire command LAYOUTGET corresponds to struct |
| pnfs_layout_segment, usually referred to by the variable name lseg. |
| Each nfs_inode may hold a pointer to a cache of of these layout |
| segments in nfsi->layout, of type struct pnfs_layout_hdr. |
| |
| We reference the header for the inode pointing to it, across each |
| outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN, |
| LAYOUTCOMMIT), and for each lseg held within. |
| |
| Each header is also (when non-empty) put on a list associated with |
| struct nfs_client (cl_layouts). Being put on this list does not bump |
| the reference count, as the layout is kept around by the lseg that |
| keeps it in the list. |
| |
| deviceid_cache |
| -------------- |
| lsegs reference device ids, which are resolved per nfs_client and |
| layout driver type. The device ids are held in a RCU cache (struct |
| nfs4_deviceid_cache). The cache itself is referenced across each |
| mount. The entries (struct nfs4_deviceid) themselves are held across |
| the lifetime of each lseg referencing them. |
| |
| RCU is used because the deviceid is basically a write once, read many |
| data structure. The hlist size of 32 buckets needs better |
| justification, but seems reasonable given that we can have multiple |
| deviceid's per filesystem, and multiple filesystems per nfs_client. |
| |
| The hash code is copied from the nfsd code base. A discussion of |
| hashing and variations of this algorithm can be found at: |
| http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809 |
| |
| data server cache |
| ----------------- |
| file driver devices refer to data servers, which are kept in a module |
| level cache. Its reference is held over the lifetime of the deviceid |
| pointing to it. |
| |
| lseg |
| ---- |
| lseg maintains an extra reference corresponding to the NFS_LSEG_VALID |
| bit which holds it in the pnfs_layout_hdr's list. When the final lseg |
| is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED |
| bit is set, preventing any new lsegs from being added. |
| |
| layout drivers |
| -------------- |
| |
| PNFS utilizes what is called layout drivers. The STD defines 3 basic |
| layout types: "files" "objects" and "blocks". For each of these types |
| there is a layout-driver with a common function-vectors table which |
| are called by the nfs-client pnfs-core to implement the different layout |
| types. |
| |
| Files-layout-driver code is in: fs/nfs/nfs4filelayout.c && nfs4filelayoutdev.c |
| Objects-layout-deriver code is in: fs/nfs/objlayout/.. directory |
| Blocks-layout-deriver code is in: fs/nfs/blocklayout/.. directory |
| |
| objects-layout setup |
| -------------------- |
| |
| As part of the full STD implementation the objlayoutdriver.ko needs, at times, |
| to automatically login to yet undiscovered iscsi/osd devices. For this the |
| driver makes up-calles to a user-mode script called *osd_login* |
| |
| The path_name of the script to use is by default: |
| /sbin/osd_login. |
| This name can be overridden by the Kernel module parameter: |
| objlayoutdriver.osd_login_prog |
| |
| If Kernel does not find the osd_login_prog path it will zero it out |
| and will not attempt farther logins. An admin can then write new value |
| to the objlayoutdriver.osd_login_prog Kernel parameter to re-enable it. |
| |
| The /sbin/osd_login is part of the nfs-utils package, and should usually |
| be installed on distributions that support this Kernel version. |
| |
| The API to the login script is as follows: |
| Usage: $0 -u <URI> -o <OSDNAME> -s <SYSTEMID> |
| Options: |
| -u target uri e.g. iscsi://<ip>:<port> |
| (allways exists) |
| (More protocols can be defined in the future. |
| The client does not interpret this string it is |
| passed unchanged as received from the Server) |
| -o osdname of the requested target OSD |
| (Might be empty) |
| (A string which denotes the OSD name, there is a |
| limit of 64 chars on this string) |
| -s systemid of the requested target OSD |
| (Might be empty) |
| (This string, if not empty is always an hex |
| representation of the 20 bytes osd_system_id) |
| |
| blocks-layout setup |
| ------------------- |
| |
| TODO: Document the setup needs of the blocks layout driver |