| |
| Making Filesystems Exportable |
| ============================= |
| |
| Most filesystem operations require a dentry (or two) as a starting |
| point. Local applications have a reference-counted hold on suitable |
| dentrys via open file descriptors or cwd/root. However remote |
| applications that access a filesystem via a remote filesystem protocol |
| such as NFS may not be able to hold such a reference, and so need a |
| different way to refer to a particular dentry. As the alternative |
| form of reference needs to be stable across renames, truncates, and |
| server-reboot (among other things, though these tend to be the most |
| problematic), there is no simple answer like 'filename'. |
| |
| The mechanism discussed here allows each filesystem implementation to |
| specify how to generate an opaque (out side of the filesystem) byte |
| string for any dentry, and how to find an appropriate dentry for any |
| given opaque byte string. |
| This byte string will be called a "filehandle fragment" as it |
| corresponds to part of an NFS filehandle. |
| |
| A filesystem which supports the mapping between filehandle fragments |
| and dentrys will be termed "exportable". |
| |
| |
| |
| Dcache Issues |
| ------------- |
| |
| The dcache normally contains a proper prefix of any given filesystem |
| tree. This means that if any filesystem object is in the dcache, then |
| all of the ancestors of that filesystem object are also in the dcache. |
| As normal access is by filename this prefix is created naturally and |
| maintained easily (by each object maintaining a reference count on |
| its parent). |
| |
| However when objects are included into the dcache by interpreting a |
| filehandle fragment, there is no automatic creation of a path prefix |
| for the object. This leads to two related but distinct features of |
| the dcache that are not needed for normal filesystem access. |
| |
| 1/ The dcache must sometimes contain objects that are not part of the |
| proper prefix. i.e that are not connected to the root. |
| 2/ The dcache must be prepared for a newly found (via ->lookup) directory |
| to already have a (non-connected) dentry, and must be able to move |
| that dentry into place (based on the parent and name in the |
| ->lookup). This is particularly needed for directories as |
| it is a dcache invariant that directories only have one dentry. |
| |
| To implement these features, the dcache has: |
| |
| a/ A dentry flag DCACHE_DISCONNECTED which is set on |
| any dentry that might not be part of the proper prefix. |
| This is set when anonymous dentries are created, and cleared when a |
| dentry is noticed to be a child of a dentry which is in the proper |
| prefix. |
| |
| b/ A per-superblock list "s_anon" of dentries which are the roots of |
| subtrees that are not in the proper prefix. These dentries, as |
| well as the proper prefix, need to be released at unmount time. As |
| these dentries will not be hashed, they are linked together on the |
| d_hash list_head. |
| |
| c/ Helper routines to allocate anonymous dentries, and to help attach |
| loose directory dentries at lookup time. They are: |
| d_alloc_anon(inode) will return a dentry for the given inode. |
| If the inode already has a dentry, one of those is returned. |
| If it doesn't, a new anonymous (IS_ROOT and |
| DCACHE_DISCONNECTED) dentry is allocated and attached. |
| In the case of a directory, care is taken that only one dentry |
| can ever be attached. |
| d_splice_alias(inode, dentry) will make sure that there is a |
| dentry with the same name and parent as the given dentry, and |
| which refers to the given inode. |
| If the inode is a directory and already has a dentry, then that |
| dentry is d_moved over the given dentry. |
| If the passed dentry gets attached, care is taken that this is |
| mutually exclusive to a d_alloc_anon operation. |
| If the passed dentry is used, NULL is returned, else the used |
| dentry is returned. This corresponds to the calling pattern of |
| ->lookup. |
| |
| |
| Filesystem Issues |
| ----------------- |
| |
| For a filesystem to be exportable it must: |
| |
| 1/ provide the filehandle fragment routines described below. |
| 2/ make sure that d_splice_alias is used rather than d_add |
| when ->lookup finds an inode for a given parent and name. |
| Typically the ->lookup routine will end: |
| if (inode) |
| return d_splice(inode, dentry); |
| d_add(dentry, inode); |
| return NULL; |
| } |
| |
| |
| |
| A file system implementation declares that instances of the filesystem |
| are exportable by setting the s_export_op field in the struct |
| super_block. This field must point to a "struct export_operations" |
| struct which could potentially be full of NULLs, though normally at |
| least get_parent will be set. |
| |
| The primary operations are decode_fh and encode_fh. |
| decode_fh takes a filehandle fragment and tries to find or create a |
| dentry for the object referred to by the filehandle. |
| encode_fh takes a dentry and creates a filehandle fragment which can |
| later be used to find/create a dentry for the same object. |
| |
| decode_fh will probably make use of "find_exported_dentry". |
| This function lives in the "exportfs" module which a filesystem does |
| not need unless it is being exported. So rather that calling |
| find_exported_dentry directly, each filesystem should call it through |
| the find_exported_dentry pointer in it's export_operations table. |
| This field is set correctly by the exporting agent (e.g. nfsd) when a |
| filesystem is exported, and before any export operations are called. |
| |
| find_exported_dentry needs three support functions from the |
| filesystem: |
| get_name. When given a parent dentry and a child dentry, this |
| should find a name in the directory identified by the parent |
| dentry, which leads to the object identified by the child dentry. |
| If no get_name function is supplied, a default implementation is |
| provided which uses vfs_readdir to find potential names, and |
| matches inode numbers to find the correct match. |
| |
| get_parent. When given a dentry for a directory, this should return |
| a dentry for the parent. Quite possibly the parent dentry will |
| have been allocated by d_alloc_anon. |
| The default get_parent function just returns an error so any |
| filehandle lookup that requires finding a parent will fail. |
| ->lookup("..") is *not* used as a default as it can leave ".." |
| entries in the dcache which are too messy to work with. |
| |
| get_dentry. When given an opaque datum, this should find the |
| implied object and create a dentry for it (possibly with |
| d_alloc_anon). |
| The opaque datum is whatever is passed down by the decode_fh |
| function, and is often simply a fragment of the filehandle |
| fragment. |
| decode_fh passes two datums through find_exported_dentry. One that |
| should be used to identify the target object, and one that can be |
| used to identify the object's parent, should that be necessary. |
| The default get_dentry function assumes that the datum contains an |
| inode number and a generation number, and it attempts to get the |
| inode using "iget" and check it's validity by matching the |
| generation number. A filesystem should only depend on the default |
| if iget can safely be used this way. |
| |
| If decode_fh and/or encode_fh are left as NULL, then default |
| implementations are used. These defaults are suitable for ext2 and |
| extremely similar filesystems (like ext3). |
| |
| The default encode_fh creates a filehandle fragment from the inode |
| number and generation number of the target together with the inode |
| number and generation number of the parent (if the parent is |
| required). |
| |
| The default decode_fh extract the target and parent datums from the |
| filehandle assuming the format used by the default encode_fh and |
| passed them to find_exported_dentry. |
| |
| |
| A filehandle fragment consists of an array of 1 or more 4byte words, |
| together with a one byte "type". |
| The decode_fh routine should not depend on the stated size that is |
| passed to it. This size may be larger than the original filehandle |
| generated by encode_fh, in which case it will have been padded with |
| nuls. Rather, the encode_fh routine should choose a "type" which |
| indicates the decode_fh how much of the filehandle is valid, and how |
| it should be interpreted. |
| |
| |