New client READDIR

READDIR in the client has been rewritten for performance considerations. The major changes are:

  • avoid an additional round-trip (RTT) after the last "page" of direntries has been fetched denoting EOF. Previously, READDIR in the SLASH2 client mount_slash exactly modeled the getdents(2) system calls performed by the process, an arrangement unnecessary and foolhardy considering the remote/network nature of the SLASH2 client <-> MDS communication.
  • asynchronous readahead for the next expected page after a getdents(2) is issued. This is capped within certain limits so a readdir(3) on a huge directory does not exhaust memory in the client (or MDS for that matter). The implementation issues another READDIR after one finishes in the client for big size to take advantage of throughput during huge directory reads but again respecting memory concerns. Because of the strange nature of dirent offsets, the readahead is issued only after the current direntries page finishes, as the dirent offset in many modern file systems reflects a cookie for traversal instead of a physical offset as the on-disk format may be a non-linear data structure such as a B-tree. In the case of the backing MDS file system, ZFS, a cookie is used but with certain properties that shouldn't cause issues with the heuristics in the client readdir direntry buffering cache.
  • pages of direntries are now cached. Much in the style of file stat(2) attributes in SLASH2, pages are held around after a getdents(2) for other applications instead of being immediately marked for release. This cached data is reclaimed on-demand when needed and not periodically later like in the old code, which can be resurrected if necessary very easily. Operations such as timeout (exactly like the file stat(2) attribute caching) or anything such as rename(2), creat(2), unlink(2), symlink(2), etc. immediately remove dircache pages to avoid inconsistency errors.
  • negative extended attributes are now cached. Modern Linux applications such as ls(1) perform listxattrs(2) which adds another synchronous RPC to each dirent returned in getdents(2). The MDS now performs this on each entry before replying, returning only the number of extended attributes for each file, and a flag is set in the client when this number is zero so it is known not to bother querying the MDS for this information again soon when the application shortly after getdents(2) finishes when issuing the listxattrs(2) on each entry returned.

With these improvements, the speed of readdir(3) really flies!