Partial reclaim support

Partial reclaim (not to be confused with partial truncation resolution) support has been added recently.

This entails sending notifications to IOSes that support fallocate(2) with the ability to release holes in backing files the signal to do so for data which has been released by SLASH2.

These instances may be from bmap replica ejection or after partial truncation has cleared some bmaps. Partial truncation support is still not available as it requires some processing the IOS which must take provisions to ensure in-memory coherency.

The support for preclaim (partial reclaim), however, was very lightweight after the introduction of a general purpose asynchronous batch RPC API. One remaining problem is the pruning of out-of-date preclaim updates which were added to the batch request at some point in the past which should no longer go out, such as in the case of TRUNCATE, WRITE; where a mistimed preclaim may clobber legitimate data received after the truncation. A proposal tracking the bmap generation numbers may suffice for this issue.

New client READDIR

READDIR in the client has been rewritten for performance considerations. The major changes are:

  • avoid an additional round-trip (RTT) after the last "page" of direntries has been fetched denoting EOF. Previously, READDIR in the SLASH2 client mount_slash exactly modeled the getdents(2) system calls performed by the process, an arrangement unnecessary and foolhardy considering the remote/network nature of the SLASH2 client <-> MDS communication.
  • asynchronous readahead for the next expected page after a getdents(2) is issued. This is capped within certain limits so a readdir(3) on a huge directory does not exhaust memory in the client (or MDS for that matter). The implementation issues another READDIR after one finishes in the client for big size to take advantage of throughput during huge directory reads but again respecting memory concerns. Because of the strange nature of dirent offsets, the readahead is issued only after the current direntries page finishes, as the dirent offset in many modern file systems reflects a cookie for traversal instead of a physical offset as the on-disk format may be a non-linear data structure such as a B-tree. In the case of the backing MDS file system, ZFS, a cookie is used but with certain properties that shouldn't cause issues with the heuristics in the client readdir direntry buffering cache.
  • pages of direntries are now cached. Much in the style of file stat(2) attributes in SLASH2, pages are held around after a getdents(2) for other applications instead of being immediately marked for release. This cached data is reclaimed on-demand when needed and not periodically later like in the old code, which can be resurrected if necessary very easily. Operations such as timeout (exactly like the file stat(2) attribute caching) or anything such as rename(2), creat(2), unlink(2), symlink(2), etc. immediately remove dircache pages to avoid inconsistency errors.
  • negative extended attributes are now cached. Modern Linux applications such as ls(1) perform listxattrs(2) which adds another synchronous RPC to each dirent returned in getdents(2). The MDS now performs this on each entry before replying, returning only the number of extended attributes for each file, and a flag is set in the client when this number is zero so it is known not to bother querying the MDS for this information again soon when the application shortly after getdents(2) finishes when issuing the listxattrs(2) on each entry returned.

With these improvements, the speed of readdir(3) really flies!

Garbage collection improvements

Zhihui has recently fixed a number of issues in the garbage reclamation code contained within the MDS that was preventing large gaps between IOSes from catching up to each other. Obviously on large deployments this becomes a major factor of concern.

Furthermore, a number of other improvements in the handling of garbage reclamation updates to IOSes were made such as general RPC robustness, selection, error log spam on error, and error discovery, and progress in the event of errors.

Finally, a new API was added that will in the near future convert the structure of the garbage reclamation update RPCs to use an asynchronous mechanism as large updates on slow hardware (such as tape archivers) can effectively timeout the RPC response and look like a network error.

Disk usage accounting improvements

As it turns out, FreeBSD ZFS has some interesting behavioral oddities related to fsync(2) and st_blocks. Specifically, after a write(2) then fsync(2), the stat(2) st_blocks information returned does not immediately reflect the correct disk usage.

Without further investigation, cursory empirical discovery has revealed that this property is only updated after several seconds after the fsync(2). This behavior was accordingly implemented in sliod so the correct usage accounting information gets propagated to the MDS on bmap CRC updates (BCRs) instead of erroneous values. The cost is that BCRs stay around in memory increasing memory pressure on busy IOSes and widening the window of lost BCRs to the MDS if the sliod fails.

avoiding syslog flooding

With a large deployment featuring many nodes including clients and I/O servers, it is useful to centralize all error logs generated by the various daemons to one place to scan for problems and react as necessary. However, because of abuses in either (1) buggy SLASH2 causing flooding of syslog activity or (2) other applications that are sharing the same syslog receivers, it is possible that rsyslogd can backup and become unresponsive.

In these situations, any applications using syslog(3) with a remote configuration with essentially stall until the remote rsyslogd stops thrashing. Once cleared, service can return to normal, but any behavior in the interim that generates log messages, such as RPC timeouts, can bring the deployment into a chicken-and-egg dependency loop of threads stuck in our debug logging routines while awaiting syslog transmission.

Of course, fixing the spam is the major solution to this problem, but that still does not fix case #2 outlined above, which is exactly what happened in on of our SLASH2 deployments. The proper solution is remove the network dependency from an essential code path in the SLASH2 code base.

This alleviation is the introduction of the PFL_SYSLOG_PIPE environment variable. Instead of issuing syslog(3) directly, this variable arranges that stderr be written to as a normal file somewhere on the system (requiring some local storage for netboots if high volumes of debug logging traffic are to be generated) and for a logger(1) process to be spawned doing the syslog(3) on behalf so the application does not grind to a halt.

Not having debug logs, or even genuine system activity logs for that matter, is unfortunate but a class altogether different from a completely unresponsive system.