More READIR improvements

The previous rewrite of the SLASH2 client READDIR code, named dircache, which does the work of readdir(3) from user application system calls, unfortunately had some problems:

  1. qualitative value was expected from the underlying file system (ZFS)'s assignment of values to the d_off field. This is a problem because modern file systems use this more as a cookie that does not confer position e.g. to traverse a B-tree.
  2. because we can't predict this number without advanced knowledge of the directory, which is by design only known by the MDS in SLASH2, there isn't really a great way to do readahead for improving the performance of reading all dirents from a large directory. The old code had to wait an RTT for each chunk of dirents, with the slight exception of being able to issue the next one after the current one is replied.

So the code has been restructured to push this readahead responsibility onto the MDS. The MDS now sets up multiple RPCs when one READDIR request comes in in anticipation that the client will soon request the new chunk of dirents. This comes at the cost of putting some additional load on the MDS but the advantage should be clear and really helps and the the readahead is currently limited to three simultaneous RPCs.

I/O server affinity support now available

Support for allowing users to prefer reusing I/O system members, instead of round robin-ing across cohorts, has been recently added to SLASH2. It is available via msctl(8) fattr. Happy data managing!

Bulk RPC data now protected on untrusted networks

Some changes to the RPC layer in SLASH2 have been made recently that provides cryptographic protection over bulk data sent via RPCs among SLASH2 nodes. Previously, only message headers were protected. This really only provides message integrity (and not confidentiality) and at the cost of the computational overhead which essentially digitally signs the data but it is not consistent with the way SLASH2 handles all other network traffic with peers. Some bulk data was already protected with specific handling to make it so but now all is, so network bit flips, etc. should now be largely avoided thanks to the cryptographic routines.

More ideas are on the way about how to have untrusted clients (non root) on a SLASH2 network in a non-disruptive manner, how to split SLASH2 nodes across internal networks, and how to alleviate performance overhead on fast trusted networks…

MDS database performance improvements

Recently, work has been done to the MDS update scheduling engine. Under the hood, an SQLite database is used to stage all that the MDS has to do. This work involves scheduling replication activity, resolving partial truncate(2) blocks, and garbage reclamation.

Until now, an elementary method was used to store this work and retrieve it when convenient. All execution threads inside the MDS wishing to do queries performed them in a shared mode of access no caching, placing limits on database performance.

The code has been restructured with the following improvements:

  • pass as many database queries as possible to a single thread to prevent cache flushing.
  • give each thread its own database handle to provide concurrent read operations.
  • move the database into RAM via tmpfs and /dev/shm.
  • backup the database occasionally in the event of a crash.

Obviously, between backups, there is a window for lost operations. This issue will need to be addressed.

SLASH2 released under GPL 2.0 license

The SLASH2 code base has now been released under the GPL 2.0 license!