Clean up direct I/O (DIO) mechanism

Whenever two or more clients have conflicting access requests on the same block within the same file (e.g., two write requests), the SLASH2 MDS must force all existing lease holders to downgrade to the so-called direct I/O (DIO) mode. In the DIO mode, every client writes and reads directly to or from the IO server, bypassing its local page cache. This mechanism mostly works, but there were some lingering issues.

First, we must not hand out a DIO lease until after all existing lease holders have replied. Second, a new lease request must honor the DIO transitional period. Just because a block is concurrently not in the DIO mode does not mean a non-conflicting lease request should get a caching lease. Of course, devels are always in the details. While in theory one client should have one lease per block per file, but it can have more than one lease in flight (due to RPC delays). The MDS should handle that properly.

After this clean up, all the known corner cases are ironed out. The number of codes actually reduced as a result.

Recent rework of client I/O infrastructure

The disk-based archive system at PSC has been in production for about a year. Overall, the system works for the most part. However, there was a period when we saw more client crashes than we would like when the system was under serious load or the I/O service was down. Initially, we tried to tackle these issues in baby steps until one day we realized that perhaps a major rewrite is needed.

One problem with the old code is that it uses a lot of locking and flags to make sure key data structures will not be freed prematurely. With enough mental gynmastics, we can convince ourselves that the old code works. However, a simpler and more robust way is to use reference count. Another problem with the old code is that it has the same logic duplicated at different places.

So the client rewrite is actually a major clean up, using reference counts to protect key data structures and consolidating logics. After this clean up, regardless whether a request is split into multiple RPCs or it is done via AIO or it needs to be retired, the same code path is used. The new code also does not assume when a PRC will be complete.

Any newly written code is about to have bugs, especially when the new code changes a lot of assumptions that the old code relies on. Over the past month, we have dealt with a few fallouts of the new code. And the client code seldom crashes these days.

zfs-fuse 0.7.0 merged

The zfs-fuse 0.7.0 stack has been integrated into the SLASH2 metadata backend. This process has started quite awhile ago, with some intermittent testing time to ensure stability, as well as for quite awhile before this announcement, and the metadata server seems very stable now, especially during high activity.

Solaris port

Patches to allow SLASH2 to run on Solaris-derived operating systems have been submitted into the mainline SLASH2 source tree. At this time, only the I/O component (sliod) is supported but as most of the hard work has already been done, supporting a client shouldn’t take much more work.

SLASH2 at XSEDE12

J Ray Scott gave a talk about SLASH2 and its application in PSC’s new storage archiver system. The talk covered the architecture of the system, from hardware all the way up to the application level, as well as the historical timeline and other logistics of operation since the system has gone into production.