Apr 8, 2013
by
yanovich
The Lustre networking stack version 1.8.7 has been merged into the SLASH2 tree. This was to address a number of fringe error conditions such as exhausting memory descriptors to alleviate some issues seen during high activity/usage.
Apr 8, 2013
by
yanovich
At this time one year ago, PSC’s Data SuperCell disk-based storage archival system went into production! It has been an interesting year with a lot of great moments as well as hiccups but SLASH2 has really proven its worth both in engineering as well as design. Here’s to the next year!
Sep 13, 2011
by
pauln
To support the upcoming SLASH2-based PSC data archiver, an AIO (asynchronous I/O) implementation was needed and is now largely complete. As previously mentioned, AIO support in SLASH2 enables to the system to tolerate long delays between presentation of the initial I/O request and its completion. Such support is needed for integration of tape archival systems as SLASH2 I/O systems.
The current implementation supports all types of I/O operations: read, write, read-ahead, and replication. The main caveat for the moment is that writes are forced to use direct I/O mode. Complexities surrounding the management of bmap write leases and client-side buffer cache management were the primary factor here. We definitely wanted to avoid creating scenarios where the client would be forced to ‘juggle’ write leases for arbitrary lengths of time. Further, using cached I/O on the client may create quality of service problems for writes to a single tape resident file because I/O to this file may consume the entire write-back cache. Due to the fact that writes may cause tape reads (as described in a previous post), this would cause the entire client to block for some arbitrary amount of time. By using direct I/O these issues are completely avoided.
Both the client and sliod (aka I/O server) use a similar mechanism to deal with AIO’d buffers. When it is determined that a request is dependent on AIO’d buffer for completion, that request is placed on a list attached to the buffer (on the sliod that buffer is a struct slvr_ref, on the client - struct bmap_pagecache_entry). Upon completion of the AIO these buffers become ‘READY’ and any requests queued in the completion list are processed. This method suits both client read I/O and sliod replication I/O equally well. (Note: replication is more straight-forward since only a single buffer may be involved.)
Jul 22, 2011
by
pauln
Jared has been working on asynchronous I/O support so that users may read from a tape-based SLASH2 I/O service. The patches work around RPC timeout issues by immediately replying to the initial request and calling back the client once the I/O server has filled the buffer from tape. While the data is being fetched from tape, the client and I/O server threads do not block so that other application requests may be serviced.
In testing some of these new patches I’ve noticed that writes issued by clients to IOS’s with type “archival_fs” are having some issues. The reason is because the SLIOD tries to fault in the first 1MB of the file before processing the incoming write buffer. The entire sliver is needed to calculate the checksum – similar in nature to a read-modify-write on a RAID system. Presumably, if the incoming write request were for a full sliver this “read-prep-write” wouldn’t be necessary but these are 64k writes. The plan was to not implement async I/O for writes but it would seem that we have no choice.
The question for the moment is where to hold the write buffer. When the write hits the SLIOD it’s presumably accompanied by a valid bmap lease. If the SLIOD were to return EWOULDBLOCK to the client (and not take the bulk), the client would be forced to retry later. This retry could even be prompted via a callback once the SLIOD has readied the respective sliver buffer. However this method leads us to the problem of dirty buffers sitting on the client which are covered by expiring bmap leases. While the bmap leases may be refreshed this approach is still problematic because it makes the client more susceptible to being caught with dirty buffers and no valid lease by which to flush them. I feel like this situation should always be avoided when possible.
Another method would be to put the heavy lifting onto SLIOD. This would avoid the problem of delaying the client while it holds dirty data:
- client sends write
- sliod sees EWOULDBLOCK on sliver fault
- sliod performs bulk, taking the buffer from the client but returns EWOULDBLOCK to the client
- sliod attaches the write buffer to the pending aio read
- the client fsthr handles this op just like an aio read - queuing the fuse reply so that it may be handled once we're notified by the SLIOD
Would direct I/O mode do the trick? That would simplify things a bit by removing cached pages from equation but the SLIOD will still have to perform a read-prep-write if the write doesn’t cover an entire sliver.
Jul 22, 2011
by
pauln
This example was given during the TG’11 talk but I thought it may be of interest so I’m copying the results here.
First, create a file to replicate:
(pauln@peel0:~)$ dd if=/dev/zero of=/p0_archive/pauln/big_file count=2k bs=1M
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 4.81828 seconds, 446 MB/s
Next, view its status with msctl:
(pauln@peel0:msctl)$ ./msctl -r /p0_archive/pauln/big_file
file-replication-status #valid #bmap %prog
================================================================================
/p0_archive/pauln/big_file
new-bmap-repl-policy: one-time
archsliod@PSCARCH 16 16 100%
++++++++++++++++
Request replication of the entire file:
(pauln@peel0:msctl)$ date && ./msctl -Q archlime@PSCARCH:*:/p0_archive/pauln/big_file
Wed Jul 20 02:51:56 EDT 2011
(pauln@peel0:msctl)$
Now check the status with msctl:
(pauln@peel0:msctl)$ date && ./msctl -r /p0_archive/pauln/big_file
Wed Jul 20 02:51:57 EDT 2011
file-replication-status #valid #bmap %prog
=================================================================================
/p0_archive/pauln/big_file
new-bmap-repl-policy: one-time
archsliod@PSCARCH 16 16 100%
++++++++++++++++
archlime@PSCARCH 0 16 0%
sqqqqqqqqqqqqqqq
(pauln@peel0:msctl)$ date && ./msctl -r /p0_archive/pauln/big_file
Wed Jul 20 02:52:05 EDT 2011
file-replication-status #valid #bmap %prog
=================================================================================
/p0_archive/pauln/big_file
new-bmap-repl-policy: one-time
archsliod@PSCARCH 16 16 100%
++++++++++++++++
archlime@PSCARCH 10 16 62.50%
++++++++++qqqqqq
(pauln@peel0:msctl)$ date && ./msctl -r /p0_archive/pauln/big_file
Wed Jul 20 02:52:20 EDT 2011
file-replication-status #valid #bmap %prog
=================================================================================
/p0_archive/pauln/big_file
new-bmap-repl-policy: one-time
archsliod@PSCARCH 16 16 100%
++++++++++++++++
archlime@PSCARCH 16 16 100%
++++++++++++++++