Direct I/O support has been in the slash2 for a long time. It is
done through the direct_io flag set on a per-file basis.
On the other hand, the Linux kernel does not have O_DIRECT support
for a FUSE file system until kernel release 3.4. Later, in release
3.10, the Linux kernel added asynchronous direct I/O support for FUSE.
This is important for applications like mySQL that does things like
io_submit().
However, mySQL fails with EINVAL on slash2. We tried every release
from 3.10 all the way to 4.1 release. As it turns out, the two
ways to specify direct I/O (O_DIRECT versus direct_io) are not
well-integrated until Linux 4.1 release. For earlier releases, we have
to use the following workaround to make aynchronous direct I/O work on slash2:
# msctl -p sys.direct_io=0
Note that even if slash2 turns off its side of direct I/O, an application
can continue to reap the benefits of direct I/O if they open a file with O_DIRECT.
For a curious mind, I believe that the following kernel patch fixes
the problem:
zhihui@krakatoa:~/linux-git$ git show 15316263649d9eed393d75095b156781a877eb06 | head -8
commit 15316263649d9eed393d75095b156781a877eb06
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Mon Mar 30 22:08:36 2015 -0400
fuse: switch fuse_direct_io_file_operations to ->{read,write}_iter()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
“Global mount” is currently SLASH2’s way to support distributed
metadata operation.
It uses a few bits in the FID space to determine the MDS that a client
should talk to in order to access the corresponding file.
SLASH2 has a configuration file that can be used to find all the MDSes
without the need to have another higher-level manager.
When a SLASH2 instance is mounted on a client, it sends the ROOT pid 1
to a metadata server.
Depending on whether global mount is enabled on the target MDS or not,
it can return either the super root or the root of the name space of the
target MDS.
Here is some ASCII art:
+-----+
| / | super root
+-----+
|
|
+-------------------+-------------------+
| | |
v v v
+-----+ +------+ +-----+
| PSC | | PITT | | CMU |
+-----+ +------+ +-----+
| | |
| | |
+---------+ +---------+ +---------+
| | | | | |
v v v v v v
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+
| dir1/ | | file1 | | dir2/ | | file2 | | dir3/ | | file3 |
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+
For example, if we mount against the MDS at PSC, we will see dir1 and
file1 under the root.
If global mount is enabled on PSC, then we will see PSC, PITT, CMU
under the root.
Afterward the client is responsible for contacting the correct MDS for
leases.
As expected, hardlinks across two MDSes will be rejected.
Symbolic links work, but it is advisable to use absolute target names.
This is consistent with regular file systems.
In addition, no regular files and directories can be created under the
super root.
In the past few months, SLASH2 has received some impressive I/O
performance improvements for both reads and writes.
On the write side, we used to do read-before-writes for misaligned
writes.
This hurts the performance of applications like genetorrent badly.
Luckily, we already have the logic to flush only the parts of the pages
that are actually dirty. Now we add a few new fields to each page that
track the area that has been written with new data.
That way, we don’t have to read over the network anymore if an
application is only interested in writing data or if a read can be
satisfied by previously written data.
Two caveats:
if a read has to go over the network, pending writes must be flushed
first.
if a read over the network is in progress, a new write has to wait.
So the order matters.
On the read side, we used to have two problems.
The first problem is that we only launched read-ahead that is adjacent
to the end of the current read request.
In other words, we did not fill a pipe of pre-read pages for read
requests to catch up.
In the new code, the readahead window can be somewhere, say 4MiB, ahead
of the current read request.
This gives us a whopping 3-4 fold increase on some dd benchmark.
The second problem is that our readahead was only limited within a bmap.
The readahead logic is reset each time we cross the 128MiB bmap
boundary.
The new code uses a readahead thread to launch the readahead that is
beyond the current bmap.
This gives us a further 10% boost.
A few days ago, I committed bits to convert the replication arrangement
engine in the MDS to use batch RPC processing.
This gobbles up a bunch of tiny requests intended to a destination IOS
into a single RPC and blasts it off, allowing for more effective
throughput by less RPC overhead, especially for many small files.
I noticed an interesting problem on the DSC deployment the other
day.
Performance was varying wildly with single threaded I/O tests (rsync
-P).
I took a look at the leases the MDS was assigning to gain some insight:
This command counts the number of occurrences of leases issued to each
I/O system.
There was an obvious problem with preferred treatment to sense2s5.
Examining the code, I see that we copy the list of I/O systems starting
from a position P in the list.
When we reach the end of the list, we start over from the beginning up
to position P.
Then, we increment P next time in an approach to round-robin
selection of I/O systems:
In theory, this should work, but any servers that are unavailable will
give an unfair advantage to the first server in the list after a run of
such unavailable servers, as this first server will get hammered N +
1 times if there are N unavailable servers.
This solution was to add the list and shuffle it, resulting in a much
nicer load distribution.