io_submit() support for SLASH2

Direct I/O support has been in the slash2 for a long time. It is done through the direct_io flag set on a per-file basis.

On the other hand, the Linux kernel does not have O_DIRECT support for a FUSE file system until kernel release 3.4. Later, in release 3.10, the Linux kernel added asynchronous direct I/O support for FUSE. This is important for applications like mySQL that does things like io_submit().

However, mySQL fails with EINVAL on slash2. We tried every release from 3.10 all the way to 4.1 release. As it turns out, the two ways to specify direct I/O (O_DIRECT versus direct_io) are not well-integrated until Linux 4.1 release. For earlier releases, we have to use the following workaround to make aynchronous direct I/O work on slash2:

        # msctl -p sys.direct_io=0

Note that even if slash2 turns off its side of direct I/O, an application can continue to reap the benefits of direct I/O if they open a file with O_DIRECT.

For a curious mind, I believe that the following kernel patch fixes the problem:


zhihui@krakatoa:~/linux-git$ git show 15316263649d9eed393d75095b156781a877eb06 | head -8
commit 15316263649d9eed393d75095b156781a877eb06
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Mon Mar 30 22:08:36 2015 -0400

    fuse: switch fuse_direct_io_file_operations to ->{read,write}_iter()
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Global mount support for SLASH2

“Global mount” is currently SLASH2’s way to support distributed metadata operation. It uses a few bits in the FID space to determine the MDS that a client should talk to in order to access the corresponding file.

SLASH2 has a configuration file that can be used to find all the MDSes without the need to have another higher-level manager. When a SLASH2 instance is mounted on a client, it sends the ROOT pid 1 to a metadata server. Depending on whether global mount is enabled on the target MDS or not, it can return either the super root or the root of the name space of the target MDS.

Here is some ASCII art:

			  +-----+
			  |  /  |      super root
			  +-----+
			     |
			     |
	 +-------------------+-------------------+
	 |                   |                   |
	 v                   v                   v
      +-----+            +------+             +-----+
      | PSC |            | PITT |             | CMU |
      +-----+            +------+             +-----+
	 |                   |                   |
	 |                   |                   |
    +---------+         +---------+         +---------+
    |         |         |         |         |         |
    v         v         v         v         v         v
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+
| dir1/ | | file1 | | dir2/ | | file2 | | dir3/ | | file3 |
+-------+ +-------+ +-------+ +-------+ +-------+ +-------+

For example, if we mount against the MDS at PSC, we will see dir1 and file1 under the root. If global mount is enabled on PSC, then we will see PSC, PITT, CMU under the root.

Afterward the client is responsible for contacting the correct MDS for leases.

As expected, hardlinks across two MDSes will be rejected. Symbolic links work, but it is advisable to use absolute target names. This is consistent with regular file systems.

In addition, no regular files and directories can be created under the super root.

Recent I/O improvements in SLASH2

In the past few months, SLASH2 has received some impressive I/O performance improvements for both reads and writes.

On the write side, we used to do read-before-writes for misaligned writes. This hurts the performance of applications like genetorrent badly. Luckily, we already have the logic to flush only the parts of the pages that are actually dirty. Now we add a few new fields to each page that track the area that has been written with new data. That way, we don’t have to read over the network anymore if an application is only interested in writing data or if a read can be satisfied by previously written data. Two caveats:

  • if a read has to go over the network, pending writes must be flushed first.
  • if a read over the network is in progress, a new write has to wait.

So the order matters.

On the read side, we used to have two problems. The first problem is that we only launched read-ahead that is adjacent to the end of the current read request. In other words, we did not fill a pipe of pre-read pages for read requests to catch up. In the new code, the readahead window can be somewhere, say 4MiB, ahead of the current read request. This gives us a whopping 3-4 fold increase on some dd benchmark. The second problem is that our readahead was only limited within a bmap. The readahead logic is reset each time we cross the 128MiB bmap boundary. The new code uses a readahead thread to launch the readahead that is beyond the current bmap. This gives us a further 10% boost.

Batch processing of replication requests

A few days ago, I committed bits to convert the replication arrangement engine in the MDS to use batch RPC processing. This gobbles up a bunch of tiny requests intended to a destination IOS into a single RPC and blasts it off, allowing for more effective throughput by less RPC overhead, especially for many small files.

Fix unfair load distribution bug in IOS selection

I noticed an interesting problem on the DSC deployment the other day. Performance was varying wildly with single threaded I/O tests (rsync -P). I took a look at the leases the MDS was assigning to gain some insight:

yanovich@illusion2$ slmctl -sbml | awk '{print $3, $5}' | sort | uniq -c | column -t
      1 io-system		flags
      6 <any>		R---TB-----
      5 sense2s2@PSCARCH	-W--TB-----
      9 sense2s4@PSCARCH	-W--TB-----
    182 sense2s5@PSCARCH	-W--TB-----
      5 sense2s6@PSCARCH	-W--TB-----
      7 sense2s7@PSCARCH	-W--TB-----
      5 sense3s0@PSCARCH	-W--TB-----
      6 sense3s1@PSCARCH	-W--TB-----
      6 sense3s2@PSCARCH	-W--TB-----
      6 sense3s3@PSCARCH	-W--TB-----
      5 sense3s4@PSCARCH	-W--TB-----
      5 sense3s5@PSCARCH	-W--TB-----
      6 sense3s6@PSCARCH	-W--TB-----
      6 sense4s0@PSCARCH	-W--TB-----
      6 sense4s1@PSCARCH	-W--TB-----
      5 sense4s2@PSCARCH	-W--TB-----
      5 sense4s3@PSCARCH	-W--TB-----
      5 sense4s4@PSCARCH	-W--TB-----
      5 sense4s5@PSCARCH	-W--TB-----
      6 sense4s6@PSCARCH	-W--TB-----
     11 sense5s0@PSCARCH	-W--TB-----
      5 sense5s1@PSCARCH	-W--TB-----
      5 sense5s2@PSCARCH	-W--TB-----
      6 sense5s3@PSCARCH	-W--TB-----
     11 sense5s5@PSCARCH	-W--TB-----
      5 sense5s6@PSCARCH	-W--TB-----
      7 sense6s0@PSCARCH	-W--TB-----
      6 sense6s1@PSCARCH	-W--TB-----
      6 sense6s2@PSCARCH	-W--TB-----
      6 sense6s3@PSCARCH	-W--TB-----
      6 sense6s4@PSCARCH	-W--TB-----
      6 sense6s5@PSCARCH	-W--TB-----
      6 sense6s6@PSCARCH	-W--TB-----
      6 sense6s7@PSCARCH	-W--TB-----

This command counts the number of occurrences of leases issued to each I/O system. There was an obvious problem with preferred treatment to sense2s5. Examining the code, I see that we copy the list of I/O systems starting from a position P in the list. When we reach the end of the list, we start over from the beginning up to position P. Then, we increment P next time in an approach to round-robin selection of I/O systems:

slashd/mds.c:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
__static void
slm_resm_roundrobin(struct sl_resource *r, struct psc_dynarray *a)
{
	struct resprof_mds_info *rpmi = res2rpmi(r);
	struct sl_resm *m;
	int i, idx;

	RPMI_LOCK(rpmi);
	idx = slm_get_rpmi_idx(r);
	RPMI_ULOCK(rpmi);

	for (i = 0; i < psc_dynarray_len(&r->res_members); i++, idx++) {
		if (idx >= psc_dynarray_len(&r->res_members))
			idx = 0;

		m = psc_dynarray_getpos(&r->res_members, idx);
		psc_dynarray_add_ifdne(a, m);
	}
}

static __inline int
slm_get_rpmi_idx(struct sl_resource *res)
{
	struct resprof_mds_info *rpmi;
	int locked, n;

	rpmi = res2rpmi(res);
	locked = RPMI_RLOCK(rpmi);
	if (rpmi->rpmi_cnt >= psc_dynarray_len(&res->res_members))
		rpmi->rpmi_cnt = 0;
	n = rpmi->rpmi_cnt++;
	RPMI_URLOCK(rpmi, locked);
	return (n);
}

In theory, this should work, but any servers that are unavailable will give an unfair advantage to the first server in the list after a run of such unavailable servers, as this first server will get hammered N + 1 times if there are N unavailable servers.

This solution was to add the list and shuffle it, resulting in a much nicer load distribution.