Fix unfair load distribution bug in IOS selection
I noticed an interesting problem on the DSC deployment the other day. Performance was varying wildly with single threaded I/O tests (rsync -P). I took a look at the leases the MDS was assigning to gain some insight:
yanovich@illusion2$ slmctl -sbml | awk '{print $3, $5}' | sort | uniq -c | column -t 1 io-system flags 6 <any> R---TB----- 5 sense2s2@PSCARCH -W--TB----- 9 sense2s4@PSCARCH -W--TB----- 182 sense2s5@PSCARCH -W--TB----- 5 sense2s6@PSCARCH -W--TB----- 7 sense2s7@PSCARCH -W--TB----- 5 sense3s0@PSCARCH -W--TB----- 6 sense3s1@PSCARCH -W--TB----- 6 sense3s2@PSCARCH -W--TB----- 6 sense3s3@PSCARCH -W--TB----- 5 sense3s4@PSCARCH -W--TB----- 5 sense3s5@PSCARCH -W--TB----- 6 sense3s6@PSCARCH -W--TB----- 6 sense4s0@PSCARCH -W--TB----- 6 sense4s1@PSCARCH -W--TB----- 5 sense4s2@PSCARCH -W--TB----- 5 sense4s3@PSCARCH -W--TB----- 5 sense4s4@PSCARCH -W--TB----- 5 sense4s5@PSCARCH -W--TB----- 6 sense4s6@PSCARCH -W--TB----- 11 sense5s0@PSCARCH -W--TB----- 5 sense5s1@PSCARCH -W--TB----- 5 sense5s2@PSCARCH -W--TB----- 6 sense5s3@PSCARCH -W--TB----- 11 sense5s5@PSCARCH -W--TB----- 5 sense5s6@PSCARCH -W--TB----- 7 sense6s0@PSCARCH -W--TB----- 6 sense6s1@PSCARCH -W--TB----- 6 sense6s2@PSCARCH -W--TB----- 6 sense6s3@PSCARCH -W--TB----- 6 sense6s4@PSCARCH -W--TB----- 6 sense6s5@PSCARCH -W--TB----- 6 sense6s6@PSCARCH -W--TB----- 6 sense6s7@PSCARCH -W--TB-----
This command counts the number of occurrences of leases issued to each I/O system. There was an obvious problem with preferred treatment to sense2s5. Examining the code, I see that we copy the list of I/O systems starting from a position P in the list. When we reach the end of the list, we start over from the beginning up to position P. Then, we increment P next time in an approach to round-robin selection of I/O systems:
slashd/mds.c:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
__static void
slm_resm_roundrobin(struct sl_resource *r, struct psc_dynarray *a)
{
struct resprof_mds_info *rpmi = res2rpmi(r);
struct sl_resm *m;
int i, idx;
RPMI_LOCK(rpmi);
idx = slm_get_rpmi_idx(r);
RPMI_ULOCK(rpmi);
for (i = 0; i < psc_dynarray_len(&r->res_members); i++, idx++) {
if (idx >= psc_dynarray_len(&r->res_members))
idx = 0;
m = psc_dynarray_getpos(&r->res_members, idx);
psc_dynarray_add_ifdne(a, m);
}
}
static __inline int
slm_get_rpmi_idx(struct sl_resource *res)
{
struct resprof_mds_info *rpmi;
int locked, n;
rpmi = res2rpmi(res);
locked = RPMI_RLOCK(rpmi);
if (rpmi->rpmi_cnt >= psc_dynarray_len(&res->res_members))
rpmi->rpmi_cnt = 0;
n = rpmi->rpmi_cnt++;
RPMI_URLOCK(rpmi, locked);
return (n);
}
In theory, this should work, but any servers that are unavailable will give an unfair advantage to the first server in the list after a run of such unavailable servers, as this first server will get hammered N + 1 times if there are N unavailable servers.
This solution was to add the list and shuffle it, resulting in a much nicer load distribution.