Readahead Implementation

In an effort to improve read performance, I’ve been working on patches to dynamically prefetch data when a file’s I/O access pattern is determined to be sequential. First attempts at implementation were partially successful. The readahead (RA) logic was able to prefetch relevant pages into the client’s cache. However, the performance increase was not as large as had been expected. Slash2 writes can easily saturate a gige connection, and on a loopback’ed host running client and sliod, I’ve seen >300MB/s writes. (Note: performance here was disk bound).

Reads are typically in the range of 60 MB/s, and with RA, about 75MB/s. My working theory is that read prefetches must not block the requesting client thread for longer than necessary. As expected, most of the latency involved in transferring large chunks is due to the link bandwidth. For gige, it takes about 8ms to transfer a 1MB buffer. In the current implementation, if the client requested 32kb it waits for the entire 1MB RA I/O to complete. In this case, the thread sits idle for about 7ms. My hope was for this cost to be sufficiently amortized by subsequent read I/O’s so that a simple, synchronous RA model could be employed. Testing has revealed this is not the case.. My first attempt at an async model will be done without the use of a dedicated thread.

MDS replication/update scheduler performance improvements

The update scheduler in charge of operations such as overseeing replication has been changed to potentially schedule multiple replication operations simultaneously, however necessary to saturate bandwidth between I/O server pairs.

This should greatly improve the performance of replicating many small files, as well as take advantage of long fat networks that would like to pipeline more than one bmap (64MB) of data at a time.

The old scheme was a proof of concept that allowed a single bmap to be in transmission between an I/O server pair fostering replication at a time. Now, instead the system assigns a bandwidth limit between arbitrary I/O server pairs and queues operations only when they would not exceed this speed limit.

In the future, this algorithm will probably only need to be adjusted to more accurately respect real dynamic limits between I/O servers spread across sites, relating to their actual topologic characteristics, as well as accomodate special needs such as network reservation by administrator policy.

NARA / PSC SLASH2 collaboration file system installed

A test SLASH2 server was put up at the NARA ABL lab.

SLASH2 at SC10

Stop by the PSC booth during SC10 for a SLASH2 demo!

More testing.. this time chmod

(pauln@born-of-fire:pauln)$ (cd /s2/pjd-fstest-20080816/ && sudo prove -f tests/chmod/00.t)
tests/chmod/00.t .. 46/58 
not ok 56
tests/chmod/00.t .. Failed 1/58 subtests 

Test Summary Report
-------------------
tests/chmod/00.t (Wstat: 0 Tests: 58 Failed: 1)
  Failed test:  56
Files=1, Tests=58,  7 wallclock secs ( 0.05 usr  0.00 sys +  1.73 cusr  0.11 csys =  1.89 CPU)
Result: FAIL

===============
# POSIX: If the calling process does not have appropriate privileges, and if                                                            
# the group ID of the file does not match the effective group ID or one of the                                                          
# supplementary group IDs and if the file is a regular file, bit S_ISGID                                                                
# (set-group-ID on execution) in the file's mode shall be cleared upon                                                                  
# successful return from chmod().                                                                                                       

expect 0 create ${n0} 0755
expect 0 chown ${n0} 65535 65535
expect 0 -u 65535 -g 65535 chmod ${n0} 02755
expect 02755 stat ${n0} mode
expect 0 -u 65535 -g 65535 chmod ${n0} 0755
expect 0755 stat ${n0} mode

# Unfortunately FreeBSD doesn't clear set-gid bit, but returns EPERM instead.                                                           
case "${os}" in
FreeBSD)
        expect EPERM -u 65535 -g 65534 chmod ${n0} 02755
	expect 0755 stat ${n0} mode
	;;
*)
	expect 0 -u 65535 -g 65534 chmod ${n0} 02755
	expect 0755 stat ${n0} mode
	;;

Test 56 is the last one, where we try to set the group sticky bit when the effective gid does not match that of the file.