Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

While running Gromacs benchmarks in different setups (intra-node vs 2, 3, and 4 nodes connected with Infiniband), we have noticed severe performance degradation. In order to investigate, we have created a test program that uses MPI_Alltoall() to transfer data packages of various sizes (4 bytes to 2 MB) among all nodes. Both the internal timings of the test program and statistics gathered from IntelMPI's I_MPI_STATS facility show the same pattern: for small amounts of data, the resulting bandwidth is acceptable, whereas for larger payloads, behaviour becomes erratic: some of the transfers take extremely long (about 2.15 seconds), so average performance collapses. These very long delays seem to be occurring stochastically, so they may be absent or small sample sizes (e.g. 100 transfers per payload size). Here is some sample data taken with 4 nodes at 1000 transfers per size:

#           Message size    Call count  Min time    Avr time    Max time    Total time

Alltoall
1           2097152         1000        5649.09     13420.98    2152225.97  13420980.69
2           1048576         1000        2874.85     13000.87    2151684.05  13000867.13
3           524288          1000        1404.05     8484.15     2149509.91  8484153.99
4           262144          1000        719.07      5308.87     2148617.98  5308866.74
5           131072          1000        364.78      9223.77     2148303.99  9223767.04
6           65536           1000        206.95      5124.41     2147943.97  5124409.44
7           32768           1000        120.88      12562.09    2147678.85  12562089.68
8           16384           1000        36.00       57.03       93.94       57034.25
9           8192            1000        22.89       34.80       103.00      34803.87

We are using QDR Infiniband via an unmanaged switch, and IntelMPI 4.0.3. I have tried to check with MPI out of the way by setting up a ring-like transfer (node1 -> node2 -> node3 -> node4 -> node1) with ib_send_bw, but did not observe any problematic behaviour:

#bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
16384      10000           1202.93            1202.91
32768      10000           1408.94            1367.46
65536      10000           1196.71            1195.85
131072     10000           1195.68            1180.71
262144     10000           1197.27            1167.45
524288     10000           1162.94            1154.15
1048576    10000           1184.48            1151.31
2097152    10000           1163.39            1143.60
4194304    10000           1157.77            1141.84
8388608    10000           1141.23            1138.36

My question: is there any way to look deeper into this to find out what the root cause of the problem is? I have already looked through the IntelMPI reference manual, but not seen anything helpful except for I_MPI_STATS.

share|improve this question
    
I don't think it is a MPI issue. Did you do extensive network benchmarks under full load? You should do some network tuning, modify the mtu of the interface, and tune the /proc/sys/net/ parameters on server and client. –  user55518 Apr 8 '14 at 13:44
    
@bersch: I did run a benchmark without MPI. The switch was not fully loaded, but I tried to duplicate the MPI setup as closely as possible: four nodes exchanging messages in a ring. As to network settings, can Infiniband parameters be changed via /proc/sys/net? –  Ansgar Esztermann Apr 9 '14 at 10:19
    
You can tune Infiniband via module parameters and ipv{4,6} via proc see also publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/… I'd also give a try to ask at Gromacs. –  user55518 Apr 9 '14 at 20:00

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.