While running Gromacs benchmarks in different setups (intra-node vs 2, 3, and 4 nodes connected with Infiniband), we have noticed severe performance degradation.
In order to investigate, we have created a test program that uses MPI_Alltoall()
to transfer data packages of various sizes (4 bytes to 2 MB) among all nodes.
Both the internal timings of the test program and statistics gathered from IntelMPI's I_MPI_STATS
facility show the same pattern: for small amounts of data, the resulting bandwidth is acceptable, whereas for larger payloads, behaviour becomes erratic: some of the transfers take extremely long (about 2.15 seconds), so average performance collapses. These very long delays seem to be occurring stochastically, so they may be absent or small sample sizes (e.g. 100 transfers per payload size). Here is some sample data taken with 4 nodes at 1000 transfers per size:
# Message size Call count Min time Avr time Max time Total time
Alltoall
1 2097152 1000 5649.09 13420.98 2152225.97 13420980.69
2 1048576 1000 2874.85 13000.87 2151684.05 13000867.13
3 524288 1000 1404.05 8484.15 2149509.91 8484153.99
4 262144 1000 719.07 5308.87 2148617.98 5308866.74
5 131072 1000 364.78 9223.77 2148303.99 9223767.04
6 65536 1000 206.95 5124.41 2147943.97 5124409.44
7 32768 1000 120.88 12562.09 2147678.85 12562089.68
8 16384 1000 36.00 57.03 93.94 57034.25
9 8192 1000 22.89 34.80 103.00 34803.87
We are using QDR Infiniband via an unmanaged switch, and IntelMPI 4.0.3. I have tried to check with MPI out of the way by setting up a ring-like transfer (node1 -> node2 -> node3 -> node4 -> node1) with ib_send_bw, but did not observe any problematic behaviour:
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
16384 10000 1202.93 1202.91
32768 10000 1408.94 1367.46
65536 10000 1196.71 1195.85
131072 10000 1195.68 1180.71
262144 10000 1197.27 1167.45
524288 10000 1162.94 1154.15
1048576 10000 1184.48 1151.31
2097152 10000 1163.39 1143.60
4194304 10000 1157.77 1141.84
8388608 10000 1141.23 1138.36
My question: is there any way to look deeper into this to find out what the root cause of the problem is? I have already looked through the IntelMPI reference manual, but not seen anything helpful except for I_MPI_STATS
.
/proc/sys/net/
parameters on server and client. – user55518 Apr 8 '14 at 13:44