Intel® MPI Benchmarks User Guide and Methodology Description
You can control all the aspects of the Intel® MPI Benchmarks through the command-line. The general command-line syntax is the following:
IMB-MPI1 [-h{elp}] [-npmin <P_min>] [-multi <outflag>] [-off_cache <cache_size[,cache_line_size]> [-iter <msgspersample[,overall_vol[,msgs_nonaggr[,iter_policy]]]>] [-iter_policy <iter_policy>] [-time <max_runtime per sample>] [-mem <max. mem usage per process>] [-msglen <Lengths_file>] [-map <PxQ>] [-input <filename>] [-include] [benchmark1 [,benchmark2 [,...]]] [-exclude] [benchmark1 [,benchmark2 [,...]]] [-msglog [<minlog>:]<maxlog>] [-thread_level <level>] [-sync <mode>] [-root_shift <mode>] [benchmark1 [,benchmark2 [,...]]]
The command line is repeated in the output. The options may appear in any order.
Examples:
Get out-of-cache data for PingPong:
mpirun -np 2 IMB-MPI1 PingPong -off_cache -1
Run a very large configuration, with the following restrictions:
mpirun -np 512 IMB-MPI1 -npmin 512 alltoallv -iter 20 -time 1.5 -mem 2
Run the P_Read_shared benchmark with the minimum number of processes set to seven:
mpirun -np 14 IMB-IO P_Read_shared -npmin 7
Run the IMB-MPI1 benchmarks including PingPongSpecificSource and PingPingSpecificSource, but excluding the Alltoall and Alltoallv benchmarks. Set the transfer message sizes as 0, 4, 8, 16, 32, 64, 128:
mpirun -np 16 IMB-MPI1 -msglog 2:7 -include PingPongSpecificsource PingPingSpecificsource -exclude Alltoall Alltoallv
Run the PingPong, PingPing, PingPongSpecificSource and PingPingSpecificSource benchmarks with the transfer message sizes 0, 2^0, 2^1, 2^2, ..., 2^16:
mpirun -np 4 IMB-MPI1 -msglog 16 PingPong PingPing PingPongSpecificSource PingPingSpecificSource
Benchmark selection arguments are a sequence of blank-separated strings. Each string is the name of a benchmark in exact spelling, case insensitive.
For example, the string IMB-MPI1 PingPong Allreduce specifies that you want to run PingPong and Allreduce benchmarks only:
mpirun -np 10 IMB-MPI1 PingPong Allreduce
By default, all benchmarks of the selected component are run.
Specifies the minimum number of processes P_min to run all selected benchmarks on. The P_min value after -npmin must be an integer.
Given P_min, the benchmarks run on the processes with the numbers selected as follows:
P_min, 2P_min, 4P_min, ..., largest 2xP_min <P, P
You may set P_min to 1. If you set P_min > P, Intel MPI Benchmarks interprets this value as P_min = P.
For example, to run the IMB-EXT benchmarks with minimum number of processes set to five, call:
mpirun -np 11 IMB-EXT -npmin 5
By default, all active processes are selected as described in the Running Intel® MPI Benchmarks section.
Defines whether the benchmark runs in multiple mode. In this mode MPI_COMM_WORLD is split into several groups, which run simultaneously. The argument after -multi is a meta-symbol <outflag> that can take an integer value of 0 or 1:
Outflag = 0 display only maximum timings (minimum throughputs) over all active groups
Outflag = 1 report on all groups separately. The report may be long in this case.
When the number of processes running the benchmark is more than half of the overall number MPI_COMM_WORLD, the multiple benchmark coincides with the non-multiple one, as not more than one process group can be created.
For example, if you run this command:
mpirun -np 16 IMB-MPI1 -multi 0 bcast -npmin 12
The benchmark will run in non-multiple mode, as the benchmarking starts from 12 processes, which is more than half of MPI_COMM_WORLD.
By default, Intel® MPI Benchmarks run non-multiple benchmark flavors.
Use the -off_cache flag to avoid cache re-use. If you do not use this flag (default), the same communications buffer is used for all repetitions of one message size sample. In this case, Intel® MPI Benchmarks reuses the cache, so throughput results might be non-realistic.
The argument after off_cache can be a single number (cache_size), two comma-separated numbers (cache_size,cache_line_size), or -1:
cache_size is a float for an upper bound of the size of the last level cache, in MB.
cache_line_size is assumed to be the size of a last level cache line (can be an upper estimate).
-1 uses values defined in IMB_mem_info.h. In this case, make sure to define values for cache_size and cache_line_size in IMB_mem_info.h.
The sent/received data is stored in buffers of size ~2x MAX(cache_size, message_size). When repetitively using messages of a particular size, their addresses are advanced within those buffers so that a single message is at least 2 cache lines after the end of the previous message. When these buffers are filled up, they are reused from the beginning.
-off_cache is effective for IMB-MPI1 and IMB-EXT. Avoid using this option for IMB-IO.
Examples:
Use the default values defined in IMB_mem_info.h:
-off_cache -1
2.5 MB last level cache, default line size:
-off_cache 2.5
16 MB last level cache, line size 128:
-off_cache 16,128
The off_cache mode might also be influenced by eventual internal caching with the Intel® MPI Library. This could make results interpretation complicated.
Default: no cache control.
Use this option to control the number of iterations executed by every benchmark.
By default, the number of iterations is controlled through parameters MSGSPERSAMPLE, OVERALL_VOL, MSGS_NONAGGR, and ITER_POLICY defined in IMB_settings.h.
You can optionally add one or more arguments after the -iter flag, to override the default values defined in IMB_settings.h. Use the following guidelines for the optional arguments:
Examples:
To define MSGSPERSAMPLE as 2000, and OVERALL_VOL as 100, use the following string:
-iter 2000,100
To define MSGS_NONAGGR as 150, you need to define values for MSGSPERSAMPLE and OVERALL_VOL as shown in the following string:
-iter 1000,40,150
To define MSGSPERSAMPLE as 2000 and set the multiple_np policy, use the following string (see -iter_policy):
-iter 2000,multiple_np
Use this option to set a policy for automatic calculation of the number of iterations. Use one of the following arguments to override the default ITER_POLICY value defined in IMB_settings.h:
Policy |
Description |
---|---|
dynamic |
Reduces the number of iterations when the maximum run time per sample (see -time) is expected to be reached. Using this policy ensures faster execution, but may lead to inaccuracy of the results. |
multiple_np |
Reduces the number of iterations when the message size is getting bigger. Using this policy ensures the accuracy of the results, but may lead to longer execution time. You can control the execution time through the -time option. |
auto |
Automatically chooses which policy to use:
|
off |
The number of iterations does not change during the execution. |
You can also set the policy through the -iter option. See -iter.
By default, the ITER_POLICY defined in IMB_settings.h is used.
Specifies the number of seconds for the benchmark to run per message size. The argument after -time is a floating-point number.
The combination of this flag with the -iter flag or its default alternative ensures that the Intel® MPI Benchmarks always chooses the maximum number of repetitions that conform to all restrictions.
A rough number of repetitions per sample to fulfill the -time request is estimated in preparatory runs that use ~1 second overhead.
Default: -time is activated. The floating-point value specifying the run-time seconds per sample is set in the SECS_PER_SAMPLE variable defined in IMB_settings.h, or IMB_settings_io.h.
Specifies the number of GB to be allocated per process for the message buffers. If the size is exceeded, a warning is returned, stating how much memory is required for the overall run.
The argument after -mem is a floating-point number.
Default: the memory is restricted by MAX_MEM_USAGE defined in IMB_mem_info.h.
Use the ASCII input file to select the benchmarks. For example, the IMB_SELECT_EXT file looks as follows:
# # IMB benchmark selection file # # Every line must be a comment (beginning with #), or it # must contain exactly one IMB benchmark name # #Window Unidir_Get #Unidir_Put #Bidir_Get #Bidir_Put Accumulate
With the help of this file, the following command runs only Unidir_Get and Accumulate benchmarks of the IMB-EXT component:
mpirun .... IMB-EXT -input IMB_SELECT_EXT
Enter any set of non-negative message lengths to an ASCII file, line by line, and call the Intel® MPI Benchmarks with arguments:
-msglen Lengths
The Lengths value overrides the default message lengths. For IMB-IO, the file defines the I/O portion lengths.
Use this option to number the processes along rows of the matrix:
0 |
P |
... |
(Q-2)P |
(Q-1)P |
1 |
|
|
|
|
... |
|
|
|
|
P-1 |
2P-1 |
|
(Q-1)P-1 |
QP-1 |
For example, to run Multi-PingPongbetween two nodes of size P, with each process on one node communicating with its counterpart on the other, call:
mpirun -np <2P> IMB-MPI1 -map <P>x2 PingPong
Specifies the list of additional benchmarks to run. For example, to add PingPongSpecificSource and PingPingSpecificSource benchmarks, call:
mpirun -np 2 IMB-MPI1 -include PingPongSpecificSource PingPingSpecificSource
Specifies the list of benchmarks to be excluded from the run. For example, to exclude Alltoall and Allgather, call:
mpirun -np 2 IMB-MPI1 -exclude Alltoall Allgather
This option allows you to control the lengths of the transfer messages. This setting overrides the MINMSGLOG and MAXMSGLOG values. The new message sizes are 0, 2^minlog, ..., 2^maxlog.
For example, if you run the following command line:
mpirun -np 2 IMB-MPI1 -msglog 3:7 PingPong
Intel® MPI Benchmarks selects the lengths 0, 8, 16, 32, 64, 128, as shown below:
#--------------------------------------------------- # Benchmarking PingPong # #processes = 2 #--------------------------------------------------- #bytes #repetitions t[μsec] Mbytes/sec 0 1000 0.70 0.00 8 1000 0.73 10.46 16 1000 0.74 20.65 32 1000 0.94 32.61 64 1000 0.94 65.14 128 1000 1.06 115.16
Alternatively, you can specify only the maxlog value, enter:
mpirun -np 2 IMB-MPI1 -msglog 3 PingPong
In this case Intel® MPI Benchmarks selects the lengths 0,1,2,4,8:
#--------------------------------------------------- # Benchmarking PingPong # #processes = 2 #--------------------------------------------------- #bytes #repetitions t[μsec] Mbytes/sec 0 1000 0.69 0.00 1 1000 0.72 1.33 2 1000 0.71 2.69 4 1000 0.72 5.28 8 1000 0.73 10.47
This option specifies the desired thread level for MPI_Init_thread(). See description of MPI_Init_thread() for details. The option is available only if the Intel® MPI Benchmarks is built with the USE_MPI_INIT_THREAD macro defined. Possible values for <level> are single, funneled, serialized, and multiple.
This option is relevant only for benchmarks measuring collective operations. It controls whether all ranks are synchronized after every iteration step by means of the MPI_Barrier operation. The -sync option can take the following arguments:
Argument |
Description |
---|---|
0 | off | disable | no |
Disables processes synchronization at each iteration step. This is the default value. |
1 | on | enable | yes |
Enables processes synchronization at each iteration step. |
This options is relevant only for benchmarks measuring collective operations that utilize the root concept (for example MPI_Bcast, MPI_Reduce, MPI_Gather, etc). It defines whether the root is changed at every iteration step or not. The –root_shift option can take the following arguments:
Argument |
Description |
---|---|
0 | off | disable | no |
Disables root change at each iteration step. Rank 0 acts as a root at each iteration step. |
1 | on | enable | yes |
Enables root change at each iteration step. Root rank is changed in a round-robin fashion. This is the default value. |