Hiperism Consulting, LLC: HCTR-2011-5

1.0 CHOICE OF BENCHMARK

1.1 The Stommel Ocean Model

HiPERiSM has used the Stommel Ocean Model (SOM) as a simple case study in training courses across various HPC platforms and it is useful as a test bed for new architectures. It has been described in a previous report (HCTR-2001-3) and for this benchmark the problem size sets the number of interior grid point at N=60,000 for a Cartesian grid of 60,000 x 60,000 with a total memory image in excess of 80 Gbytes. This domain is divided into horizontal slabs with each slab distributed to separate MPI processes. In the hybrid OpenMP+MPI version of SOM used here, each horizontal slab is further subdivided into thread-parallel chunks in an OpenMP work scheduling algorithm. The chunk size differs depending on the value of the product for the number of MPI processes times the number of OpenMP threads, but the parallel work scheduling algorithm remains the same.

1.2 Hardware test bed

The hardware platform for this benchmark exercise is the 4-processor (4P) Advanced Micro Devices (AMD) 6176SE 12 core CPUs, as described in Table 1.1 of a preceding report (HCTR-2011-1). Of interest here is to compare the multi-core performance with CPUs on a single mother board sharing a bus architecture.

2.0 COMPILING THE BENCHMARK WITH AUTO PARALLELISM

To compile the hybrid OpenMP + MPI SOM model three compilers were used in their respective (newest) 2011 releases. These included Absoft (11.1), Intel (12.0) and Portland (11.1) compilers. This report is an update to the preceding report, HCTR-2011-2, which used the previous (2010) releases of these three compilers. All compilations used the highest level of optimizations available for this host with each using double precision arithmetic. However, in this report (as opposed to HCTR-2011-3) automatic parallelism (or concurrency) options were also enabled in addition to the OpenMP directives. The reason for this is that some of these compilers now enable this option at the highest optimization level (e.g. Absoft 11.1 with -O5). While the effect of this may be only a small incremental performance improvement for this benchmark it equalizes the comparison between compilers on this multi core platform. For all three compilers the MPICH mpirun command was used with the -all-local switch to contain executions on-node.

3.0 BENCHMARK RESULTS

3.1 Wall clock times

Wall clock times for the Absoft, Intel and Portland compilers are shown in Tables 3.1 - 3.3, respectively. The three compilers offer differences in performance times and, in general, the best times appear to be for the Portland compiler. Figs. 3.1 and 3.2 show the ratio of the wall clock times to the corresponding Portland results for the other two compilers. When the 2011 results are compared with the corresponding figures in HCTR-2011-2, which used the previous (2010) compiler releases, important differences emerge. For example, in Fig. 3.1, the Absoft 11.1 compiler shows a cluster of 12 points (out of 42) below a ratio of 1.0. This is in contrast to the Intel 12.0 versus Portland 11.1 result in Fig. 3.2 which shows some 4 points below 1.0. With reference to the Absoft compiler it is unusual to see such steps up in performance gains for an incremental change in the release numbering.

Table 3.1 . Absoft 11.1 compiler wall clock time in seconds with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Absoft	OMP
MPI	1	2	4	6	8	10	12	24	48
1	7594.2	4290.4	2324.3	1703.1	1375.2	1203.6	1251.8	993.8	952.7
2	3806.6	2310.5	1377.2	1203.9	1076.5	974.5	949.3	882.7
4	2284.5	1332.9	1091.3	919.6	867.9	842.6	865.6
6	1618.1	1187.1	927	871.6	829.9
8	1468.3	1029.6	888.8	853.9
10	1315.2	988.2	813.9
12	2077.5	1230.4	1007.5
24	968.8	871
48	817

Table 3.2. Intel 12.0 compiler wall clock time in seconds with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Intel	OMP
MPI	1	2	4	6	8	10	12	24	48
1	5797.3	3611.7	2494.9	1512.4	1290.1	1325	1255.4	1315.3	1292.8
2	2754.4	1724.2	1307.8	1280.1	1055	1218.2	1188.1	1374.7
4	2135.9	1292.1	945.4	1008.5	915.4	936.4	890.6
6	1434.8	1244.2	970	1158.7	1056.6
8	1358.5	1023.1	903.4	876.1
10	1178.5	942.2	885.1
12	1285.7	978.8	850.2
24	935.8	931.9
48	895.6

Table 3.3. Portland 11.1 compiler wall clock time in seconds with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Portland	OMP
MPI	1	2	4	6	8	10	12	24	48
1	4919.4	2951.5	1811.3	1292.2	1157.3	1092.9	1064.7	1108.4	1350.1
2	2253.7	1634	1123.2	990.3	952.3	1042	1051.4	1259.3
4	1624.2	965.6	858.7	1039.7	912.6	864.1	862.6
6	1261.6	894.3	823.3	762.8	750.7
8	1171.8	790.5	909	894.3
10	986.6	826.3	833.7
12	921.3	791.3	760.4
24	776	742.8
48	711.7

Fig 3.1. The ordinate shows the ratio of wall clock time of Absoft 11.1 versus Portland 11.1 compiler with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The horizontal axis is the the OpenMP thread count and the legend shows the number of MPI processes. The number of cores used is the product of the two values.

Fig 3.2. The ordinate shows the ratio of wall clock time of Intel 12.0 versus Portland 11.1 compiler with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The horizontal axis is the log of the OpenMP thread count and the legend shows the number of MPI processes. The number of cores used is the product of the two values.

3.2 Scaling with thread count

Scaling by OpenMP thread count, with a fixed number of MPI processes, for the Absoft, Intel and Portland compilers are shown in Tables 3.4 - 3.6, respectively. The three compilers offer poor scaling when the number of MPI processes is 4, or larger, and the scaling at 1 or 2 MPI processes is best for the Absoft 11.1 compiler.

Table 3.4. Absoft 11.1 compiler scaling by OpenMP thread count for a fixed number of MPI processes with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Absoft	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1	1.77	3.27	4.46	5.52	6.31	6.07	7.64	7.97
2	1	1.65	2.76	3.16	3.54	3.91	4.01	4.31
4	1	1.71	2.09	2.48	2.63	2.71	2.64
6	1	1.36	1.75	1.86	1.95
8	1	1.43	1.65	1.72
10	1	1.33	1.62
12	1	1.69	2.06
24	1	1.11
48	1

Table 3.5. Intel 12.0 compiler scaling by OpenMP thread count for a fixed number of MPI processes with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Intel	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1	1.61	2.32	3.83	4.49	4.38	4.62	4.41	4.48
2	1	1.60	1.10	2.15	2.61	2.26	2.32	2.00
4	1	1.65	2.26	2.12	2.33	2.28	2.40
6	1	1.15	1.48	1.24	1.36
8	1	1.33	1.50	1.55
10	1	1.25	1.33
12	1	1.31	1.51
24	1	1.00
48	1

Table 3.6. Portland 11.1 compiler scaling by OpenMP thread count for a fixed number of MPI processes with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Portland	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1	1.67	2.72	3.81	4.25	4.50	4.62	4.44	3.64
2	1	1.38	1.24	2.28	2.37	2.16	2.14	1.79
4	1	1.68	1.89	1.56	1.78	1.88	1.88
6	1	1.41	1.53	1.65	1.68
8	1	1.48	1.29	1.31
10	1	1.19	1.18
12	1	1.16	1.21
24	1	1.04
48	1

3.3 Scaling with MPI process count

Scaling by MPI process count, with a fixed number of OpenMP threads, for the Absoft, Intel and Portland compilers are shown in Tables 3.7 - 3.9, respectively. The three compilers offer poor scaling when the number of threads is 4, or larger, and the scaling at 1 or 2 threads is best for the Absoft 11.1 compiler.

Table 3.7. Absoft 11.1 compiler scaling by MPI process count for a fixed number of OpenMP threads with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Absoft	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
2	2.00	1.86	1.69	1.41	1.28	1.24	1.32	1.13
4	3.32	3.22	2.13	1.85	1.58	1.43	1.45
6	4.69	3.61	2.51	1.95	1.66
8	5.17	4.17	2.62	1.99
10	5.77	4.34	2.86
12	3.66	3.49	2.31
24	7.84	4.93
48	9.30

Table 3.8. Intel 12.0 compiler scaling by MPI process count for a fixed number of OpenMP threads with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Intel	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
2	2.10	2.09	1.00	1.18	1.22	1.09	1.06	0.96
4	2.71	2.80	2.64	1.50	1.41	1.41	1.41
6	4.04	2.90	2.57	1.31	1.22
8	4.27	3.53	2.76	1.73
10	4.92	3.83	2.82
12	4.51	3.69	2.93
24	6.20	3.88
48	6.47

Table 3.9. Portland 11.1 compiler scaling by MPI process count for a fixed number of OpenMP threads with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Portland	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
2	2.18	1.81	1.00	1.30	1.22	1.05	1.01	0.88
4	3.03	3.06	2.11	1.24	1.27	1.26	1.23
6	3.90	3.30	2.20	1.69	1.54
8	4.20	3.73	1.99	1.44
10	4.99	3.57	2.17
12	5.34	3.73	2.38
24	6.34	3.97
48	6.91

3.4 Results for fixed chunk size and core count

The results above were for multiple combinations of MPI processes and OpenMP threads ranging from 1 to 48. This section shows results selected for combinations of MPI processes and OpenMP threads where the product of their respective numbers is exactly 48, for example, 12 MPI processes and 4 OpenMP threads, or 4 MPI processes and 12 OpenMP threads. The other reason for this selection is that the parallel chunk size per thread is constant for all such combinations, and this equalizes one variable affecting memory usage when comparing the three compilers. For this selection Fig. 3.3 shows the results of wall clock times extracted as the highlighted values along the diagonal from Tables 3.1-3.3 whereas Fig. 3.4 shows the corresponding ratios of these times to the Portland result. From Fig. 3.3 the Absoft 11.1 compiler reports the shortest wall clock times for 1, 2, 4, and 8 MPI processes, while Fig. 3.4 confirms this superior performance. A direct comparison with Fig. 3.4 in HCTR-2011-2 demonstrates the dramatic gains by the Absoft 11.1 compiler versus the 11.0 release with 1-4 MPI processes.

Fig 3.3. Wall clock time of three compilers with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node in OpenMP+MPI hybrid mode such that the product for the number of MPI processes and OpenMP threads is 48. The horizontal axis shows the number of MPI processes.

Fig 3.4. Ratio of wall clock time of Absoft 11.1 and Intel 12.0 compilers to the Portland 11.1 result with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node in OpenMP+MPI hybrid mode such that the product for the number of MPI processes and OpenMP threads is 48. The horizontal axis shows the number of MPI processes.

4.0 ANALYSIS OF RESULTS

Exploratory benchmarks comparing three compilers on a simple hybrid model with a regular data structure showed the smallest wall clock times for the Portland compiler over a broad parameter range for a parallel hybrid MPI+OpenMP SOM model, with some important exceptions. Relative to the corresponding Portland results, the Absoft compiler has the smallest wall clock times whenever the number of MPI processes was less than 8 (with the exception of 6). As a function of thread count, the Absoft 11.1 compiler shows the shortest wall clock for 28% of cases in a sample of 42 values. There are divergences between all three compiler and possible causes are cache effects or thread/process data affinity issues. The latter relates to where data resides relative to the host core for each thread or process. Two compilers (Intel and Portland offer options to bind threads to cores, but such options were not implemented here, and all scheduling was left to the runtime libraries of the respective compilers and the operating system.

For scaling with increasing MPI process, or OpenMP thread count, all three compilers showed acceptable results when these counts where less than, or equal to, 4. Outside this range scaling results were poor. This could be an artifact of insufficient arithmetic work inside the corresponding (smaller) parallel chunks since parallel granularity is more refined with increasing core and process count.

5.0 CONCLUSIONS

Exploratory benchmark measurements on a 48 core AMD node confirm that all three compilers deliver improved performance in the latest releases when compared with the previous ones. Performance at higher core counts was limited by a finer parallel granularity in the benchmark model. For wall clock time the Absoft 11.1 compiler shows leadership for performance gains in this roundup. Actual performance of commodity solutions in real-world applications will vary and results for specific Air Quality Models (AQM) are the subject of subsequent reports.