Hiperism Consulting, LLC: HCTR-2011-2

1.0 CHOICE OF BENCHMARK

1.1 The Stommel Ocean Model

HiPERiSM has used the Stommel Ocean Model (SOM) as a simple case study in training courses across various HPC platforms and it is useful as a test bed for new architectures. It has been described in a previous report (HCTR-2001-3) and for this benchmark the problem size sets the number of interior grid point at N=60,000 for a Cartesian grid of 60,000 x 60,000 with a total memory image in excess of 80 Gbytes. This domain is divided into horizontal slabs with each slab distributed to separate MPI processes. In the hybrid OpenMP+MPI version of SOM used here, each horizontal slab is further subdivided into thread-parallel chunks in an OpenMP work scheduling algorithm. The chunk size differs depending on the value of the product for the number of MPI processes times the number of OpenMP threads, but the parallel work scheduling algorithm remains the same.

1.2 Hardware test bed

The hardware platform for this benchmark exercise is the 4-processor (4P) Advanced Micro Devices (AMD) 6176SE 12 core CPUs, as described in Table 1.1 of a preceding report (HCTR-2011-1). Of interest here is to compare the multi-core performance with CPUs on a single mother board sharing a bus architecture.

2.0 COMPILING THE BENCHMARK

To compile the hybrid OpenMP + MPI SOM model three compilers were used. These included Absoft (11.0), Intel (11.0) and Portland (10.6) compilers. All compilations used the highest level of optimizations available for this host with each using double precision arithmetic. For all three compilers the MPICH mpirun command was used with the -all-local switch to contain executions on-node.

3.0 BENCHMARK RESULTS

3.1 Wall clock times

Wall clock times for the Absoft, Intel and Portland compilers are shown in Tables 3.1 - 3.3, respectively. The three compilers offer differences in performance times and, in general, the best times are for the Portland compiler. Therefore, Figs. 3.1 and 3.2 show the ratio of the wall clock times to the corresponding Portland results for the other two compilers.

Table 3.1 . Absoft compiler wall clock time in seconds with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Absoft	OMP
MPI	1	2	4	6	8	10	12	24	48
1	7167.5	4018.3	2148.9	1897.4	1415.9	1408.8	1363.7	1357.6	1467.1
2	3551.7	2307.5	1634.5	1339	1374.4	1355.2	1401.5	1416.2
4	1933.7	2045.5	1036.4	1055.8	1116	985.4	1085.2
6	2026.4	1228.4	941.2	1078.3	978.1
8	1627.4	1162.5	885.2	978.3
10	1459.5	1001.8	1056.5
12	2250.2	1404.5	1247.5
24	1007.2	881
48	831

Table 3.2. Intel compiler wall clock time in seconds with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Intel	OMP
MPI	1	2	4	6	8	10	12	24	48
1	6006	3186.8	2118.4	1483.7	1402.4	1207	1190.9	1441.1	1525.7
2	3136.2	2009.6	1231.3	1274.2	1270.8	1322.9	1317.4	1521.3
4	2403.2	1442.2	1168.7	1016.7	1102.8	1070.3	1156.7
6	1773.7	1081.5	1059.7	1025.1	1052.7
8	1513.3	1148.1	1093	1000.9
10	1332	1048.5	1045
12	1406.4	1099.8	965.4
24	1029.7	934.7
48	844

Table 3.3. Portland compiler wall clock time in seconds with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Portland	OMP
MPI	1	2	4	6	8	10	12	24	48
1	4874	2628.7	1515.4	1235.7	1155.4	1090.3	1183	1092	1158.6
2	2488.3	1600.4	1096.8	945.2	1049	1005.1	934.2	1351.7
4	1786.9	1034.7	803.8	942.7	952.8	843.4	1021.9
6	1308.8	996	788.7	836.8	838
8	1165.2	868.6	926.1	809.1
10	1119.6	913.4	982.8
12	912.8	744.7	769.3
24	765.4	787
48	746.5

Fig 3.1. The ordinate shows the ratio of wall clock time of Absoft versus Portland compiler with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The horizontal axis is the the OpenMP thread count and the legend shows the number of MPI processes. The number of cores used is the product of the two values.

Fig 3.2. The ordinate shows the ratio of wall clock time of Intel versus Portland compiler with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The horizontal axis is the log of the OpenMP thread count and the legend shows the number of MPI processes. The number of cores used is the product of the two values.

3.2 Scaling with thread count

Scaling by OpenMP thread count, with a fixed number of MPI processes, for the Absoft, Intel and Portland compilers are shown in Tables 3.4 - 3.6, respectively. The three compilers offer poor scaling when the number of MPI processes is 4, or larger, and the scaling at 2 MPI processes is uneven.

Table 3.4. Absoft compiler scaling by OpenMP thread count for a fixed number of MPI processes with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Absoft	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1	1.78	3.34	3.78	5.06	5.09	5.26	5.28	4.89
2	1	1.54	2.17	2.65	2.58	2.62	2.53	2.51
4	1	0.95	1.87	1.83	1.73	1.96	1.78
6	1	1.65	2.15	1.88	2.07
8	1	1.40	1.84	1.66
10	1	1.46	1.38
12	1	1.60	1.80
24	1	1.14
48	1

Table 3.5. Intel compiler scaling by OpenMP thread count for a fixed number of MPI processes with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Intel	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1	1.88	2.84	4.05	4.28	4.98	5.04	4.17	3.94
2	1	1.56	1.48	2.46	2.47	2.37	2.38	2.06
4	1	1.67	2.06	2.36	2.18	2.25	2.08
6	1	1.64	1.67	1.73	1.68
8	1	1.32	1.38	1.51
10	1	1.27	1.27
12	1	1.28	1.46
24	1	1.10
48	1

Table 3.6. Portland compiler scaling by OpenMP thread count for a fixed number of MPI processes with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Portland	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1	1.85	3.22	3.94	4.22	4.47	4.12	4.46	4.21
2	1	1.55	1.64	2.63	2.37	2.48	2.66	1.84
4	1	1.73	2.22	1.90	1.88	2.12	1.75
6	1	1.31	1.66	1.56	1.56
8	1	1.34	1.26	1.44
10	1	1.23	1.14
12	1	1.23	1.19
24	1	0.97
48	1

3.3 Scaling with MPI process count

Scaling by MPI process count, with a fixed number of OpenMP threads, for the Absoft, Intel and Portland compilers are shown in Tables 3.7 - 3.9, respectively. The three compilers offer poor scaling when the number of threads is 4, or larger, and the scaling at 4 threads is uneven.

Table 3.7. Absoft compiler scaling by MPI process count for a fixed number of OpenMP threads with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Absoft	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
2	2.02	1.74	1.31	1.42	1.03	1.04	0.97	0.96
4	3.71	1.96	2.07	1.80	1.27	1.43	1.26
6	3.54	3.27	2.28	1.76	1.45
8	4.40	3.46	2.43	1.94
10	4.91	4.01	2.03
12	3.19	2.86	1.72
24	7.12	4.56
48	8.63

Table 3.8. Intel compiler scaling by MPI process count for a fixed number of OpenMP threads with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Intel	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
2	1.92	1.59	1.00	1.16	1.10	0.91	0.90	0.95
4	2.50	2.21	1.81	1.46	1.27	1.13	1.03
6	3.39	2.95	2.00	1.45	1.33
8	3.97	2.78	1.94	1.48
10	4.51	3.04	2.03
12	4.27	2.90	2.19
24	5.83	3.41
48	7.12

Table 3.9. Portland compiler scaling by MPI process count for a fixed number of OpenMP threads with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Portland	OMP
MPI	1	2	4	6	8	10	12	24	48
1	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
2	1.96	1.64	1.00	1.31	1.10	1.08	1.27	0.81
4	2.73	2.54	1.89	1.31	1.21	1.29	1.16
6	3.72	2.64	1.92	1.48	1.38
8	4.18	3.03	1.64	1.53
10	4.35	2.88	1.54
12	5.34	3.53	1.97
24	6.37	3.34
48	6.53

3.4 Results for fixed chunk size and core count

The results above were for multiple combinations of MPI processes and OpenMP threads ranging from 1 to 48. This section shows results selected for combinations of MPI processes and OpenMP threads where the product of their respective numbers is exactly 48, for example, 12 MPI processes and 4 OpenMP threads, or 4 MPI processes and 12 OpenMP threads. The other reason for this selection is that the parallel chunk size per thread is constant for all such combinations, and this equalizes one variable affecting memory usage when comparing the three compilers. For this selection Fig. 3.3 shows the results of wall clock times extracted as the highlighted values along the diagonal from Tables 3.1-3.3 whereas Fig. 3.4 shows the corresponding ratios of these times to the Portland result.

Fig 3.3. Wall clock time of three compilers with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node in OpenMP+MPI hybrid mode such that the product for the number of MPI processes and OpenMP threads is 48. The horizontal axis shows the number of MPI processes.

Fig 3.4. Ratio of wall clock time of Absoft and Intel compilers to the Portland result with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node in OpenMP+MPI hybrid mode such that the product for the number of MPI processes and OpenMP threads is 48. The horizontal axis shows the number of MPI processes.

4.0 ANALYSIS OF RESULTS

Exploratory benchmarks comparing three compilers on a simple hybrid model with a regular data structure showed the smallest wall clock times for the Portland compiler over a broad parameter range of a parallel hybrid MPI+OpenMP SOM model. Relative to the corresponding Portland results, the variability in wall clock times was largest for the Absoft compiler, when the number of MPI processes was less than 8, whereas the variability of Absoft and Intel wall clock times was similar for more than 8. The greatest divergences occur at thread counts of 1,2,4 and 12, and for MPI process counts of 1 and 12. Possible causes are cache effects or thread/process data affinity issues. The latter relates to where data resides relative to the host core for each thread or process. While it is possible to schedule MPI processes to specific (numbered) cores with the mpiexec command in MPI2, no such effort was implemented here, and all scheduling was left to the runtime libraries of the respective compilers and the operating system.

For scaling with increasing MPI process, or OpenMP thread count, all three compilers showed good results when these counts where less than, or equal to, 4. Outside this range scaling results were poor. This could be an artifact of insufficient arithmetic work inside the corresponding (smaller) parallel chunks since parallel granularity is more refined with increasing core count.

5.0 CONCLUSIONS

Exploratory benchmark measurements on a 48 core AMD node confirm that all three compilers deliver good scaling performance at low core counts. Performance at higher core counts was limited by a finer parallel granularity in the benchmark model. For wall clock time the Portland compiler is the best performer in this roundup. However, the Intel and Absoft compiler timing results were close with the exception of the case with 12 MPI processes. Actual performance of commodity solutions in real-world applications will vary and results for specific Air Quality Models (AQM) are the subject of subsequent reports.