HiPERiSM's Technical Reports

HiPERiSM - High Performance Algorism Consulting

HCTR-2011-2: Benchmarks with three compilers on AMD processors (2010)

 

BENCHMARKS WITH THREE COMPILERS ON AMD PROCESSORS (2010)

George Delic

HiPERiSM Consulting, LLC.

 

1.0  CHOICE OF BENCHMARK

1.1 The Stommel Ocean Model

HiPERiSM has used the Stommel Ocean Model (SOM) as a simple case study in training courses across various HPC platforms and it is useful as a test bed for new architectures. It has been described in a previous report (HCTR-2001-3) and for this benchmark the problem size sets the number of interior grid point at N=60,000 for a Cartesian grid of 60,000 x 60,000 with a total memory image in excess of 80 Gbytes. This domain is divided into horizontal slabs with each slab distributed to separate MPI processes. In the hybrid OpenMP+MPI version of SOM used here, each horizontal slab is further subdivided into thread-parallel chunks in an OpenMP work scheduling algorithm. The chunk size differs depending on the value of the product for the number of MPI processes times the number of OpenMP threads, but the parallel work scheduling algorithm remains the same. 

1.2 Hardware test bed

The hardware platform for this benchmark exercise is the 4-processor (4P) Advanced Micro Devices (AMD) 6176SE 12 core CPUs, as described in Table 1.1 of a preceding report (HCTR-2011-1). Of interest here is to compare the multi-core performance with CPUs on a single mother board sharing a bus architecture.

2.0  COMPILING THE BENCHMARK

To compile the hybrid OpenMP + MPI SOM model three compilers were used. These included Absoft (11.0), Intel (11.0) and Portland (10.6) compilers. All compilations used the highest level of optimizations available for this host with each using double precision arithmetic.  For all three compilers the MPICH  mpirun command was used with the -all-local switch to contain executions on-node.

3.0  BENCHMARK RESULTS

3.1 Wall clock times

Wall clock times for the Absoft, Intel and Portland compilers are shown in Tables 3.1 - 3.3, respectively. The three compilers offer differences in performance times and, in general, the best times are for the Portland compiler. Therefore, Figs. 3.1 and 3.2 show the ratio of the wall clock times to the corresponding Portland results for the other two compilers.

Table 3.1 . Absoft compiler wall clock time in seconds with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Absoft

OMP

 

 

 

 

 

 

 

 

MPI

1

2

4

6

8

10

12

24

48

1

7167.5

4018.3

2148.9

1897.4

1415.9

1408.8

1363.7

1357.6

1467.1

2

3551.7

2307.5

1634.5

1339

1374.4

1355.2

1401.5

1416.2

 

4

1933.7

2045.5

1036.4

1055.8

1116

985.4

1085.2

 

 

6

2026.4

1228.4

941.2

1078.3

978.1

 

 

 

 

8

1627.4

1162.5

885.2

978.3

 

 

 

 

 

10

1459.5

1001.8

1056.5

 

 

 

 

 

 

12

2250.2

1404.5

1247.5

 

 

 

 

 

 

24

1007.2

881

 

 

 

 

 

 

 

48

831

 

 

 

 

 

 

 

 

Table 3.2. Intel compiler wall clock time in seconds with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Intel

OMP

 

 

 

 

 

 

 

 

MPI

1

2

4

6

8

10

12

24

48

1

6006

3186.8

2118.4

1483.7

1402.4

1207

1190.9

1441.1

1525.7

2

3136.2

2009.6

1231.3

1274.2

1270.8

1322.9

1317.4

1521.3

 

4

2403.2

1442.2

1168.7

1016.7

1102.8

1070.3

1156.7

 

 

6

1773.7

1081.5

1059.7

1025.1

1052.7

 

 

 

 

8

1513.3

1148.1

1093

1000.9

 

 

 

 

 

10

1332

1048.5

1045

 

 

 

 

 

 

12

1406.4

1099.8

965.4

 

 

 

 

 

 

24

1029.7

934.7

 

 

 

 

 

 

 

48

844

 

 

 

 

 

 

 

 

Table 3.3. Portland compiler wall clock time in seconds with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

Portland

OMP

 

 

 

 

 

 

 

 

MPI

1

2

4

6

8

10

12

24

48

1

4874

2628.7

1515.4

1235.7

1155.4

1090.3

1183

1092

1158.6

2

2488.3

1600.4

1096.8

945.2

1049

1005.1

934.2

1351.7

 

4

1786.9

1034.7

803.8

942.7

952.8

843.4

1021.9

 

 

6

1308.8

996

788.7

836.8

838

 

 

 

 

8

1165.2

868.6

926.1

809.1

 

 

 

 

 

10

1119.6

913.4

982.8

 

 

 

 

 

 

12

912.8

744.7

769.3

 

 

 

 

 

 

24

765.4

787

 

 

 

 

 

 

 

48

746.5

 

 

 

 

 

 

 

 

Fig 3.1. The ordinate shows the ratio of wall clock time of Absoft versus Portland compiler with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The horizontal axis is the the OpenMP thread count and the legend shows the number of MPI processes. The number of cores used is the product of the two values.

Fig 3.2. The ordinate shows the ratio of wall clock time of Intel versus Portland compiler with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The horizontal axis is the log of the OpenMP thread count and the legend shows the number of MPI processes. The number of cores used is the product of the two values.

3.2 Scaling with thread count

Scaling by OpenMP thread count, with a fixed number of MPI processes, for the Absoft, Intel and Portland compilers are shown in Tables 3.4 - 3.6, respectively. The three compilers offer poor scaling when the number of MPI processes is 4, or larger, and the scaling at 2 MPI processes is uneven.

Table 3.4. Absoft compiler scaling by OpenMP thread count for a fixed number of MPI processes with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

 Absoft

OMP

 

 

 

 

 

 

 

 

MPI

1

2

4

6

8

10

12

24

48

1

1

1.78

3.34

3.78

5.06

5.09

5.26

5.28

4.89

2

1

1.54

2.17

2.65

2.58

2.62

2.53

2.51

 

4

1

0.95

1.87

1.83

1.73

1.96

1.78

 

 

6

1

1.65

2.15

1.88

2.07

 

 

 

 

8

1

1.40

1.84

1.66

 

 

 

 

 

10

1

1.46

1.38

 

 

 

 

 

 

12

1

1.60

1.80

 

 

 

 

 

 

24

1

1.14

 

 

 

 

 

 

 

48

1

 

 

 

 

 

 

 

 

Table 3.5. Intel compiler scaling by OpenMP thread count for a fixed number of MPI processes with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

 Intel

OMP

 

 

 

 

 

 

 

 

MPI

1

2

4

6

8

10

12

24

48

1

1

1.88

2.84

4.05

4.28

4.98

5.04

4.17

3.94

2

1

1.56

1.48

2.46

2.47

2.37

2.38

2.06

 

4

1

1.67

2.06

2.36

2.18

2.25

2.08

 

 

6

1

1.64

1.67

1.73

1.68

 

 

 

 

8

1

1.32

1.38

1.51

 

 

 

 

 

10

1

1.27

1.27

 

 

 

 

 

 

12

1

1.28

1.46

 

 

 

 

 

 

24

1

1.10

 

 

 

 

 

 

 

48

1

 

 

 

 

 

 

 

 

Table 3.6. Portland compiler scaling by OpenMP thread count for a fixed number of MPI processes with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

 Portland

OMP

 

 

 

 

 

 

 

 

MPI

1

2

4

6

8

10

12

24

48

1

1

1.85

3.22

3.94

4.22

4.47

4.12

4.46

4.21

2

1

1.55

1.64

2.63

2.37

2.48

2.66

1.84

 

4

1

1.73

2.22

1.90

1.88

2.12

1.75

 

 

6

1

1.31

1.66

1.56

1.56

 

 

 

 

8

1

1.34

1.26

1.44

 

 

 

 

 

10

1

1.23

1.14

 

 

 

 

 

 

12

1

1.23

1.19

 

 

 

 

 

 

24

1

0.97

 

 

 

 

 

 

 

48

1

 

 

 

 

 

 

 

 

3.3 Scaling with MPI process count

Scaling by MPI process count, with a fixed number of OpenMP threads, for the Absoft, Intel and Portland compilers are shown in Tables 3.7 - 3.9, respectively. The three compilers offer poor scaling when the number of threads is 4, or larger, and the scaling at 4 threads is uneven.

Table 3.7. Absoft compiler scaling by MPI process count for a fixed number of OpenMP threads with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

 Absoft

OMP

 

 

 

 

 

 

 

 

MPI

1

2

4

6

8

10

12

24

48

1

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

2

2.02

1.74

1.31

1.42

1.03

1.04

0.97

0.96

 

4

3.71

1.96

2.07

1.80

1.27

1.43

1.26

 

 

6

3.54

3.27

2.28

1.76

1.45

 

 

 

 

8

4.40

3.46

2.43

1.94

 

 

 

 

 

10

4.91

4.01

2.03

 

 

 

 

 

 

12

3.19

2.86

1.72

 

 

 

 

 

 

24

7.12

4.56

 

 

 

 

 

 

 

48

8.63

 

 

 

 

 

 

 

 

Table 3.8. Intel compiler scaling by MPI process count for a fixed number of OpenMP threads with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

 Intel

OMP

 

 

 

 

 

 

 

 

MPI

1

2

4

6

8

10

12

24

48

1

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

2

1.92

1.59

1.00

1.16

1.10

0.91

0.90

0.95

 

4

2.50

2.21

1.81

1.46

1.27

1.13

1.03

 

 

6

3.39

2.95

2.00

1.45

1.33

 

 

 

 

8

3.97

2.78

1.94

1.48

 

 

 

 

 

10

4.51

3.04

2.03

 

 

 

 

 

 

12

4.27

2.90

2.19

 

 

 

 

 

 

24

5.83

3.41

 

 

 

 

 

 

 

48

7.12

 

 

 

 

 

 

 

 

Table 3.9. Portland compiler scaling by MPI process count for a fixed number of OpenMP threads with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node. The tabular configuration is row-wise for OpenMP thread count and column-wise for MPI process count.

 Portland

OMP

 

 

 

 

 

 

 

 

MPI

1

2

4

6

8

10

12

24

48

1

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

1.00

2

1.96

1.64

1.00

1.31

1.10

1.08

1.27

0.81

 

4

2.73

2.54

1.89

1.31

1.21

1.29

1.16

 

 

6

3.72

2.64

1.92

1.48

1.38

 

 

 

 

8

4.18

3.03

1.64

1.53

 

 

 

 

 

10

4.35

2.88

1.54

 

 

 

 

 

 

12

5.34

3.53

1.97

 

 

 

 

 

 

24

6.37

3.34

 

 

 

 

 

 

 

48

6.53

 

 

 

 

 

 

 

 

3.4 Results for fixed chunk size and core count

The results above were for multiple combinations of MPI processes and OpenMP threads ranging from 1 to 48. This section shows results selected for combinations of MPI processes and OpenMP threads where the product of their respective numbers is exactly 48, for example, 12 MPI processes and 4 OpenMP threads, or 4 MPI processes and 12 OpenMP threads. The other reason for this selection is that the parallel chunk size per thread is constant for all such combinations, and this equalizes one variable affecting memory usage when comparing the three compilers. For this selection Fig. 3.3 shows the results of wall clock times extracted as the highlighted values along the diagonal from Tables 3.1-3.3 whereas Fig. 3.4 shows the corresponding ratios of these times to the Portland result.

Fig 3.3. Wall clock time of three compilers with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node in  OpenMP+MPI hybrid mode such that the product for the number of MPI processes and OpenMP threads is 48. The horizontal axis shows the number of MPI processes.

Fig 3.4. Ratio of wall clock time of Absoft and Intel compilers to the Portland result with problem size N=60,000 in the SOM benchmark on the AMD 12 core 6176SE 4P node in OpenMP+MPI hybrid mode such that the product for the number of MPI processes and OpenMP threads is 48. The horizontal axis shows the number of MPI processes.

4.0 ANALYSIS OF RESULTS

Exploratory benchmarks comparing three compilers on a simple hybrid model with a regular data structure showed the smallest wall clock times for the Portland compiler over a broad parameter range of a parallel hybrid MPI+OpenMP SOM model. Relative to the corresponding Portland results, the variability in wall clock times was largest for the Absoft compiler, when the number of MPI processes was less than 8, whereas the variability of Absoft and Intel wall clock times was similar for more than 8. The greatest divergences occur at thread counts of 1,2,4 and 12, and for MPI process counts of 1 and 12. Possible causes are cache effects or thread/process data affinity issues. The latter relates to where  data resides relative to the host core for each thread or process. While it is possible to schedule MPI processes to specific (numbered) cores with the mpiexec command in MPI2, no such effort was implemented here, and all scheduling was left to the runtime libraries of the respective compilers and the operating system.

For scaling with increasing MPI process, or OpenMP thread count, all three compilers showed good results when these counts where less than, or equal to, 4. Outside this range scaling results were poor. This could be an artifact of insufficient arithmetic work inside the corresponding (smaller) parallel chunks since parallel granularity is more refined with increasing core count.

5.0 CONCLUSIONS

Exploratory benchmark measurements on a 48 core AMD node confirm that all three compilers deliver good scaling performance at low core counts. Performance at higher core counts was limited by a finer parallel granularity in the benchmark model. For wall clock time the Portland compiler is the best performer in this roundup. However, the Intel and Absoft compiler timing results were close with the exception of the case with 12 MPI processes. Actual performance of commodity solutions in real-world applications will vary and results for specific Air Quality Models (AQM) are the subject of subsequent reports.

backnext page

HiPERiSM Consulting, LLC, (919) 484-9803 (Voice)

(919) 806-2813 (Facsimile)