HiPERiSM's Technical Reports

HiPERiSM - High Performance Algorism Consulting

HCTR-2001-6: Compiler Performance 7

 



 
 
 

1.0 The Stommel Ocean Model: 2-D MPI decomposition + OpenMP Hybrid

1.1 Serial and MPI+OpenMP hybrid code

This is a comparison of parallel MPI+OpenMP hybrid performance with SUN Fortran compilers on the SUN E10000™ platform for a floating-point intensive application. The application is the Stommel Ocean Model (SOM77) and the Fortran 77 source code was developed by Jay Jayakumar (serial version) and Luke Lonnergan (MPI version) at the NAVO site, Stennis Space Center, MS. It is available at http://www.navo.hpc.mil/pet/Video/Courses/MPI_Finite. The algorithm is identical to the Fortran 90 version discussed in report HCTR-2001-2 but the Fortran 77 version allows for more flexibility in the domain decomposition for MPI. The OpenMP hybrid version was developed by HiPERiSM Consulting, LLC, as part of  case studies for the training courses.


This is a 2-dimensional domain decomposition in both the x and y direction with horizontal and vertical square slabs of the domain passed to different MPI processes (one square sub-domain per process). In the MPI version all parameters must be broadcast by process with rank 0 to all processes before computation begins. Otherwise the code is identical to the serial version excepting that each MPI process operates on its own square of the domain. At the beginning of each iteration the processes synchronize boundary values by exchanging adjacent ghost arrays (parallel to either the x or y direction) with the nearest neighbor processes whenever square sides are adjacent (subroutine EXCHANGE). The exterior boundaries of the outermost squares do not exchange rows since they correspond to domain boundaries. The hybrid version placed OpenMP parallel regions for loop nests such as the compute kernel  that performs a Jacobi iteration sweep over a two-dimensional finite difference grid (where the number of iterations is set to 100). In this hybrid model there are two levels of parallel granularity with MPI at the coarser grain and OpenMP at the finer grain. All MPI procedures are called by the master thread and none are called from any OpenMP parallel region. This is to ensure a "safe" coding practice that makes no assumptions about the thread safety of the MPI library used. For this reason the call to MPI_ALLREDUCE has been moved out of the OpenMP parallel region. There is an implied barrier at the end of the OpenMP parallel region (all threads on all OpenMP nodes must have completed computation before the MPI reduction operation). Here two reduction operations are performed: one for OpenMP threads on each node and another for MPI processes between nodes. The problem sizes chosen were N=1000 for a 1000 x 1000 and N=2000 for a 2000 x 2000 finite difference grid. Static scheduling was used for the OpenMP work distribution with a scheduling parameter
isched chosen to vary with problem size and MPI processs and OpenMP thread count.

 

2.1 MPI+OpenMP parallel performance for N=1000

For problem size N=1000 this section shows parallel performance for the Stommel Ocean Model (SOM77) in a 2-D MPI domain decomposition with the SUN Fortran 77 compiler using a hybrid MPI+OpenMP parallel model. <>

Table 2.1a shows results for the MPI+OpenMP version executed on the SUN E10000 for 1, 2 2 and 4 4 MPI processes and 1, 2, and 4 threads. The speed up shown there is relative to the case of one MPI process and one OpenMP thread. This is a 64 processor node and the workload impacted the last example.

Table 2.1a: Performance summary for hybrid MPI+OpenMP Stommel Ocean Model (SOM77) in a 2-D MPI decomposition (problem size 1000)
 

Wall clock time (sec)

Speed up

 

1 MPI process

4 MPI processes

16 MPI processes

1 MPI process

4 MPI processes

16 MPI processes

1 thread

6.13

0.88

0.25

1.00

7.00

24.5

2 threads

3.13

1.00

0.50

1.96

6.13

12.3

4 threads

1.25

0.50

18.8

4.90

12.3

0.33

 

Table 2.1b shows how the value of the OpenMP scheduling parameter isched changes and the corresponding OpenMP speed up for each value of the number of MPI processes.

Table 2.1b: OpenMP scheduling parameter isched (speed up) for hybrid MPI+OpenMP Stommel Ocean Model (SOM77) in a 2-D MPI decomposition (problem size 1000)
 

1 MPI process

4 MPI processes

16 MPI processes

1 thread

500 (1.00)

250 (1.00)

125 (1.00)

2 threads

250 (1.96)

125 (0.88)

62 (0.50)

4 threads

125 (4.90)

62 (1.75)

31 (0.013)

 

Fig. 2.1. Time to solution in seconds for SOM 2-D when N=1000 in a hybrid MPI+OpenMP model on the SUN E10000 for 1, 22 and 44 MPI processes and 1, 2, and 4 threads.

 

Fig. 2.2. Speed up for SOM 2-D when N=1000 in a hybrid MPI+OpenMP model on the SUN E10000 for 1, 22, and  44  MPI processes and 1, 2, and 4 threads.

 

3.1 MPI+OpenMP parallel performance for N=1000

For problem size N=2000 this section shows parallel performance for the Stommel Ocean Model (SOM77) in a 2-D MPI domain decomposition with the SUN Fortran 77 compiler using a hybrid MPI+OpenMP parallel model. 

Table 3.1a shows results for the MPI+OpenMP version executed on the SUN E10000 for 1, 2 2 and 4 4 MPI processes and 1, 2, and 4 threads. The speed up shown there is relative to the case of one MPI process and one OpenMP thread. This is a 64 processor node and the workload impacted the last example.

Table 3.1a: Performance summary for hybrid MPI+OpenMP Stommel Ocean Model (SOM77) in a 2-D MPI decomposition (problem size 2000)
 

Wall clock time (sec)

Speed up

 

1 MPI process

4 MPI processes

16 MPI processes

1 MPI process

4 MPI processes

16 MPI processes

1 thread

44.88

7.5

0.75

1.00

5.98

59.8

2 threads

22.62

3.5

1.5

1.98

12.8

29.9

4 threads

12.00

2.0

5.25

3.74

22.4

8.55

Table 3.1b shows how the value of the OpenMP scheduling parameter isched changes and the corresponding OpenMP speed up for each value of the number of MPI processes.

Table 3.1b: OpenMP scheduling parameter isched (speed up) for hybrid MPI+OpenMP Stommel Ocean Model (SOM77) in a 2-D MPI decomposition (problem size 2000)
 

1 MPI process

4 MPI processes

16 MPI processes

1 thread

1000 (1.00)

500 (1.00)

250 (1.00)

2 threads

500 (1.98)

250 (2.14)

125 (0.50)

4 threads

250 (3.74)

125 (3.75)

62 (0.14)

 

Fig. 3.1. Time to solution in seconds for SOM 2-D when N=2000 in a hybrid MPI+OpenMP model on the SUN E10000 for 1, 22 and 44 MPI processes and 1, 2, and 4 threads.

 

Fig. 3.2. Speed up for SOM 2-D when N=2000 in a hybrid MPI+OpenMP model on the SUN E10000 for 1, 22, and  44  MPI processes and 1, 2, and 4 threads.

 

Fig. 3.3. Speed up relative to one thread for SOM 2-D when N=2000 in a hybrid MPI+OpenMP model  on the SUN E10000 for  44, 22, and 1  MPI processes, respectively, and 1, 2, and 4 threads..

 

backnext page

HiPERiSM Consulting, LLC, (919) 484-9803 (Voice)

(919) 806-2813 (Facsimile)