Hiperism Consulting, LLC: HCTR-2001-5

HiPERiSM's Technical Reports

HiPERiSM - High Performance Algorism Consulting

HCTR-2001-5: Compiler Performance 6

1.0 The Stommel Ocean Model: 1-D MPI decomposition + OpenMP Hybrid

1.1 Serial and MPI+OpenMP hybrid code

This is a comparison of parallel MPI+OpenMP hybrid performance with SUN Fortran compilers on the SUN E10000™ platform for a floating-point intensive application. The application is the Stommel Ocean Model (SOM77) and the Fortran 77 source code was developed by Jay Jayakumar (serial version) and Luke Lonnergan (MPI version) at the NAVO site, Stennis Space Center, MS. It is available at http://www.navo.hpc.mil/pet/Video/Courses/MPI_Finite. The algorithm is identical to the Fortran 90 version discussed in report HCTR-2001-2 but the Fortran 77 version allows for more flexibility in the domain decomposition for MPI. The OpenMP hybrid version was developed by HiPERiSM Consulting, LLC, as part of case studies for the training courses.

This is a 1-dimensional domain decomposition in the y direction with horizontal slabs of the domain passed to different MPI processes (one slab per process). In the MPI version all parameters must be broadcast by process with rank 0 to all processes before computation begins. Otherwise the code is identical to the serial version excepting that each MPI process operates on its own slab of the domain. At the beginning of each iteration the slabs synchronize values by exchanging adjacent ghost rows (parallel to the x direction) with the nearest neighbor processes whenever slabs are adjacent (subroutine EXCHANGE). The top row of the uppermost slab and the lowest row of the bottommost slab do not exchange rows since they correspond to domain boundaries. The hybrid version placed OpenMP parallel regions for loop nests such as the compute kernel that performs a Jacobi iteration sweep over a two-dimensional finite difference grid (where the number of iterations is set to 100). In this hybrid model there are two levels of parallel granularity with MPI at the coarser grain and OpenMP at the finer grain. All MPI procedures are called by the master thread and none are called from any OpenMP parallel region. This is to ensure a "safe" coding practice that makes no assumptions about the thread safety of the MPI library used. For this reason the call to MPI_ALLREDUCE has been moved out of the OpenMP parallel region. There is an implied barrier at the end of the OpenMP parallel region (all threads on all OpenMP nodes must have completed computation before the MPI reduction operation). Here two reduction operations are performed: one for OpenMP threads on each node and another for MPI processes between nodes. The problem size chosen was N=1000 for a 1000 x 1000 finite difference grid. Static scheduling was used for the OpenMP work distribution with a scheduling parameter isched chosen to vary with problem size and MPI processs and OpenMP thread count.

2.1 MPI+OpenMP hybrid parallel performance

This section shows parallel performance for the Stommel Ocean Model (SOM77) in a 1-D MPI domain decomposition with the SUN Fortran 77 compiler using a hybrid MPI+OpenMP parallel model.

Table 2.1a shows results for the MPI+OpenMP version executed on the SUN E10000 for 1, 2 and 4 MPI processes and 1, 2, and 4 threads. The speed up shown there is relative to the case of one MPI process and one OpenMP thread.

Table 2.1a: Performance summary for hybrid MPI+OpenMP Stommel Ocean Model (SOM77) in a 1-D MPI decomposition (problem size 1000)
	Wall clock time (sec)			Speed up
	1 MPI process	2 MPI processes	4 MPI processes	1 MPI process	2 MPI processes	4 MPI processes
1 thread	6.25	1.63	0.875	1.00	3.83	7.14
2 threads	3.25	0.875	0.500	1.92	7.14	12.5
4 threads	1.25	0.500	0.375	5.00	12.5	16.6

Table 2.1b shows how the value of the OpenMP scheduling parameter isched changes and the corresponding OpenMP speed up for each value of the number of MPI processes.

Table 2.1b: OpenMP scheduling parameter isched (speed up) for hybrid MPI+OpenMP Stommel Ocean Model (SOM77) in a 1-D MPI decomposition (problem size 1000)
	1 MPI process	2 MPI processes	4 MPI processes
1 thread	500 (1.00)	250 (1.00)	125 (1.00)
2 threads	250 (1.92)	125 (1.86)	62 (1.75)
4 threads	125 (5.00)	62 (3.26)	31 (2.33)

The following figures shows the time in seconds (as reported by the MPI W_TIME procedure) and the corresponding speed up.

Fig. 2.1. Time to solution in seconds for SOM 1-D when N=1000 in a hybrid MPI+OpenMP model on the SUN E10000 for 1, 2 and 4 MPI processes and 1, 2, and 4 threads.

Fig. 2.2. Speed up for SOM 1-D when N=1000 in a hybrid MPI+OpenMP model on the SUN E10000 for 1, 2, and 4 MPI processes and 1, 2, and 4 threads.

Fig. 2.3. Speed up relative to one thread for SOM 1-D when N=1000 in a hybrid MPI+OpenMP model on the SUN E10000 for 4, 2, and 1 MPI processes and 1, 2, and 4 threads.

HiPERiSM Consulting, LLC, (919) 484-9803 (Voice)

(919) 806-2813 (Facsimile)