1.0 The Stommel Ocean Model: 1-D MPI
decomposition + OpenMP Hybrid
1.1 Serial and MPI+OpenMP hybrid code
This
is a comparison of parallel MPI+OpenMP hybrid performance with SUN Fortran compilers on
the SUN E10000 platform for a floating-point intensive application. The application
is the Stommel Ocean Model (SOM77) and the Fortran 77 source code was developed by Jay
Jayakumar (serial version) and Luke Lonnergan (MPI version) at the NAVO site, Stennis
Space Center, MS. It is available at http://www.navo.hpc.mil/pet/Video/Courses/MPI_Finite.
The algorithm is identical to the Fortran 90 version discussed in report HCTR-2001-2 but
the Fortran 77 version allows for more flexibility in the domain decomposition for MPI.
The OpenMP hybrid version was developed by HiPERiSM Consulting, LLC, as part of case
studies for the training courses.
This
is a 1-dimensional domain decomposition in the y direction with horizontal slabs of the
domain passed to different MPI processes (one slab per process). In the MPI version all
parameters must be broadcast by process with rank 0 to all processes before computation
begins. Otherwise the code is identical to the serial version excepting that each MPI
process operates on its own slab of the domain. At the beginning of each iteration the
slabs synchronize values by exchanging adjacent ghost rows (parallel to the x direction)
with the nearest neighbor processes whenever slabs are adjacent (subroutine EXCHANGE). The
top row of the uppermost slab and the lowest row of the bottommost slab do not exchange
rows since they correspond to domain boundaries. The hybrid version placed OpenMP parallel
regions for loop nests such as the compute kernel that performs a Jacobi iteration
sweep over a two-dimensional finite difference grid (where the number of iterations is set
to 100). In this hybrid model there are two levels of parallel granularity with MPI at the
coarser grain and OpenMP at the finer grain. All MPI procedures are called by the master
thread and none are called from any OpenMP parallel region. This is to ensure a
"safe" coding practice that makes no assumptions about the thread safety of the
MPI library used. For this reason the call to MPI_ALLREDUCE has been moved out of the
OpenMP parallel region. There is an implied barrier at the end of the OpenMP parallel
region (all threads on all OpenMP nodes must have completed computation before the MPI
reduction operation). Here two reduction operations are performed: one for OpenMP threads
on each node and another for MPI processes between nodes. The problem size chosen was
N=1000 for a 1000 x 1000 finite difference grid. Static scheduling was used for the OpenMP
work distribution with a scheduling parameter isched chosen to vary with problem size and MPI processs and OpenMP thread
count.
2.1 MPI+OpenMP hybrid parallel performance
This
section shows parallel performance for the Stommel Ocean Model (SOM77) in a 1-D MPI domain
decomposition with the SUN Fortran 77 compiler using a hybrid MPI+OpenMP parallel model.
Table 2.1a shows results for the MPI+OpenMP version
executed on the SUN E10000 for 1, 2 and 4 MPI processes and 1, 2, and 4 threads. The speed
up shown there is relative to the case of one MPI process and one OpenMP thread.
Table 2.1a: Performance summary for
hybrid MPI+OpenMP Stommel Ocean Model (SOM77) in a 1-D MPI decomposition (problem size
1000) |
|
Wall
clock time (sec) |
Speed
up |
|
1 MPI process |
2 MPI processes |
4 MPI processes |
1 MPI process |
2 MPI processes |
4 MPI processes |
1 thread |
6.25 |
1.63 |
0.875 |
1.00 |
3.83 |
7.14 |
2 threads |
3.25 |
0.875 |
0.500 |
1.92 |
7.14 |
12.5 |
4 threads |
1.25 |
0.500 |
0.375 |
5.00 |
12.5 |
16.6 |
Table 2.1b shows how the value of the OpenMP scheduling
parameter isched
changes and the corresponding OpenMP speed up for each value of the number of MPI
processes.
Table 2.1b: OpenMP scheduling parameter
isched (speed up) for hybrid
MPI+OpenMP Stommel Ocean Model (SOM77) in a 1-D MPI decomposition (problem size 1000) |
|
1 MPI process |
2 MPI processes |
4 MPI processes |
1 thread |
500 (1.00) |
250 (1.00) |
125 (1.00) |
2 threads |
250 (1.92) |
125 (1.86) |
62 (1.75) |
4 threads |
125 (5.00) |
62 (3.26) |
31 (2.33) |
The
following figures shows the time in seconds (as reported by the MPI W_TIME procedure) and
the corresponding speed up.
Fig. 2.1. Time to solution in seconds for SOM 1-D when N=1000 in a hybrid
MPI+OpenMP model on the SUN E10000 for 1, 2 and 4 MPI processes and 1, 2, and 4 threads.
Fig. 2.2. Speed up for SOM 1-D when N=1000 in a hybrid MPI+OpenMP model on the SUN
E10000 for 1, 2, and 4 MPI processes and 1, 2, and 4 threads.
Fig. 2.3. Speed up relative to one thread for SOM
1-D when N=1000 in a hybrid MPI+OpenMP model on the SUN E10000 for 4, 2, and 1 MPI
processes and 1, 2, and 4 threads.
HiPERiSM Consulting, LLC, (919) 484-9803
(Voice)
(919) 806-2813 (Facsimile) |