The
Princeton Ocean Model (POM) is a numerical ocean model originally developed by
George L. Mellor and Alan Blumberg around 1977 and coded in Fortran.
Subsequently, others contributed to development and use of the model and a list
of research papers that describe results and the numerical model may be found
at the POM URL http://www.aos.princeton.edu/WWWPUBLIC/htdocs.pom,
where a User Guide is also available. The POM is used for modeling of
estuaries, coastal regions and oceans.
The world-wide POM user community passed 340 in number by mid-1998 and
active use continues to this day, even though POM has been superceded by newer
models described at the NOAA Geophysical Fluid Dynamics Laboratory URL http://www.gfdl.noaa.gov.
The
Fortran source code for the serial version of the POM is available at the POM
URL. This site also lists numerous benchmark results on a wide variety of
platforms from High Performance Computers (HPC) to workstations and personal
computers. In the past the code was successfully implemented on vector
architectures. An analysis by HiPERiSM Consulting, LLC, in Course HC2, finds
over 500 vectorizable loops, and describes conversion
of the POM to an OpenMP version. Performance on
HiPERiSM Consulting, LLC’s IA-32 cluster for several
problem sizes is discussed in Section 10.4.
The source code used here for this MPI course was made
available to HiPERiSM Consulting, LLC, for educational purposes, by Dr. Steve Piacsek, Naval Research Laboratory,
Back to Intel Trace Analyzer™
The POM contains an
embedded moment turbulence closure sub-model (subroutines PROFQ and ADVQ) to
provide vertical mixing coefficients and implements complete thermodynamics. It solves a numerical model
in a discrete three dimensional grid with the following characteristics:
·
It is a sigma coordinate model in that the vertical
coordinate (z) is scaled on the water column depth and vertical time differencing
is implicit.
·
The horizontal grid (x,y) uses
curvilinear orthogonal coordinates in an “Arakawa C” differencing scheme that
is explicit.
·
The model has a free surface and a split time step:
1. The external mode portion of
the model is two-dimensional and uses a short time step based on the CFL
condition and the external wave speed.
2.
The internal mode is three dimensional and uses a long
time step based on the CFL condition and internal wave speed.
In
POM advection, horizontal diffusion, and (in the case of velocity) the pressure
gradient and Coriolis terms are evaluated in
subroutines ADVT, ADVQ, ADVCT, ADVU, ADVV and ADVAVE. The vertical diffusion is
handled in subroutines PROFT, PROFQ, PROFU and PROFV.
The Fortran symbols that define the POM problem sizes under
study here are defined in Table 10.1. The serial version uses upper case for
Fortran symbols throughout, whereas MP-POM uses mixed case for symbols. For
this discussion mixed case is used where it is not confusing.
Table 10.1
: Key grid and time step Fortran symbols in the |
|
Symbol |
Description |
i , j imax , jmax |
Horizontal grid indexes and Upper range
limits |
k kmax |
Vertical grid index (k=1 is top) and Upper range
limit (at bottom) |
iint |
DO loop index for internal mode time
step |
DTI |
Internal mode time step (sec) |
iend |
Upper range limit for iint |
iext |
DO loop index for external mode time
step |
DTE |
External mode time step (sec) |
isplit = DTI / DTE |
Upper range limit for iext |
The POM User Guide contains a detailed description of the
model equations, time integration method, and Fortran symbols. The following sections
describe only the essentials needed to perform the cases under study.
The conversion of POM into the MPI version, MP-POM, used
techniques and a specialized tool (TOPAZ) that are described in the paper at
the MP-POM URL (see Section 10.1.3). In the following only a few points of the
methodology are described.
Domain decomposition is applied to divide the computational
work over MPI processes, but this requires overlap regions (also known as ghost
cells, or halo regions) between domains. One key idea in the MP-POM conversion
process is to add overlap regions that are larger than the minimum of one
vector, as used, for example, in the SOM case study of Chapters 8 and 9. A
domain decomposition algorithm requires communication between domains to
exchange data at synchronization points. The idea in MP-POM is that if the halo
region is made broader the overlap elements can be used for computation, and,
only when all have been used up in computation, is communication required. The
communication approach in MP-POM is guided by a motivation for efficient
parallel code. Efficiency considerations require a minimal frequency of
synchronization and communication because:
·
Synchronization
is an overhead cost
·
Communication
has a latency overhead
·
Communication
delays occur if load imbalance between processes results in idle time (as some
wait for others to complete computation before communication begins).
Thus a fundamental issue is the choice of an optimal overlap
region width. Different choices in overlap region width have a different
balance between computation and communication. To address this issue Table 10.2
lists the criteria and their trends for the two possible extremes in overlap
region size.
Table
10.2 : Criteria and consequences of choice in overlap region size for MP-POM. |
||
Criterion |
Large
overlap region |
Small
overlap region |
Computation |
Increase |
Decrease |
Synchronization |
Decrease |
Increase |
Communication & Frequency |
Longer and less frequent messages |
Shorter and more frequent messages |
Redundancy |
More in computation on halo |
Less in computation on halo |
However, increasing the size of the overlap region increases
the amount of redundant computation because overlap elements on one processor
correspond to non-overlap elements on a neighbor where the same computation is
performed. Despite this redundancy there is a potential performance gain when
the cost of a local element computation is less than the cost of retrieving the
value of the same (non-overlap) element value from a neighboring processor.
This is because the communication factors listed above contribute to the cost
of retrieval.
Figure 10.1 shows the three parallel regions and synchronization
points in MP-POM relative to the DO loops for internal and external modes.
Figure 10.1:
Steps in
the MPI version MP-POM showing the three parallel regions and synchronization
points. For the latter the number in square parentheses is the number of arrays
that are synchronized between MPI processes.
----------- ----------- ------------
STEP Rank
0 Rank > 0
----------- ----------- ------------
1. setup (idle)
2. Internal mode DO 9000 IINT=1,IEND
Synchronization Point
#1 <<<<<< [16] <<<<<
Parallel Region #1
>>>>>>>>>>>>>>>>>>>>>>>
¯ ¯
3. External mode DO 8000 IEXT=1,ISPLIT
Synchronization
Point #2 << [29] <<<
Parallel Region
#2 >>>>>>>>>>>>>>>>>
¯ ¯
8000
CONTINUE
Synchronization Point
#3 <<<<<< [62] <<<<<
Parallel Region #3
>>>>>>>>>>>>>>>>>>>>>>>
¯ ¯
9000
CONTINUE
3. I/O and STOP PRINT OUTPUT
¯ ¯
call MPI_Finalize call MPI_Finalize
----------- ----------- -----------
Table 10.3 defines the four grids under study chosen such
that problem size scaling is in the ratio 1:3:
Table
10.3: Problem sizes for the grid in the |
|
Case |
Description |
GRID1 |
imax = 64, jmax= 64, kmax
= 21 |
GRID2 |
imax = 128, jmax= 128, kmax
= 16 |
GRID3 |
imax = 256, jmax= 256, kmax
= 16 |
GRID4 |
imax = 512, jmax= 512, kmax
= 16 |
Back to Intel Trace Analyzer™
The following examples were performed under Red Hat Linux
with the Portland Group pgf90 compiler on a 8-node, dual processor (per node),
Pentium III IA-32 cluster. Therefore the scripts shown in the directories
indicated below are for this specific case and need to be modified for the
local implementation.
The MPI version of the POM code, MP-POM, is located in the
subdirectory ~/hc6/mpi_f_cases/mppom. This has the source and makefiles
in the folder:
~/build
For the four problem sizes that have been selected for this
case study results (for the above implementation) are in the respective
folders:
~/pom_grid1_sse
~/pom_grid2_sse
~/pom_grid3_sse
~/pom_grid4_sse
The results shown in these directories are those for the
implementation described above and to prepare for execution on the local
platform new directories should be created as follows:
~/pom_grid1
~/pom_grid2
~/pom_grid3
~/pom_grid4
and all run scripts copied for each grid as in this example
for the first problem size:
cp
~/pom_grid1_sse/run* ~/pom_grid1/.
Each run script has a distinctive name to denote the grid
chosen and the number of MPI processes used as in the following example for run_grid1_8
/usr/bin/time mpirun
-np 8 pom >
pom_grid1_8.out
date >> pom_grid1_8.out
Note that the execution of each run script requires the
correct identification on the Linux time and date
commands, availability of the mpirun command, and the executable pom. The method of producing the executable pom is the subject of the next section.
The MP-POM source and makefiles
are located in the folder:
~/build
which has several different makefiles
for various platforms, but only the makefile Makefile.linux has been tested and is used in this case study. Before the
MP-POM executable pom can be produced with this makefile
it must first be adapted for the current platform. A README.txt file is available in this directory and should be studied
carefully before the makefile is used. Since this makefile uses the Portland Group pgf90 and GNU gcc compilers some important implementation-dependent
issues are repeated here. Edit the makefile Makefile.linux and observe the following, making changes (as needed), in
the respective parts of the makefile.
Edit
the makefile for the following information in PART I:
NPROCS=8
The
parameter NPROCS is the number of processes/threads/processors to use. Whenever
NPROCS is changed also delete the file dimensions.h
before executing make, but do not erase the file dimensions.H
since this is used by the preprocessor to generate dimensions.h.
Edit
the Makefile for the following information in PART
II:
Change
the following to select the location of source and build files in PART IV (this
is only required if PART IV is used – which is not the case here)
SRCDIR=~/hc6/mpi_f_cases/mppom/build
The
target in this implementation is the 32-bit Linux operating system with the
Portland Group and GNU gcc compilers so the following
needs to be set for the local implementation:
CC = cc
FC = pgf90
CPP = /usr/bin/cpp
LOADER = pgf90
FFLAGS = -c -Mnosecond_underscore -fast -Mvect=sse -r8
-Minfo
CFLAGS =
CPPFLAGS
= -P -traditional
LDFLAGS=
LIBS=
-lfmpich –lmpich
The
following are some notes on these changes to ensure that they will be correct
for the local Linux implementation.
# CC = cc # this is set for gcc
# FC = pgf90 # this is the Portland Group pgf90
compiler
# CPP = /usr/bin/cpp # location of cpp
preprocessor
# LOADER = pgf90 # Portland Group pgf90 loader
# FFLAGS = -c -Mnosecond_underscore
-fast -Mvect=sse -r8 -Minfo
# # Portland Group
pgf90 flags
# CFLAGS =
# CPPFLAGS = -P -traditional
#
# For cpp use the
following flags:
# /usr/bin/cpp -P -traditional .... etc
# The reason is that
by default (on Linux) cpp will affect the label
# field in processing fortran
files. The -traditional flag corrects this
Once the above changes to Makefile.linux have been made for the local implementation an executable
is generated by the following commands in the ~/build folder:
cp
Makefile.linux Makefile
make &> compile_grid1_8.out
After completion of the make execution scroll through the file compile_grid1_8.out and at the end confirm that the executable pom appears.
On
first installation of the above directories the files have been set for the
test problem of GIRD1 in Table 10.3 with 8 MPI processes. However, this case study allows for a total
of 16 combinations of MPI processor count and problem size with the following
options:
NPROCS=1,
4, 8, or 16 (or any power of 2)
File
grid replaced by any one of grid1,
grid2, grid3, or grid4
To execute any other problem size with any number of MPI
processes changes have to be made in the ~/build directory as described in the following example.
Example
for a different problem size and MPI process count
The default setup of the MP-POM installation is for the
problem size GRID1 with 8 MPI processes. The steps that need to be followed
when either the number of processes, or the problem size, is changed are
described here for the example of 16 processes and GRID2.
To
change the number of MPI processes follow these steps in folder ~/build:
1.
Edit the makefile and change the value NPROCS=16 ( the value used should be a power of 2).
2.
rm
dimensions.h.
3.
rm
*.o.
To
change the problem size perform this next step before execution of the make:
4.
cp grid2 grid
5.
make &> compile_grid2_16.out.
If
the problem size is not changed, then the output file of the make (in this example) is compile_grid1_16.out
After
completion of the make
execution scroll through the file *.out and at the end confirm that the executable pom appears. Once a combination of processor count and problem
size is selected, and corresponding executable pom created, follow the directions for execution of the MP-POM
code in the next section.
Once
the executable pom has been created (as described in the previous section) it
needs to be moved to the appropriate run directory:
~/pom_grid1
~/pom_grid2
~/pom_grid3
~/pom_grid4
This
example is for the default combination of 8 MPI processes and GRID1 and assumes
that the run scripts have been copied as described in Section 10.3.1.
Example of MP-POM
execution
The results shown in these directories are those for the default
implementation described above and assumes that the executable pom has been created in directory ~/build.
To
execute the default case follow these steps in folder ~/build:
1.
cd
pom_grid1
2.
cp ../pom .
3.
./run_grid1_8
Alternatively,
for the example of the previous section the steps would be:
4.
cd
pom_grid2
5.
cp ../pom .
6.
./run_grid2_16
Once
the executable run_grid1_8 has completed (as described in the previous section) in the
run directory:
~/pom_grid1
the
terminal returns the prompt with the results of the Linux time command and (for
the default combination) an example has been saved in the file ~/pom_grid1_sse/run_grid1_8.time. In addition the script run_grid1_8
produces the output file run_grid1_8.time.out that has the MP-POM output. This should be viewed with an
editor to see that at the end it shows the correct choice of problem size and
number of processors. In addition some internal timing information is also
presented.
Example of output
files
The results in *.out and *.time files for each of 16
possible combinations of the number of MPI processes and problem sizes are
shown in the respective default directories described in Section 10.3.1. In addition, results of the compilation that
produced the executable are shown in the compile_*.out files. Results generated on the local implementation should
be saved with a similar naming convention in the respective local directories:
~/pom_grid1
~/pom_grid2
~/pom_grid3
~/pom_grid4
Exercise 1.
Perform the above case studies on your implementation for problem sizes in
GRID2, GRID3, and GRID4, for MPI process counts in the range P=1, 4, 8, and 16
and compare performance with that discussed in the next Section.
Back to Intel Trace Analyzer™
The
MP-POM code was executed for the four problem sizes of Table 10.3 and MPI
process count P in the range 1 to 16. The results are for the Portland Group
compiler pgf90 (with SSE enabled) on an Intel Pentium III, 933MHz, CPU (see
Section 10.3). As was pointed out in Section 1.7, when two CPUs are in use on a
dual processor node they compete for memory bandwidth whereas if only one CPU
is in use it has the full memory bandwidth to itself. Hence the apparent loss
in performance on this cluster between P=8 and p=16 MPI processes is a result
of contention for memory bandwidth resources on the Intel dual processor node.
This
section reports the results for wall clock time summarized in Table 10.4. Figure 10.2 shows two views of these results with the upper panel
showing a 2-dimensional histogram chart for all four problem sizes and P=1, 4,
and 8 MPI processes. The second (lower) panel shows a 3-dimensional histogram
plot for P=1, 4, 8, and 16 MPI processes. In the 3-dimensional chart the lack
of significant time reduction between P=8 and P=16 is evident for all problem
sizes. However, the upper panel of Figure 10.2 shows that the reduction in wall clock time with
increasing processor count P, increases as the problem
size increases. The performance analysis in Section 10.5 gives more insight
into the reasons for this but suffice it to say that increasing problem size
corresponds to increasing coarseness in parallel granularity.
Table
10.4 : MP-POM wall clock time for the four problem sizes (Grid) defined in
Table 10.2 with the MPI process count values (P) shown. |
||||
Grid |
P=1 |
P=4 |
P=8 |
P=16 |
GRID1 |
263.6 |
230.6 |
217.5 |
270.2 |
GRID2 |
856.7 |
420.9 |
318.0 |
355.9 |
GRID3 |
3425.9 |
1199.3 |
745.1 |
681.1 |
GRID4 |
13573.0 |
3841.0 |
2153.4 |
1836.3 |
The
MP-POM code was executed for the four problem sizes of Table 10.3 and MPI
process count P in the range 1 to 16. The results are for the Portland Group
compiler pgf90 (with SSE enabled) on an Intel Pentium III, 933MHz, CPU (see
Section 10.3). As was pointed out in Section 1.7, when two CPUs are in use on a
dual processor node they compete for memory bandwidth whereas if only one CPU
is in use it has the full memory bandwidth to itself. Hence the apparent loss
in performance on this cluster between P=8 and p=16 MPI processes is a result
of contention for memory bandwidth resources on the Intel dual processor node.
This
section reports the results for parallel speedup summarized in Table 10.5. Figure 10.3 shows two views
of these results with the upper panel showing a 2-dimensional histogram chart
for all four problem sizes and P=1, 4, and 8 MPI processes. The second (lower)
panel shows a 3-dimensional histogram plot for P=1, 4, 8, and 16 MPI processes.
In the 3-dimensional chart the lack of significant increase between P=8 and
P=16 is evident for all problem sizes. However, the upper panel of Figure 10.3 shows that
the gain in parallel speed up with increasing processor count (P) increases as
the problem size increases because of the convergence to the ideal (linear
speed up) curve shown as the solid line with the largest problem size showing
the best parallel speed up result. The performance analysis in Section 10.5
gives more insight into the reasons for this but suffice it to say that
increasing problem size corresponds to increasing coarseness in parallel
granularity.
Table
10.5 : MP-POM parallel speed-up for the four problem sizes (Grid) defined in Table
10.2 with the MPI process count values (P) shown. |
||||
Grid |
P=1 |
P=4 |
P=8 |
P=16 |
GRID1 |
1 |
1.1 |
1.2 |
1.0 |
GRID2 |
1 |
2.0 |
2.7 |
2.4 |
GRID3 |
1 |
2.9 |
4.6 |
5.0 |
GRID4 |
1 |
3.5 |
6.3 |
7.4 |
|
|
Figure 10.2: Time to
solution for the MP-POM algorithm for the four problem sizes (grid number in
Table 10.3) and MPI process count P=1, 4, 8 (upper frame), and including 16
(lower frame). |
|
|
Figure
10.3: Parallel speedup for the MP-POM algorithm
for the four problem sizes (grid number in Table 10.3) and MPI process count
P=1, 4, 8 (upper frame), and including 16 (lower frame). |
The
MP-POM code was executed for the four problem sizes of Table 10.3 and MPI
process count P in the range 1 to 16. The results are for the Portland Group
compiler pgf90 (with SSE enabled) on an Intel Pentium III, 933MHz, CPU (see Section
9.3). As was pointed out in Section 1.7, when two CPUs are in use on a dual
processor node they compete for memory bandwidth whereas if only one CPU is in
use it has the full memory bandwidth to itself. Hence the apparent loss in
performance on this cluster between P=8 and p=16 MPI processes is a result of
contention for memory bandwidth resources on the Intel dual processor node.
This
section reports the results for parallel efficiency summarized in Table 10.6.
Figure
10.4 shows two
views of these results with the upper panel showing a 2-dimensional histogram
chart for all four problem sizes and P=1, 2, and 8 MPI processes. The second
(lower) panel shows a 3-dimensional histogram plot for P=1, 4, 8, and 16 MPI
processes. In the 3-dimensional chart the significant decrease between P=8 and
P=16 is evident for all problem sizes. However, the upper panel of Figure 10.4 shows that
the gain in parallel efficiency with increasing processor count (P) increases
as the problem size increases because of the convergence to the ideal value of
1 with the largest problem size showing the best parallel efficiency result.
The performance analysis in Section 10.5 gives more insight into the reasons
for this but suffice it to say that increasing problem size corresponds to
increasing coarseness in parallel granularity.
Table
10.6 : MP-POM parallel efficiency for the four problem sizes (Grid) defined in
Table 10.2 with the MPI process count values (P) shown. |
||||
Grid |
P=1 |
P=4 |
P=8 |
P=16 |
GRID1 |
1 |
0.29 |
0.15 |
0.06 |
GRID2 |
1 |
0.51 |
0.34 |
0.15 |
GRID3 |
1 |
0.71 |
0.57 |
0.31 |
GRID4 |
1 |
0.88 |
0.79 |
0.46 |
|
|
Figure
10.4: Parallel efficiency for the MP-POM algorithm for the
four problem sizes (grid number in Table 10.3) and MPI process count P=1, 4,
8 (upper frame), and including 16 (lower frame). |
The
MP-POM code was executed for the four problem sizes of Table 10.3 and MPI
process count P in the range 1 to 16. The results are for the Portland Group
compiler pgf90 (with SSE enabled) on an Intel Pentium III, 933MHz, CPU (see
Section 10.3). As was pointed out in Section 1.7, when two CPUs are in use on a
dual processor node they compete for memory bandwidth whereas if only one CPU
is in use it has the full memory bandwidth to itself. Hence the apparent loss
in performance on this cluster between P=8 and p=16 MPI processes is a result
of contention for memory bandwidth resources on the Intel dual processor node.
This
section reports the results for parallel efficiency summarized in Table 10.6 as
a function of problem size for fixed process count P. Figure 10.5 shows these
results with a 2-dimensional chart for all four problem sizes and P=4, 8 and 16
MPI processes. This shows that parallel efficiency increases as the problem size
increases for each choice of processor count (P). In general, parallel
efficiency values of 0.5 (or larger) are considered acceptable, while values
0.75 (and above) are in the good (to excellent) range. Thus the largest problem
size shows the best parallel efficiency results. The performance analysis in
Section 10.5 gives more insight into the reasons for this but suffice it to say
that increasing problem size corresponds to increasing coarseness in parallel
granularity.
|
Figure
10.5: Scaling
of parallel efficiency for the MP-POM algorithm with problem size (grid
number in Table 10.3) and fixed MPI process count P=4, 8, and 16. |
Back to Intel Trace Analyzer™
It
is often stated that enhanced parallel performance comes from enhanced
simultaneous activities. Usually this devolves into the separation of
communication performance and application performance. The latter is a function
of the choice of algorithm, data structure/partition, etc., whereas the former
is a function of the communication costs for the chosen MPI procedures. Vampir (see Section 1.9) can give insights into both
categories of performance issues because, through the many displays of a trace
file that it offers, it can reveal subtle performance issues. It supports the
programmer both in the discovery of problems and in the quantitative analysis
of performance improvements between different versions of a program after code
changes have been made.
The
following sections show how Vampir is used to
investigate performance costs in Section 10.6, by categories such as
statistical overview, runtime analysis, and MPI statistics. Vampir
also gives insight into load balance problems in Section 10.7 where statistics
for activity, messages, and process profile are discussed. Section 10.8
concludes with a brief summary of the performance analysis.
Performance
of the Princeton Ocean Model is evaluated by comparing Vampir
trace files for the MP-POM code when executed for the problem sizes GRID2,
GRID3, and GRID4, of Table 10.3 and a MPI process count P=4. This choice of
ascending grid size produces problem size scaling in the ratio 1:
Back to Intel Trace Analyzer™
Vampir
opens with the Summary Chart display by default. This shows in Figure 10.6 to Figure
10.11 the sum of the time consumed by all activities over
all selected processes for the respective trace files listed above. This is a
very condensed view of the entire program run that shows a comparison of time
spent in MPI procedures and user code. Each pair of figures corresponds to each
of the three grids. The first figure of each pair shows the absolute time for
the whole execution while the second shows the relative time in percent. For
the smallest grid size in Figure
10.6 and Figure
10.7 the ratio of times for user application to MPI
processes is 1, whereas for GRID3 and GRID4 it is 3:1 and 1:7.5, respectively.
This Summary Chart display is the first clear quantitative indication of
scalability in computation-to-communication ratio with increasing problem size.
Later sections (and other displays) explore more detail for this observation.
Figure
10.6: Summary chart display with absolute
times for pom_grid2_4.bvt. |
Figure
10.7: Summary chart display with the
relative times for pom_grid2_4.bvt. |
Figure 10.8: Summary
chart display with absolute times for pom_grid3_4.bvt. |
Figure 10.9: Summary chart display with the relative
times for pom_grid3_4.bvt |
Figure 10.10:
Summary chart
display with absolute times for pom_grid4_4.bvt. |
Figure
10.11: Summary chart display with the relative
times for pom_grid4_4.bvt. |
An
in-depth-analysis of the runtime behavior addresses the issue as to whether
user application or MPI procedures dominate. Because Vampir
has access to a trace that exactly describes when an event occurred in the
application, it can render these changes over time in so-called Timeline
Displays.
For
chronologically ordered information the Global
Timeline display for the respective problem sizes of the three grids is
shown in the upper frame of Figure 10.12, Figure
10.13, and Figure
10.14. The result shown is that obtained after several
applications of the Vampir zooming feature used for
magnification of part of the timeline. These views have selected closely similar
time slices of approximately 19.6 seconds. This display shows a color map that
uses red for MPI (communication) and green for the application (computation).
In these examples there are many MPI procedure calls that merge into black
bands at this resolution on the horizontal scale. These black banded regions
are the synchronization points of the MP-POM algorithm in Figure 10.1. These alternate computation and communication
regions recur with regular frequency and each region involves participation of
all MPI processes simultaneously. The comparison of the three Global Timeline
displays shows an increasing sparseness of the MPI communication regions (black
bands) compared to computation (green bands) as problem size increases.
Another
timeline display is the Summary Timeline, which shows all analyzed state changes for each process
(over the chosen time interval) in one display as a histogram over time bins.
Within each time bin the number of processors in either an MPI (red) or Application
(green) state is shown as a stacked column bar chart. Because a histogram is
used the Summary Timeline display
works well even for long trace times. An example of the summary timeline for
GRID2, GRID3, and GRID4 is shown in the lower frame of Figure 10.12, Figure
10.13, and Figure
10.14. respectively, again just for the selected timeslice. The Summary Timeline display shows that with increasing problem size, the
duration of application (computation) regions increases dramatically, while MPI
(communication) regions develop a sparse structure.
Figure
10.12: Global timeline for pom_grid2_4.bvt after zooming.
Figure 10.13 Global timeline for pom_grid3_4.bvt after zooming.
Figure 10.14: Global timeline for pom_grid4_4.bvt after zooming.
Runtime
statistics for MPI processes are available through several displays in Vampir and some of these are shown here for the three
problem sizes under study in the POM. One example of MPI
statistics is seen in
Figure
10.15, Figure
10.16, and Figure 10.17
for the three problem sizes. In
each case the MPI process that accounts for the largest proportion of the time
is MPI_Wait with respective proportions of 38.5%, 20.1% and 8.75% of
the total. The MPI_Send process is the next largest time consumer with proportions
of 4.0%, 2.5% and 2.0% of the total time in the respective problem sizes. It is
evident that the relative time for MPI_Wait reduces substantially as the problem sizes scales upward
and less so for the MPI_Send process.
Figure 10.15: Summary chart display for MPI
only showing the relative times of
pom_grid2_4.bvt. |
Figure 10.16: Summary chart display for MPI only showing
the relative times of pom_grid3_4.bvt. |
Figure 10.17: Summary chart display for MPI only showing the relative times of pom_grid4_4.bvt. |
Back to Intel Trace Analyzer™
Another
statistical display in Vampir is the Activity Chart.
It is especially useful in looking at distribution of time between MPI and user
code for all processes at-a-glance. This display shows a pie chart for each MPI
process with proportionate division by application, MPI, and Vampirtrace (VT_API) based on the total time utilization.
VT_API is listed in addition to MPI and Application if there is one (or more)
call(s) to a Vampirtrace API function. However, such
calls typically take negligible time and thus the display helps to understand
the uniformity of the load distribution by process at-a-glance.
The
corresponding Activity Chart displays for the three problem problem
sizes in MP-POM with four MPI processes are shown in Figure
10.18, Figure 10.19, and Figure
10.20, respectively. It is clear that by proportion, for all four processes, the amount of time spent in MPI
procedures decreases sharply as the problem sizes scales upward. It is also
worthwhile noting that these Activity Chart displays show proportions for MPI
or Application activity that are relatively homogeneous over MPI processes for
the largest problem size.
Figure 10.18: Global activity chart display for pom_grid2_4.bvt.
Figure 10.19: Global activity chart display for pom_grid3_4.bvt.
Figure 10.20: Global activity chart display for pom_grid4_4.bvt.
Another
important indicator of possible MPI performance problems is in the details of
messages passed between processes. In addition to the statistics discussed in
Section 10.6.1, Vampir provides a
feature-rich Message Statistics display for message counts, for the minimum,
maximum, or average of message length, rate, or duration.
The
default is to display the sum of all message lengths (not shown here) but some
detailed comparison is available by viewing the Message Length statistics
displays shown in Figure
10.21, Figure 10.22, and Figure 10.22.
In this display the number of messages in each length bin are shown. The
feature to note is that the bin size doubles for each increment in problem size
as does the vertical scale and this indicates doubling in message length.
Figure 10.21: Message length statistics for communications between MPI processes with pom_grid2_4.bvt.
Figure 10.22: Message length statistics for communications between MPI processes with pom_grid3_4.bvt.
Figure 10.23: Message length statistics for communications between MPI processes with pom_grid4_4.bvt.
Another
source for message statistics and information on possible MPI performance
problems is in the details of the process profile for specific messages passed
between processes. This is shown in the Process Profile displays which
are discussed in this section. These
displays give timings on a per process basis for specific states. Only the
results for the MPI_Wait state are shown here in view of the foregoing discussion on
which MPI process dominates.
The
Sum of time results for the MPI_Wait
state are shown Figure
10.24, Figure
10.26, and Figure
10.28 for the respective problem sizes in the POM with four
MPI processes. A companion set of results for the Average Duration is shown in Figure 10.25, Figure
10.27, and Figure
10.29. These
displays show that the average time consumed by the MPI_Wait state changes only mildly with increasing problem size
whereas the sum of time decreases drastically relative to the total time.
Consequently the total cost of synchronization in MP-POM decreases
substantially as the problem size increases.
Figure
10.24: Process profile for MPI activity and
the MPI_Wait
state showing sum of time for
pom_grid2_4.bvt. |
Figure
10.25: Process profile for MPI
activity and the MPI_Wait
state showing average duration for pom_grid2_4.bvt. |
Figure
10.26: Process profile for MPI activity and the MPI_Wait
state showing sum of time for pom_grid3_4.bvt. |
Figure
10.27: Process profile for MPI
activity and the MPI_Wait state
showing average duration for pom_grid3_4.bvt. |
Figure
10.28: Process
profile for MPI activity and the MPI_Wait
state showing sum of time for pom_grid4_4.bvt. |
Figure
10.29: Process
profile for MPI activity and the MPI_Wait
state showing average duration for pom_grid4_4.bvt. |
Back to Intel Trace Analyzer™
Only
by using Vampir to compare the three different problem
sizes of GRID2, GRID3, and GRID4 of Table 10.3 for MP-POM is a quantitative,
in-depth, understanding of the specifics of performance differences possible.
Section
10.6 presented results for performance costs with
increasing problem size in the following categories
1.
Statistical
overview to show the first
clear quantitative indication of scalability in computation-to-communication
ratio.
2.
Runtime
analysis to show that timelines have increasing computation regions
and diminishing communication intervals.
3.
Runtime
MPI statistics to show that MPI process time is
dominated by one process and its relative time decreases by approximately a
factor of two for each increase in grid size.
Section
10.7 presented results for load balance statistics with
increasing problem size in the following categories
1.
Activity to show that all MPI processes have a sharp
decrease in proportion to computation with improved homogeneity across
processes for the largest grid.
2.
Messages
to show a doubling of message length
3.
Process
profile to show that the
total cost of synchronization decreases substantially as problem sizes
increases.
Consequently,
it is possible to conclude that with increasing problem size
·
the computation-to-communication ratio
increases
·
synchronization costs decrease, and
·
message length increases.
This
performance analysis explains the scaling with problem size in parallel
efficiency as shown in Figure
10.5.
Back to Intel Trace Analyzer™
10 Case studies III: The Princeton Ocean Model......................................................................................... 207
10.1 Introduction............................................................................................................................................................ 207
10.1.1 The history of the Princeton Ocean Model (POM)............................................................................................. 207
10.1.2 Performance of the POM.......................................................................................................................................... 207
10.1.3 MPI parallel version of the POM (MP-POM)...................................................................................................... 207
10.2 POM Overview........................................................................................................................................................... 208
10.2.1 Brief description of the POM................................................................................................................................... 208
10.2.2 Conversion of POM to MP-POM............................................................................................................................ 209
10.2.3 Problem sizes under study....................................................................................................................................... 210
10.3 MP-POM on a Linux IA-32 cluster................................................................................................................... 211
10.3.1 Location of MP-POM code...................................................................................................................................... 211
10.3.2 Compiling and linking MP-POM........................................................................................................................... 212
10.3.3 Executing MP-POM.................................................................................................................................................. 214
10.3.4 MP-POM output files................................................................................................................................................ 214
10.3.5 Exercises with MP-POM.......................................................................................................................................... 215
10.4 MP-POM performance results on
an IA-32 cluster........................................................................... 215
10.4.1 Wall clock time for MP-POM.................................................................................................................................. 215
10.4.2 Parallel speed-up for MP-POM.............................................................................................................................. 216
10.4.3 Parallel efficiency for MP-POM............................................................................................................................. 219
10.4.4 Conclusions on scalability for MP-POM.............................................................................................................. 221
10.5 Performance analysis on an
IA-32 cluster.......................................................................................... 222
10.5.1 Comparing tracefiles in Vampir for MP-POM..................................................................................................... 222
10.5.2 Parallel granularity for MP-POM......................................................................................................................... 222
10.6 Identifying performance costs.................................................................................................................... 222
10.6.1 Statistical overview for MP-POM.......................................................................................................................... 222
10.6.2 Runtime analysis for MP-POM............................................................................................................................... 224
10.6.3 Runtime MPI statistics for MP-POM...................................................................................................................... 226
10.7 Identifying load balance
problems......................................................................................................... 227
10.7.1 Activity statistics for MP-POM............................................................................................................................... 227
10.7.2 Message statistics for MP-POM.............................................................................................................................. 229
10.7.3 Process profile statistics for MP-POM.................................................................................................................. 231
10.8 Summary of performance
analysis for MP-POM.............................................................................. 233
Back to Intel Trace Analyzer™