1.0
Introduction We investigate parallel programming
paradigms and survey software and hardware vendor involvement in ongoing development.
These developments are having a profound impact on the way science and engineering models
are computed. Those involved in developing applications need to consider alternative
parallel programming paradigms and develop experience with them on both shared and
distributed memory architectures as well as workstation clusters. Such considerations are
an important part in a plan for transitioning long-life-cycle models to future High
Performance Computing (HPC) parallel architectures and software standards since both are
undergoing evolutionary (and revolutionary) change.
This Newsletter covers several topics in HPC hardware and
software trends and starts with a survey of HPC hardware developments in section 2. The
prevailing status is described and the anticipated growth over the next decade is
indicated. This review is a prelude to (and partial explanation of) the need to discuss
parallel programming paradigms. This focal point is compiler software and toolkit
development [PTC] (Citations are are shown with a mnemonic key listed in Section 7) by
both hardware and software vendors to take advantage of HPC hardware trends. The role of
parallel software tool development is important because major surveys of parallel systems
[NSF] have noted the perception that scalable parallel systems are not viewed as viable
general purpose scientific and engineering computing resources. The effective utilization
of scalable parallel computers hinges on obtaining performance (scalability), and ease of
use (usability). Table 1 summarizes the basic parallel programming paradigms that have
found vendor support, or world-wide acceptance, and citations that are sources of
information.
Table 1:
Parallel Programming Paradigms |
Paradigm |
Citations |
High Performance Fortran (HPF) |
IDC, HPFD, HPFF, HPFH, PGI, CRIHPFC, CRIHPF,
PGHPF |
Message Passing Interface (MPI) |
MPI, MPIF,MPIG, MPIP, MPIT, PAL, ST1 |
Parallel Virtual Machine (PVM) |
PVM |
OpenMP Application Program Interface (OpenMP) |
OPENMP, KAI Software, ST2 |
Sections 2 and 3 summarize HPC hardware and software
trends, respectively, as a prelude to an anlaysis of parallel programming paradigms in
Section 4. Section 5 lists basic parallel performance metrics for future reference,
Section 6 is a summary, and Section 7 has a citation index with numerous World Wide Web
hyperlinks.
Grateful acknowledgements are due to Bob Kuhn (KAI
Software, a division of Intel Americas, Inc.[KAI]), and Doug Miles (The Portland Group,
Inc.[PGI]) for their time and for freely sharing information about their respective
company products. The material presented here in summary draws heavily from these sources
and those listed in the Citation Index (Section 7). The opinions expressed are not
necessarily those of the companies (or names) mentioned here. All trademarks and
tradenames cited are the property of the owners.
2.0 High Performance
Computing Hardware Trends
In the last decade computer hardware performance has
improved by a factor of 500 [MESS1] due to use of both hundreds (or thousands) of
individual processors and also parallelism within processors (as in vector architectures).
Vendors have moved strongly towards off-the-shelf microprocessors and memory chips for
large scale, closely coupled, Distributed Memory Parallel (DMP) architectures with
multiple processor elements (PEs). Furthermore, some dedicated applications have found a
home on microprocessor-based clusters delivering cost effective performance [AVA, LANL,
NASA]. Vector processors in Shared Memory Parallel (SMP) configurations, still have a
large installed base [TOP500, GUNTER], but are approaching an asymptotic limit in clock
cycle time. This trend, which is the result of solid-state design considerations, has (in
part) been compensated for by an increase in the number of pipelines or functional units
in the vector central processing unit (CPU). Nevertheless, the state-of- the-art today
sees growing activity in design, and marketing, of scalable parallel systems [BADER]. This
development is the result of improvements in floating point performance of
microprocessors, new connection schemes for many distributed processors (and memories),
and improvements in interface software tools for DMPs. As a result, science and
engineering users have an improved knowledge base on how to use DMPs compared to even 5
years ago. These market trends have raised expectations that DMPs can scale indefinitely
to meet future demand of compute intensive applications, but a warning has been sounded as
to the false basis of such expectations and the future course of the HPC industry [KUCK].
Messina reviews [MESS1] the current trends in parallel architecture development and
identifies the current dominant commercial architectures to include those shown in Tables
2 and 3.
Table 2: Distributed Memory Processors (DMP) |
DMP implementation |
Maximal configuration |
Citation |
Cray T3E |
1024 nodes @ 1 SMP PE per node |
CRI |
IBM SP |
512 nodes @ 4-8 SMP PEs per node |
IBM |
SGI/Cray Origin 2000 |
1024 PEs |
SGI |
HP Exemplar X-Class/SPP 2000 |
16 nodes @ 32 PEs per node |
HP |
Hitachi SR2201 MPP |
2048 PEs |
HIT |
Table
3: Vector Shared Memory Processors (VSMP) |
VSMP implementation |
Cycle time |
Citation |
Hitachi S-3800 |
2.0 nanosecond |
HIT |
NEC SX/4 |
8.0 nanosecond |
NEC |
SGI/Cray T90 |
2.2 nanosecond |
CRI |
While the Vector SMP architectures are proprietary, those for DMPs are (or will be) based
on commodity microprocessors such as the Alpha 21164 [CRI, DEC], PowerPC [IBM], MIPS
R10000 [SGI], PA-8000 [HP], and a high performance RISC chip (unspecified) [HIT]. Recent
reviews of the history of the DEC alpha chip [ NTMAG] and marketing enterprises [ALPHA]
have further details.
Messina also reviews [MESS2] the expected developments in microprocessor clock speeds from
a current 450Mhz (2.2 nsec cycle) to 1100 Mhz (0.91 nsec cycle) by the year 2010, a gain
factor of only 2.4. This slow-down in reduction of cycle time is the result of known
problems with the limits of feature size in CMOS on silicon devices. Moore's law predicts
[MOR] that the number of transistors on a chip will double every 18 months. So far the
micro-electronics industry has sustained this pace by reducing feature size and thereby
increasing circuit speed. A recent review of semiconductor research [HAM] notes that the
Technology Roadmap for Semiconductors [TRS], which charts progress in process and design
technologies, calls for an exponential scaling in feature size from a current (1999) 180
nanometers (21 million logic transisitors) to 50 nanometers (1400 million logic
transisitors) by 2012. Beyond 2006 new techniques will need to be developed to overcome
the deep-ultraviolet 193nm wavelength limit which confines the smallest feature size to
100 nm in the chip etching process. To face this challenge the microelectronics industry
has formed various alliances to address the difficult materials and design challenges
[SRC, STARC, SEL, MED], including X-ray proximity lithography [ASET], to break through the
100 nm limit. In addition to the challenges of overcoming physical properties, the
industry will need to find new solutions and tools for chip design if Moore's law is to
survive into the next millenium.
Coincident with this progress in chip manufacture the
increasing gap between processor speed and memory speed is expected to continue. This
problem has been compounded in DMP systems by latency due to physically distributed
memory. A proposed means of mitigating this trend is to ensure use of multiple levels of
memory hierarchy and an architecture that can work with many concurrent threads. The word
"thread" denotes a group of instructions scheduled as one unit. This important
idea is summarized by Messina [MESS2] : "The multithread approach hides memory
latency if the ratio of threads to processors is high enough, meaning that there is always
work to be done by a sufficient number of 'ready' threads. While ready threads are
executing, data needed by threads that are not yet scheduled can be migrated up the memory
hierarchy, behind the scenes." To sustain the growth in hardware performance
"microprocessors will have to support more concurrent operations, and more processors
will have to be used." Implicit in this proposal is the expectation of continued
development of parallel compiler technology to the stage where thread generation and data
movement is possible on the scale required.
Meanwhile at the low-end of the HPC scale, microprocessor-based, multi-CPU, desk-side
computers have arrived as viable science and engineering computing resources. Examples
include Intel Pentium [INTEL] and COMPAQ Alpha [DEC] microprocessors with 4
[COMPAQ, SGIV] to 14 [DEC] (or more [SUN]) PEs per configuration are now available. The
performance per PE with such microprocessors approaches, or exceeds, that of traditional
vector processing mainframes [MM5]. In the marketplace over 800,000 personal workstations
were sold in 1996 [IDC]. This constituted over fifty percent of all workstation shipments
in 1996 and represented a growth of over thirty percent in either units or revenue when
compared to 1995. Concommitant with this sector growth is the recognition of the critical
need for advances in compiler technology to extract performance from commodity
microprocessors.
3.0 High Performance
Computing Software Trends
The future demand for some two orders of magnitude increase
in parallelism [MESS2] will require new designs for parallel programming paradigms and
end-user understanding of them. Of prime concern in this enterprise are criteria of
usability, scalability, and portability.
It is known that two components are necessary for parallel
scalability in application performance: scalable hardware and scalable software. On the
scalable software front significant advances have been made in the past decade. To
describe and evaluate some of these advances the scope of this discussion is restricted to
source code preprocessing software, compiler technology, and applications development
tools for parallel architectures. Also included is some discussion of graphical interfaces
to static/dynamic debugging software and profiling/performance tuning tool kits.
Examples of scalable software paradigms include HPF and the OpenMP standard which aim to
hide the details of the architecture from the end-user through compiler-level (language)
constructs and directives. The opposite extreme is presented by (ostensibly) portable
message passing libraries such as MPI [MPI] or PVM [PVM] providing programmers greater
control (but also demanding total management) of parallelism at the task level.
Consequently the two extremes are also extremes in the level-of-effort in coding and using
parallel computers. Today most major vendors offer HPF, either as features in their own
compilers (Digital [DEC], IBM [IBM]), or implementations through collaborative agreements
with the Portland Group, Inc., [PGI] (Hewlett-Packard/Convex [HP], Intel [INTEL], SGI
[SGI]). The Portland Group, Inc., considers interoperability between HPF and MPI an
important feature. Furthermore, a recent study shows how OpenMP and MPI can be make to
work together within the same application [BOVA] for SMP workstation clusters.
HPF relies on a data parallel programming model with the following characteristics [IDC,
HPFF, HPFH]:
- Operation-level parallelism, or simultaneous execution of
identical instructions on different processors with different data segments,
- Single (global) thread of control where one program defines
all operations,
- Global name space with all data in a single logical memory
and available to all parts of the code,
- Loose synchronization where processors run independentl with
synchronization only at specified.
A message passing paradigm (such as MPI [ST1]) is characterized by:
- Processes level parallelism where independent processes
exchange data with message send/receive calls, multiple (independent) threads of control,
- Multiple, independent name spaces,
- Close synchronization of all processes.
Whereas, the data parallel model partitions data over available processors, the message
passing model distributes computational tasks over (groups) of processors. The OpenMP
proposal [OPENMP, ST2, BOVA] has some similarity to the data parallel model in that it
uses directive driven parallelism with global data but is designed for shared memory
parallel systems (SMP) and generates multiple threads of control in a fork/join model.
4.0 Parallel Programming Paradigms
4.1 Effective Parallel Processing
Effective and efficient parallel processing depends on a
combination of several key factors [NSF]:
- Performance, or achieving good scalability,
- Usability, or ease of use
On the performance side, scalability is the key issue and the level achieved depends on
the application, the parallel architecture, and the implementation environment. On the
usability side, ease of use addresses issues of how easy it is port applications, achieve
implementation robustness, and maintain code to achieve good scalability. Implementation
features critical to the ease of use issue include the programming, debugging,
optimization, and execution environment software tools. Previous surveys [NSF] have found
divergent scalability results for different applications on different platforms and have
proposed more precise evaluation methodologies. Section 5 defines some detail of specific
parallel computer evaluation areas including:
- Macroperformance, or gross behavior of the
computer-application combination,
- Microperformance, or the underlying factors responsible for
the observed macroperformance,
- Usability, or program development environment.
Key concerns of end-users remain centered on issues such as:
- Portability, or ensuring an application will run on multiple
platforms without modification,
- Code maintenance, or code that is easy to read and maintain
as a single source for serial and parallel implementations
- Scalable hardware, or products with growth potential,
- Scalable software, or tools to enable scalability with
relative ease.
4.2 High Performance Fortran
In 1992 the High Performance Fortran Forum (HPFF), a
coalition of industry, academic and laboratory representatives, proposed High Performance
Fortran (HPF) as a set of extensions to Fortran 90 [HPFF]. Subsequently the HPF 1.0
Language Definition document was published [HPFD] and was updated in 1994 to version 1.1.
Currently a HPF 2.0 Language Specification document is in development. The programming
paradigm here is single-program-multiple-data stream (SPMD) based on the model that the
same operation is repeated on different data elements. A data parallel model therefore
attempts to distribute these operations across multiple processors to enhance
simultaneity. The goal of the HPFF is to define language extensions [HPFF] that support:
- Data parallel programming features,
- High performance on DMP and microprocessor based SMP
architectures,
- Performance tuning across architectures.
The data parallel paradigm implies single-threaded control
structures, a global name space, and loosely synchronos parallel execution. Secondary
goals of HPFF include:
- Portability of code from serial to parallel versions and
between parallel computers,
- Compatability with the Fortran 95 standard,
- Simplicity of language constructs,
- Interoperability with other languages/paradigms.
In HPF applications the user implements directives to
support the data parallel programming model and the HPF compiler uses the Fortran 90/95
[FTN90, FTN95] source and embedded directives to generate an executable code that
automatically uses a SMP/DMP system [CRIHPF]. Utility routines appear in HPF either as
Fortran 95 intrinsic functions or in the form of HPF library routines contained in a
Fortran 95 module. The HPF library is a set of intrinsic procedures designed to support
optimized implementations of commonly used data parallel operations [HPFH].
Key HPF features of value to programmers have been identified as:
- Data distribution at the language level,
- Parallel loop constructs,
- Masking operations,
- Parallel function library
While HPF does not address all issues of programming
parallel computers, it does provide a single source code version that can be used across
multiple architectures from a Cray T3E [CRI] to a cluster of workstations. For
applications appropriate to the data parallel paradigm [HPFUG] the issues of single source
code maintenance, performance tuning, and a reduced level of effort in porting to various
parallel architectures have proven to be critical advantages. In real world applications
portability now means that the same source can run on all the DMP architectures listed in
Table 2, including the Sun UltraHPC [SUN]. The most popular implementation of HPF is by
the Portland Group, Inc., [PGI], under the trade name of PGHPF [PGHPF], and is installed
at 40% of the first 100 sites of the top 500 [TOP500] HPC sites world-wide. The Portland
Group, Inc. [PGI] has a joint marketing agreement with SGI/Cray [PGHPF] and in this
release Cray's CRAFT programming model has been included [CRIHPFC, CRIHPF]. Some 80% of
PGI's compiler technology is both target-independent and language-independent which
enables rapid deployment for numerous architectures. A Transport Independent Interface
acts as a kernel of the PGHPF runtime system and has been efficiently implemented (in a
way transparent to the user) on top of message passing paradigms such as PVM [PVM], MPI
[MPI], and SHMEM [CRI]. As a result of the easy target facility an interesting new market
trend has been the implementation for Intel Pentium [INTEL] platforms with multiple
CPUs under either LINUX or Windows NT 4.0 operating systems. This user, and
implementation base, is expected to grow even though memory bandwidth could be
problematic. Nevertheless, good scalability has been observed for quad Pentium
microprocessors [COMPAQ] when compared with DMP architectures such as the SGI Origin
[SGI], and issues of memory latency are being addressed with new workstation technologies
[SGIV].
The PGHPF model has low overhead for start-up of parallel regions through features such
as:
- Threads created at program start-up and re-cycled for each
parallel loop or region,
- Parallel regions implemented as inline assembly code
segments and not as subroutine calls,
- A parallelizer integrated with the other internal
compilation phases such as global optimization, vectorization, memory hierarchy
optimization, communication optimization, and interprocedural analysis.
The PGHPF implementation is also designed to be
interoperable with, and complimetary to, MPI (see Section 4.3) and it is possible to call
routines which perform MPI message-passing as HPF_LOCAL extrinsics. This provides a way of
transitioning from a global name space model to a local model at a subroutine boundary.
Within an HPF_LOCAL routine the code has essentially the same view of the data that an MPI
program has and can do MPI message passing between the processors. This minimizes the
level of effort in coding message passing since the actual amount of code in MPI is
usually very small in relation to the whole application implemented in HPF. This hybrid
approach usually resolves problems of declining scalability.
Actual performance improvement is dependent on the application. The PGHPF implementation
[PGI, PGHPF] on the Cray T3E out-performs MPI in three of the NAS Parallel Benchmarks upto
128 PEs [NPB]. Several large scale models and a 3D reservoir model show either good
scaling, or results that are within x 1.4 of an MPI implementation on an SGI Origin [SGI].
Scaling results for HPF implementations of two large-scale applications, The Princeton
Ocean Model (POM) and the RIEMANN code have been studied. For the POM acceptable scaling
is oberved upto 16 PEs on either a Cray T3E or the SGI Origin 2000, while for for the
RIEMANN code, scalability is exceptionally good for either the Origin 2000 or the
Cray T3E , to 128 or 256 PEs, respectively [BAL].
4.3 Message Passing
Interface
The message passing programming model is dominated by the
Message Passing Interface (MPI) standard [MPI, MPIG, MPIP], although a predecessor, PVM:
Parallel Virtual Machine [PVM], is in common use. An effort is underway [PVM] to create a
new standard, PVMPI, that combines the virtual machine features of PVM and the message
passing features of MPI. The discussion here will center around MPI which has become
popular because it often provides superior scalability behaviour when compared to the HPF
paradigm. Furthermore, in a distributed memory architecture (such as the Cray T3E) it is
the only alternative to HPF if (ostensibly) portable parallel code is needed that is not
tied to a vendor specific message passing library. Nevertheless, MPI is available in
multiple implementations both public domain and vendor specific [MPIT], but portability
across implementations is not always transparent to the end user.
The MPI standard defines a library of functions and procedures that implement the message
passing model to control passing of data and messages between processes in a parallel
application. The fundamental idea is the concept of a computational task which is assigned
to one or more PEs. MPI is general and flexible and allows for explicit exchange of data
and synchronization between PEs, either individually, or in groups, but without the
assumption of a global address space shared by all processes. Features in the MPI model
allow for precise optimization of communication performance not otherwise possible in
either HPF or OpenMP. However, due in part to this greater flexibility, the MPI model is
also a difficult and labor-intensive way to write parallel code because data structures
must be explicitly partitioned. As a result the entire application must be parallelized to
work with the partitioned data, and all synchronization between PE's is then the
responsibility of the programmer. In the message passing model there is no incremental
path to parallelizing a new (or pre-existing) code - the whole code (or code segment) must
be rewritten. However, another initiative [HPFMPI] has proposed a standard set of
functions for coupling multiple HPF tasks to form task-parallel computations. This hybrid
could combine the ease of use characteristic of HPF with the communication performance
advantages of the message passing paradigm.
The MPI 1.1 standard proposed the following main features:
- Point-to-point communications routines,
- Routines for collective communication between groups of
processes,
- Communication context providing design support of safe
parallel libraries,
- Specification of communication topologies,
- Creation of derived datatypes describing messages of
non-contiguous data.
Because of a lack of standard to start MPI tasks on
separate hosts, MPI 1.1 applications are not always portable across workstation networks.
To correct this (and other) problems the MPI-2 specification has been completed, and adds
120 functions to the 128 in the MPI-1 API specification. New features will include:
- MPI_SPAWN function to start MPI and non-MPI processes (i.e.
dynamic process creation),
- One-sided communication functions (e.g. put/get),
- Nonblocking collective communication functions,
- Language bindings for C++.
Some basic features of MPI are summarized in Table 4.
Table 4:
MPI features |
MESSAGES & PROCESSES |
Local memory has one (or more) process(es)
associated with it. A message is made up of a body and an envelope, and is the only means
by which processes can access data in local memories or synchronize. Processes can
cooperate in groups to perform common tasks and each process can opt to read a message
based on the envelope contents. Processes are ranked (have a numerical identity) but this
may be assigned by the user to suit the virtual topology appropriate to the task which
need not correspond to the physical hardware connection scheme. In MPI a
"communicator" is always associated with a process group (task), a communication
context, and a virtual topology. |
DATATYPES |
MPI can use the predefined data types of the
host language (C or Fortran) but extends these by allowing the user to construct derived
data types so that both contiguous and noncontiguous data can be defined. MPI message
library functions include a data type argument and the user has considerable felixibility
in choosing data types suitable to virtual topologies or computational tasks. |
COMMUNICATIONS |
MPI is rich in communications options which
fall into two basic types: point-to-point or collective. Both types require specification
of a start address, length, and data type, followed by message envelope parameters. In
point-to-point communications send and receive parameters must match to ensure safe
transmission and allow both blocked and non-blocked transmission. A communication
procedure is blocked if it does not return before it is safe for a process to re-use the
resources identified in the call. Collective communications are valuable in either data
redistribution (broadcast, gather/scatter, etc.) or computation (minimum, maximum,
reduction, etc.). |
SPECIAL FEATURES |
- Support for user-written libraries that are independent of
user code and interoperable with other libraries,
- Support for heterogeneous networks of computers with
different data formats,
- Communications between non-overlapping groups of processes
(or separate tasks),
- Collective communication extended to all-to-all type,
- Simplification of often repeated point-to-point
communication procedures by cache of "persistent" call sequences.
|
Originally the MPI standard omitted specification of
debugging or performance profiling tools although standard interfaces are provided. This
lack of powerful graphical tools has, in part, been compensated for by individual software
developers who have developed tracing tools such as Vampir/Vampirtrace [PAL], while other
groups support development of parallel tools [PTC].
4.4 OpenMP
The OpenMP application program interface (API) supports
multi-platform, shared-memory parallel (SMP) programming, on Unix and Microsoft Windows
NT platforms. OpenMP [OPENMP] partners include major computer hardware vendors
[COMPAQ, DEC, HP, IBM, INTEL, SGI, SUN]. The OpenMP model has also been endorsed by
key applications developers and instrumental in the development of OpenMP has been KAI
Software, a division of Intel Americas, Inc.. [KAI].
OpenMP has been developed as a portable scalable model that gives SMP programmers a simple
but flexible interface for developing parallel applications across a range of platforms. A
white paper on a proposed standard was drafted [OPENMP] and an OpenMP Architecture Review
Board (OARB) was established and is undergoing incorporation. The OARB will provide long
term support and enhancements of specifications, develop future standards, address issues
of validation for implementations, and promote OpenMP as a defacto standard. OpenMP is
defined for Fortran, C, and C++ applications. OpenMP for Fortran has at its core a set of
standard compiler directives to enable expression of SMP parallelism. Unlike message
passing, or vendor sets of extensions for parallel software directives, OpenMP is portable
enabling creation of a single source for multiple platforms.
In the shared memory model every processor has access to the memory of all other
processors and the programmer can expresses parallelism through shared/private data
allocation. A lack of a standard that is portable has been a major reason for lack of
development of this model. The result in the past has been that different vendors provide
proprietary parallel extensions to Fortran or C. This situation has lead programmers to
opt for a message passing model such as MPI or PVM for reasons such as portability or
performance. As a result it is a commonly held belief that scalability in parallel
software is only possible with a message passing paradigm. With the emergence of cache
coherent parallel architectures, or scalable shared memory parallel hardware, software
scalability is easily achieved with a shared memory model. OpenMP has been proposed with
the view that it can provide a model for incremental parallelization of existing code as
well as scalability of performance.
At the simplest level OpenMP is a set of standardized compiler directives and runtime
library routines that extend an (unspecifed) programming language such as Fortran, C, or
C++, to express shared memory parallelism. Directives are common in vendor-specific
parallel implementations (for example in Cray autotasking), but in OpenMP they are not
implementation-specific, and are therefore portable. OpenMP has features that are new and
differ from the coarse-grain parallel models (e.g. Cray autotasking for a parallel loop).
In OpenMP a parallel region may contain calls to subroutines that contain DO loops which
are lexically invisible to the parallel directive in the calling routine. The
corresponding DO loops are examples of orphan directives and synchronization control can
be performed inside the routine called. This OpenMP feature enables successful
parallelization of nontrivial coarse grain parallel applications without the need of
moving the DO loops into the calling routine to make them visible to the parallel region
directive.
The four design categories of the OpenMP standard are briefly described in Table 5.
Table
5: Four design categories of the OpenMP standard |
CONTROL STRUCTURES
(defining parallel/nonparallel iterations/regions) |
The design goal here is the
smallest set possible with inclusion only for those cases where the compiler can provide
both functionality and performance over what a user could reasonably code. Examples are
PARALLEL, DO, SINGLE with a sentinal !$OMP. |
DATA ENVIRONMENT
(scoping of data, or global objects such as threads) |
Each process has an associated
(unique) data environment with the objects having one of three basic attributes: SHARED,
PRIVATE, or REDUCTION. The last is used to specify a reduction construct which may be
differently computed on different architectures. The THREADPRIVATE directive makes global
objects private by creation of copies of the global object (one for each thread). |
SYNCHRONIZATION
(defining barriers, critical regions, etc.) |
Implicit synchronization points
exist at the beginning and end of PARALLEL directives, and at the end of control
directives (e.g. DO or SINGLE), but can be removed with a NOWAIT parameter. Explicit
synchronization directives (e.g. ATOMIC) allow the user to tune synchonization in an
application. All OpenMP synchonization directives may be orphaned. |
RUNTIME LIBRARY AND ENVIRONMENT
VARIABLES |
A callable runtime library (RTL)
and accompanying environment variables include functions such as, query, runtime, lock,
etc. The programmer may set the number of processes in parallel regions, or when to
enable/disable nested parallelism. |
A particularly attractive implementation of OpenMP is
the KAP/Pro Toolset the by KAI Software, a division of
Intel Americas, Inc. [KAI]. This implementation is rich in graphical
interfaces and has three major components: Guide:(parallelizer), Guideview (graphical
performance profiler), and Assure (parallel
code verification). The KAP/Pro Toolset is specifically targeted for SMP
architectures (or node clusters) with interoperability to MPI for inter-node communication
[BOVA]. An important distinction made by KAI is that their product analyzes the dynamic
performance of the application and is not a static source analysis [APRI]. Results for
scaling with the KAP/Pro Toolset implementation of MM5 for either the SGI
Origin 2000 or the DEC Alpha Server 8400 (under Windows NT) are excellent.
4.5 Comparing HPF, MPI,
PVM, and OpenMP
In this section the relative merits of HPF, MPI, PVM, and
OpenMP are assessed based on end-user experience and designer comparisons.
Key reasons users prefer HPF over message passing models include:
- The relative speed with which large scale applications can
be parallelized,
- It is a good approach to making parallel programming
simpler,
- The code is transparent to the scientific end-user and
algorithmic content is not obscured,
- Benefit of maintaining one source usable by both working
scientists and performance analysts,
- Portability is a key in smooth transitions among existing
and new architectures,
- Ease with which code can be maintained,
- A single address space across the physical processors, and
- Data mapped in a single computational node are accessible
through "extrinsic local" procedures written in Fortran 77/95, or C.
Table 6 lists positive and negative characteristics of HPF
identified by developers.
Table 6:
Characteristics of HPF identified by developers |
Positive: |
Negative: |
Productivity |
Too rich in language
specifications |
Performance |
Data duplication across
processors |
Portability |
Scope limited to data parallel
model |
Maintenance |
Problems with scalability |
Opinions vary as to successes with HPF versus MPI [NSF] and
at some sites [PSC] users have abandoned the former in favor of the latter. One criticism
of the current PGHPF implementation has been that while it is compliant with HPF 1.1, it
is necessarily restricted to regular data distributions. The proposed HPF 2 standard would
resolve this problem by allowing dynamically distributed data and irregular
communications. However, it should be noted that work-arounds are often possible even
under HPF 1 [DMM].
Some evaluations find that scalability results are more limited with HPF compared to
message passing for the same architectures [NSF]. Others find good scalability results
[DMM] and both arguments have repeated confirmations. The only conclusion on the
scalability issue is that results are application dependent.
A comparison of HPF and OpenMP shows some overlap in functionality, however, with a much
richer and more flexible process control structure in the latter case. Therefore the
OpenMP paradigm shares many of the advantages listed above for HPF: readable code,
portability, ease of parallelization and code maintenance. Most important is that OpenMP
allows for the incremental parallelization of a whole application without an immediate
need for a total rewrite. A simple example would be the incremental parallelization of
loops in a large application. These would be declared as parallel regions, and variables
appropriately scoped, as in a fork/join execution model. No data decomposition is needed,
nor does the number of PEs need to be specified, since this is transparent to the user.
The end of a loop is the end of a parallel region with an implicit barrier. By contrast,
in an MPI implementation, there is no globally shared data, and all data needs to be
explicitly broadcast to all PEs (creating a storage expense). Loop bounds have to be
explicitly managed by identification of each processor and the number used for the loop.
Also one PE needs to be reserved for I/O operations. While task level parallelism may have
its attractive side it does require a comprehensive book-keeping effort in managing the
task-to-PE mapping.
The MPI standard [MPI, MPIF] has established a widely accepted and efficient model of
parallel computing with new definitions in the areas of process group, collective
communications, and virtual topologies. MPI code is portable across multiple platforms and
allows for the development of portable application and library software. Such libraries
are useful where there is a need for standard parallel functionality, as in adapting
finite difference models to parallel computers, with block domain decomposition and
parallel I/O. If the programmer chooses a message passing approach this is labor-intensive
because it is a huge undertaking to build parallel libraries which anticipate and
incorporate the needs of a wide variety of applications. However, whereas it may take
months to build a library the first time it is needed, it is re-usable and thereafter is
probably no harder to use than a compiler directive, because the programmer needs only to
substitute library calls, define and assign a few new variables, and link the code with
the parallel library.
The relative merits of PVM versus MPI have been investigated [PVM]. MPI is popular because
of high communication performance on a given platform, but this is at the cost of some
features. One is the lack of interoperability between different MPI implementations so
that one vendor's MPI implementation cannot send messages to another vendor's MPI
implementation. At present there are some five different public domain implementations of
MPI in addition to vendor-specific versions. The MPI standard allows portability in that
an application developed on one platform can be compiled and executed on another. However,
unlike PVM, MPI executables compiled on different architectures need not be able to
communicate with each other because the MPI standard does not require heterogeneous
communication. On the question of portability, PVM is superior in that it "contains
resource management and process control functions that are important for creating portable
applications that run on clusters of workstations and MPPs" [PVM]. Even when MPI is
used in a vendor specific implementation, the performance achieved can still be
considerably lower than that possible with the vendor's proprietary message passing
protocols.
Another difference in PVM and MPI is that of language interoperability, Whereas a PVM
application can exchange messages between C and Fortran codes, the MPI standard does not
require this, even on the same platform. While MPI can be used with FORTRAN 77 code it
does not offer the level of integration of either HPF or OpenMP. As one example, MPI does
not take advantage of the Fortran 90/95 array syntax.
A further deficit in MPI is the lack of a feature set to support writing of fault tolerant
applications: "The MPI specification states that the only thing that is guaranteed
after an MPI error is the ability to exit the program" [PVM]. In this respect PVM can
be used in large heterogeneous clusters for long run times even when hosts or tasks fail.
Graphical interfaces vary considerably in quality between the HPF, MPI, and OpenMP
paradigms. While the MPI standard does specify a profiling interface standard graphical
profilers are rare and not used in every-day applications. In the case of HPF an
application profiler, PGPROF, provides statistics on execution time and function calls in
a graphical interface. Both HPF and MPI code can be debugged using the TotalView
[DOL] multiprocessor debugger which is commonly available. TotalView has an
intuitive graphical interface that allows management and control of multiple processes
across languages (C, C++, FORTRAN) either on multiprocessor systems or distributed over
workstation clusters. By far the richest parallel interactive graphical user
interfaces are to be found in KAI's OpenMP implementation in the KAP/Pro
Toolset. Performance visualization and
tuning is facilitated by the GuideView graphical interface which shows what each processor is doing at
various levels of detail. Guideview provides interactive identification of source location
for performance bottlenecks and prioritized remedial actions. Similarly, the AssureView
graphical interface works with the Assure tool
for automatic parallel error detection and parallel code validation. Such features promise
drastic reductions in level-of-effort for debugging of parallel code because much of that
effort is shifted to the application environment and the platform.
5.0 Measuring efficiency in
parallel performance
For future reference this section summarises some basic
parallel performance metrics [NSF]. A detailed discussion of this subject can be found in
specialized monographs [GEL, KUCK]. Table 7 summarises the simplest scalability criteria
used to measure performance versus increasing number of PEs. These metrics are in common
use and are critical in assessing a successful parallel implementation on one
architecture.
Table 7: Parallel Computing Scalability Criteria |
SPEEDUP |
= SERIAL TIME /
PARALLEL TIME = TIME ON 1 PE
/ TIME ON N PEs |
PARALLEL
EFFICIENCY |
= SPEEDUP / N |
The ideal situation corresponds to linear scaling when SPEEDUP = N and PARALLEL
EFFICIENCY = 1.
Overall performance in parallel computers is affected by having both good communication
performance between PEs and good computation performance on each PE. Assuming
communication time increases linearly with message size, then communication performance is
a combination of:
- Latency, or minimum communication time (seconds)
- Bandwidth, or asymptotic communication rate (MB/sec)
One metric of communication to computation balance achievable in a given parallel
architecture is:
BALANCE = BANDWIDTH (MB/sec) / PROCESSOR
PEAK SPEED (Mflops)
where, for a floating point intensive application,
computational SPEED (more correctly rate) is measured in million floating point
operations per second (Mflops).
Table 8:
Macroperformance Metrics for an Application and Architecture Combination |
CLOCK PERIOD |
= PROCESSOR CLOCK CYCLE TIME |
EFFICIENCY |
= ACTUAL PERFORMANCE / PEAK
PERFORMANCE |
COST EFFECTIVENESS |
= PRICE ($) / PERFORMANCE (Mflops) |
ABSOLUTE PERFORMANCE |
= 1 / TIME ON N PEs |
When comparing the same application on different
architectures (for a fixed problem size) the appropriate comparative scalability criteria
include:
- SPEEDUP / Mflops
- PARALLEL EFFICIENCY / ( Mflops / PE)
Successfully scaled problem sizes often lead to greatly
enhanced COST EFFECTIVENESS over a serial solution. In studying scalability it is
important to distinguish fixed and scaled problem sizes. With a fixed problem size the
same problem is distributed over an increasing number of PEs. With increasing N this
eventually leads to a decrease in PARALLEL EFFICIENCY because smaller data
partions per PE imply an increase in communication costs between PEs relative to the
amount of computation time. A scaled problem size seeks a homogeneous and optimal
distribution of data per PE while minimizing the relative communication costs.
The discepancy between the ideal value of 1 and the actual PARALLEL EFFICIENCY
achieved is measured by
PARALLEL INEFFICIENCY |
= 1 - PARALLEL EFFICIENCY |
|
= COMMUNICATION OVERHEAD |
|
+ LOAD IMBALANCE |
|
+ SERIAL
OVERHEAD |
with
COMMUNICATION
OVERHEAD |
= Maximum time spent
in communication among all PEs /
total TIME ON N PEs |
and
LOAD IMBALANCE |
= { T(max) - T(avg) }
/ T(avg) |
where |
|
T(avg) |
= { T(1) + T(2) + ...
+ T(N) } / N |
T(max) |
= max {T(i)},
i=1,...,N |
with T(i) the computation time on processor i.
The SERIAL OVERHEAD is usually not as significant as the other two terms
contributing to PARALLEL INEFFICIENCY but can be estimated with the following
approach. If T(S) is the serial (uniprocessor time), then T(S)/N is the
parallel time in the ideal case. The difference between T(S)/N and the parallel
time, T(P), is the "overhead" time of parallel execution
PARALLEL
OVERHEAD |
= T(P) - T(S)/N |
Subtracting the total communication time
(which can be measured) and the load imbalance estimate, gives an estimate for serial time
as:
NET
PARALLEL OVERHEAD |
= T(P) - T(S) / N |
|
- communication
time |
|
- {T(max) -
T(avg)} |
Values of PARALLEL EFFICIENCY >
0.5 are considered as acceptable [NSF], and values close to 1 are common for N < 10.
However, as N increases PARALLEL EFFICIENCY will diverge from the ideal value of
1 to an increasing extent, until it asymptotes to a constant value and eventually, for
sufficiently large N, decreases [KUCK]. This phenomenon is the empirical result of
mismatch between problem (data) size, processor cache size, PE count, and over-all
communication efficiency of the architecture. The smaller the group of processors assigned
to independent tasks the higher the parallel efficiency tends to be. Higher parallel
efficiency corresponds to a higher computation to communication ratio. Often the study of
scaling behavior is specific to a fixed problem size and this can be deceptive. Kuck
[KUCK] (see his Figs. 5.6 and 6.7) shows that a "sweet spot" is defined by a
surface mapped out by the sequence of SPEEDUP versus N curves for successively larger
problem sizes. This surface is unique to each application-architecture combination and no
generalizations may apply. Therefore caution is advised when evaluating parallel
performance on a specific architecture with either a fixed problem size, or one
application.
6.0 Summary
Developers should plan for transition of existing models to
future parallel architectures and evaluate the software options with respect to
suitability to this task based on the criteria of portability, usability, and scalability.
Such a plan should have focal points such as the following:
- Portability: A major NASA/NSF
report [NSF] found that the typical cost of parallelizing serial production code is 0.5 to
1 person months per 1000 lines of code.
- Usability: At sites with a large
base of legacy serial code there is a pent-up demand for simpler parallelization
strategies that do not require a complete rewrite of the code as the first step.
- Scability: For SMP models message
passing is unnecessary and overly restrictive and the OpenMP paradigm provides a promising
solution for scalable parallelism on multiprocessor clusters.
Proto-typing with easy-to-use parallelizing software, such as HPF and OpenMP based tools, provides input to a decision making process on the
advisability of launching a larger effort with a message passing implementation. For large
applications such parallel prototyping may be the only way of determining the potential
for scalability when there is a pressing need for parallel applications porting of a
single source with either single or clustered SMP nodes.
These issues will be the subjects of discussion in future
HiPERiSM Consulting, LLC, Newsletters and Technical Reports.
7.0 Citation Index
Legend |
Citation |
APRI |
Applied Parallel Research, Inc.,
http://www.apri.com. |
ASET |
Association for Super-Advanced
Electronics Technologies, http://www.aset.or.jp. |
ALPHA |
Alpha Processor, Inc., http://www.alpha-processor.com. |
AVA |
Avalon Alpha Beowulf cluster http://cnls.lanl.gov/avalon. |
BADER |
Parascope: A List of Parallel
Computing Sites, http://www.computer.org/parascope. |
BAL |
Scalability results of the RIEMANN
code by Dinshaw Balsara, http://www.ncsa.uiuc.edu/SCD/Perf/Tuning/mp_scale/
|
BOVA |
S. W. Bova et al. Parallel
Programming with Message Passing and Directives,preprint. |
COMPAQ |
Compaq Computer Corporation, http://www.Compaq.com, Compaq Pro 8000,
http://www.Compaq.com/products/workstations/pw8000/index.html |
CRI |
Cray C90 and T3E http://www.cray.com/products. |
CRIHPFC |
Cray release of PGI HPF_CRAFT for
the Cray T3E http://www.sgi.com/newsroom/press_releases/1997/july/cray_pgi_release.html. |
CRIHPF |
http://www.sgi.com/newsroom/press_releases/1997/august/crayupgrade_release.html. |
DEC |
COMPAQ DIGITAL Products and
Services, http://www.digital.com, Digital Alpha
Server 8400, http://www.digital.com/alphaserver/products.html |
DOL |
Dolphin Interconnect Solutions,
Inc. http://www.dolphinics.com. |
DMM |
L. Dagum, L. Meadows, and D.
Miles, Data Parallel Direct Simulation Monte Carlo in High Performance Fortran, Scientific
Programming, (1995). |
FTN90 |
Jeanne C. Adams, Walter S.
Brainerd, Jeanne T. Martin, Brian T. Smith, and Jerrold L. Wagener, Fortran 90 Handbook:
Complete ANSI/ISO Reference , Intertext Publications/Multiscience Press, Inc., McGraw-Hill
Book Company, New York, NY, 1992. |
FTN95 |
Jeanne C. Adams, Walter S.
Brainerd, Jeanne T. Martin, Brian T. Smith, and Jerrold L. Wagener, FTN95 Handbook:
complete ISO/ANSI Reference, The MIT Press, Cambridge, MA, 1997. |
GEL |
Erol Gelenbe, Multiprocessor
Performance, Wiley & Sons, Chichester England, 1989. |
GUNTER |
List of the world's most powerful
computing sites, http://www.skyweb.net/~gunter. |
HAM |
S. Hamilton, Semiconductor
Research Corporation, Taking Moore's Law Into the Next Century, IEEE Computer, January,
1999, pp. 43-48. |
HIT |
Hitachi, http://www.hitachi.co.jp/Prod/comp.hpc/index.html. |
HP |
Hewlett-Packard Company, HP
Exemplar http://www.enterprisecomputing.hp.com |
HPFD |
Scientific Programming, Vol. 2, no.
1-2 (Spring and Summer 1993), pp. 1-170, John Wiley and Sons |
HPFF |
High Performance Fortran Forum, http://www.crpc.rice.edu/HPFF/index.html |
HPFH |
Charles H. Koelbel, David B.
Loveman, Robert S. Schreiber, Guy L. Steele, Jr., and Mary E. Zosel, The High Performance
Fortran Handbook, The MIT Press, Cambridge, MA, 1994. |
HPFMPI |
Task Parallelism and Fortran,
HPF/MPI: An HPF Binding for MPI, http://www.mcs.anl.gov/fortran-m. |
HPFUG |
High Performance Fortran (HPF)
User Group, http://www.lanl.gov/HPF |
IBM |
IBM, Inc., http://www.ibm.com, http://www.rs6000.ibm.com/hardware/largescale/index.html. |
IDC |
Christopher G. Willard,
Workstation and High-Performance Systems Bulletin: Technology Update: High-Perfomance
Fortran, International Data Corporation, November 1996 (IDC #12526, Volume:
2.High-performance Systems, Tab: 6.Technology Issues). http://www.idc.com. |
INTEL |
Intel Corporation, http://www.intel.com. |
KAI |
KAI Software, a division of Intel
Americas, Inc., http://www.kai.com. |
KUCK |
David J. Kuck, High Performance
Computing, Oxford University Press, New York, 1996. |
LANL |
Los Alamos National Laboratory,
Loki - Commodity Parallel Processing, http://loki-www.lanl.gov/index.html. |
MED |
Micro-Electronics Development for
European Applications, http://www.medea.org. |
MESS1 |
P. Messina, High Performance
Computers: The Next Generation (Part I), Computers in Physics, vol. 11, No. 5 (1997),
pp.454-466. |
MESS |
P. Messina, High Performance
Computers: The Next Generation (Part II), Computers in Physics vol. 11, No. 6 (1997),
pp.598-610. |
MM5 |
MM5 Version 2 Timing Results, http://www.mmm.ucar.edu/mm5. |
MOR |
Moore's Law http://webopedia.internet.com/TERM/M/Moores_Law.html. |
MPI |
The Message Passing Interface
(MPI) standard, http://www.mcs.anl.gov/mpi/index.html. |
MPIF |
MPI Forum. MPI: A Message-Passing
Interface Standard, International Journal of Supercomputer Applications, Vol. 8, no. 3/4
(1994), pp. 165-416. |
MPIG |
William Gropp, Ewing Lusk, and
Anthony Skjellum, Using MPI - Portable Parallel Programming with the Message-Passing
Interface, The MIT Press, Cambridge, MA, 1994. |
MPIP |
Peter S. Pacheco, Parallel
Programming with MPI, Morgan Kaufman Publishers, Inc., San Francisco, CA, 1997. |
MPIT |
MPI Software Technology, Inc.,
http://www.mpi-softtech.com. |
NASA |
NASA High Performance Computing
and Communications (HPCC) Program, Center of Excellence in Space Data and Information
Sciences (CESDIS), the Beowulf Parallel Workstation project http://cesdis.gsfc.nasa.gov/beowulf. |
NSF |
W. Pfeiffer, S. Hotovy, N.A.
Nystrom, D. Rudy, T. Sterling, and M. Straka, JNNIE: The Joint NSF-NASA Initiative on
Evaluation (of scalable parallel processors), July 1995, http://www.tc.cornell.edu/JNNIE/jnnietop.html. |
NTMAG |
A. Sakovich, Life in the Alpha
Family, Windows NT Magazine, January, 1999, http://www.ntmag.com. |
NEC |
NEC, Supercomputer SX-4 Series, http://www.hpc.comp.nec.co.jp/sx-e/Products/sx-4.html. |
NPB |
NAS Parallel Benchmarks, http://science.nas.nasa.gov/Software/NPB. |
OPENMP |
OpenMP: A Proposed Industry
Standard API for Shared Memory Programming, http://www.openmp.org. |
PAL |
Pallas, GmBH, http://www.pallas.de, MPI visualization tool http://www.pallas.de/pages/vampir.htm,
MPI profiling/performance monitor, http://www.pallas.de/pages/vampirt.htm. |
PGHPF |
PGHPF description for Cray
systems, http://www.sgi.com/Products/appsdirectory.dir/DeveloperIXThe_Portland_Group.html. |
PGI |
The Portland Group, Inc., http://www.pgroup.com |
PSC |
The Pittsburgh Supercomputing
Center, http://www.psc.edu. |
PTC |
The Parallel Tools Consortium, http://www.ptools.org. |
PVM |
PVM: Parallel Virtual Machine, http://www.epm.ornl.gov/pvm. |
SEL |
Semiconductor Leading Edge
Technologies, Inc., http://www.selete.co.jp. |
SGI |
Silicon Graphics, Inc.,
http://www.sgi.com, Cray Origin 2000, http://www.sgi.com/origin2000, |
SGIV |
Silicon Graphics, Inc., Windows NT
workstations, http://www.sgi.com/visual. |
SRC |
Semiconductor Research
Corporation, http://www.src.org/areas/design.dgw. |
STARC |
Semiconductor Technology Academic
Research Center, http://www.starc.or/jp. |
ST1 |
C. H. Still, Portable Parallel
Computing Via the MPI1 Message Passing Standard, Computers in Physics, 8 (1994), pp.
553-539. |
ST2 |
C. H. Still, Shared-Memory
Progeamming With OpenMP, Computers in Physics, 12 (1998), pp. 577-584. |
SUN |
SUN Microsystems, Inc., http://www.sun.com. |
TRS |
Technology Roadmap for
Semiconductors, http://notes.sematech.org/ntrs/Rdmpmem.nsf. |
TOP500 |
TOP500 Supercomputer Sites, http://www.netlib.org/benchmark/top500.html |
HiPERiSM Consulting, LLC, (919) 484-9803
(Voice)
(919) 806-2813 (Facsimile)
|