Hiperism Consulting, LLC: Newsletters 1999-Q1

HiPERiSM's Newsletters

HiPERiSM - High Performance Algorism Consulting

1999: Parallel Programming and Industry Trends

All trademarks mentioned are the property of the owners and opinions expressed here are not necessarily shared by them. HiPERiSM Consulting, LLC, will not be liable for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of the products or source code discussed in these reports.

Editorial

The decision to start a consulting business was not a simple one for me so I will make an effort at explaining some of my motives.

Some years back when I started learning Fortran 90 with the NAGware™ FTN90 compiler on a single processor 133MHz Pentium™ under Windows NT™ 3.51 I thought this knowledge could be used to collect my own algorithms in a book on selected topics in numerical analysis that I had direct experience with. Combined with this project was a decision to present my own code in a language which had a future. At first it seemed that Fortran 90/95 would suffice. But then several things happened in quick succession, Windows NT™ 4.0 was released and I upgraded to a dual processor Intel Pentium™ workstation. Then Digital Visual Fortran™ was announced, and later, in 1997, the OpenMP parallel programming standard arrived. All of a sudden I realized that all the things I was doing on high performance Supercomputers could be done at home at my own desk. So, questions started to surface such as what did this mean for other scientists and engineers (S&E), and how would it affect the way they could work in the future? At this point I realized I had a knowledge base that I could bring to this sector of the marketplace and offer it as a service to those who needed to learn a new skill set. Developing parallel programs with OpenMP, while not difficult, is not trivial, and has to be learned. So the idea of HiPERiSM Consulting was born. However, I also decided that it was not enough to offer only a service, the offering has to include development tools, compiler kits, and SMP applications libraries that facilitate the task of portable parallel code development. Discussions with Kuck and Associates, Inc., (now KAI Software, a division of Intel Americas, Inc.) began because this company has unique products useful to the shared memory parallel (SMP) code developer and KAI Software was looking to expand their customer base through new distribution channels. Negotiations with other vendors are in progress with a view to complementing the products listed on the products pages. Applications software vendors who feel they could benefit from teaming with HiPERiSM Consulting, LLC, should feel free to contact us.

HiPERiSM Consulting, LLC, is opening for business at a time when the workstation platform is undergoing a transition to the SMP paradigm at remarkable price/performance value points and we will be exploring the market place to see how this challenges the work habits and abilities of the S&E applications developer. Hence the choice of topic of the first newsletter which attempts to set the context for some industry trends. We don't claim to give a complete picture of where high performance computing (HPC) has been, or where it is going. However, we do believe that enough information is presented to reach informed decisions about the nature of some of the changes in parallel programming in the HPC context.

- George Delic, Ph.D.

TABLE OF CONTENTS

1.0 Introduction

2.0 High Performance Computing Hardware Trends

3.0 High Performance Computing Software Trends

4.0 Parallel Programming Paradigms

4.1 Effective Parallel Processing

4.2 High Performance Fortran

4.3 Message Passing Interface

4.4 OpenMP

4.5 Comparing HPF, MPI, PVM, and OpenMP

5.0 Measuring efficiency in parallel performance

6.0 Summary

7.0 Citation Index

1.0 Introduction

We investigate parallel programming paradigms and survey software and hardware vendor involvement in ongoing development. These developments are having a profound impact on the way science and engineering models are computed. Those involved in developing applications need to consider alternative parallel programming paradigms and develop experience with them on both shared and distributed memory architectures as well as workstation clusters. Such considerations are an important part in a plan for transitioning long-life-cycle models to future High Performance Computing (HPC) parallel architectures and software standards since both are undergoing evolutionary (and revolutionary) change.

This Newsletter covers several topics in HPC hardware and software trends and starts with a survey of HPC hardware developments in section 2. The prevailing status is described and the anticipated growth over the next decade is indicated. This review is a prelude to (and partial explanation of) the need to discuss parallel programming paradigms. This focal point is compiler software and toolkit development [PTC] (Citations are are shown with a mnemonic key listed in Section 7) by both hardware and software vendors to take advantage of HPC hardware trends. The role of parallel software tool development is important because major surveys of parallel systems [NSF] have noted the perception that scalable parallel systems are not viewed as viable general purpose scientific and engineering computing resources. The effective utilization of scalable parallel computers hinges on obtaining performance (scalability), and ease of use (usability). Table 1 summarizes the basic parallel programming paradigms that have found vendor support, or world-wide acceptance, and citations that are sources of information.

Table 1: Parallel Programming Paradigms
Paradigm	Citations
High Performance Fortran (HPF)	IDC, HPFD, HPFF, HPFH, PGI, CRIHPFC, CRIHPF, PGHPF
Message Passing Interface (MPI)	MPI, MPIF,MPIG, MPIP, MPIT, PAL, ST1
Parallel Virtual Machine (PVM)	PVM
OpenMP Application Program Interface (OpenMP)	OPENMP, KAI Software, ST2

Sections 2 and 3 summarize HPC hardware and software trends, respectively, as a prelude to an anlaysis of parallel programming paradigms in Section 4. Section 5 lists basic parallel performance metrics for future reference, Section 6 is a summary, and Section 7 has a citation index with numerous World Wide Web hyperlinks.

Grateful acknowledgements are due to Bob Kuhn (KAI Software, a division of Intel Americas, Inc.[KAI]), and Doug Miles (The Portland Group, Inc.[PGI]) for their time and for freely sharing information about their respective company products. The material presented here in summary draws heavily from these sources and those listed in the Citation Index (Section 7). The opinions expressed are not necessarily those of the companies (or names) mentioned here. All trademarks and tradenames cited are the property of the owners.

2.0 High Performance Computing Hardware Trends

In the last decade computer hardware performance has improved by a factor of 500 [MESS1] due to use of both hundreds (or thousands) of individual processors and also parallelism within processors (as in vector architectures). Vendors have moved strongly towards off-the-shelf microprocessors and memory chips for large scale, closely coupled, Distributed Memory Parallel (DMP) architectures with multiple processor elements (PEs). Furthermore, some dedicated applications have found a home on microprocessor-based clusters delivering cost effective performance [AVA, LANL, NASA]. Vector processors in Shared Memory Parallel (SMP) configurations, still have a large installed base [TOP500, GUNTER], but are approaching an asymptotic limit in clock cycle time. This trend, which is the result of solid-state design considerations, has (in part) been compensated for by an increase in the number of pipelines or functional units in the vector central processing unit (CPU). Nevertheless, the state-of- the-art today sees growing activity in design, and marketing, of scalable parallel systems [BADER]. This development is the result of improvements in floating point performance of microprocessors, new connection schemes for many distributed processors (and memories), and improvements in interface software tools for DMPs. As a result, science and engineering users have an improved knowledge base on how to use DMPs compared to even 5 years ago. These market trends have raised expectations that DMPs can scale indefinitely to meet future demand of compute intensive applications, but a warning has been sounded as to the false basis of such expectations and the future course of the HPC industry [KUCK].

Messina reviews [MESS1] the current trends in parallel architecture development and identifies the current dominant commercial architectures to include those shown in Tables 2 and 3.

Table 2: Distributed Memory Processors (DMP)
DMP implementation	Maximal configuration	Citation
Cray T3E	1024 nodes @ 1 SMP PE per node	CRI
IBM SP	512 nodes @ 4-8 SMP PEs per node	IBM
SGI/Cray Origin 2000	1024 PEs	SGI
HP Exemplar X-Class/SPP 2000	16 nodes @ 32 PEs per node	HP
Hitachi SR2201 MPP	2048 PEs	HIT

Table 3: Vector Shared Memory Processors (VSMP)
VSMP implementation	Cycle time	Citation
Hitachi S-3800	2.0 nanosecond	HIT
NEC SX/4	8.0 nanosecond	NEC
SGI/Cray T90	2.2 nanosecond	CRI

While the Vector SMP architectures are proprietary, those for DMPs are (or will be) based on commodity microprocessors such as the Alpha 21164 [CRI, DEC], PowerPC [IBM], MIPS R10000 [SGI], PA-8000 [HP], and a high performance RISC chip (unspecified) [HIT]. Recent reviews of the history of the DEC alpha chip [ NTMAG] and marketing enterprises [ALPHA] have further details.

Messina also reviews [MESS2] the expected developments in microprocessor clock speeds from a current 450Mhz (2.2 nsec cycle) to 1100 Mhz (0.91 nsec cycle) by the year 2010, a gain factor of only 2.4. This slow-down in reduction of cycle time is the result of known problems with the limits of feature size in CMOS on silicon devices. Moore's law predicts [MOR] that the number of transistors on a chip will double every 18 months. So far the micro-electronics industry has sustained this pace by reducing feature size and thereby increasing circuit speed. A recent review of semiconductor research [HAM] notes that the Technology Roadmap for Semiconductors [TRS], which charts progress in process and design technologies, calls for an exponential scaling in feature size from a current (1999) 180 nanometers (21 million logic transisitors) to 50 nanometers (1400 million logic transisitors) by 2012. Beyond 2006 new techniques will need to be developed to overcome the deep-ultraviolet 193nm wavelength limit which confines the smallest feature size to 100 nm in the chip etching process. To face this challenge the microelectronics industry has formed various alliances to address the difficult materials and design challenges [SRC, STARC, SEL, MED], including X-ray proximity lithography [ASET], to break through the 100 nm limit. In addition to the challenges of overcoming physical properties, the industry will need to find new solutions and tools for chip design if Moore's law is to survive into the next millenium.

Coincident with this progress in chip manufacture the increasing gap between processor speed and memory speed is expected to continue. This problem has been compounded in DMP systems by latency due to physically distributed memory. A proposed means of mitigating this trend is to ensure use of multiple levels of memory hierarchy and an architecture that can work with many concurrent threads. The word "thread" denotes a group of instructions scheduled as one unit. This important idea is summarized by Messina [MESS2] : "The multithread approach hides memory latency if the ratio of threads to processors is high enough, meaning that there is always work to be done by a sufficient number of 'ready' threads. While ready threads are executing, data needed by threads that are not yet scheduled can be migrated up the memory hierarchy, behind the scenes." To sustain the growth in hardware performance "microprocessors will have to support more concurrent operations, and more processors will have to be used." Implicit in this proposal is the expectation of continued development of parallel compiler technology to the stage where thread generation and data movement is possible on the scale required.

Meanwhile at the low-end of the HPC scale, microprocessor-based, multi-CPU, desk-side computers have arrived as viable science and engineering computing resources. Examples include Intel Pentium™ [INTEL] and COMPAQ Alpha™ [DEC] microprocessors with 4 [COMPAQ, SGIV] to 14 [DEC] (or more [SUN]) PEs per configuration are now available. The performance per PE with such microprocessors approaches, or exceeds, that of traditional vector processing mainframes [MM5]. In the marketplace over 800,000 personal workstations were sold in 1996 [IDC]. This constituted over fifty percent of all workstation shipments in 1996 and represented a growth of over thirty percent in either units or revenue when compared to 1995. Concommitant with this sector growth is the recognition of the critical need for advances in compiler technology to extract performance from commodity microprocessors.

3.0 High Performance Computing Software Trends

The future demand for some two orders of magnitude increase in parallelism [MESS2] will require new designs for parallel programming paradigms and end-user understanding of them. Of prime concern in this enterprise are criteria of usability, scalability, and portability.

It is known that two components are necessary for parallel scalability in application performance: scalable hardware and scalable software. On the scalable software front significant advances have been made in the past decade. To describe and evaluate some of these advances the scope of this discussion is restricted to source code preprocessing software, compiler technology, and applications development tools for parallel architectures. Also included is some discussion of graphical interfaces to static/dynamic debugging software and profiling/performance tuning tool kits.

Examples of scalable software paradigms include HPF and the OpenMP standard which aim to hide the details of the architecture from the end-user through compiler-level (language) constructs and directives. The opposite extreme is presented by (ostensibly) portable message passing libraries such as MPI [MPI] or PVM [PVM] providing programmers greater control (but also demanding total management) of parallelism at the task level. Consequently the two extremes are also extremes in the level-of-effort in coding and using parallel computers. Today most major vendors offer HPF, either as features in their own compilers (Digital [DEC], IBM [IBM]), or implementations through collaborative agreements with the Portland Group, Inc., [PGI] (Hewlett-Packard/Convex [HP], Intel [INTEL], SGI [SGI]). The Portland Group, Inc., considers interoperability between HPF and MPI an important feature. Furthermore, a recent study shows how OpenMP and MPI can be make to work together within the same application [BOVA] for SMP workstation clusters.

HPF relies on a data parallel programming model with the following characteristics [IDC, HPFF, HPFH]:

Operation-level parallelism, or simultaneous execution of identical instructions on different processors with different data segments,
Single (global) thread of control where one program defines all operations,
Global name space with all data in a single logical memory and available to all parts of the code,
Loose synchronization where processors run independentl with synchronization only at specified.

A message passing paradigm (such as MPI [ST1]) is characterized by:

Processes level parallelism where independent processes exchange data with message send/receive calls, multiple (independent) threads of control,
Multiple, independent name spaces,
Close synchronization of all processes.

Whereas, the data parallel model partitions data over available processors, the message passing model distributes computational tasks over (groups) of processors. The OpenMP proposal [OPENMP, ST2, BOVA] has some similarity to the data parallel model in that it uses directive driven parallelism with global data but is designed for shared memory parallel systems (SMP) and generates multiple threads of control in a fork/join model.

4.0 Parallel Programming Paradigms

4.1 Effective Parallel Processing

Effective and efficient parallel processing depends on a combination of several key factors [NSF]:

Performance, or achieving good scalability,
Usability, or ease of use

On the performance side, scalability is the key issue and the level achieved depends on the application, the parallel architecture, and the implementation environment. On the usability side, ease of use addresses issues of how easy it is port applications, achieve implementation robustness, and maintain code to achieve good scalability. Implementation features critical to the ease of use issue include the programming, debugging, optimization, and execution environment software tools. Previous surveys [NSF] have found divergent scalability results for different applications on different platforms and have proposed more precise evaluation methodologies. Section 5 defines some detail of specific parallel computer evaluation areas including:

Macroperformance, or gross behavior of the computer-application combination,
Microperformance, or the underlying factors responsible for the observed macroperformance,
Usability, or program development environment.

Key concerns of end-users remain centered on issues such as:

Portability, or ensuring an application will run on multiple platforms without modification,
Code maintenance, or code that is easy to read and maintain as a single source for serial and parallel implementations
Scalable hardware, or products with growth potential,
Scalable software, or tools to enable scalability with relative ease.

4.2 High Performance Fortran

In 1992 the High Performance Fortran Forum (HPFF), a coalition of industry, academic and laboratory representatives, proposed High Performance Fortran (HPF) as a set of extensions to Fortran 90 [HPFF]. Subsequently the HPF 1.0 Language Definition document was published [HPFD] and was updated in 1994 to version 1.1. Currently a HPF 2.0 Language Specification document is in development. The programming paradigm here is single-program-multiple-data stream (SPMD) based on the model that the same operation is repeated on different data elements. A data parallel model therefore attempts to distribute these operations across multiple processors to enhance simultaneity. The goal of the HPFF is to define language extensions [HPFF] that support:

Data parallel programming features,
High performance on DMP and microprocessor based SMP architectures,
Performance tuning across architectures.

The data parallel paradigm implies single-threaded control structures, a global name space, and loosely synchronos parallel execution. Secondary goals of HPFF include:

Portability of code from serial to parallel versions and between parallel computers,
Compatability with the Fortran 95 standard,
Simplicity of language constructs,
Interoperability with other languages/paradigms.

In HPF applications the user implements directives to support the data parallel programming model and the HPF compiler uses the Fortran 90/95 [FTN90, FTN95] source and embedded directives to generate an executable code that automatically uses a SMP/DMP system [CRIHPF]. Utility routines appear in HPF either as Fortran 95 intrinsic functions or in the form of HPF library routines contained in a Fortran 95 module. The HPF library is a set of intrinsic procedures designed to support optimized implementations of commonly used data parallel operations [HPFH].

Key HPF features of value to programmers have been identified as:

Data distribution at the language level,
Parallel loop constructs,
Masking operations,
Parallel function library

While HPF does not address all issues of programming parallel computers, it does provide a single source code version that can be used across multiple architectures from a Cray T3E [CRI] to a cluster of workstations. For applications appropriate to the data parallel paradigm [HPFUG] the issues of single source code maintenance, performance tuning, and a reduced level of effort in porting to various parallel architectures have proven to be critical advantages. In real world applications portability now means that the same source can run on all the DMP architectures listed in Table 2, including the Sun UltraHPC [SUN]. The most popular implementation of HPF is by the Portland Group, Inc., [PGI], under the trade name of PGHPF [PGHPF], and is installed at 40% of the first 100 sites of the top 500 [TOP500] HPC sites world-wide. The Portland Group, Inc. [PGI] has a joint marketing agreement with SGI/Cray [PGHPF] and in this release Cray's CRAFT programming model has been included [CRIHPFC, CRIHPF]. Some 80% of PGI's compiler technology is both target-independent and language-independent which enables rapid deployment for numerous architectures. A Transport Independent Interface acts as a kernel of the PGHPF runtime system and has been efficiently implemented (in a way transparent to the user) on top of message passing paradigms such as PVM [PVM], MPI [MPI], and SHMEM [CRI]. As a result of the easy target facility an interesting new market trend has been the implementation for Intel Pentium™ [INTEL] platforms with multiple CPUs under either LINUX or Windows NT™ 4.0 operating systems. This user, and implementation base, is expected to grow even though memory bandwidth could be problematic. Nevertheless, good scalability has been observed for quad Pentium™ microprocessors [COMPAQ] when compared with DMP architectures such as the SGI Origin [SGI], and issues of memory latency are being addressed with new workstation technologies [SGIV].

The PGHPF model has low overhead for start-up of parallel regions through features such as:

Threads created at program start-up and re-cycled for each parallel loop or region,
Parallel regions implemented as inline assembly code segments and not as subroutine calls,
A parallelizer integrated with the other internal compilation phases such as global optimization, vectorization, memory hierarchy optimization, communication optimization, and interprocedural analysis.

The PGHPF implementation is also designed to be interoperable with, and complimetary to, MPI (see Section 4.3) and it is possible to call routines which perform MPI message-passing as HPF_LOCAL extrinsics. This provides a way of transitioning from a global name space model to a local model at a subroutine boundary. Within an HPF_LOCAL routine the code has essentially the same view of the data that an MPI program has and can do MPI message passing between the processors. This minimizes the level of effort in coding message passing since the actual amount of code in MPI is usually very small in relation to the whole application implemented in HPF. This hybrid approach usually resolves problems of declining scalability.

Actual performance improvement is dependent on the application. The PGHPF implementation [PGI, PGHPF] on the Cray T3E out-performs MPI in three of the NAS Parallel Benchmarks upto 128 PEs [NPB]. Several large scale models and a 3D reservoir model show either good scaling, or results that are within x 1.4 of an MPI implementation on an SGI Origin [SGI]. Scaling results for HPF implementations of two large-scale applications, The Princeton Ocean Model (POM) and the RIEMANN code have been studied. For the POM acceptable scaling is oberved upto 16 PEs on either a Cray T3E or the SGI Origin 2000, while for for the RIEMANN code, scalability is exceptionally good for either the Origin 2000 or the Cray T3E , to 128 or 256 PEs, respectively [BAL].

4.3 Message Passing Interface

The message passing programming model is dominated by the Message Passing Interface (MPI) standard [MPI, MPIG, MPIP], although a predecessor, PVM: Parallel Virtual Machine [PVM], is in common use. An effort is underway [PVM] to create a new standard, PVMPI, that combines the virtual machine features of PVM and the message passing features of MPI. The discussion here will center around MPI which has become popular because it often provides superior scalability behaviour when compared to the HPF paradigm. Furthermore, in a distributed memory architecture (such as the Cray T3E) it is the only alternative to HPF if (ostensibly) portable parallel code is needed that is not tied to a vendor specific message passing library. Nevertheless, MPI is available in multiple implementations both public domain and vendor specific [MPIT], but portability across implementations is not always transparent to the end user.

The MPI standard defines a library of functions and procedures that implement the message passing model to control passing of data and messages between processes in a parallel application. The fundamental idea is the concept of a computational task which is assigned to one or more PEs. MPI is general and flexible and allows for explicit exchange of data and synchronization between PEs, either individually, or in groups, but without the assumption of a global address space shared by all processes. Features in the MPI model allow for precise optimization of communication performance not otherwise possible in either HPF or OpenMP. However, due in part to this greater flexibility, the MPI model is also a difficult and labor-intensive way to write parallel code because data structures must be explicitly partitioned. As a result the entire application must be parallelized to work with the partitioned data, and all synchronization between PE's is then the responsibility of the programmer. In the message passing model there is no incremental path to parallelizing a new (or pre-existing) code - the whole code (or code segment) must be rewritten. However, another initiative [HPFMPI] has proposed a standard set of functions for coupling multiple HPF tasks to form task-parallel computations. This hybrid could combine the ease of use characteristic of HPF with the communication performance advantages of the message passing paradigm.

The MPI 1.1 standard proposed the following main features:

Point-to-point communications routines,
Routines for collective communication between groups of processes,
Communication context providing design support of safe parallel libraries,
Specification of communication topologies,
Creation of derived datatypes describing messages of non-contiguous data.

Because of a lack of standard to start MPI tasks on separate hosts, MPI 1.1 applications are not always portable across workstation networks. To correct this (and other) problems the MPI-2 specification has been completed, and adds 120 functions to the 128 in the MPI-1 API specification. New features will include:

MPI_SPAWN function to start MPI and non-MPI processes (i.e. dynamic process creation),
One-sided communication functions (e.g. put/get),
Nonblocking collective communication functions,
Language bindings for C++.

Some basic features of MPI are summarized in Table 4.

Table 4: MPI features
MESSAGES & PROCESSES	Local memory has one (or more) process(es) associated with it. A message is made up of a body and an envelope, and is the only means by which processes can access data in local memories or synchronize. Processes can cooperate in groups to perform common tasks and each process can opt to read a message based on the envelope contents. Processes are ranked (have a numerical identity) but this may be assigned by the user to suit the virtual topology appropriate to the task which need not correspond to the physical hardware connection scheme. In MPI a "communicator" is always associated with a process group (task), a communication context, and a virtual topology.
DATATYPES	MPI can use the predefined data types of the host language (C or Fortran) but extends these by allowing the user to construct derived data types so that both contiguous and noncontiguous data can be defined. MPI message library functions include a data type argument and the user has considerable felixibility in choosing data types suitable to virtual topologies or computational tasks.
COMMUNICATIONS	MPI is rich in communications options which fall into two basic types: point-to-point or collective. Both types require specification of a start address, length, and data type, followed by message envelope parameters. In point-to-point communications send and receive parameters must match to ensure safe transmission and allow both blocked and non-blocked transmission. A communication procedure is blocked if it does not return before it is safe for a process to re-use the resources identified in the call. Collective communications are valuable in either data redistribution (broadcast, gather/scatter, etc.) or computation (minimum, maximum, reduction, etc.).
SPECIAL FEATURES	Support for user-written libraries that are independent of user code and interoperable with other libraries, Support for heterogeneous networks of computers with different data formats, Communications between non-overlapping groups of processes (or separate tasks), Collective communication extended to all-to-all type, Simplification of often repeated point-to-point communication procedures by cache of "persistent" call sequences.

Originally the MPI standard omitted specification of debugging or performance profiling tools although standard interfaces are provided. This lack of powerful graphical tools has, in part, been compensated for by individual software developers who have developed tracing tools such as Vampir/Vampirtrace [PAL], while other groups support development of parallel tools [PTC].

4.4 OpenMP

The OpenMP application program interface (API) supports multi-platform, shared-memory parallel (SMP) programming, on Unix and Microsoft Windows NT™ platforms. OpenMP [OPENMP] partners include major computer hardware vendors [COMPAQ, DEC, HP, IBM, INTEL, SGI, SUN]. The OpenMP model has also been endorsed by key applications developers and instrumental in the development of OpenMP has been KAI Software, a division of Intel Americas, Inc.. [KAI].

OpenMP has been developed as a portable scalable model that gives SMP programmers a simple but flexible interface for developing parallel applications across a range of platforms. A white paper on a proposed standard was drafted [OPENMP] and an OpenMP Architecture Review Board (OARB) was established and is undergoing incorporation. The OARB will provide long term support and enhancements of specifications, develop future standards, address issues of validation for implementations, and promote OpenMP as a defacto standard. OpenMP is defined for Fortran, C, and C++ applications. OpenMP for Fortran has at its core a set of standard compiler directives to enable expression of SMP parallelism. Unlike message passing, or vendor sets of extensions for parallel software directives, OpenMP is portable enabling creation of a single source for multiple platforms.

In the shared memory model every processor has access to the memory of all other processors and the programmer can expresses parallelism through shared/private data allocation. A lack of a standard that is portable has been a major reason for lack of development of this model. The result in the past has been that different vendors provide proprietary parallel extensions to Fortran or C. This situation has lead programmers to opt for a message passing model such as MPI or PVM for reasons such as portability or performance. As a result it is a commonly held belief that scalability in parallel software is only possible with a message passing paradigm. With the emergence of cache coherent parallel architectures, or scalable shared memory parallel hardware, software scalability is easily achieved with a shared memory model. OpenMP has been proposed with the view that it can provide a model for incremental parallelization of existing code as well as scalability of performance.

At the simplest level OpenMP is a set of standardized compiler directives and runtime library routines that extend an (unspecifed) programming language such as Fortran, C, or C++, to express shared memory parallelism. Directives are common in vendor-specific parallel implementations (for example in Cray autotasking), but in OpenMP they are not implementation-specific, and are therefore portable. OpenMP has features that are new and differ from the coarse-grain parallel models (e.g. Cray autotasking for a parallel loop). In OpenMP a parallel region may contain calls to subroutines that contain DO loops which are lexically invisible to the parallel directive in the calling routine. The corresponding DO loops are examples of orphan directives and synchronization control can be performed inside the routine called. This OpenMP feature enables successful parallelization of nontrivial coarse grain parallel applications without the need of moving the DO loops into the calling routine to make them visible to the parallel region directive.

The four design categories of the OpenMP standard are briefly described in Table 5.

Table 5: Four design categories of the OpenMP standard
CONTROL STRUCTURES (defining parallel/nonparallel iterations/regions)	The design goal here is the smallest set possible with inclusion only for those cases where the compiler can provide both functionality and performance over what a user could reasonably code. Examples are PARALLEL, DO, SINGLE with a sentinal !$OMP.
DATA ENVIRONMENT (scoping of data, or global objects such as threads)	Each process has an associated (unique) data environment with the objects having one of three basic attributes: SHARED, PRIVATE, or REDUCTION. The last is used to specify a reduction construct which may be differently computed on different architectures. The THREADPRIVATE directive makes global objects private by creation of copies of the global object (one for each thread).
SYNCHRONIZATION (defining barriers, critical regions, etc.)	Implicit synchronization points exist at the beginning and end of PARALLEL directives, and at the end of control directives (e.g. DO or SINGLE), but can be removed with a NOWAIT parameter. Explicit synchronization directives (e.g. ATOMIC) allow the user to tune synchonization in an application. All OpenMP synchonization directives may be orphaned.
RUNTIME LIBRARY AND ENVIRONMENT VARIABLES	A callable runtime library (RTL) and accompanying environment variables include functions such as, query, runtime, lock, etc. The programmer may set the number of processes in parallel regions, or when to enable/disable nested parallelism.

A particularly attractive implementation of OpenMP is the KAP/Pro Toolset the by KAI Software, a division of Intel Americas, Inc. [KAI]. This implementation is rich in graphical interfaces and has three major components: Guide:(parallelizer), Guideview (graphical performance profiler), and Assure (parallel code verification). The KAP/Pro Toolset™ is specifically targeted for SMP architectures (or node clusters) with interoperability to MPI for inter-node communication [BOVA]. An important distinction made by KAI is that their product analyzes the dynamic performance of the application and is not a static source analysis [APRI]. Results for scaling with the KAP/Pro Toolset™ implementation of MM5 for either the SGI Origin 2000 or the DEC Alpha Server 8400 (under Windows NT) are excellent.

4.5 Comparing HPF, MPI, PVM, and OpenMP

In this section the relative merits of HPF, MPI, PVM, and OpenMP are assessed based on end-user experience and designer comparisons.

Key reasons users prefer HPF over message passing models include:

The relative speed with which large scale applications can be parallelized,
It is a good approach to making parallel programming simpler,
The code is transparent to the scientific end-user and algorithmic content is not obscured,
Benefit of maintaining one source usable by both working scientists and performance analysts,
Portability is a key in smooth transitions among existing and new architectures,
Ease with which code can be maintained,
A single address space across the physical processors, and
Data mapped in a single computational node are accessible through "extrinsic local" procedures written in Fortran 77/95, or C.

Table 6 lists positive and negative characteristics of HPF identified by developers.

Table 6: Characteristics of HPF identified by developers
Positive:	Negative:
Productivity	Too rich in language specifications
Performance	Data duplication across processors
Portability	Scope limited to data parallel model
Maintenance	Problems with scalability

Opinions vary as to successes with HPF versus MPI [NSF] and at some sites [PSC] users have abandoned the former in favor of the latter. One criticism of the current PGHPF implementation has been that while it is compliant with HPF 1.1, it is necessarily restricted to regular data distributions. The proposed HPF 2 standard would resolve this problem by allowing dynamically distributed data and irregular communications. However, it should be noted that work-arounds are often possible even under HPF 1 [DMM].

Some evaluations find that scalability results are more limited with HPF compared to message passing for the same architectures [NSF]. Others find good scalability results [DMM] and both arguments have repeated confirmations. The only conclusion on the scalability issue is that results are application dependent.

A comparison of HPF and OpenMP shows some overlap in functionality, however, with a much richer and more flexible process control structure in the latter case. Therefore the OpenMP paradigm shares many of the advantages listed above for HPF: readable code, portability, ease of parallelization and code maintenance. Most important is that OpenMP allows for the incremental parallelization of a whole application without an immediate need for a total rewrite. A simple example would be the incremental parallelization of loops in a large application. These would be declared as parallel regions, and variables appropriately scoped, as in a fork/join execution model. No data decomposition is needed, nor does the number of PEs need to be specified, since this is transparent to the user. The end of a loop is the end of a parallel region with an implicit barrier. By contrast, in an MPI implementation, there is no globally shared data, and all data needs to be explicitly broadcast to all PEs (creating a storage expense). Loop bounds have to be explicitly managed by identification of each processor and the number used for the loop. Also one PE needs to be reserved for I/O operations. While task level parallelism may have its attractive side it does require a comprehensive book-keeping effort in managing the task-to-PE mapping.

The MPI standard [MPI, MPIF] has established a widely accepted and efficient model of parallel computing with new definitions in the areas of process group, collective communications, and virtual topologies. MPI code is portable across multiple platforms and allows for the development of portable application and library software. Such libraries are useful where there is a need for standard parallel functionality, as in adapting finite difference models to parallel computers, with block domain decomposition and parallel I/O. If the programmer chooses a message passing approach this is labor-intensive because it is a huge undertaking to build parallel libraries which anticipate and incorporate the needs of a wide variety of applications. However, whereas it may take months to build a library the first time it is needed, it is re-usable and thereafter is probably no harder to use than a compiler directive, because the programmer needs only to substitute library calls, define and assign a few new variables, and link the code with the parallel library.

The relative merits of PVM versus MPI have been investigated [PVM]. MPI is popular because of high communication performance on a given platform, but this is at the cost of some features. One is the lack of interoperability between different MPI implementations so that one vendor's MPI implementation cannot send messages to another vendor's MPI implementation. At present there are some five different public domain implementations of MPI in addition to vendor-specific versions. The MPI standard allows portability in that an application developed on one platform can be compiled and executed on another. However, unlike PVM, MPI executables compiled on different architectures need not be able to communicate with each other because the MPI standard does not require heterogeneous communication. On the question of portability, PVM is superior in that it "contains resource management and process control functions that are important for creating portable applications that run on clusters of workstations and MPPs" [PVM]. Even when MPI is used in a vendor specific implementation, the performance achieved can still be considerably lower than that possible with the vendor's proprietary message passing protocols.

Another difference in PVM and MPI is that of language interoperability, Whereas a PVM application can exchange messages between C and Fortran codes, the MPI standard does not require this, even on the same platform. While MPI can be used with FORTRAN 77 code it does not offer the level of integration of either HPF or OpenMP. As one example, MPI does not take advantage of the Fortran 90/95 array syntax.

A further deficit in MPI is the lack of a feature set to support writing of fault tolerant applications: "The MPI specification states that the only thing that is guaranteed after an MPI error is the ability to exit the program" [PVM]. In this respect PVM can be used in large heterogeneous clusters for long run times even when hosts or tasks fail.

Graphical interfaces vary considerably in quality between the HPF, MPI, and OpenMP paradigms. While the MPI standard does specify a profiling interface standard graphical profilers are rare and not used in every-day applications. In the case of HPF an application profiler, PGPROF, provides statistics on execution time and function calls in a graphical interface. Both HPF and MPI code can be debugged using the TotalView™ [DOL] multiprocessor debugger which is commonly available. TotalView™ has an intuitive graphical interface that allows management and control of multiple processes across languages (C, C++, FORTRAN) either on multiprocessor systems or distributed over workstation clusters. By far the richest parallel interactive graphical user interfaces are to be found in KAI's OpenMP implementation in the KAP/Pro Toolset. Performance visualization and tuning is facilitated by the GuideView graphical interface which shows what each processor is doing at various levels of detail. Guideview provides interactive identification of source location for performance bottlenecks and prioritized remedial actions. Similarly, the AssureView graphical interface works with the Assure tool for automatic parallel error detection and parallel code validation. Such features promise drastic reductions in level-of-effort for debugging of parallel code because much of that effort is shifted to the application environment and the platform.

5.0 Measuring efficiency in parallel performance

For future reference this section summarises some basic parallel performance metrics [NSF]. A detailed discussion of this subject can be found in specialized monographs [GEL, KUCK]. Table 7 summarises the simplest scalability criteria used to measure performance versus increasing number of PEs. These metrics are in common use and are critical in assessing a successful parallel implementation on one architecture.

Table 7: Parallel Computing Scalability Criteria
SPEEDUP	= SERIAL TIME / PARALLEL TIME = TIME ON 1 PE / TIME ON N PEs
PARALLEL EFFICIENCY	= SPEEDUP / N

The ideal situation corresponds to linear scaling when SPEEDUP = N and PARALLEL EFFICIENCY = 1.

Overall performance in parallel computers is affected by having both good communication performance between PEs and good computation performance on each PE. Assuming communication time increases linearly with message size, then communication performance is a combination of:

Latency, or minimum communication time (seconds)

Bandwidth, or asymptotic communication rate (MB/sec)

One metric of communication to computation balance achievable in a given parallel architecture is:

BALANCE = BANDWIDTH (MB/sec) / PROCESSOR PEAK SPEED (Mflops)

where, for a floating point intensive application, computational SPEED (more correctly rate) is measured in million floating point operations per second (Mflops).

Table 8: Macroperformance Metrics for an Application and Architecture Combination
CLOCK PERIOD	= PROCESSOR CLOCK CYCLE TIME
EFFICIENCY	= ACTUAL PERFORMANCE / PEAK PERFORMANCE
COST EFFECTIVENESS	= PRICE ($) / PERFORMANCE (Mflops)
ABSOLUTE PERFORMANCE	= 1 / TIME ON N PEs

When comparing the same application on different architectures (for a fixed problem size) the appropriate comparative scalability criteria include:

SPEEDUP / Mflops

PARALLEL EFFICIENCY / ( Mflops / PE)

Successfully scaled problem sizes often lead to greatly enhanced COST EFFECTIVENESS over a serial solution. In studying scalability it is important to distinguish fixed and scaled problem sizes. With a fixed problem size the same problem is distributed over an increasing number of PEs. With increasing N this eventually leads to a decrease in PARALLEL EFFICIENCY because smaller data partions per PE imply an increase in communication costs between PEs relative to the amount of computation time. A scaled problem size seeks a homogeneous and optimal distribution of data per PE while minimizing the relative communication costs.

The discepancy between the ideal value of 1 and the actual PARALLEL EFFICIENCY achieved is measured by

PARALLEL INEFFICIENCY	= 1 - PARALLEL EFFICIENCY
	= COMMUNICATION OVERHEAD
	+ LOAD IMBALANCE
	+ SERIAL OVERHEAD

with

COMMUNICATION OVERHEAD

= Maximum time spent in communication among all PEs /

total TIME ON N PEs

and

LOAD IMBALANCE	= { T(max) - T(avg) } / T(avg)
where
T(avg)	= { T(1) + T(2) + ... + T(N) } / N
T(max)	= max {T(i)}, i=1,...,N

with T(i) the computation time on processor i.

The SERIAL OVERHEAD is usually not as significant as the other two terms contributing to PARALLEL INEFFICIENCY but can be estimated with the following approach. If T(S) is the serial (uniprocessor time), then T(S)/N is the parallel time in the ideal case. The difference between T(S)/N and the parallel time, T(P), is the "overhead" time of parallel execution

PARALLEL OVERHEAD

= T(P) - T(S)/N

Subtracting the total communication time (which can be measured) and the load imbalance estimate, gives an estimate for serial time as:

NET PARALLEL OVERHEAD	= T(P) - T(S) / N
	- communication time
	- {T(max) - T(avg)}

Values of PARALLEL EFFICIENCY > 0.5 are considered as acceptable [NSF], and values close to 1 are common for N < 10. However, as N increases PARALLEL EFFICIENCY will diverge from the ideal value of 1 to an increasing extent, until it asymptotes to a constant value and eventually, for sufficiently large N, decreases [KUCK]. This phenomenon is the empirical result of mismatch between problem (data) size, processor cache size, PE count, and over-all communication efficiency of the architecture. The smaller the group of processors assigned to independent tasks the higher the parallel efficiency tends to be. Higher parallel efficiency corresponds to a higher computation to communication ratio. Often the study of scaling behavior is specific to a fixed problem size and this can be deceptive. Kuck [KUCK] (see his Figs. 5.6 and 6.7) shows that a "sweet spot" is defined by a surface mapped out by the sequence of SPEEDUP versus N curves for successively larger problem sizes. This surface is unique to each application-architecture combination and no generalizations may apply. Therefore caution is advised when evaluating parallel performance on a specific architecture with either a fixed problem size, or one application.

6.0 Summary

Developers should plan for transition of existing models to future parallel architectures and evaluate the software options with respect to suitability to this task based on the criteria of portability, usability, and scalability. Such a plan should have focal points such as the following:

Portability: A major NASA/NSF report [NSF] found that the typical cost of parallelizing serial production code is 0.5 to 1 person months per 1000 lines of code.

Usability: At sites with a large base of legacy serial code there is a pent-up demand for simpler parallelization strategies that do not require a complete rewrite of the code as the first step.

Scability: For SMP models message passing is unnecessary and overly restrictive and the OpenMP paradigm provides a promising solution for scalable parallelism on multiprocessor clusters.

Proto-typing with easy-to-use parallelizing software, such as HPF and OpenMP based tools, provides input to a decision making process on the advisability of launching a larger effort with a message passing implementation. For large applications such parallel prototyping may be the only way of determining the potential for scalability when there is a pressing need for parallel applications porting of a single source with either single or clustered SMP nodes.

These issues will be the subjects of discussion in future HiPERiSM Consulting, LLC, Newsletters and Technical Reports.

7.0 Citation Index

Legend	Citation
APRI	Applied Parallel Research, Inc., http://www.apri.com.
ASET	Association for Super-Advanced Electronics Technologies, http://www.aset.or.jp.
ALPHA	Alpha Processor, Inc., http://www.alpha-processor.com.
AVA	Avalon Alpha Beowulf cluster http://cnls.lanl.gov/avalon.
BADER	Parascope: A List of Parallel Computing Sites, http://www.computer.org/parascope.
BAL	Scalability results of the RIEMANN code by Dinshaw Balsara, http://www.ncsa.uiuc.edu/SCD/Perf/Tuning/mp_scale/
BOVA	S. W. Bova et al. Parallel Programming with Message Passing and Directives,preprint.
COMPAQ	Compaq Computer Corporation, http://www.Compaq.com, Compaq Pro 8000, http://www.Compaq.com/products/workstations/pw8000/index.html
CRI	Cray C90 and T3E http://www.cray.com/products.
CRIHPFC	Cray release of PGI HPF_CRAFT for the Cray T3E http://www.sgi.com/newsroom/press_releases/1997/july/cray_pgi_release.html.
CRIHPF	http://www.sgi.com/newsroom/press_releases/1997/august/crayupgrade_release.html.
DEC	COMPAQ DIGITAL Products and Services, http://www.digital.com, Digital Alpha Server 8400, http://www.digital.com/alphaserver/products.html
DOL	Dolphin Interconnect Solutions, Inc. http://www.dolphinics.com.
DMM	L. Dagum, L. Meadows, and D. Miles, Data Parallel Direct Simulation Monte Carlo in High Performance Fortran, Scientific Programming, (1995).
FTN90	Jeanne C. Adams, Walter S. Brainerd, Jeanne T. Martin, Brian T. Smith, and Jerrold L. Wagener, Fortran 90 Handbook: Complete ANSI/ISO Reference , Intertext Publications/Multiscience Press, Inc., McGraw-Hill Book Company, New York, NY, 1992.
FTN95	Jeanne C. Adams, Walter S. Brainerd, Jeanne T. Martin, Brian T. Smith, and Jerrold L. Wagener, FTN95 Handbook: complete ISO/ANSI Reference, The MIT Press, Cambridge, MA, 1997.
GEL	Erol Gelenbe, Multiprocessor Performance, Wiley & Sons, Chichester England, 1989.
GUNTER	List of the world's most powerful computing sites, http://www.skyweb.net/~gunter.
HAM	S. Hamilton, Semiconductor Research Corporation, Taking Moore's Law Into the Next Century, IEEE Computer, January, 1999, pp. 43-48.
HIT	Hitachi, http://www.hitachi.co.jp/Prod/comp.hpc/index.html.
HP	Hewlett-Packard Company, HP Exemplar http://www.enterprisecomputing.hp.com
HPFD	Scientific Programming, Vol. 2, no. 1-2 (Spring and Summer 1993), pp. 1-170, John Wiley and Sons
HPFF	High Performance Fortran Forum, http://www.crpc.rice.edu/HPFF/index.html
HPFH	Charles H. Koelbel, David B. Loveman, Robert S. Schreiber, Guy L. Steele, Jr., and Mary E. Zosel, The High Performance Fortran Handbook, The MIT Press, Cambridge, MA, 1994.
HPFMPI	Task Parallelism and Fortran, HPF/MPI: An HPF Binding for MPI, http://www.mcs.anl.gov/fortran-m.
HPFUG	High Performance Fortran (HPF) User Group, http://www.lanl.gov/HPF
IBM	IBM, Inc., http://www.ibm.com, http://www.rs6000.ibm.com/hardware/largescale/index.html.
IDC	Christopher G. Willard, Workstation and High-Performance Systems Bulletin: Technology Update: High-Perfomance Fortran, International Data Corporation, November 1996 (IDC #12526, Volume: 2.High-performance Systems, Tab: 6.Technology Issues). http://www.idc.com.
INTEL	Intel Corporation, http://www.intel.com.
KAI	KAI Software, a division of Intel Americas, Inc., http://www.kai.com.
KUCK	David J. Kuck, High Performance Computing, Oxford University Press, New York, 1996.
LANL	Los Alamos National Laboratory, Loki - Commodity Parallel Processing, http://loki-www.lanl.gov/index.html.
MED	Micro-Electronics Development for European Applications, http://www.medea.org.
MESS1	P. Messina, High Performance Computers: The Next Generation (Part I), Computers in Physics, vol. 11, No. 5 (1997), pp.454-466.
MESS	P. Messina, High Performance Computers: The Next Generation (Part II), Computers in Physics vol. 11, No. 6 (1997), pp.598-610.
MM5	MM5 Version 2 Timing Results, http://www.mmm.ucar.edu/mm5.
MOR	Moore's Law http://webopedia.internet.com/TERM/M/Moores_Law.html.
MPI	The Message Passing Interface (MPI) standard, http://www.mcs.anl.gov/mpi/index.html.
MPIF	MPI Forum. MPI: A Message-Passing Interface Standard, International Journal of Supercomputer Applications, Vol. 8, no. 3/4 (1994), pp. 165-416.
MPIG	William Gropp, Ewing Lusk, and Anthony Skjellum, Using MPI - Portable Parallel Programming with the Message-Passing Interface, The MIT Press, Cambridge, MA, 1994.
MPIP	Peter S. Pacheco, Parallel Programming with MPI, Morgan Kaufman Publishers, Inc., San Francisco, CA, 1997.
MPIT	MPI Software Technology, Inc., http://www.mpi-softtech.com.
NASA	NASA High Performance Computing and Communications (HPCC) Program, Center of Excellence in Space Data and Information Sciences (CESDIS), the Beowulf Parallel Workstation project http://cesdis.gsfc.nasa.gov/beowulf.
NSF	W. Pfeiffer, S. Hotovy, N.A. Nystrom, D. Rudy, T. Sterling, and M. Straka, JNNIE: The Joint NSF-NASA Initiative on Evaluation (of scalable parallel processors), July 1995, http://www.tc.cornell.edu/JNNIE/jnnietop.html.
NTMAG	A. Sakovich, Life in the Alpha Family, Windows NT Magazine, January, 1999, http://www.ntmag.com.
NEC	NEC, Supercomputer SX-4 Series, http://www.hpc.comp.nec.co.jp/sx-e/Products/sx-4.html.
NPB	NAS Parallel Benchmarks, http://science.nas.nasa.gov/Software/NPB.
OPENMP	OpenMP: A Proposed Industry Standard API for Shared Memory Programming, http://www.openmp.org.
PAL	Pallas, GmBH, http://www.pallas.de, MPI visualization tool http://www.pallas.de/pages/vampir.htm, MPI profiling/performance monitor, http://www.pallas.de/pages/vampirt.htm.
PGHPF	PGHPF description for Cray systems, http://www.sgi.com/Products/appsdirectory.dir/DeveloperIXThe_Portland_Group.html.
PGI	The Portland Group, Inc., http://www.pgroup.com
PSC	The Pittsburgh Supercomputing Center, http://www.psc.edu.
PTC	The Parallel Tools Consortium, http://www.ptools.org.
PVM	PVM: Parallel Virtual Machine, http://www.epm.ornl.gov/pvm.
SEL	Semiconductor Leading Edge Technologies, Inc., http://www.selete.co.jp.
SGI	Silicon Graphics, Inc., http://www.sgi.com, Cray Origin 2000, http://www.sgi.com/origin2000,
SGIV	Silicon Graphics, Inc., Windows NT™ workstations, http://www.sgi.com/visual.
SRC	Semiconductor Research Corporation, http://www.src.org/areas/design.dgw.
STARC	Semiconductor Technology Academic Research Center, http://www.starc.or/jp.
ST1	C. H. Still, Portable Parallel Computing Via the MPI1 Message Passing Standard, Computers in Physics, 8 (1994), pp. 553-539.
ST2	C. H. Still, Shared-Memory Progeamming With OpenMP, Computers in Physics, 12 (1998), pp. 577-584.
SUN	SUN Microsystems, Inc., http://www.sun.com.
TRS	Technology Roadmap for Semiconductors, http://notes.sematech.org/ntrs/Rdmpmem.nsf.
TOP500	TOP500 Supercomputer Sites, http://www.netlib.org/benchmark/top500.html

HiPERiSM Consulting, LLC, (919) 484-9803 (Voice)

(919) 806-2813 (Facsimile)