HiPERiSM - High Performance Algorism Consulting
2006: Clusters and Industry Trends - The multi-core revolution
All trademarks mentioned are the property of the owners and opinions expressed here are not necessarily shared by them. HiPERiSM Consulting, LLC, will not be liable for incidental or consequential damages in connection with, or arising out of, the furnishing, performance, or use of the products or source code discussed in these reports.
The changes in the industry call for some personal observations. HiPERiSM's newsletters are far apart and few because we wait for the next revolution in commodity computing. In the interim we remain focused on our prime services: training, software sales and consulting for the High Performance Computing community working with clusters of computing nodes. The next revolution in commodity hardware has arrived so we take the opportunity to make some comments on this and a few other topics of current interest to us. We feel motivated to do this in part after the experience in converting the U.S. EOPA's Community Multiscale Air Quality (CMAQ) model to hybrid OpenMP+MPI mode and delivering the threaded version to the U.S. EPA.
- George Delic, Ph.D.
TABLE OF CONTENTS
This discussion will focus on developments in hardware, software and programming practices for clusters users. The main trends in the industry since 2001 has been the rapid assimilation of Linux and commodity clusters as the operating system (OS) and hardware platform of choice for High Performance Computing (HPC). This choice does not necessarily imply that commodity hardware is the best match to HPC models and we see evidence of this in measured performance metrics. Nevertheless, the price-performance break-point is driving the technical computing market to commodity solutions. Therefore it is essential to understand the strengths and weaknesses of commodity computing solutions when it comes to HPC applications.
IDC [cf. Earl Joseph in CCS2006] has reported that technical computing had a growth rate of 24% in 2005 when it became a $9.2 billion/year industry with a 18% share of the world computer market. Of this clusters were one third of the market in 2004 and now (2005-2006) are approximately one half of the market. The top 500 list [TOP500] shows the clear trend towards commodity processors in the market place since 2001. These developments have been driven by technical computing that has experienced an average growth rate of 21.6% per annum in the period since the year 2000. The acceptance of 64-bit commodity hardware (x86_64) has been rapid in the last two years with an accelerated decline in the shipment of hardware for the 32 bit (x86_32) platforms. This suggests that in the very near term the x86_64 platform will dominate clusters. Furthermore, with some 73% of the high-end HPC platforms using Linux, the market penetration of Linux as the OS of choice, has coincided with this hardware change. These changes have been driven by the cost benefit of commodity processors summarized in Table 1.
Meanwhile, and since we started operations, the richness and choice of features in compiler technology for commodity hardware has grown enormously. When HiPERiSM started in the mid-1990's Linux was hardly known as a commodity solution and compilers for this OS were limited to one (or two) commercially supported products. Now there are at least five commercially supported compiler products from vendors such as: Intel, STMicroelectronics (aka The Portland Group), Lahey, NAG, and Absoft. In addition there are also compilers for the Apple Mac OS (OSX) such as those from Intel (recently), Absoft, and IBM, although (as of fall 2005) support for the latter has been discontinued for the Mac OSX OS. Furthermore, whereas OpenMP, as a shared memory parallel (SMP) paradigm, was an esoteric add-on to the DEC Visual Fortran compiler in the mid-1990's, it is now supported by all the compilers currently available commercially for Linux and MAC OS.
In the following sections we revisit some of the topics covered over seven years ago in the previous newsletter. To make these more relevant we cite the examples of real-world code in use in the Air Quality Modeling (AQM) community that we have been servicing during this period. AQM models have followed the path of industry trends and have migrated from proprietary HPC platforms to commodity clusters. In our experience AQM models have much to gain from commodity technology trends but are largely ill-prepared to take advantage of them because of obsolete legacy code that will not enable modern compilers to use new hardware instruction sets and implement inherent task or data parallelism.
Two important recent developments in commodity hardware have important consequences for software development with commodity clusters. The first of these is the proliferation (and rapid acceptance) of processor hardware that supports 64-bit memory addressing. Examples include the Intel Pentium Xeon 64EMT [INTEL], American Micro Devices (AMD) Opteron [AMD], and IBM’s Power PC G5 [IBM], processors. Each of these architectures supports 64-bit Linux kernels, and also Apple Mac OSX 10.4 (in the case of the G5). In this discussion we say little about the Itanium processor because we know of only two customers who use it. The second recent development is the availability of multi-core processors. At this time (2006) these are dual core CPU's with separate GPR's, functional units and cache hardware. While dual core processors are already in the market place and examples of third party vendors offering such solutions with AMD dual core processors are Microway [MICRO], HPC Systems [HPCS] and SUN Microsystems [SUN].
Meanwhile Intel [INTEL] has announced a technology roadmap of processor fabrication with feature resolution as follows
Beyond 2011 Intel's research roadmap continues the trend in feature resolution from 16nm (2013) to 8nm (2017). From Table 2 the increase in density from current (2005) to future (2011) is approximately (65/22)^2 = 8.7, or nearly an order of magnitude. Power consumption, and heat output, will presumably dictate that future processors must have multiple cores. It is anticipated that developments in multi-core processors will be rapid in the next few years and by 2010 the number of cores per processor is expected to exceed 100 [cf. Jack Dongarra in LCI2006].
A few other developments from Intel's HPC hardware roadmap are worth noting. The Woodcrest CPU now offers (a) four pipelines (versus the three offered previously), (b) double the SSE performance with four floating point instructions per cycle (versus the two offered previously), and, (c) a L2 cache shared by the two cores. Intel quad core CPU's will become available in 2007. Furthermore, for the first time since the late 1990's Intel has designed two new half-sized (5.9" x 13") motherboards to fit two into a 1U rack solution. These are the Caretta (800MHz FSB) and the "Port Townsend" (1066MHz FSB), with support, respectively, for dual core and quad core CPUs. This suggests that dual quad core CPUs in a 1U format will arrive in less than a year.
Another hardware development of note has been the adoption of co-processor technology as an add-on to commodity processors. Examples include graphical processors [AMD] and Field Programmable Gate Array (FPGA) microprocessors [cf. David Morton, Vincent Natoli, in CCS2006]. Interest in these technologies is driven in part by the lower power requirements when compared to commodity CPUs and the high operation counts achievable. However, their applicability is limited to algorithms with deep vector/parallel structure and FPGA technology is not so suitable for general purpose computing. Another issue limiting applicability is ease-of-use because of a lack of simple and portable programming constructs that may be implemented in the major programming languages used in HPC. Nevertheless, developments in this hardware technology must be watched closely, as the performance boosts of the order 10 to 100 have been demonstrated for some applications. Furthermore, whereas commodity processors have traditionally shown a performance growth rate of the order of 1.4 Flops/year (Moore's law), FPGA's have shown a growth rate of 2.2 Flops/year, which is a factor or 2 greater.
As in the past, the two dominant parallel programming paradigms are MPI and OpenMP so our emphasis will be on these and products that support them. The arrival of multi-core CPUs is coincident with the availability of large memory capacity so we anticipate a resurgence of interest in software development for large memory models within a Shared Memory Parallel (SMP) programming paradigm such as OpenMP. The OpenMP application program interface (API) is now at the 2.5 standard [OPENMP] and discussions are underway on what features should be included in the 3.0 standard. The OpenMP model is supported by all major compilers [for details see this page] and been endorsed by key applications developers. It has a bright future in riding the wave of the multi-core revolution.
The Message Passing Interface (MPI) standard [MPI] is in common use and several major compiler vendors offer it in software that targets the commodity cluster computing market. However, with the advent of multi-core CPUs, for models that already use MPI, exploration of hybrid MPI and SMP levels of parallelism could be beneficial for performance scaling. Hybrid parallel computing models have been explored in the past, but their utility with dual processor commodity environments has been limited. Within the next year we will see the possibility of SMP models with 8 (or more) threads per node applied to real world models. The following discussion surveys these developments in commodity hardware and examines consequences for model throughput for future architectures. Case studies with models previously examined for performance bottle-necks are mentioned and some relevant HiPERiSM technical reports cited. In real-world code we will continue to seek opportunities for hybrid parallelism and, where they are successfully realized, measure parallel scalability as functions of the number of MPI processes versus the number of OpenMP threads. For classroom examples we refer the reader to HiPERiSM's HC8 training course.
Effective and efficient parallel processing depends on a combination of several key factors:
The above is what we said seven years ago and the story has not changed. What makes the search for effective parallel processing on commodity processor clusters particularly challenging is how (and with what frequency) technology changes.
Before parallel performance (Macroperformance) is evaluated the serial performance (Microperformance) of an application needs to be optimized. By serial performance we mean the on-core execution efficiency of the code. For example, code that does not take advantage on the extended SSE instruction set, or otherwise experiences serial performance bottle-necks, should not be executed in parallel mode before serial performance is optimized. Achievable serial execution efficiency will depend on several factors that change with time: memory architecture, FSB rates, CPU architecture (e.g. number of stages in the pipeline), scope of hardware resources (e.g. number of GPR's, size of the TLB cache, etc), and the instruction set that comes with each new hardware generation. Performance in each of these factors can be precisely measured from hardware counters available on commodity processors at the user level.
At HiPERiSM we have developed a proprietary interface to automatically collect hardware performance data as an application executes. The operating system (OS) used for this is HiPERiSM Consulting, LLC’s modification of the Linux™ 2.6.9 kernel to include a patch that enables access to hardware performance counters. This modification allows the use of the Performance Application Programming Interface (PAPI) performance event library [PAPI] to collect hardware performance counter values as the code executes. The PAPI interface defines over a hundred hardware performance events, but not all of these events are available on all platforms. For the Intel hardware under discussion the number of hardware events that can be collected are, respectively, 28 (Pentium 4 Xeon) and 25 (Pentium 4 Xeon, 64EMT). Not all events can be collected in a single execution due to the fact that the number of hardware counters is small (typically four). Thus, multiple executions are needed to collect all available events on any given platform. Performance metrics are defined using the PAPI events and measured in the expectation that they will give insight into how resource utilization differs between compilers. Results have been reported at several meetings in the last few years and more details may be found in the technical presentations [cf. George Delic in LCI2005, LCI2006, CCS2006]
The benchmarks chosen are a small set that we have used over the years because of their special characteristics, or because customers have requested performance analysis of them. Examples include:
The technical reports pages give more details of these algorithms and benchmarks results will continue to be added as work on them progresses. Currently efforts are focused on the three AQM's listed above because they share the common characteristic of showing very little vector instruction issue, and consequently, negligible capability to take advantage of SSE instruction sets on commodity processors. It should be noted that CAMx is available in an OpenMP version and CMAQ is available in an MPI version. However, the performance issues so far examined at HiPERiSM are with the serial code. Studies of parallel performance will follow once we have converted CMAQ and AERMOD to SMP parallel mode.
It has been our practice to examine all available compilers for Linux systems, but we have tended to confine attention to the commercially supported ones. Even amongst these the main interest is in the compilers in use by the AQM community and these include those from Absoft, the Portland Group, and Intel. Even though HiPERiSM is a reseller of some of these compiler products, our approach is an agnostic one: we report what we find since our experience in scientific research leads us to believe in showing the objective facts and trying to understand (a) what is the measured performance? and (b) why does one benchmark give different performance compared to another? We have found that all compilers have undergone very significant maturation in the last two years with the arrival of the 64-bit hardware technology. Precise performance measured comparisons show that even minor compiler version upgrades of a specific compiler (say from x.0 to x.1) can show performance boosts in the range 12%-22%, whereas comparisons between compilers can show a "leap-frog" effect as new releases arrive asynchronously from these vendors. Thus, active users of commodity compiler products are well advised to exercise new compiler releases as they arrive with their favorite benchmarks. Furthermore, since new (major) compiler releases often come with compiler bugs, our advice to end-users is to apply at least two different compilers to the same benchmark in mission critical projects to validate numerical accuracy. A good recent example of a compiler bug is with Intel's ifort v9.1 on the (build 20060707, 9.1.036) which produces an "internal compiler error" and aborts compilation of CAMx 4.02, whereas v9.0 completes compilation successfully on the same platform (for a specific combination of switches). Such occurrences are to be expected in rapidly evolving compiler technology for commodity platforms.
Defining good versus poor performance depends on the criteria applied. The basic performance categories that need to be examined are:
Specific metrics need to be defined and examined in each category. HiPERiSM uses multiple metrics that show either rates (i.e. number per unit time), or ratios (ratio of operations or instructions of different categories). Choice of the appropriate metric is critical in determining "good" performance. For example a floating point metric is meaningless for the Kallman algorithm, whereas it is meaningful for the SOM algorithm. The optimal balance between the basic categories listed above is important and, as has always been the case in HPC, overlap of operations or instructions is critical to enhanced performance. A recent presentation [cf. George Delic in CCS2006] attempts to set the absolute scale of "good" vs "poor" performance (at least in a small subset of compilers and hardware platforms). The technical reports (as they are updated) will discuss the relevant categories for each benchmark under study.
The fundamental problem with commodity hardware solutions is the performance cost of accessing memory. Processor performance has improved by leaps-and-bounds in the last decade or so, whereas memory latency has not improved on a corresponding scale [STREAM]. As a consequence, the challenge of optimizing the balance between memory operations and arithmetic operations is crucial because commodity architectures compromise on memory bandwidth and latency to reduce costs.
Applications with a voluminous rate of total memory instructions need to be examined carefully. A high rate of memory instruction issue need not be an indicator of a performance bottleneck. Benchmarks with good vector character (e.g. SOM) that deliver of the order of 1Gflop on a Pentium 4 Xeon can also show high memory access rates. But if an application has low vector instruction rates and voluminous memory access rates (e.g. AERMOD), performance is constricted on commodity architectures where memory bandwidth is limited by the FSB and cache design. We have even observed in such cases that the compiler with the lowest execution time is also the one with the lowest memory instruction rate. Parenthetically, we note that even with good vector code, if memory access can be reduced by algorithm changes, performance can be improved. This is case for the SOM benchmark where an alternative algorithm avoids a redundant memory copy and boosts performance by more than 30%. This experience demonstrates that a little care in managing memory access can have very significant performance consequences.
Memory access bottlenecks can occur in several ways. Usually it is assumed that the bottleneck arises in L1 cache with data or instruction misses. However, our analyses also reveal another important source of memory performance inhibition. Between the processor and the first level of cache (L1) there is the TLB cache. The translation lookaside buffer (TLB) is a small buffer (or cache) to which the processor presents a virtual memory address and looks up a table for a translation to a physical memory address. If the address is found in the TLB table then there is a hit (no translation is computed) and the processor continues. The TLB buffer is usually small, and efficiency depends on hit rates as high as 98%. If the translation is not found (a TLB miss) then several cycles are lost while the physical address is translated Therefore TLB misses degrade performance. These TLB cache misses can occur for either data or instructions. In the case of AERMOD, it is the instruction TLB misses that are critical because of the voluminous incidence of control transfer instructions.
For optimal performance code must be structured to present compilers with ample opportunity to engage SSE instructions and overlap with memory (or I/O) operations. This means applying the usual practices for vector code construction:
There is nothing new in these basic practices for constructing vector code - they are the same that applied when serial code was ported to Cray vector architectures. They are also the appropriate rules for constructing code with good parallel potential, e.g. applying task parallelism to the DO loop code block.
Developments in hardware and software offer the opportunity of orders of magnitude increases in performance. However, they also require refined programming practices from cluster users coupled with precise measurement of performance metrics using hardware performance counters. Model developers should plan for transition of existing models to multi-core parallel architectures and evaluate the software options with respect to suitability to this task based on the criteria of level-of-effort, usability, and scalability. Such a plan should have focal points:
Prototyping with easy-to-use parallelizing compilers and OpenMP, provides input to a decision making process on the serial-to-parallel conversion strategy. HiPERiSM's experience with OpenMP shows that the level of effort is some five to ten times less than that in using MPI. For large applications hybrid parallel models that use OpenMP and MPI in combination will determine the potential for scalability with clustered SMP nodes.
HiPERiSM Consulting, LLC, (919) 484-9803 (Voice)
(919) 806-2813 (Facsimile)