Hiperism Consulting, LLC: HCTR-2001-1

HiPERiSM's Technical Reports

HiPERiSM - High Performance Algorism Consulting

HCTR-2001-1: Compiler Performance 2

1.0 Combinatorial problems (part 2)

1.1 The permanent of (0,1) matrices

This is a continuation of the previous report's tests with the Kallman algorithm to compute the permanent of a (0,1) matrix. A discussion of other alternatives, and why the Kallman algorithm is preferred, is discussed by Delic and Cash in Computer Physics Communications, vol. 124, pp. 315-329, and the source code in question is available from CPC.

Kallman's algorithm relies on packing the rows of a (0,1) matrix in the respective bits of a word and then performs group operations on theses words, e.g. with masking operations, or population count. While the Intel™ based architectures discussed here are limited to 32 bit words, by a simple device, it is possible to spread row sizes longer than 32 columns across mutliple words. The version of the Kallman algorithm (KALLPC) studied here employs this method. The 64 bit version (KALLC90) is a different source code (not discussed in this report) and for 64 bit word platforms can deliver performance superior to the IA-32 results shown here.

Table 1.1 shows results of CPU time for the KALLPC code compiled using two different Fortran 90 compilers with the Linux™ operating system. KALLPC was executed in single thread mode on a dual processor Intel™ Pentium III 933MHz workstation with a 256MB on-processor L2 cache (this is the Flip Chip PGA version of the Pentium III). The KALLPC application is CPU intensive, performs a small amount of I/O only at the beginning and end of each run, and has modest memory requirements (~ 2 Mbytes). The Kallman algorithm runs in scalar mode because of a complex branching structure that inhibits parallelization.

Both KALLPC (and KALLC90) perform no floating point work and therefore they do not test numerical performance. Both codes are integer, logical, and bit-wise operation intensive and represent a very special type of metric for compiler and implementation performance. Therefore the usual caution should be exercised in making generalizations. Numerical performance of these compilers will be the subject of separate reports.

Table 1.1: CPU time in seconds for KALLPC on a 933MHz dual processor Intel™ Pentium III workstation with Red Hat Linux™ 6.2 for two different Fortran 90 compilers (see notes to table for details).
Matrix size N	Portland Group, Inc., Fortran 90, v3.1 [1]	Absoft Fortran 90, v6.2 [2,3]
30	0.60	0.25
44	136.9	46.4
48	22.9	7.4
52	84.0	26.5
56	721.1	229.2
60	46079.4	15029.3
[1] PGF90™ fortran compiler (Linux™ distribution) from the Portland Group Inc., (http://www.pgroup.com).
[2] Absoft Pro Fortran™ (Linux™ distribution) from the Absoft Corporation (http://www.absoft.com).
[3] Compare this result for N=60 with 66148.6 sec (f90) on the Cray C90™ (240MHz) in the 64 bit version, KALLC90, using the HPF POPCNT intrinsic function, or 22032.1 sec (f77) on the SGI Onyx2 with R10,000 processors (195MHz, 4MB L2 cache).

HiPERiSM Consulting, LLC, (919) 484-9803 (Voice)

(919) 806-2813 (Facsimile)