fluent.com home page

   
 
 

Considerations for Parallel CFD Enhancements on SGI ccNUMA and Cluster Architectures

Mark Kremenetsky, Ph.D., Principal Scientist, CFD Applications
SGI Mountain View, CA, mdk@sgi.com, 650.933.2304
Tom Tysinger, Ph.D., Principal Engineer
Fluent Inc, Lebanon, NH, tlt@fluent.com, 603.643.2600
Stan Posey, HPC Applications Market Development SGI Mountain View, CA, sposey@sgi.com, 650.933.1689

 


Abstract

The maturity of Computational Fluid Dynamics (CFD) methods and the increasing computational power of contemporary computers has enabled industry to incorporate CFD technology into several stages of a design process. As the application of CFD technology grows from component level analysis to system level, the complexity and size of models are increasing continuously. Successful simulation requires synergy between CAD, grid generation and CFD solvers.

The requirement for shorter design cycles has put severe limitations on the turnaround time of the numerical simulations. The time required for (1) mesh generation for computational domains of complex geometry and (2) obtaining numerical solutions for flows with complex physics has traditionally been the pacing item for CFD applications. Unstructured grid generation techniques and parallel algorithms have been instrumental in making such calculations affordable. Availability of these algorithms in commercial packages has grown in the last few years and parallel performance has become a very important factor in the selection of such methods for production work.

Although extensive research has been devoted in determining the optimum parallel paradigm, in practice the best parallel performance can be obtained only when algorithm and paradigms take into consideration the architectural design of the target computer system they are intended for. This paper addresses the issues related to efficient performance of the commercial CFD software FLUENT on a cache coherent Non Uniform Memory Architecture, or ccNUMA. Also presented are results from implementation of FLUENT on a cluster of systems for both the Linux and SGI IRIX operating systems. Issues related to performance of the message passing system and data placement are investigated for efficient scalability of FLUENT when applied to a variety of industrial problems.

1. Introduction

In recent years Computer Aided Engineering (CAE) has contributed significantly to changes in the manufacturing product development process. The major factors behind increased use of CAE are the requirements for continuos improvement in product performance and quality, reductions in number and cost of prototypes, and faster time to market.

Due to these requirements the conventional method of product design and development has evolved into a modern approach that relies more on CAE tools and techniques. A segment of CAE, Computational Fluid Dynamics (CFD) has recently been accepted as one such CAE tool and is used extensively to guide and compliment experimental methods used for product design and verification.

Beginning in the mid-1990's, the introduction of unstructured CFD grid technology, accurate and robust numerical solutions and the availability of powerful parallel computers have acted as catalysts in the rapid acceptance of a CFD-assisted design approach. The availability of commercial as well as in-house codes that use parallel processing has increased considerably in recent years leading to larger models and reduced solution times. All parallel implementations on existing system architectures are based on two alternatives (1) fine-grain and (2) coarse-grain parallelism.

The first parallel paradigm is often termed compiler parallelism. It exploits loop level parallelism, implemented by automatic parallel compilers for shared memory architectures. This technique is popular for its ease-of-use and incremental approach for parallelization of existing source code. The coarse-grain parallel method can be further subdivided into (2a) shared memory parallelism, (2b) library based distributed memory parallelism such as high performance FORTRAN (HPF), and (2c) parallelism based on explicit message passing with software systems such as MPI.

Distributed memory coarse-grain parallelism (or 2c) has increasingly become the preferred method because of its ability to accommodate both shared and distributed memory computer architectures. It is also considered to provide better scalability, although recent studies indicate that selection of the solver algorithm and not the programming model is responsible for scalability impediments usually associated with the shared memory parallel model [3].

This assortment of parallel programming paradigms and parallel computer architectures necessitates careful porting and performance tuning of various applications to ensure they take a full advantage of computational system resources. The objectives of the present study are the examination of performance issues associated with the implementation of commercial CFD software FLUENT on ccNUMA and cluster system architectures.

An investigation of performance bottlenecks was conducted for FLUENT with industrial-sized cases. This paper describes the parallel implementation of the FLUENT solver, the system environment for the investigations and results for each. Also, a new directions in parallel performance enhancement, such as hybrid parallel programming paradigms and new class of communication primitives are discussed.

2. FLUENT CFD Software

The commercial CFD software FLUENT is a fully-unstructured finite-volume CFD solver for complex flows ranging from incompressible (subsonic) to mildly compressible (transonic) to highly compressible (supersonic and hypersonic) flows. The wealth of physical models in FLUENT allows accurate prediction of laminar, transitional and turbulent flows, along with various modes of heat transfer, chemical reaction, multiphase flows and other complex phenomena.

The cell-based discretization approach used in FLUENT is capable of handling arbitrary convex polyhedral elements. For solution strategy, FLUENT allows a choice of two numerical methods, either segregated or coupled. With either method FLUENT solves the governing integral equations for conservation of mass, momentum, energy and other scalars such as turbulence and chemical species.

The FLUENT solution consists of a technique that includes (1) division of the domain into discrete control-volumes based upon a computational grid, (2) the integration of governing equations on the individual control volumes to construct algebraic equations for the discrete dependent variables such as velocities, pressure, temperature, and conserved scalars, and (3) linearization of discretized equations for solution of the resulting linear system to yield updated values of dependent variables.

Both segregated and coupled numerical methods employ a similar finite-volume discretization process but their approach to linearization and solution of the discretized equations is different. A point implicit (Gauss-Seidel) linear equation solver is used in conjunction with an Algebraic Multigrid (AMG) scheme to solve the resultant linear system for the dependent variables in each cell.

The AMG scheme is most often used but FLUENT also offers a full-approximation storage (FAS) multigrid scheme also. AMG is an essential component of both the segregated and coupled implicit solvers, while FAS is an important but optional component of the coupled explicit solver [2]. Parallelism is implemented through a coarse-grain, domain decomposition technique with use of the message passing interface (MPI) system.

3. Computer System Architectures

Provided are descriptions of the system architectures used for investigation of the FLUENT parallel studies. These include two proprietary SGI systems based on the IRIX operating system and and Intel-IA32 architecture based on the Linux operating system.

3.1 SGI ccNUMA Architecture

The majority of FLUENT computations presented in this paper were performed on an SGI 2000 (formerly Origin2000) family of servers. The SGI 2000 is a cache-coherent non-uniform access multiprocessor (ccNUMA) architecture [1]. The SGI ccNUMA memory is physically distributed among the nodes but it is globally addressable to all processors through the interconnection network.

The distribution of memory among processors ensures that memory latency is reduced. The globally addressable memory model is retained but memory access times are no longer uniform. The ccNUMA design incorporates hardware and software features that minimize latency differences between remote and local memory. Page migration hardware moves data closer into memory closer to a processor that frequently uses it, meaning that most memory references are local.

Cache coherence is maintained via a directory based protocol and caches are used to reduce memory latency as well. While data only exists in either local or remote memory, copy of the data can exist in various processor caches. Keeping these copies consistent is the responsibility of the logic (cache-coherent protocol) of the various hubs. The directory-based cache coherence protocol is preferable to snooping since it reduces the amount of coherence traffic; cache-line invalidations are broadcasted only to those CPUs actually using the cache line instead to all CPUs in the system.

The building block of the system is the node, which contains two processors, up to 4 GB of main memory and its corresponding directory memory, and a connection to a portion of IO subsystem. The hub chip is the distributed memory controller and is responsible for providing transparent access to all of the distributed memory in a cache-coherent manner to all of the processors and I/O devices. The nodes can be connected together via any choice of scalable interconnection network.

3.2 Linux and IRIX Cluster Systems

Cluster computing is based on the simple approach of connecting several compute servers to utilize their collective resources for solving a single or multiple problems quickly. The rapidly decreasing price-performance of hardware and system software, emerging high-speed networks, and the availability of mature commercial CFD application software allows cluster computing to become an attractive option for industrial CAE users.

There are two major classes of a cluster configurations (1) capacity cluster and (2) capability cluster. A capacity cluster is targeted for solutions of multiple problems, each running on a dedicated single CPU with minimum communication between individual servers. A capability cluster is used as a collective computational power of several computational nodes for solution of a single problem as rapidly as possible.

The capability approach requires a well developed and fast interconnect between clustered nodes. The studies presented in this paper concentrate only on capability clusters. Additionally we will subdivide the capability cluster configuration into two distinctive subclasses, (1) high end cluster consisting of few powerful multiprocessor nodes connected within a simple but powerful and mostly proprietary network topology, and (2) low end clusters employing substantial numbers of mostly single or low CPU count nodes connected in general via inexpensive commercially available networks.

4. Parallel Performance Issues

The fundamental issues behind parallel algorithm design are well understood and described in various research publications. For grid-based problems such as the numerical solution of partial differential equations, there are four sources of overhead that can affect parallel performance declining:

  • Non-optimal algorithm and algorithmic overhead: The best sequential algorithm may often be difficult or impossible to parallelize (e.g. triangular solver). In such cases the parallel algorithm may have a larger operation count than a sequential one. Additionally, in order to avoid communication overhead one may wish to duplicate some computations on different processors (e.g. double flux calculations)
  • Software: parallelization often results in an increase of the (relative) software overheads such as the those associated with indexing, procedure calls, and so on.
  • Load imbalance: The execution time of a parallel algorithm is determined by the execution of the processor having the largest amount of work. Should the computational workload not be evenly distributed, load imbalance will result and processor idling will occur, meaning certain processors must wait for other processors to finish a particular computation.
  • Communication and synchronization: All time spent in communication and synchronization is pure overhead.

The parallel version of FLUENT was carefully designed to minimize major sources of parallel inefficiencies. For example, it is well known that multigrid algorithms in general, and AMG in particular (the implementation in FLUENT) are well suited for multiprocessor computing. Minimization of communication overhead through optimal surface-volume ratio, and efficient load balance are achieved with the METIS graph-partitioning scheme of the grid domain.

Still with such a sophisticated approach, parallel performance can exhibit unsatisfactory results owing to the lack of special mapping to the specific architecture of a particular computer system. Parallel performance of FLUENT, as originally ported on the SGI ccNUMA architecture did not meet initial expectations, with most models scaling only up to 4 processors. Examination of the parallel performance for a number of cases identified bottlenecks with MPI latency and non-enforcement of processor-memory affinity (data placement) as the key reasons for limited scalability. The data placement concern was related to a feature of the ccNUMA architecture and was addressed through implementation of the IRIX dplace set of tools.

The latency bottlenecks associated with MPI required more attention. FLUENT parallelization is based on an explicit message passing paradigm which utilizes MPI to exchange boundary information between partitions. For AMG schemes, MPI is used for information exchange of both fine-grid as well as with coarse-grid levels. The size of MPI messages decreases for coarse-grids making the message initiation (latency) cost more important than the message transmission cost (bandwidth). Thus, unlike classic one-level algorithms where the bandwidth of the MPI implementation is critical to scalability, FLUENT scalability depends primarily on latency of the MPI implementation.

Total MPI latency is determined by both the specifics of a system architecture and the implementation of MPI for that system. Since system architecture latency is determined by design of a particular interconnect, overall latency improvements can only be made to the MPI implementation. Modifications to the MPI software to ensure "awareness" of a specific architecture is the only way to reduce the total latency and subsequently the communication overhead.

Table 1. shows scalability data for one of the FLUENT test problems using both MPICH (public domain MPI) and the SGI implementation of MPI coupled with the use of placement tools. Clearly the later provides higher scalability when compared to the public domain of MPI. Using a "ping-pong" test verifies that MPICH latency is almost three times higher than SGI-MPI specific implementation. The results presented in the later sections will demonstrate the influence of MPI latency on the FLUENT scalabilty for various architectures of SGI systems.

Table 1. Baseline FLUENT Study: Automotive Aerodynamics

CPUs
FLUENT MPICH Baseline
FLUENT + SGI MPI + dplace
1
1.0
1.0
2
1.7
1.9
4
2.9
3.9
8
4.0
7.9
16
7.8
16.2
32
10.0
29.1
64
n/a
45.0
96
n/a
58.3

It should be noted that the current release of FLUENT for SGI systems is instrumented with data placement tools for use with MPICH and exhibits improved scalability over this experiment that was conducted during the beginning of our investigations. Still the lower latency of SGI-MPI over MPICH will produce maximum efficiency in parallel performance.

5. FLUENT Parallel Performance

The overall objective of these studies is to demonstrate the performance of FLUENT CFD software on the SGI ccNUMA architecture as a single system image (SSI) configuration, as well as with various cluster configurations based on SGI systems. Performance is examined on a moderate SSI and clusters with several models less than 1M cells, then on a large SSI and large cluster for models much greater than 1M cells.

5.1 Linux and IRIX Cluster Performance

The use of a cluster of systems as a CFD computational resource is increasingly appealing for CAE professionals. Flexibility, local control, relatively low cost combined with suitable performance makes clusters a popular choice, especially for small-sized engineering companies and departments. The major components of a cluster for CFD software are the set of computational nodes (single or multiprocessors), a network interconnect of hardware and software, and an operating system software and administration tools. Specific configurations used for the moderate SSI and cluster experiments are summarized in Table 2.

Table 2. System Environments for FLUENT Performance Investigations

System

Processor

Topology

O/S

Interconnect

SGI 2400
MIPS R10000
/250Mhz
1 x 16
IRIX 6.5
N/A (SSI)
SGI 2100
MIPS R10000
/250Mhz
4 x 8
IRIX 6.5
HIPPI
SGI 1400
Intel Xeon
/500Mhz
4 x 4
Linux 2.2.5
100BT/HUB
SGI 1400
Intel Xeon
/500Mhz
32 x 4
Linux 2.2.5
Myrinet

Studies on cluster systems were conducted on FLUENT performance as a function of architecture type, processor type and choice of interconnect. Performance depends on a number of factors including latency and bandwidth of each cluster arrangement. A simple "ping-pong" test allows one to measure the bandwidth and latency for each systems in this study. Table 3. provides results of this test.

Table 3. Communication Rates From Ping-Pong Test

System

SGI 2400

SGI 2100

SGI 1400

SGI 1400/Myr

Latency (microsec)

12.3
139.5
168.5
18.7

Bytes

Bandwidth (MB/sec)

8
0.6512
0.0574
0.0475
0.4277
1024
30.8285
5.2996
3.3621
5.6313
4096
71.5594
14.0763
5.0581
14.4953
16384
84.6533
21.9109
7.4708
34.3379
65356
122.5513
45.3720
8.9236
38.7949
262144
127.1931
64.2009
8.8473
37.6015
1048576
83.7358
69.5913
8.5831
36.4732
4194304
72.7694
70.1449
8.7337
36.5582

From the ping-pong test it is observed that the Single System Image (SSI) architecture (SGI 2400) with its hypercube topology, fast NUMALink communication hardware and SGI-MPI communication software exhibit the lowest latency and highest bandwidth. Next in line is the SGI 1400 Linux cluster (SGI 1400/Myr) with Myrinet interconnect and MPICH port to GM Myrinet communication protocol. The SGI 2100 cluster with HIPPI interconnect shows quite respectable levels of bandwidth but latency is almost an order higher than SSI latency. And finally, the SGI 1400 Linux cluster with 100BT demonstrates the least favorable communication parameters.

It is well known that the parallel performance of numerical applications is influenced by the size of the models for a particular benchmarking case as well as by the number of CPU used for that particular execution. In order to investigate the behavior of these system architectures in the broad spectrum of problem sizes, three benchmark tests are chosen for the study:

  • SMALL - Flow through a 90 degree elbow duct with 78,887 cells
  • MEDIUM - Turbulent flow in an engine valve-port with 242,782 cells
  • LARGE - Transonic flow around a fighter aircraft with 847,764 cells

The following tables provide results of computational speed, presented in the form of how many jobs can be completed in a 24 hour period. For this metric, the larger numbers are better. Also, parallel performance scaling is also provided for each.

Table 4. FLUENT Performance for SMALL Model

Model: 90 degree elbow duct, 78,887 tetrahedral cells, k-e turbulence, segregated implicit solver Metric: Number of jobs completed in 24 hours with parallel speed-up

CPUs

SGI 2400

SGI 2100

SGI 1400

SGI 1400/Myr

1
606 1.0
606 1.0
432 1.0
427 1.0
2
1234 2.0
1234 2.0
786 1.8
786 1.8
4
2304 3.8
2304 3.8
987 2.3
987 2.3
8
3456 5.7
3142 5.2
735 1.7
1819 4.3
16
3142 5.2
2659 4.4
364 0.8
3142 7.4

Table 5. FLUENT Performance for MEDIUM Model

Model: Engine valveport, 242,782 hybrid cells, k-e turbulence, segregated implicit solver Metric: Number of jobs completed in 24 hours with parallel speed-up

CPUs

SGI 2400

SGI 2100

SGI 1400

SGI 1400/Myr

1
116 1.0
115 1.0
96 1.0
98 1.0
2
213 1.9
212 1.8
166 1.7
175 1.8
4
421 3.6
416 3.6
243 2.5
252 2.6
8
823 7.1
804 7.0
432 4.5
508 5.2
16
1234 10.7
1115 9.7
576 6.0
1017 10.3
32
2033 17.6
1382 12.0
n/a n/a
1571 16.0

Table 6. FLUENT Performance for LARGE Model

Model: Fighter aircraft, 847,764 hexahedral cells, RNG k-e turbulence, coupled explicit solver Metric: Number of jobs completed in 24 hours with parallel speed-up

CPUs

SGI 2400

SGI 2100

SGI 1400

SGI 1400/Myr

1
12.5 1.0
12.5 1.0
10.2 1.0
10.6 1.0
2
24.1 1.9
23.3 1.9
15.9 1.6
16.3 1.6
4
45.7 3.8
47.2 3.8
26.9 2.7
28.6 2.7
8
85.1 6.8
80.4 6.4
43.6 4.3
52.6 5.0
16
154.3 12.3
110.1 8.8
39.6 3.9
80.0 7.6
32
259.8 20.6
91.9 7.3
n/a n/a
118.0 11.2
64
n/a n/a
n/a n/a
n/a n/a
159.3 15.0

Despite that the systems in this study use CPUs that provide approximately the same hardware peak performance (about 0.5 GFLOPS) substantial variation in absolute performance is observed, and especially for low numbers of processors. This is explained by a strong dependency of FLUENT single CPU performance on the memory subsystem parameters, particularly memory bandwidth and secondary cache size. But these studies concentrate on parallel performance metrics which are mostly defined by interconnect characteristics rather than CPU performance. In general, most of the results confirm the well known fact that larger problems exhibit better scaling.

The SMALL model begins to peak in parallel scaling at 8 CPUs, where as the LARGE model continue to scale even after 32 processors. But the most important observation is that the level of parallel scalablity clearly correlates with the latency characteristics for a given interconnect. Both SSI and the Linux cluster with Myrinet exhibit smallest latency and the highest parallel scaling. Systems with slower interconnects (and especially 100BT) are even more limited in scaling ability.

Another interesting fact is that as the number of processors increase, the ratio of computation work to communication overhead is diminishing, such that network speeds practically define the absolute performance of the application. This is especially clear on the example of the SMALL benchmark. Note for example the 1 CPU and 16 CPU times of SGI 2400 and SGI 1400/Myr. For the 1 CPU case SGI 2400 is faster by 1.4 fold over SGI 1400/Myr, yet they are equal at 16 CPUs.

This is a result of much larger communication overhead that exists at 16 CPUs over the 1 CPU case, since the work per CPU is much smaller at 16 partitions for 16 CPUs. For the 1 CPU case, there was only 1 partition (the complete domain) and no communication costs. These results show that low cost clusters with fast network speeds (low latency communication protocols) can be a viable solution as low- and mid-range CFD servers for modest problem size.

5.2 SGI ccNUMA Performance

Another important class of computer systems that provide resources capable of solving Grand Challenge level problems are often refered to as supercomputers. Until recently only vector computers (e.g. a Cray T90) were labeled as supercomputers. The arrival of massively parallel processing (MPP) computers during the 1990's changed this situation. Today a new class of massively parallel computers are called upon for simulation of Grand Challenge application problems within various areas of science and engineering including CFD.

This class of CFD applications drive issues in computer science and numerical methods, design requirements in supercomputing architectures, high speed communications, visualization and database performance. Success with Grand Challenge CFD problems involves coordinated research in mathematical models of physical behavior, numerical algorithms, parallel implementations and computer science methodologies.

Historically speaking, only large research facilities such as government funded national laboratories and university-based supercomputer centers were involved in the solution of extremely large problems and correspondingly, they were major customers for supercomputer vendors. During recent years, the increasing computational demands of the aerospace and automotive industries, along with rapidly decreasing costs of large-scale computing significantly changed this situation.

As a result of substantial improvements in processing speed and increases in storage capacity of a new generation of parallel computers, combined with improvements in computational algorithms, the industry was able to satisfy the growing demand for supercomputing resources. Practically every large aerospace and automotive company have installed supercomputer class systems that are highly utilized for mainstream engineering practice, and the majority of those belong to the MPP class systems.

MPP systems can generally come in 2 varieties: (1) large SSI systems that allow the use of both shared and distributed memory parallel paradigms across the complete system (e.g. the SGI 2800 512 CPU system at NASA Ames Research Center), and (2) clusters of smaller multiprocessors systems connected with a high performance network (e.g. 64 x SGI 2800 128 CPU cluster system, for a total of 6144 CPUs, ASCI Blue Mountain at Los Alamos National Laboratory).

The same spectrum of supercomputer investments occurs in industry, but on a smaller scale. Several major automotive companies have SGI 2800 128 CPU systems installed for various applications including simulation of various flow phenomena related to vehicle aerodynamics. Similarly, several aerospace companies have installed SGI 2800 128 CPU systems and in one case, a cluster of 4 x SGI 2400 64 CPU (total 256 CPU) for aerodynamic simulations related to design of aircraft and turbomachinery.

In such applications for both the automotive and aerospace industries, many use the FLUENT commercial CFD software, rather than develop CFD software in-house. The important question that is asked most often by these companies is -- which configuration, large SSI or cluster of smaller SSIs, provides the best performance for FLUENT simulations and why? In order to investigate and draw conclusions for such comparisons, the systems from Table 7. were configured for FLUENT benchmark tests:

Table 7. System Environments for FLUENT Performance Investigations

System

Processor

Topology

O/S

Interconnect

SGI 2800
MIPS R12000/300Mhz
1 x 256
IRIX 6.5
N/A (SSI)
SGI 2800
MIPS R12000/300Mhz
4 x 64
IRIX 6.5
HIPPI

Latency was measured for each configuration using an MPI point-to-point test, and the results are provided in Table 8. Note the increase in latency of more than 11-fold for the cluster configuration with HIPPI interconnect over the SSI.

Table 8. Measured Latency From MPI Point-to-Point Test

System

SGI 2800/SSI

SGI 2800/Cluster

Latency (microsec)

11.8
139.2

In order to avoid any possible constraints of parallel performance due to the size of a particular model, an EXTRA LARGE test case was constructed that consisted of about 30 million cells. This case was derived from one of the FLUENT Standard Benchmarks called FL5L2, which is a simulation of external aerodynamics about an automotive vehicle. A grid refinement was conducted that increased the model size 8-fold over the original size of 3,618,080 cells.

During the tests for the cluster configuration, the load was uniformly distributed among the four cluster nodes, each with 64 CPUs. The tests began at 30 processors which, for example required the use of 8 CPUs on two nodes of the cluster and 7 CPUs on the other two nodes. The SSI tests began at 10 CPUs and performance results for each are shown in Table 9.

Table 9. FLUENT Performance for EXTRA LARGE Model

Model: Auto aerodynamics, 28,944,640 tetrahedral cells, k-e turbulence, segregated implicit solver Metric: Number of jobs completed in 24 hours with parallel speed-up

CPUs

SGI 2800/SSI

SGI 2800/Cluster

10
9.1 1.0 *
n/a
20
18.4 2.0
n/a
30
35.0 3.8
24.7
60
60.7 6.7
48.2
120
120.0 13.2
88.6
240
188.9 20.8
70.2

* All speed-up results are relative to the 10 CPU result

The results for this EXTRA LARGE model show that the cluster is limited in scalability and therefore absolute performance when compared with the SSI. What's more, this performance difference increases as the number of CPUs employed also increases. There are two reasons for this behavior. First and most importantly is the higher inter-host latency which was discussed previously. The second reason is due the limited pipeline capability of the cluster's inter-host communication.

The SSI system offers a rich hyper-cube topology such that very few outstanding messages are required to negotiate a message passing from source to destination. The situation is much different for the cluster since inter-host communication with HIPPI doesn't provide the pipeline capability to resolve this bottleneck of having very few communication routes between hosts. This behavior was not so evident for the low-end clusters where the number of model partitions per host is low, but it becomes increasingly influential for large multiprocessor hosts for a model that contains a high number of partitions in need of inter-host communication. The evidence provides a conclusion that for solution of large models on a large number of processors, the SSI architecture provides much better performance than any cluster configuration.

6. Considerations For Further Performance Enhancements

Program development for parallel computers is becoming less complex compared with recent past efforts. For example the current status of the shared memory parallel paradigm often allows respectable, yet moderate scalability with simple implementation of automatic compiler technology. For high levels of scalability the situation is not as simple, since consideration must be given to the details of a particular microprocessor and system architecture.

Often performance gains can be realized with improvements based upon single processor considerations before any parallel programming begins. As an example, major changes were implemented from FLUENT 4.2 to FLUENT 5.0 that enhanced the solver for improved compiler scheduling and cache contention. These improvements were made possible by changes to the data structure that was changed from a link-list approach to an array based approach that provides more uniform memory stride and data re-use.

The FLUENT single CPU performance improvements that were developed to better utilize hierarchical memory structure, not only improve performance on existing systems, but provide full performance advantages to introductions of faster microprocessors in the future. The new release improvements to FLUENT are given in Table 10. and demonstrate the performance increases that can be obtained with introduction of a new data structure and a faster microprocessor.

Table 10. FLUENT New Release Improvements (seconds/iteration)

CPUs

FLUENT/UNS 4.2.10 MIPS R10000 /195Mhz

FLUENT 5.0 MIPS R10000
/195Mhz

FLUENT 5.0 MIPS R10000
/250Mhz

1
3712
2490
1779
10
338
149
101
20
139
61
51
30
97
43
35
60
54
28
n/a

Future single processor performance enhancements for FLUENT and other applications will come from introduction of new a generation of MIPS processors (R12000/400 Mhz and R14000/500 Mhz) and from the upcoming Intel IA-64 processor family. SGI Applications Engineering completed a first phase of FLUENT porting on the first processor in the IA-64 family, the Itanium at 800 Mhz, 3.2 Gflops peak performance. The preliminary results were presented at the Intel Developer's Forum 2000 recently in Palm Springs, CA and generated much interest within the scientific software development community.

Future progress in FLUENT parallel performance can be influenced from both the system software and hardware development fronts. One example is an SGI developed implementation of point-to-point communication primitives that are a subset of the MPI-2 standard. This primitive known as one-sided MPI and associated primitives use the memory structure of the ccNUMA architecture to a much fuller extent than conventional two-sided functions from MPI-1.

Recent measurements conducted at SGI showed that MPI_get and MPI_put calls reduce latency almost 4-fold in comparison with the conventional MPI implementation. The experiment was part of an ongoing development effort and as such, communication call interfaces are not completely user-friendly. Even still, an effort was made to include this communication layer in FLUENT and initial studies show a significant improvement to parallel scalability.

An additional consideration for increases in parallel performance is the coupling of various parallel paradigms within in the same code. This approach is best suited for the SGI ccNUMA architecture which allows to use both shared and distributed memory paradigms. FLUENT developers skillfully used this opportunity for performance improvements to multi-phase flow simulations. The motivation for this approach was parallelization of the Lagrangian system of coordinates when particulate matter is present in a flow field simulation.

For all FLUENT simulations including multi-phase, FLUENT employs a static frame of reference known as an Eulerian coordinate system. This is most convenient in multi-phase simulations for description of the continuum phase of fluid flow such as a gas. The static nature of this reference frame allows creation of static partitioning, which is an effective method for utilization of the distributed memory parallel model. At the same time, the discrete phase of a multi-phase simulation is usually described and calculated in a dynamic (moving) frame of reference known as a Lagrangian coordinate system.

To impose a static partitioning scheme on a Lagrangian topology is difficult and not an effective means of parallelization. The proposed hybrid parallel scheme with combined distributed and shared memory parallel paradigm allows resolution of this conflict very elegantly. The implementation means that both coordinate systems (Eularian and Lagrangian) work in a segregated manner and exchange data between the two phases. This exchange occurs upon completion of a distributed or shared memory parallel step of each iteration.

This hybrid implementation was completed for release of FLUENT 5.1, and Table 11. demonstrates the improved parallel performance over FLUENT 5.0 for multi-phase applications. FLUENT 5.0 offered distributed memory parallel continuum phase and sequential discrete phase, and FLUENT 5.1 offers parallelism of both continuum and discrete through the hybrid parallel approach. The model for this example is a FLUENT Standard Benchmark known as FL5M1.

Table 11. FLUENT Performance with Hybrid Parallel (parallel speed-up)

Model: Coal combustion in a boiler, 155,188 tetrahedral cells, 6 species with reaction, dispersed phase, P1 radiation, k-e turbulence, segregated implicit solver.

CPUs

FLUENT 5.0

FLUENT 5.1

1
1.0
1.0
2
1.4
2.0
4
1.8
3.7
8
2.0
6.6
16
2.1
9.8
32
2.1
10.9

The hybrid parallel programming model provides improved scalability over the standard MPI implementation for this complex problem. Note that there are limitations to use of the hybrid parallel scheme in a cluster environment since the available number of shared memory threads is restricted by the number of CPUs on a particular shared-memory node of the cluster. In this case the SSI environment has an advantage since the shared memory phase of algorithm can use the same number of processors as a distributed memory phase of a particular simulation, which will lead to improved parallel performance.

7. Conclusions

Presented were parallel performance results of the commercial CFD software FLUENT for a variety of applications and system configurations. Results of these experiments lead to the following observations:

  • For small and medium sized FLUENT models, UNIX and Linux clusters can be a reasonable solution in the range of 8 to 16 CPUs. The major bottleneck of the investigated algorithm is communication latency, which can be resolved with a careful choice of interconnect communication hardware and software.
  • For large and extra large models, the ccNUMA SSI architecture provides superior FLUENT parallel performance owing to characteristics of rapid internal interconnects.
  • For single CPU performance increases, consideration must be give to the correlation between the given data structures and algorithms, and to the specific memory hierarchy of the particular target system architecture.
  • The ccNUMA architecture allows a coupling of various parallel programming paradigms which can lead to a much improved parallel performance of multi-physics applications.

References

[1] Laudon, J. and Lenoski, D., "The SGI Origin ccNUMA Highly Scalable Server," SGI Published White Paper, March 1997.

[2] FLUENT 5 User's Guide, 1998, Fluent Incorporated, Lebanon, NH.