
Considerations for Parallel CFD Enhancements on SGI ccNUMA and Cluster Architectures Mark Kremenetsky, Ph.D., Principal Scientist, CFD Applications |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 1. Baseline FLUENT Study: Automotive Aerodynamics |
||
|
CPUs
|
FLUENT MPICH Baseline
|
FLUENT + SGI MPI + dplace
|
|
1
|
1.0
|
1.0
|
|
2
|
1.7
|
1.9
|
|
4
|
2.9
|
3.9
|
|
8
|
4.0
|
7.9
|
|
16
|
7.8
|
16.2
|
|
32
|
10.0
|
29.1
|
|
64
|
n/a
|
45.0
|
|
96
|
n/a
|
58.3
|
It should be noted that the current release of FLUENT for SGI systems is instrumented with data placement tools for use with MPICH and exhibits improved scalability over this experiment that was conducted during the beginning of our investigations. Still the lower latency of SGI-MPI over MPICH will produce maximum efficiency in parallel performance.
The overall objective of these studies is to demonstrate the performance of FLUENT CFD software on the SGI ccNUMA architecture as a single system image (SSI) configuration, as well as with various cluster configurations based on SGI systems. Performance is examined on a moderate SSI and clusters with several models less than 1M cells, then on a large SSI and large cluster for models much greater than 1M cells.
The use of a cluster of systems as a CFD computational resource is increasingly appealing for CAE professionals. Flexibility, local control, relatively low cost combined with suitable performance makes clusters a popular choice, especially for small-sized engineering companies and departments. The major components of a cluster for CFD software are the set of computational nodes (single or multiprocessors), a network interconnect of hardware and software, and an operating system software and administration tools. Specific configurations used for the moderate SSI and cluster experiments are summarized in Table 2.
Table 2. System Environments for FLUENT Performance Investigations |
||||
System |
Processor |
Topology |
O/S |
Interconnect |
|
SGI 2400
|
MIPS R10000
/250Mhz |
1 x 16
|
IRIX 6.5
|
N/A (SSI)
|
|
SGI 2100
|
MIPS R10000
/250Mhz |
4 x 8
|
IRIX 6.5
|
HIPPI
|
|
SGI 1400
|
Intel Xeon
/500Mhz |
4 x 4
|
Linux 2.2.5
|
100BT/HUB
|
|
SGI 1400
|
Intel Xeon
/500Mhz |
32 x 4
|
Linux 2.2.5
|
Myrinet
|
Studies on cluster systems were conducted on FLUENT performance as a function of architecture type, processor type and choice of interconnect. Performance depends on a number of factors including latency and bandwidth of each cluster arrangement. A simple "ping-pong" test allows one to measure the bandwidth and latency for each systems in this study. Table 3. provides results of this test.
Table 3. Communication Rates From Ping-Pong Test |
||||
System |
SGI 2400 |
SGI 2100 |
SGI 1400 |
SGI 1400/Myr |
Latency (microsec) |
12.3
|
139.5
|
168.5
|
18.7
|
Bytes |
Bandwidth (MB/sec) |
|||
|
8
|
0.6512
|
0.0574
|
0.0475
|
0.4277
|
|
1024
|
30.8285
|
5.2996
|
3.3621
|
5.6313
|
|
4096
|
71.5594
|
14.0763
|
5.0581
|
14.4953
|
|
16384
|
84.6533
|
21.9109
|
7.4708
|
34.3379
|
|
65356
|
122.5513
|
45.3720
|
8.9236
|
38.7949
|
|
262144
|
127.1931
|
64.2009
|
8.8473
|
37.6015
|
|
1048576
|
83.7358
|
69.5913
|
8.5831
|
36.4732
|
|
4194304
|
72.7694
|
70.1449
|
8.7337
|
36.5582
|
From the ping-pong test it is observed that the Single System Image (SSI) architecture (SGI 2400) with its hypercube topology, fast NUMALink communication hardware and SGI-MPI communication software exhibit the lowest latency and highest bandwidth. Next in line is the SGI 1400 Linux cluster (SGI 1400/Myr) with Myrinet interconnect and MPICH port to GM Myrinet communication protocol. The SGI 2100 cluster with HIPPI interconnect shows quite respectable levels of bandwidth but latency is almost an order higher than SSI latency. And finally, the SGI 1400 Linux cluster with 100BT demonstrates the least favorable communication parameters.
It is well known that the parallel performance of numerical applications is influenced by the size of the models for a particular benchmarking case as well as by the number of CPU used for that particular execution. In order to investigate the behavior of these system architectures in the broad spectrum of problem sizes, three benchmark tests are chosen for the study:
The following tables provide results of computational speed, presented in the form of how many jobs can be completed in a 24 hour period. For this metric, the larger numbers are better. Also, parallel performance scaling is also provided for each.
Table 4. FLUENT Performance for SMALL Model |
||||
|
Model: 90 degree elbow duct, 78,887 tetrahedral cells, k-e turbulence, segregated implicit solver Metric: Number of jobs completed in 24 hours with parallel speed-up |
||||
CPUs |
SGI 2400 |
SGI 2100 |
SGI 1400 |
SGI 1400/Myr |
|
1
|
606 1.0
|
606 1.0
|
432 1.0
|
427 1.0
|
|
2
|
1234 2.0
|
1234 2.0
|
786 1.8
|
786 1.8
|
|
4
|
2304 3.8
|
2304 3.8
|
987 2.3
|
987 2.3
|
|
8
|
3456 5.7
|
3142 5.2
|
735 1.7
|
1819 4.3
|
|
16
|
3142 5.2
|
2659 4.4
|
364 0.8
|
3142 7.4
|
Table 5. FLUENT Performance for MEDIUM Model |
||||
|
Model: Engine valveport, 242,782 hybrid cells, k-e
turbulence, segregated implicit solver Metric: Number of jobs completed
in 24 hours with parallel speed-up
|
||||
CPUs |
SGI 2400 |
SGI 2100 |
SGI 1400 |
SGI 1400/Myr |
|
1
|
116 1.0
|
115 1.0
|
96 1.0
|
98 1.0
|
|
2
|
213 1.9
|
212 1.8
|
166 1.7
|
175 1.8
|
|
4
|
421 3.6
|
416 3.6
|
243 2.5
|
252 2.6
|
|
8
|
823 7.1
|
804 7.0
|
432 4.5
|
508 5.2
|
|
16
|
1234 10.7
|
1115 9.7
|
576 6.0
|
1017 10.3
|
|
32
|
2033 17.6
|
1382 12.0
|
n/a n/a
|
1571 16.0
|
Table 6. FLUENT Performance for LARGE Model |
||||
|
Model: Fighter aircraft, 847,764 hexahedral cells,
RNG k-e turbulence, coupled explicit solver Metric: Number of jobs
completed in 24 hours with parallel speed-up
|
||||
CPUs |
SGI 2400 |
SGI 2100 |
SGI 1400 |
SGI 1400/Myr |
|
1
|
12.5 1.0
|
12.5 1.0
|
10.2 1.0
|
10.6 1.0
|
|
2
|
24.1 1.9
|
23.3 1.9
|
15.9 1.6
|
16.3 1.6
|
|
4
|
45.7 3.8
|
47.2 3.8
|
26.9 2.7
|
28.6 2.7
|
|
8
|
85.1 6.8
|
80.4 6.4
|
43.6 4.3
|
52.6 5.0
|
|
16
|
154.3 12.3
|
110.1 8.8
|
39.6 3.9
|
80.0 7.6
|
|
32
|
259.8 20.6
|
91.9 7.3
|
n/a n/a
|
118.0 11.2
|
|
64
|
n/a n/a
|
n/a n/a
|
n/a n/a
|
159.3 15.0
|
Despite that the systems in this study use CPUs that provide approximately the same hardware peak performance (about 0.5 GFLOPS) substantial variation in absolute performance is observed, and especially for low numbers of processors. This is explained by a strong dependency of FLUENT single CPU performance on the memory subsystem parameters, particularly memory bandwidth and secondary cache size. But these studies concentrate on parallel performance metrics which are mostly defined by interconnect characteristics rather than CPU performance. In general, most of the results confirm the well known fact that larger problems exhibit better scaling.
The SMALL model begins to peak in parallel scaling at 8 CPUs, where as the LARGE model continue to scale even after 32 processors. But the most important observation is that the level of parallel scalablity clearly correlates with the latency characteristics for a given interconnect. Both SSI and the Linux cluster with Myrinet exhibit smallest latency and the highest parallel scaling. Systems with slower interconnects (and especially 100BT) are even more limited in scaling ability.
Another interesting fact is that as the number of processors increase, the ratio of computation work to communication overhead is diminishing, such that network speeds practically define the absolute performance of the application. This is especially clear on the example of the SMALL benchmark. Note for example the 1 CPU and 16 CPU times of SGI 2400 and SGI 1400/Myr. For the 1 CPU case SGI 2400 is faster by 1.4 fold over SGI 1400/Myr, yet they are equal at 16 CPUs.
This is a result of much larger communication overhead that exists at 16 CPUs over the 1 CPU case, since the work per CPU is much smaller at 16 partitions for 16 CPUs. For the 1 CPU case, there was only 1 partition (the complete domain) and no communication costs. These results show that low cost clusters with fast network speeds (low latency communication protocols) can be a viable solution as low- and mid-range CFD servers for modest problem size.
Another important class of computer systems that provide resources capable of solving Grand Challenge level problems are often refered to as supercomputers. Until recently only vector computers (e.g. a Cray T90) were labeled as supercomputers. The arrival of massively parallel processing (MPP) computers during the 1990's changed this situation. Today a new class of massively parallel computers are called upon for simulation of Grand Challenge application problems within various areas of science and engineering including CFD.
This class of CFD applications drive issues in computer science and numerical methods, design requirements in supercomputing architectures, high speed communications, visualization and database performance. Success with Grand Challenge CFD problems involves coordinated research in mathematical models of physical behavior, numerical algorithms, parallel implementations and computer science methodologies.
Historically speaking, only large research facilities such as government funded national laboratories and university-based supercomputer centers were involved in the solution of extremely large problems and correspondingly, they were major customers for supercomputer vendors. During recent years, the increasing computational demands of the aerospace and automotive industries, along with rapidly decreasing costs of large-scale computing significantly changed this situation.
As a result of substantial improvements in processing speed and increases in storage capacity of a new generation of parallel computers, combined with improvements in computational algorithms, the industry was able to satisfy the growing demand for supercomputing resources. Practically every large aerospace and automotive company have installed supercomputer class systems that are highly utilized for mainstream engineering practice, and the majority of those belong to the MPP class systems.
MPP systems can generally come in 2 varieties: (1) large SSI systems that allow the use of both shared and distributed memory parallel paradigms across the complete system (e.g. the SGI 2800 512 CPU system at NASA Ames Research Center), and (2) clusters of smaller multiprocessors systems connected with a high performance network (e.g. 64 x SGI 2800 128 CPU cluster system, for a total of 6144 CPUs, ASCI Blue Mountain at Los Alamos National Laboratory).
The same spectrum of supercomputer investments occurs in industry, but on a smaller scale. Several major automotive companies have SGI 2800 128 CPU systems installed for various applications including simulation of various flow phenomena related to vehicle aerodynamics. Similarly, several aerospace companies have installed SGI 2800 128 CPU systems and in one case, a cluster of 4 x SGI 2400 64 CPU (total 256 CPU) for aerodynamic simulations related to design of aircraft and turbomachinery.
In such applications for both the automotive and aerospace industries, many use the FLUENT commercial CFD software, rather than develop CFD software in-house. The important question that is asked most often by these companies is -- which configuration, large SSI or cluster of smaller SSIs, provides the best performance for FLUENT simulations and why? In order to investigate and draw conclusions for such comparisons, the systems from Table 7. were configured for FLUENT benchmark tests:
Table 7. System Environments for FLUENT Performance Investigations |
||||
System |
Processor |
Topology |
O/S |
Interconnect |
|
SGI 2800
|
MIPS R12000/300Mhz
|
1 x 256
|
IRIX 6.5
|
N/A (SSI)
|
|
SGI 2800
|
MIPS R12000/300Mhz
|
4 x 64
|
IRIX 6.5
|
HIPPI
|
Latency was measured for each configuration using an MPI point-to-point test, and the results are provided in Table 8. Note the increase in latency of more than 11-fold for the cluster configuration with HIPPI interconnect over the SSI.
Table 8. Measured Latency From MPI Point-to-Point Test |
||
System |
SGI 2800/SSI |
SGI 2800/Cluster |
Latency (microsec) |
11.8
|
139.2
|
In order to avoid any possible constraints of parallel performance due to the size of a particular model, an EXTRA LARGE test case was constructed that consisted of about 30 million cells. This case was derived from one of the FLUENT Standard Benchmarks called FL5L2, which is a simulation of external aerodynamics about an automotive vehicle. A grid refinement was conducted that increased the model size 8-fold over the original size of 3,618,080 cells.
During the tests for the cluster configuration, the load was uniformly distributed among the four cluster nodes, each with 64 CPUs. The tests began at 30 processors which, for example required the use of 8 CPUs on two nodes of the cluster and 7 CPUs on the other two nodes. The SSI tests began at 10 CPUs and performance results for each are shown in Table 9.
Table 9. FLUENT Performance for EXTRA LARGE Model |
||
|
Model: Auto aerodynamics, 28,944,640 tetrahedral
cells, k-e turbulence, segregated implicit solver Metric: Number
of jobs completed in 24 hours with parallel speed-up
|
||
CPUs |
SGI 2800/SSI |
SGI 2800/Cluster |
|
10
|
9.1 1.0 *
|
n/a
|
|
20
|
18.4 2.0
|
n/a
|
|
30
|
35.0 3.8
|
24.7
|
|
60
|
60.7 6.7
|
48.2
|
|
120
|
120.0 13.2
|
88.6
|
|
240
|
188.9 20.8
|
70.2
|
* All speed-up results are relative to the 10 CPU result
The results for this EXTRA LARGE model show that the cluster is limited in scalability and therefore absolute performance when compared with the SSI. What's more, this performance difference increases as the number of CPUs employed also increases. There are two reasons for this behavior. First and most importantly is the higher inter-host latency which was discussed previously. The second reason is due the limited pipeline capability of the cluster's inter-host communication.
The SSI system offers a rich hyper-cube topology such that very few outstanding messages are required to negotiate a message passing from source to destination. The situation is much different for the cluster since inter-host communication with HIPPI doesn't provide the pipeline capability to resolve this bottleneck of having very few communication routes between hosts. This behavior was not so evident for the low-end clusters where the number of model partitions per host is low, but it becomes increasingly influential for large multiprocessor hosts for a model that contains a high number of partitions in need of inter-host communication. The evidence provides a conclusion that for solution of large models on a large number of processors, the SSI architecture provides much better performance than any cluster configuration.
Program development for parallel computers is becoming less complex compared with recent past efforts. For example the current status of the shared memory parallel paradigm often allows respectable, yet moderate scalability with simple implementation of automatic compiler technology. For high levels of scalability the situation is not as simple, since consideration must be given to the details of a particular microprocessor and system architecture.
Often performance gains can be realized with improvements based upon single processor considerations before any parallel programming begins. As an example, major changes were implemented from FLUENT 4.2 to FLUENT 5.0 that enhanced the solver for improved compiler scheduling and cache contention. These improvements were made possible by changes to the data structure that was changed from a link-list approach to an array based approach that provides more uniform memory stride and data re-use.
The FLUENT single CPU performance improvements that were developed to better utilize hierarchical memory structure, not only improve performance on existing systems, but provide full performance advantages to introductions of faster microprocessors in the future. The new release improvements to FLUENT are given in Table 10. and demonstrate the performance increases that can be obtained with introduction of a new data structure and a faster microprocessor.
Table 10. FLUENT New Release Improvements (seconds/iteration) |
|||
CPUs |
FLUENT/UNS 4.2.10 MIPS R10000 /195Mhz |
FLUENT 5.0 MIPS R10000
|
FLUENT 5.0 MIPS R10000
|
|
1
|
3712
|
2490
|
1779
|
|
10
|
338
|
149
|
101
|
|
20
|
139
|
61
|
51
|
|
30
|
97
|
43
|
35
|
|
60
|
54
|
28
|
n/a
|
Future single processor performance enhancements for FLUENT and other applications will come from introduction of new a generation of MIPS processors (R12000/400 Mhz and R14000/500 Mhz) and from the upcoming Intel IA-64 processor family. SGI Applications Engineering completed a first phase of FLUENT porting on the first processor in the IA-64 family, the Itanium at 800 Mhz, 3.2 Gflops peak performance. The preliminary results were presented at the Intel Developer's Forum 2000 recently in Palm Springs, CA and generated much interest within the scientific software development community.
Future progress in FLUENT parallel performance can be influenced from both the system software and hardware development fronts. One example is an SGI developed implementation of point-to-point communication primitives that are a subset of the MPI-2 standard. This primitive known as one-sided MPI and associated primitives use the memory structure of the ccNUMA architecture to a much fuller extent than conventional two-sided functions from MPI-1.
Recent measurements conducted at SGI showed that MPI_get and MPI_put calls reduce latency almost 4-fold in comparison with the conventional MPI implementation. The experiment was part of an ongoing development effort and as such, communication call interfaces are not completely user-friendly. Even still, an effort was made to include this communication layer in FLUENT and initial studies show a significant improvement to parallel scalability.
An additional consideration for increases in parallel performance is the coupling of various parallel paradigms within in the same code. This approach is best suited for the SGI ccNUMA architecture which allows to use both shared and distributed memory paradigms. FLUENT developers skillfully used this opportunity for performance improvements to multi-phase flow simulations. The motivation for this approach was parallelization of the Lagrangian system of coordinates when particulate matter is present in a flow field simulation.
For all FLUENT simulations including multi-phase, FLUENT employs a static frame of reference known as an Eulerian coordinate system. This is most convenient in multi-phase simulations for description of the continuum phase of fluid flow such as a gas. The static nature of this reference frame allows creation of static partitioning, which is an effective method for utilization of the distributed memory parallel model. At the same time, the discrete phase of a multi-phase simulation is usually described and calculated in a dynamic (moving) frame of reference known as a Lagrangian coordinate system.
To impose a static partitioning scheme on a Lagrangian topology is difficult and not an effective means of parallelization. The proposed hybrid parallel scheme with combined distributed and shared memory parallel paradigm allows resolution of this conflict very elegantly. The implementation means that both coordinate systems (Eularian and Lagrangian) work in a segregated manner and exchange data between the two phases. This exchange occurs upon completion of a distributed or shared memory parallel step of each iteration.
This hybrid implementation was completed for release of FLUENT 5.1, and Table 11. demonstrates the improved parallel performance over FLUENT 5.0 for multi-phase applications. FLUENT 5.0 offered distributed memory parallel continuum phase and sequential discrete phase, and FLUENT 5.1 offers parallelism of both continuum and discrete through the hybrid parallel approach. The model for this example is a FLUENT Standard Benchmark known as FL5M1.
Table 11. FLUENT Performance with Hybrid Parallel (parallel speed-up) |
||
|
Model: Coal combustion in a boiler, 155,188 tetrahedral cells, 6 species with reaction, dispersed phase, P1 radiation, k-e turbulence, segregated implicit solver. |
||
CPUs |
FLUENT 5.0 |
FLUENT 5.1 |
|
1
|
1.0
|
1.0
|
|
2
|
1.4
|
2.0
|
|
4
|
1.8
|
3.7
|
|
8
|
2.0
|
6.6
|
|
16
|
2.1
|
9.8
|
|
32
|
2.1
|
10.9
|
The hybrid parallel programming model provides improved scalability over the standard MPI implementation for this complex problem. Note that there are limitations to use of the hybrid parallel scheme in a cluster environment since the available number of shared memory threads is restricted by the number of CPUs on a particular shared-memory node of the cluster. In this case the SSI environment has an advantage since the shared memory phase of algorithm can use the same number of processors as a distributed memory phase of a particular simulation, which will lead to improved parallel performance.
Presented were parallel performance results of the commercial CFD software FLUENT for a variety of applications and system configurations. Results of these experiments lead to the following observations:
[1] Laudon, J. and Lenoski, D., "The SGI Origin ccNUMA Highly Scalable Server," SGI Published White Paper, March 1997.
[2] FLUENT 5 User's Guide, 1998, Fluent Incorporated, Lebanon, NH.