Fine/Open: Large Industrial unsteady CFD simulations (NUMECA)

Fine/Open prototype solver dedicated to aero-external applications

Direct Numerical Simulation (DNS) and Large Eddy Simulation (LES) simulations in general require highly accurate and lengthy computations as well as extremely fine meshes when second order methods are used. For complex industrial geometries and high Reynolds numbers flows, the computational cost is a major limitation of CFD usage in a development cycle. In a development cycle, a simulation must be complete within at most a day. Under this constraint, industries must compromise the accuracy using smaller mesh size or less accurate physical modeling in order to lower the computation time. In 2013, a DNS has been conducted for a somewhat academic test case on 3-billion mesh points using a scheme of the order 5-6 and 32,000 processors, in just one day. Such a dense mesh applied to industrial geometries, implies a much larger meshes sizes, ranging from 1 to 10 billion mesh points and more.

The best chance to perform such large simulations within a day is to move forward exascale systems, i.e. capable of an exaFLOPS computation (a billion billion calculations per second). This can be achieved only by increasing the number of processors up to +100,000. The CFD solver needs a good scalability, since communication between the processors becomes a major bottleneck with such an increase of processors.

AllScale Potential.

The implementation of the Fine/Open solver is based on MPI and achieved reasonable scalability on several thousands of cores. However, as the number of targeted cores grew to several tenths of thousands, severe scalability problems arose, which forced the introduction of nested parallelism through the use of OpenMP. This implementation strategy dramatically increased maintenance and development efforts. Furthermore, the resilience to hardware failure still remains an open issue.

Through the use of AllScale’s unified programming model these software development problems will be eliminated. At its core, Fine/Open is based on isolated update operations on mesh points simulating individual time steps. This mesh-time structure is amendable to a similar recursive decomposition pattern. This restructuring will reduce global communication, increase scalability, and introduce support for platform portability. Also, it provides the foundation for AllScale’s sophisticated load balancing and resilience management services adding to the capabilities of the Fine/Open solver. In particular the auto-tuning of applications for customer target architectures as well as the configurable multi-objective optimization of the AllScale runtime system will enable customers to utilize their computational resources more effectively in line with their own trade-offs among optimization objectives.

Using the AllScale runtime system, the scalability and efficiency of parallel algorithms will be maintained while increasing the number of processors by 1 or 2 orders of magnitude. Moreover, the number of partitions should be reduced using the ‘shared memory’ property of the architecture. At last, the efforts in coding will be significantly reduced since the AllScale framework ensure that the code seamlessly runs optimally on various kinds of architectures (i.e. CPU, GPU, accelerators, …) and the parallelization efforts are managed by AllScale tool chain (thus no need for explicit MPI instrumentation in the solver).

A new version of Fine/Open solver dedicated to aero-external applications has already been ported in the AllScale runtime system for 'shared-memory' architectures.

Scheme of the FineTM/Open prototype split in 3 executables

At the current stage of development, the Fine/Open pilot application provides a mutligrid compressible steady and unsteady Navier & Stokes solver including one turbulence model (Spalart-Allmaras). From this prototype version, one can conclude that, the first tests show a similar scalability than the reference solver.

FineTM/Open prototype speedup

From above figure, one can conclude that, the first tests show a similar scalability to the reference solver for a number of cores inferior to 12. Note that the node used for this test has only 12 physical cores leading to 24 hyper-threaded cores. Thus, one can conclude that the lost of scalability, when compared to the reference application, observed by the prototype is due to a poor management of hyper-threaded cores. This should be further investigated. Note also that, in the sequential code, the prototype performs 10 % faster than the reference application.

This prototype in its distributed version will target a full generic aircraft business jet in landing configuration, requiring a billion mesh points and 10,000 to 100,000 time steps, with the objective to run within 2-3 days on 50,000+ cores, and in the order of only a few hours on even larger systems.

Using the AllScale runtime system

In the AllScale runtime system, the user applications are implemented based on a set of generic parallel APIs offered by the project – the AllScale API. This component is subdivided into two layers: the User-Level API and the Core API. The Core API layer covers a small, concise set of essential parallel primitives. To support application developers, a User API layer is introduced and maintained by expert developers. It maps common parallel patterns to the Core-API primitives. The User API supports stencil operations, reduction and parallel loop operations, as well as any branch and bound or divide and conquer based algorithms. Also, it supports a comprehensive set of data structures (e.g. arrays, trees, sets, maps etc.), for the Fine/Open pilot application, it encompass all the data structure needed, The AllScale system provides an dedicated User API that cover the needs of a CFD solver. This API manages all the data structure (the mesh and its hierarchy but also all data necessary for the simulation), the memory management (memory access, memory allocation and deallocation), the parallelism (parallel loop, communications between processes, synchronization). Such data structure is represented on the figure below.

The User API has been designed as a high level interface for the application developers. It ease the development by abstracting a number of operations (such as memory allocatio or communication and synchronization between processes), allowing a faster development while keeping a low level of maintenance. A snippet of code showing how the heat fluxes are computed in the Fine/Open solver is shown in the box below. From this pice of code, one can conclude that the User API provides a very developer-friendly but yet powerful API. Indeed, while the code is fully parallelized, there is no explicit formulation of the parallelism except for a pfor loops (see lines 8,14 and 27). The developer do not need to perform explicit synchronization or communication. However, the developer needs to be aware of the share-memory paradigm which is the program shall not write twice a piece of data on the same memory chunk at the same time. Note also, that memory allocation are entirely managed by the API, i.e. there is no operator like new (or malloc) or delete in the code, see line 2 for instance. At last, the data structure is easily accessible via templated routines that perform for instance loops over the nodes (Cell, faces …) or that acces the edges (i.e. the connectivities see line 16 or 29) or the Node data (such as the cellCenter at line 5).


Example of code for computing heat flux in parallel:

\\ Memory allocation for the conductivity (available on the cells)
auto conductivity = mesh.createNodeData<Cell,value_t,Level>();
\\ Memory allocation for the fluxes (available only on each faces)
auto flux= mesh.template createNodeData<Face,value_t,Level>();
\\ Initialization of the conductivity (from a constant)
mesh.template pforAll<Cell,Level>([&](auto c) { conductivity[c] = kappa; });
\\ Accessing the cell center position
auto& cellCenter = geom.template get<CellCenter,Level>();

\\ Flux computation. Parallel loop over the internal faces.
mesh.template pforAll<Face,Level>([&](auto f){
    \\ Accessing the Face to Cell connectivity
    auto left = mesh.getNeighbor<Face_2_Cell_Left_Upward>(f);
    auto right = mesh.getNeighbor<Face_2_Cell_Right_Downward>(f);
    value_t flux;
    value_t face_conductivity = 0.5 * (conductivity[left] + conductivity[right]);
    value_t distance = dist(cellCenter[left],cellCenter[right]);
    value_t gradT= (varsInCell[TEMP][right] – varsInCell[TEMP][left]) / distance;
    flux[f] = face_conductivity * gradT * faceSurf[f];

 \\ Redistribution of the fluxes over the cells. This step is needed in order to prevent the algorithms from
 \\accessing twice the same memory slot
mesh.template pforAll<Cell, Level>([&](auto c){
    \\ Accessing the Cell to face connectivity
    auto lefts = mesh.template getNeighbors<Cell_2_Face_left>(c);
    auto rights = mesh.template getNeighbors<Cell_2_Face_right>(c);
    for (auto f : lefts){
        residual[c] -= flux[left];
    for (auto f : rights){
        residual[c] += flux[right];


So far, the FINE™/Open pilot application has been compiled and tested using the standard toolchain. The porting of the mesh data structure from the prototype to the API implementation complies with the requirements of the AllScale API and various optimizations of the performances have been made. All features of the prototype are done and validated:

  • Steady and unsteady flow solver.
  • Restart functionality.
  • Multigrid approach.
  • Three physical models (Euler, laminar Navier & Stokes and Spallart Allmaras turbulent Navier & Stokes).
  • Three numerical schemes (first/second order upwind scheme, central scalar and central matrix).
  • Efficient parallel IO.

The next step for the prototype is to work on being able to compile it within the full AllScale toolchain.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671603

Contact Details

General Coordinator

Thomas Fahringer

Scientific Coordinator

Herbert Jordan