Resilience Techniques Survey

Analysis of results

Q1: In which field of computational science do you work?

Q2: In which languages are you applications written?

Q3: Which of the following software technologies do you use in your application today?

Q4: Which are the principal mathematical kernels that dominate your applications?

Q5: Which numerical representation is most common in your applications today?

Q6: How many CPUs do you use in typical jobs today (more than one answer allowed)?

Q7: Have you considered the challenges of Exascale computing and are actively considering changes to bring the code to readiness for ExaScale?

Q8: Have you considered the impact of HARD and SOFT faults in your application?

Q9: Resiliency against HARD faults can be provided by using a generic checkpoint/restart mechanism. Does your application software employ checkpoint/restart mechanisms. If so please describe briefly?

Q10: Do you use any of the following error checking mechanisms?

Summarizing the results of the survey

Conclusions and Future Work

Bibliography

 

Background to the survey

In its early stages the AllScale project undertook a horizon scanning type activity including a review of the literature on resilience techniques as applicable to ExaScale computing.

The AllScale pilot applications reported  that checkpoint re-start for hard faults was their preferred solution. The project decided to engage with a wider community of end users and undertook a survey. The results of which are reported here. We posed ten questions covering the full range of issues facing users as the ExaScale era approaches. The questions are shown below along with the analysis of the responses.

 

Analysis of results

In this section we present each question in the survey and the responses to that question.

 

Q1: In which field of computational science do you work

 

Area of computational science

Number of responses

Numerical mathematics

1

Nuclear physics

1

Materials Science and solid state physics

1

Computational chemistry and plasma physics

7

Medical and engineering science

1

Physical oceanography

2

Condensed matter physics

1

Fluid dynamics, turbulent atmospheric flows

11

Planetary atmospheres

1

Climate modelling

1

Programming models and runtimes

1

Total

28

Table 1: Q1 Responses – field of computational science

Analysis: There is a fairly wide spread of applications in the list of respondents and therefore in number of groups surveyed All of these are well known field in HPC, perhaps with the exception of medical engineering. Obviously the responses reflect the fact that the survey was circulated mainly via large HPC user communities, which is reflected in the dominance of computational chemistry and CfD.

 

Q2: In which languages are you applications written?

Multiple answers were allowed per respondent, so this captures the total space.

Figure 1: Distribution of languages used for program implementation

Analysis: The dominance of Fortran is no surprise particularly given the high proportion of computational chemists in the list of respondents. The fact that C++ have overtaken C to be the second most prevalent language is encouraging for the AllScale project.

The presence of scripting languages reflects that fact that many HPC suites are steered by scripts, particularly Linux shell scripts. Perhaps the most interesting outcome here is that Assembler is still in use and at 10% come in fourth place.

 

Q3: Which of the following software technologies do you use in your application today?

As with the previous question, multiple answers were allowed.

 

Figure 2: Analysis of the types of parallel runtime environments uses

 

Analysis: It is clear from the figure that MPI is the dominant implementation tool, closely followed by OpenMP. Together these far outweigh  the sum of all of the other technologies.  Task based runtimes are clearly in a minority.

 

Q4: Which are the principal mathematical kernels that dominate your applications

Multiple answers are allowed for this question.

Figure 3: Analysis of the types of mathematical kernels used

Analysis: Solution of differential equations, numerical integration and linear algebra are the dominant kernels, no doubt correlarting to the fact that the majority of respondents are in the fields of computational fluid dynamics or computational chemistry.

 

Q5: Which numerical representation is most common in your applications today

Figure 4: Analysis of the different numerical representations

Analysis: The dominance of double precision over all other representations is clear. It is noteworthy that a significant number of applications use a mixture of double and single precision.

 

Q6: How many CPUs do you use in typical jobs today (more than one answer allowed)

Respondents were asked to consider all of the types of jobs that they regularly run. The information is presented as a Pie chart.

Figure 5: Distribution of jobs by number of CPUs used

Analysis: The largest block here is jobs which use 1024-4095 CPUs followed by those using 256-511 CPUs and then 128-255 CPUs. This reflects the fact that many of the surveyed users test out their code bases on smaller systems, probably local university systems, before moving to larger national or PRACE type systems.  It is notable that some users  report using over 100,000 CPUs per job and a similar number between 10,000 and 99,999 CPUs. 

 

Q7: Have you considered the challenges of Exascale computing and are actively considering changes to bring the code to readiness for ExaScale

This relatively simple question produced the following result

                          46.43% – YES  

                          25.00% – NO

                          10.71% – Have consider changes but are not carrying out (too risky/costly)

                          17.86% – Do not wish to answer

This provides a much wider picture than what we had found in surveying the AllScale pilot applications in D5.1

 

Q8: Have you considered the impact of HARD and SOFT faults in your application.

Table 2: Determining if users have consider hard and soft faults

Analysis:  Approximately 40% of the users make no attempt to handle faults in their code. Written responses suggest that they simply resubmit the job. It is intriguing to note that 15% if users did not wish to answer the question. No reasons for this are available in the survey.

 

Q9: Resiliency against HARD faults can be provided by using a generic checkpoint/restart mechanism. Does your application software employ checkpoint/restart mechanisms. If so please describe briefly.

We received responses from all 28 respondents to the survey. Of these 17 indicated that a checkpoint restart mechanism was indeed part of their software  package. The implementation varied significantly. For example one respondent replied that job chains are realized. Checkpoints can be used via MPI_BARRIER statements while many others replied that periodic checkpoint files were written or in a few cases a checkpoint at every timestep. Interestingly one respondent, working in the field of computational chemistry reported (including a publication refderence) that in the area of non-deterministic global optimization, we do better than that: our program is fully resistent against hard faults, cf. DOI  10.1021/acs.jctc.6b00716.

This is reference [3] below, reporting on an efficient massively parallel implementation of genetic algorithms for chemical and materials science problems, solely based on Java virtual machine (JVM) technologies and standard networking protocols. The paper point outs that the lack of complicated dependencies allows for a highly portable solution exploiting strongly heterogeneous components within a single computational context. At runtime, the implementation is almost completely immune to hardware failure, and additional computational resources can be added or subtracted dynamically, if needed.  This raises an interesting concept of using resilient virtual machines, such as JVM, to decouple the implementation from the runtime completely.

 

Q10: Do you use any of the following error checking mechanisms?

Table 3: Analysis of the different types of error checking mechanisms incorporated

Analysis: It is interesting to note here that numerical accuracy or computation of application specific constraints together represent the main techniques used to check for errors.

 

Summarizing the results of the survey

In this section we bring together the results of each question in our survey and present some findings from it.

Forty percent of all applications surveyed, reported that they make no attempt to cater for hard or soft faults at present.  However almost fifty-percent of the applications attempt to validate the numerically accuracy of their results. Double precision computing remains the dominant mathematical representation used. Given the distribution of languages reported, this means that double precision data representations are used in all languages.

We find that almost 60% of all groups surveyed have already investigated the changes that they will need to apply to their code base when moving it to run in an ExaScale environment, including the challenges that resiliency will impose. The fact that Fortran is a dominant language in this response group, it implies that even Fortran users are considering the challenges.

Interestingly 10% of all respondents have decided that the required changes are either too costly or too risky to implement at this time. We believe that this will prove to be a false economy and that eventually these groups will have to make changes.

We note that relatively few groups use a task based run time model, with OpenMP and MPI being not surprisingly the most widely used tools with which to exploit parallelism today. It was interesting to see the response from materials science, to Q9, reporting on resiliency against hardware faults using a JVM based approach.   In that work [3]  different parallelization technologies are supported: shared-memory parallelization based on the Java thread model and distributed-memory parallelization based on the Java wrapper of the MPI API (mpiJava and MPJ Express) and both implementations were shown to scale linearly.

 

Conclusions and Future Work

We have surveyed 28 groups of application scientists from across Europe and North America to determine their understanding of resiliency requirements in the emerging field of ExaScale computing. This represents a much wider sample of the scientific software development community that the three AllScale pilot applications and addresses the issue raised by the EC reviewers at the mid-term review. While we attempted to reach as many application groups as possible, we realize that there is a strong bias towards computational physics and chemistry in our responses. However this does correlate approximately to the distribution of workloads at many national and PRACE HPC facilities.  Clearly, we are unable to compete with leading market analysts [4] in conducting our survey and were unable to engage the defense sciences community or the emerging use of data analytics in commercial computing [5].

We believe that AllScale is well placed to market itself to the community that we have surveyed. In particular we have learnt from this survey about the kind of mathematical kernels that are widely used and AllScale can take this into account in future work when developing new training materials in an effort to market the system to a wider community. It is noteworthy that solving partial differential equations, numerical integration and linear algebra, generally in double precision floating point arithmetic, are the dominant mathematical kernels in use by this group of application scientists. On a different note, the continuing wide use of the Fortran language, most likely due to having large legacy code bases, may present a barrier to the uptake of C++ tools such as AllScale [6,7].

 

Bibliography

  1. R. F. Barrett et al., "Navigating an Evolutionary Fast Path to Exascale", 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, Salt Lake City, UT, 2012, pp. 355-365.doi: 10.1109/SC.Companion.2012.55
  2. Ahern, Sean, Alam, Sadaf R, Fahey, Mark R, Hartman-Baker, Rebecca J, Barrett, Richard F, Kendall, Ricky A, Kothe, Douglas B, Mills, Richard T, Sankaran, Ramanan, Tharrington, Arnold N, and White III, James B. Scientific Application Requirements for Leadership Computing at the Exascale. United States: N. p., 2007. Web. doi:10.2172/1081802.  
  3. Johannes M. Dieterich and Bernd Hartke, Error-Safe, Portable, and Efficient Evolutionary Algorithms Implementation with High Scalability, J. Chem. Theory Comput., 2016, 12 (10), pp 5226–5233
  4. Impact of Exascale Computing trends in Key sectors (TechVision), Frost and Sullivan, 2015, available on the web at:
    https://store.frost.com/impact-of-exascale-computing-trends-in-key-sectors-techvision.html, last accessed 20th March 2018. 
  5. S Rubenoff, How Deep Learning is Causing a ‘Seismic Shift’ in the Retail Industry, InsideHPC, Special Report, available on the web at https://insidehpc.com/2018/02/deep-learning-shift-retail-industry/ last accessed 20th March 2018
  6. S. X. Yang, D. Gannon, S. Srinivas, F. Bodin and P. Bode, "High Performance Fortran interface to the parallel C++",Proceedings of IEEE Scalable High Performance Computing Conference, Knoxville, TN, 1994, pp. 301-308. doi: 10.1109/SHPCC.1994.296658
  7. M. Spencer, R. Ferreira, M. Beynon, T. Kurc, U. Catalyurek, A. Sussman, J.Saltz, "Executing Multiple Pipelined Data Analysis Operations in the Grid", Supercomputing ACM/IEEE 2002 Conference, pp. 54-54, 2002, ISSN 1063-9535.

 

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671603

Contact Details

General Coordinator

Thomas Fahringer

Scientific Coordinator

Herbert Jordan