Parallel Hybrid Testing Tool for Applications Developed by Using MPI + OpenACC Dual-Programming Model

designed to test applications programmed by using the dual-programming model MPI + OpenACC or the single-programming models OpenACC.


Introduction
In recent years, building massively-parallel supercomputing systems based on heterogeneous architecture have been one of the top research topics. Therefore, creating parallel programs becomes increasingly important, but there is a lack of parallel programming languages, and the majority of traditional programming languages cannot support parallelism efficiently. As a result, programming models have been created to add parallelism to the programming languages. Programming models are sets of instructions, operations, and constructs used to support parallelism.
Today, there are various programming models which have different features and created for different purposes; including message passing, such as MPI [1] and shared memory parallelism, such as OpenMP [2]. Also, some programming models support heterogeneous systems, which consisting of a Graphics Processing Unit (GPU) coupled with a traditional CPU. Heterogeneous parallel programming models are CUDA [3] and OpenCL [4], which are low-level programming model and OpenACC [5] as a high-level heterogeneous programming model.
Testing parallel applications is a difficult task because parallel errors are hard to detect due to the non-determined behavior of the parallel application. Even after detecting the errors and modifying the source code, it is not easy to determine whether the errors have been corrected or hidden. Integrating two different programming models inside the same application even make it more difficult to test. Despite the available testing tools that detect static and dynamic errors, still, there is a shortage in such a testing tool that detects run-time errors in systems implemented in the high-level programming model. The rest of this paper is structured as follows. Section 2 describes the research objectives, while Section 3 briefly gives an overview of some programming models and some run-time errors. The related work will be discussed in Section 4, the proposed architecture in Section 5, a discussion will be in Section 6 and finally the conclusion with future work in Section 7.

Research Objectives
This research aims to develop a parallel hybrid testing tool for systems implemented in MPI + OpenACC dual programming model with C++ programming language. The hybrid techniques combine static and dynamic testing techniques for detecting real and potential run-time errors by analyzing the source code and during run-time. Using parallel hybrid techniques will enhance the testing time and cover a wide range of errors. The following are the primary objectives of our research:

Provide new static testing techniques for detecting real and potential run-time errors for systems implemented in dual programming model (OpenACC and MPI) and C++ programming language.
These techniques are analyzing the source code before compilation for detecting static errors. Some run-time errors can also be detected from the source code, such as send-send deadlocks in Figure 2 B. These errors should be sent to developers to solve them because they will occur definitely in run-time. Also, potential run-time errors are errors that might or might not be occurred after compilation and during run-time. The reasons cause these potential errors can be detected from the source code before compilation by using static testing. However, if these errors have not been detected, it will become run-time errors. As a result, the developers should be warned about these errors and consider them; also our tool will instrument these errors by using assertion language.
The source code will include a combination of the program implemented in C++ and dual programming model source codes, which leads to one big size source code including a considerable number of statements. These static testing techniques will decrease the time of detecting run-time errors after the compilation, which will speed up the system testing time. These techniques also will allow us to correct or inform developers by providing them with a list of potential errors that in some cases in the running time these errors might happen.
The following example in Figure 1 shows a potential run-time error, when process_1 first receive request from any process beside process_0, there is no problem. However, if process_1 receives from process_0 first, the statement REC_FROM (P_0) will never, and the process_1 will be waiting. In that case, from the source code we discover that, somehow, this will cause a run-time error (Deadlock). This situation called a potential deadlock. Also, Figure 2 shows an example of a real run-time error called (deadlock), which happened because of Process_0 block and waiting for receiving from Process_1, which also block and waiting for receiving from Process_0. Similarly, this also happened between Process_2 and Process_3. This assertion language will be used to specify the properties of the programs under test and to verify that the developers' assumptions of the program remain valid during the program runtime. During testing, assertion statements help for the recording of some information, testing the correctness of statements, and monitor the values of variables. To do this, the dynamic tester will automatically insert assertion statements into the code, then provides a method for capturing, organizing, and analyzing assertions output. This will help to increase the error detection capability of a test by using the instrumentation technique. The instrumentation approach based on the idea that the tested part of a program can be specified regarding assertion or values that must be assumed by variables at specific critical points in the program, which can cause run-time errors [6].
Usually, assertion statements start with comment symbol of the programming language, such as "//" in C++, before each assert statements. The main reason behind this is reducing the compiled code that will be delivered to the customers because any statement starts with the comment symbol will be ignored during the compilation. In other words, the assert statements are in the source code but not in the compiled code, which will be delivered to the customers.

Provide new parallel dynamic testing techniques for detecting run-time errors for systems implemented in dual programming model (OpenACC and MPI) and C++ programming language.
These techniques will use the provided assertion language for detecting errors that happened during run-time, by instrumenting and analyzing the system during run-time. This is challenging because different factors and complicated scenarios can cause these errors. Also, testing parallel programs is a difficult task because of the nature of such programs and their behavior. This will add more work on the testing tool for covering every possible scenario of the test cases and data. As a result, detecting parallel run-time errors is more difficult. Furthermore, these dynamic techniques are sensitive to the execution environment and can affect the system execution time.

Integrated the provided techniques for developing a parallel hybrid testing tool for systems implemented in dual programming model (OpenACC and MPI) and C++ programming language.
Our proposed architecture will integrate static and dynamic testing techniques for creating a new hybrid testing tool for parallel systems. This allows us to take advantages of both previously mentioned techniques for detecting some of the dynamic errors from the source code by using the static testing techniques, which will enhance the system execution time. Also, our system will work in parallel to detect run-time errors, by creating testing threads depending on the number of the application threads. Intraprocess and Inter-process run-time detections will be included in our tool. The inter-process detector will be responsible for detecting run-time errors that happened within the process, and the Intra-process detector for detecting errors happened between processes each other.

Background
In this section, the main components involved in our research will be displayed and discussed. This will include the programming models that will be used in our research and describing why they have been chosen. Also, some run-time errors and testing techniques will also be described and discussed in this section.

OpenACC
In November 2011, OpenACC stands for open accelerators, was released for the first time in the International Conference for High-Performance Computing, Networking, Storage and Analysis [7]. OpenACC is a directive-based open standard developed by Cray, CAPS, NVIDIA and PGI. They design OpenACC to create simple high-level parallel programming model for heterogeneous CPU/GPU systems, that compatible with FORTRAN, C, and C++ programming languages. Also, OpenACC Standard Organization defines OpenACC as "a user-driven directive-based performanceportable parallel programming model designed for scientists and engineers interested in porting their codes to a wide variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a lowlevel model." [5]. The latest version of OpenACC was released in November 2017. OpenACC has several features and advantages comparing with other heterogeneous parallel programming models including:  Portability: Unlike programming model like CUDA works only on NVIDIA GPU accelerators, OpenACC is portable across different type of GPU accelerators, hardware, platforms, and operating systems. [8]  OpenACC is compatible with various compilers and gives flexibility to the compiler implementations.
 High-level programming model, which makes targeting accelerators easier, by hiding low-level details. For generation low-level GPU programs, OpenACC relies on the compiler using the programmer codes. [9]  Better performance with less programming effort, which gives the ability to add GPU codes to existing programs with less effort. This will lead to reduce the programmer workload and improve programmer productivity and achieving better performance than OpenCL and CUDA. [10]  OpenACC allows users to specify three levels of parallelism by using three clauses: OpenACC has both a strong and significant impact on the HPC society as well as other scientific communities. Jeffrey Vetter (HPC luminary and Joint Professor Georgia Institute of Technology) wrote: "OpenACC represents a major development for the scientific community. Programming models for open science by definition need to be flexible, open and portable across multiple platforms. OpenACC is well-designed to fill this need." [5].

Message Passing Interface (MPI)
Message Passing Interface (MPI) [1] is a message-passing library interface specification. In May 1994, the first official version of MPI was released. MPI is a message-passing parallel programming model that moves data from a process address space to another process by using cooperative operations on each process. The MPI aims to establish a standard for writing message-passing programs to be portable, efficient, and flexible. Also, MPI is a specification, not a language or implementation, and all MPI operations are expressed as functions, subroutine or methods for programming languages including FORTRAN, C, and C++. MPI has several implementations including open source implementations, such as Open MPI [11] and MPICH [12]; and commercial implementations, such as IBM Spectrum MPI [13] and Intel MPI [14]. MPI has several features and advantages including:  [1] is in progress, which aims to add new techniques, approaches, or concepts to the MPI standard that will help MPI address the need of current and next-generation applications and architectures. The new version will extend to better support hybrid programming models including hybrid MPI+X concerns and support for fault tolerance in MPI applications.

Dual-Level Programming Model: MPI + OpenACC
Integrating more than one programming model can enhance parallelism, performance, and the ability to work with heterogeneous platforms. Also, this combination will help in moving to Exascale systems, which need more powerful programming models that support massively-parallel supercomputing systems. Hybrid programming models can be classified as:  Single-Level Programming Model: MPI  Dual-Level Programming Model: MPI + X  Tri-Level Programming Model: MPI + X + Y In our research, the dual-programming model (MPI + OpenACC) will be discussed. As mentioned earlier MPI and OpenACC have various advantages, and by combining them, we will enhance parallelism, performance and reduce programming efforts as well as taking advantage of heterogeneous GPU accelerators. That will be achieved by using OpenACC that can be compiled into multiple device types, including multiple vendors of GPUs and multi-core CPUs as well as different hardware architecture. MPI will be used to exchange the data between different nodes as shown in Figure 3, which display how to use MPI for inter GPU communication with OpenACC.  [15] In order to write portable and scalable applications for heterogeneous architecture, the dual-programming model MPI + OpenACC can be practical. It inherits the advantages, such as high performance, scalability, and portability from MPI and programmability and portability from OpenACC [16]. However, this dual-programming model might introduce different types of run-time errors, which have different behaviors and causes. Also, some complexities and inefficiencies might happen including redundant data movement and excessive synchronization between the models, which need to be considered and take care of, but it is better than using CUDA or OpenCL, which is more complicated and harder to program, resulting in lower productivity.

Common Run-Time Errors
There are several types of run-time errors that happened after compilation and cannot be detected by the compilers, which cause the program not to meet the user requirements. These errors even sometimes have similar names, but they are different in the reasons that cause the run-time error or the error behavior. For example, deadlock in MPI has different causes and behaviors comparing with OpenACC deadlocks. Also, run-time errors in the dualprogramming model are different. Also, some run-time errors happened specifically in a particular programming model. By investigating the documents of the latest version of OpenACC 2.7 [17], we found that OpenACC has a repetitive run-time error that if a variable is not present on the current device, this will lead to run-time error. This case happened in non-shared memory devices for different OpenACC clauses.
Similarly, if the data is not present, a run-time error is issued in some routines. Furthermore, detecting such errors is not easy to do, and to detect them in applications developed by dual-programming model even more complicated. In the following, some popular runtime errors will be displayed and discussed in general with some examples.

Deadlock
A deadlock is a situation in which a program is in a waiting state for an indefinite amount of time. In other words, one or more threads in a group are blocked forever without consuming CPU cycles. The deadlock has two types including resource and communication deadlock. Resource deadlock is the situation where a thread waits for another thread resource to proceed.
Similarly, the communication deadlock occurs when some threads wait for some messages, but they never receive these messages [18][19][20]. The reasons that cause deadlock are different depending on the used programming models, systems nature and behavior. Once the deadlock occurs, it is not difficult to detect, but in some cases, it is difficult to detect them before it happened as they occur under specific interleaving. Finally, deadlocks in any system could be potential or real deadlocks.

Livelock
Livelock is similar to deadlock, except that livelock is a situation that happened when two or more processes change their state continuously in response to changes in the other processes. In other words, it occurs when one or more threads continuously change their states (and hence consume CPU cycles) in response to changes in states of the other threads without doing any useful work. As a result, none of the processes will make any progress and will not complete [21,22]. In a livelock, the thread might not be blocked forever, and it is hard to distinguish between livelock and long-running process. Also, livelock can lead to performance and power consumption problems because of the useless busy-wait cycles.

Race Condition
A race condition is a situation that might be occurred due to executing processes by multiple threads and where the sequence of execution for the threads makes a difference in the result of the concurrent execution. The execution timing and order will affect the program's correctness [20,23]. Some researchers do not differentiate between data race and race condition, which will be explained in the data race definition.

Data Race
A data race happened when there are two memory accesses in the program where they both are performed concurrently by two threads or target the same location [23,24]. For example, at least one read and one write may happen at the same memory location, at the same time. The relation between data race and race condition, the race condition is a data race that causes an error. However, data race does not always lead to a race condition.

Mismatching
Mismatching is a situation that happened in arguments of one call, which can be detected locally and are sometimes even detected by the compiler. Mismatching can be caused by several forms including wrong type or number of arguments, arguments involving more than one call, or in collective calls. Developers need to make special attention when comparing matched pairs of derived data types. Some examples of mismatching that occurred in MPI as the following [23]:

Testing Techniques
There are many techniques used in software testing, which include static, dynamic, as well as other techniques. Static testing is the process of analyzing the source code before compilation phase for detecting static errors. It handles the application source code only without launching it, which give us the ability to analyze the code in details and have full coverage. In contrast, the static analysis of parallel application is complicated due to the unpredicted program behavior, which is parallel application nature. However, it will be beneficial to use static analysis for detecting potential run-time errors and some real run-time errors that are obvious from the source code, such as some types of deadlocks and race condition.
Dynamic testing is the process of analyzing the system during run-time for detecting dynamic (run-time) errors. It demands to launch programs, sensitive to the execution environment, and slow down the speed of application execution. It is useful to use dynamic analysis in the parallel application, which gives the flexibility to monitor and detect each thread of the parallel application. However, it is difficult to cover the whole parallel code with tests, and after correcting the errors, it cannot be confirmed that errors are corrected or hidden.
Finally, it is the error types and behaviors that determine which techniques will be used, because static analysis and others cannot detect dynamic techniques cannot detect some errors. As a result, in our research, a hybrid technique will be used for different purposes and reasons. Furthermore, this hybrid technology will be working in parallel to detect parallel run-time errors and analyzing the application's threads.

Related Works
Many studies have been done in software testing for HPC and parallel software. These researches are varied, for different purposes and scopes. These variations include testing tools or detection for a specific type of errors or a different type of errors. Some studies focus on using static testing techniques [25][26][27][28] to detect errors by analyzing the source code and find real as well as potential run-time errors [29,30]; dynamic testing techniques [31,32] to detect errors after execution and at run-time; or hybrid testing techniques [33][34][35]. Also, detecting errors in programming models also varied from the testing tool for single level programming model to the tri-level programming model. Even in the same classification of programming model the variation between testing the programming models themselves, because each programming model has a different error to detect as discussed earlier in Section 3.4.
For detecting a specific type of errors, there are many types of research worked on detecting deadlock, livelock and race condition by using different techniques. In deadlock detection, there are many tools and studies that are using static or dynamic testing techniques to detect deadlocks including resource and communication deadlocks. UNDEAD [19] is a deadlock detection and prevention, which helps to defeats deadlocks in production software with enhancing run-time performance and memory overheads. More deadlock detection can be found in [19,36] . Regarding detecting data race, a hybrid test-driven approach has been introduced in [35] to detect data race in task-parallel programs. Also, many data race detection approaches in [28,37]. Finally, some livelock detection techniques have been proposed in [21,22].
Regarding testing the programming model, many approaches have been introduced to test and detect errors in parallel software. Many studies have been done in a single level programming models such as MPI, OpenMP, CUDA and OpenCL. While some studies focus on dual-level programming models including MPI + X hybrid programming models, which include homogeneous and heterogeneous systems. One popular combination is MPI + OpenMP, which appears in [33,38,39]. Some of these studies focus on dynamic testing, while some of them in regression testing, which is the process of analyzing the system after the maintenance phase.
Many existing HPC debuggers include both commercial and open source versions. One commercial debugger is ALLINEA DDT [40], which supports C++, MPI, OpenMP, and Pthreads. It has been designed to work at all scales including Petascale. The other is TotalView [41], which that supports MPI, Pthreads, OpenMP and CUDA. However, these debuggers do not help to test or detect errors, but it used to find out the reasons behind that errors. Also, the developer needs to select the thread, process, and kernel to be investigated. Regarding open source testing tools, ARCHER [37] is a data race detector for an OpenMP program that combines static and dynamic techniques to identify data race in large OpenMP applications. Also, AutomaDeD [42] (Automata-based Debugging for Dissimilar Parallel Tasks) is a tool that detects MPI errors by comparing the similarities and dissimilarities between tasks. MEMCHEKER [11] allows finding hard-to-catch memory errors in MPI application such as overwriting of memory regions used in non-blocking communication and one-sided communication. Furthermore, MUST [32] detects run-time errors in MPI and report them to the developers, including MPI deadlock detection, data type matching, and detection of communication buffer overlaps.
Testing OpenACC has limited studies in testing and detecting static and dynamic errors. There are some researches regarding related OpenACC testing. In [43], they evaluate three commercial OpenACC compilers by creating a validation suite that contains 140 test case for OpenACC 2.0. They also check conformance, correctness, and completeness of specific compilers for the OpenACC 2.0 new features. This test suite has been built on the same concept as the first OpenACC 1.0 validation test suite in [44], which three commercial compilers were evaluated including CAPS, PGI and CRAY. Similarly, this OpenACC test suite was published in [45] for OpenACC version 2.5, which is the past version, to validate and verify compilers' implementations of OpenACC features.
Recently, another testing of the OpenACC application was published in [46], which considered detecting numerical differences that can be occurred due to computational differences in different OpenACC directives. They proposed a solution for that by generating code from the compiler to run each computes region on both the host CPU and the GPU. Then, the values computed on the host and GPU are compared, using OpenACC data directives and clauses to decide what data to compare.
Despite the efforts that have been done in creating and proposing software testing tools for parallel application, still, there is a lot to be done primarily for OpenACC and for dualprogramming models for heterogeneous systems. Finally, in our best knowledge, there is not a parallel testing tool built to test applications programmed by using the dual-programming model MPI + OpenACC.

Proposed Architecture
We propose a parallel hybrid testing tool for the dualprogramming model (MPI + OpenACC) and C++ programming language as shown in Figure 4. This architecture has the flexibility to detect potential run-time errors and report them to the developer, detect them automatically by using assertion language and execute them to get a list of run-time errors, or detecting dynamic errors. This architecture uses hybrid testing techniques including static and dynamic testing. The static testing part is shown in Figure 5 while the dynamic part in Figure 6.
The source code includes C++ programming language and MPI + OpenACC as dual-programming models. The part that displayed in Figure 5 is responsible for detecting real and potential run-time errors by using static testing. This part produces a list of potential run-time errors for the developer.
Also, this list could be an input to the assertion process that these potential errors will be automatically detected and avoided during the dynamic testing part. Also, any real run-time errors also will be addressed to the developed with warning messages, as these errors must be corrected because they will defiantly occur during run-time. Also, these real run-time errors that been discovered from the source code can be automatically corrected before the process move to the dynamic testing part, which reduces the testing time and enhances the testing performance. The static part of the architecture includes:  Lexical analyzer: This will take the source code that includes C++, MPI, and OpenACC as an input. This analyzer will understand the source code because it has all the information related to the programming language and the determined programming models. This information includes keywords, reserved words, operators, variable and constant definitions. Then, it will convert the application source code into tokens and allocate them into tables of tokens. The output of this analyzer will be a token table, which includes token names and their respective type.  Parser: This Part is responsible for analyzing the syntax of the input source code and confirming the rule of a formal grammar. This process will produce a structural representation of the input (Parser Tree) that shows the syntax relation to each other, checking for correct syntax in the process.  State transit graph generator: This part will generate a state graph for the user program, which includes C++, MPI, and OpenACC. This state graph will be represented by any suitable data structure such as a matrix or linked list.  State graph comparator: Taking the graph for the user program as an input and compare it with the state graphs of each programming language and model. This comparator has accessibility to state graph libraries, which include the respective programming language and model that have the correct grammar of each of them. As a result, any differences of these comparisons will be provided in a list of potential run-time errors as well as some real run-time error that can be detected by the static part of the architecture. The real runtime errors will be delivered to the developer to correct them because they undoubtedly occur if they do not be corrected. The potential run-time errors will be a move to the assertion language inserting and then instrumented to be considered in the dynamic part of the proposed architecture. The dynamic testing part of the proposed architecture is shown in Figure 6, which takes the source code and the assertion language as an input and move them to the instrumental. The instrumental depending on the semantics of the assertion language will produce code in the targeted programming language. The instrumental consist of four modules including; a lexical analyzer, parser, semantic, and code translator. The instrumental will produce an instrumented source code as an output. The instrumented source code includes the user codes and the testing codes both of them wrote in the user code programming language. Two methods can do instrumentation. Firstly by adding the testing codes, assertion statements, to the source code which leads to bigger code size as it will have user code and testing code. The second method is by adding the assert statements as calling of API functions, and these functions will test the part of the code that needs to be tested. This method leads to a smaller code size that any testing needed a call statement will be written, and the function will do the test. It is noticeable when we have the same testing code for several parts of the user code, in the previous method this testing code will be repeated many times, while in this method it will be only written once and called multiple times. Further investigation of the instrumentation will be considered in our future progress. The resulted instrumented code will be compiled and linked, which results in EXE codes including user executable code and run-time subsystems. Finally, these EXE codes will be executed and provide a list of run-time errors.

Discussion
There are many tools, and researches have been done to detect a run-time error that occurs in parallel systems, which used MPI, CUDA, and OpenMP programming models. However, even though OpenACC can work in heterogeneous architecture, hardware, and platforms, as well as used by non-computer science specialist, which easily can have several errors. There is not a research or testing tool that detects OpenACC run-time errors. Also, OpenACC becomes increasingly used in different research fields as well as one of the main programming models targeting Exascale systems. Recently, OpenACC has been used in five of 13 applications to accelerate performance in the top supercomputer in the world Summit. Also, three of the top five HPC applications are using OpenACC as well. Therefore, this increased in using OpenACC will come with more errors that need to be detected.
In our tool, we consider having hybrid testing techniques including static and dynamic testing. This combination takes the advantages of two testing techniques, reduces disadvantages, and reduces the testing time. The first part of the hybrid technique is a static testing technique which analyses the source code before compilation to detect static errors. Some of the run-time errors can also be detected from the source code and should be sent to developers to solve them because they will occur definitely at runtime. In addition, potential run-time errors are errors that might or might not be occurred after compilation and during run-time based on the execution behavior. The reasons that cause these potential errors can be detected from the source code before compilation by using static testing. However, if these errors have not been detected, it will become run-time errors. As a result, the developers should be warned to these errors and consider them.
The second part of the hybrid technique is a dynamic testing technique that is detecting errors that happened during run-time, by instrumenting and analyzing the system during run-time. This is challenging because different factors and complicated scenarios can cause these errors. In addition, testing parallel programs is a difficult task because of the nature of such programs and their behavior. This will add more work to the testing tool for covering every possible scenario of the test cases and data. Furthermore, these dynamic techniques are sensitive to the execution environment and can affect the system execution time. Finally, it is the run-time errors type and behavior that determines which techniques will be used, because static analysis and others cannot detect dynamic techniques cannot detect some errors.

Conclusion and Future Works
High-performance computing has become increasingly important, and the Exascale supercomputers will be feasible by 2020; therefore, building massively parallel supercomputing systems based on a heterogeneous architecture has become even more important to increase parallelism. Using hybrid programming models for creating parallel systems has several advantages and benefits, but mixing parallel models within the same application leads to more complex codes. Testing such complex applications is a difficult task and needs new techniques for detecting run-time errors.
We proposed a parallel hybrid testing tool for detecting runtime errors for systems implemented in C++ and MPI + OpenACC. This proposed solution integrates static and dynamic testing techniques for building a new hybrid testing tool for parallel systems. This allows us to take advantages of both previously mentioned techniques for detecting some of the dynamic errors from the source code by using the static testing techniques, which will enhance the system execution time. Also, our system will work in parallel to detect run-time errors, by creating testing threads depending on the number of the application threads.
In our future work, we will identify and classify the OpenACC run-time errors and study their behavior and causes to be our guide in building our testing tool. Also, we will implement our architecture and evaluate its ability to detect OpenACC run-time errors and also we will identify and address the run-time errors that resulted from the dual-programming model MPI + OpenACC. Our experiments will be conducted in AZIZ supercomputer, which is one of the top ten supercomputers in the Kingdom of Saudi Arabia. On June 2016, AZIZ was ranked No. 359 among the Top 500 supercomputers in the world.

Conflict of Interest
The authors declare no conflict of interest.