Improving System Reliability Assessment of Safety-Critical Systems using Machine Learning Optimization Techniques

Article history: Received: 14 November, 2017 Accepted: 10 January, 2018 Online: 30 January, 2018


Introduction
Reliability assessment of safety-critical systems is becoming an almost insurmountable challenge. In the near future, the engineering of new applications for vehicles such as driving assistance functions or even autonomous driving systems will inevitably incur significantly increased engineering sophistication and longer test cycles. Thus, in the automotive domain, functional safety continues to be ensured on the basis of the international ISO 26262 standard. As both the levels of functionality such systems provide and their degree of interaction with their environment increases, an adequate increase in system safety assessment capabilities is required. This paper is an extension of work originally presented at the 10th IEEE International Conference on Software Testing, Verification and Validation (ICST 2017) [1] and describes a methodology for efficiently assessing system safety. The focus of the paper is on regression testing of safety-critical systems consisting of black-box components. This scenario is common for automotive electronic systems, where testing time is expensive and should be reduced without an uncontrolled reduction in reliability.
The work reported here correspondingly seeks to increase testing efficiency by reducing the number of selected test cases in a regression test cycle. When a selection decision is made, the following two types of errors are possible: • a test case is selected but would pass (type-Ierror, false-positive case) and • a test case is not selected but would fail (type-IIerror, false-negative case). Accordingly, we model a classifierĤ for solving the following optimization problem.
A good standard of test efficiency calls for the avoidance of false-positives. This requires minimization of the probability of mistakenly assuming the rival hypothesis (H 1 : test case fails) even though the null hypothesis (H 0 : test case passes) is correct. Conversely, false-negatives mean that system failures remain undetected; the occurrence of this type of error must therefore be avoided with very stringent requirements. Thus, a predefined limit p FN ,MAX for the probability of a false-negative is defined. [1] proposed a concept for the selection of test cases based on a stochastic model. However, this paper proposes a holistic optimization framework for the safety assessment of safety-critical systems based on machine learning optimization techniques. We suggest an incrementally and actively learning linear classifier whose parameters are estimated on the basis of Bayesian inference rules. As a result, our novel approach for modeling a linear classifier outperforms other machine learning approaches in terms of sensitivity.
Furthermore, this paper deals with the following fundamentally important research question: The machine learning approach is trained with data (test evaluations) obtained during a concurrently running regression test. How much training data is enough? When does regression test selection actually start?
We extend the proposed selection method [1] by introducing suitable test case features that are used in the machine learning approach for increasing performance (see [2]). Therefore, each feature introduced increases the complexity of the optimization problem (cf. Eq. 1) as a new dimension for optimization is introduced. Thus [3] and [4] suggest that high dimensional optimization problems can be solved in reasonable timeframes by using evolutionary algorithms instead of a (grid)search-based approach as given in [1]. Accordingly, we propose an evolutionary optimization approach for increasing testing efficiency.
Further extensions, such as the introduction of a prioritization strategy for test cases in order to select higher-priority test cases, will also be presented within this paper. In our novel approach, a linear classifier is trained in an online session; the ordering of the training data on the basis of a prioritization strategy therefore has the potential to improve our classifiers' performance.
We also provide an industrial case study to show the advantages of the suggested selection method. The study uses data from several regression test cycles of an ECU of a German car manufacturer, showing how test effort can be reduced significantly whereas the rates of both false-negatives and false-positives can be kept at very low values. In this example, we can quadruple test efficiency by keeping the falsenegative probability at 1%.
We first discuss related work in Sec. 2 and explain basic definitions in Sec. 3. Accordingly, we motivate our research topic in Sec. 4 by giving some background information on regression tests and referring to the challenges. In Sec. 5, we give a brief overview of known machine learning methods' performance in solving safety-critical binary classification tasks. The concept of our novel machine learning approach is presented in Sec. 6. Sec. 7 discusses optimization strategies, and Sec. 8 focuses on the importance of the learning phase for the success of our approach. An industrial case study with real data is then given in Sec. 9. Finally, Sec. 10 presents the paper's conclusions.

Related Work
The automotive industry is currently engaged in a laborious quality assessment process around new engineered driver assistance and active and passive safety functions, while functional safety is ensured according to the international ISO 26262 standard [5]. Reliability assessment of systems is therefore, possible through both model-checking and testing.
Model-checking is used for verifying conditions on system properties. Thus [6] states that system requirements can also be validated by model-checking techniques. The idea is to check the degree to which system properties are met and to deduce logical conclusions on the basis of the satisfaction of system requirements. Model-checking has therefore gained wide acceptance in the field of hardware and protocol verification communities [7]. Motivated by the fact that numerical model-checking approaches cannot be directly applied to black-box components as a usable formal model is not available, we focus on modelchecking driven black-box testing [6] and statistical model-checking techniques [8]. However, there exist some approaches for interactively learning finite state systems of black-box components (see [9] and [10]), which are proposed as black box checking in [11]. Learning a model is an expensive task, as the interactively learned model has to be adapted due to inaccuracy reasons. Nevertheless, some assumptions about the system to be checked, such as the number of internal states, are necessary; furthermore, conformance testing for ensuring the accuracy of the learned model has to be iteratively performed [9].
Therefore, [8] outlines the advantages of statistical model-checking as being simple, efficient and uniformly applicable to white-and even to black-box systems. [6] motivates on-the-fly generation of test cases for checking system properties; here, a test case is generated for simulating a system for a finite number of executions. All these executions are used as individual attempts to discharge a statistical hypothesis test and finally for checking the satisfaction of a dedicated system property.
Model-checking driven testing, or even simply testing a system in order to validate its requirements, is an expensive task, especially where safety-critical systems are concerned. However, the focus is on regression testing, which means that the entire system under test has already been tested once but has to be tested again due to system modifications that have been carried out. The purpose of regression testing is to provide confidence that unchanged parts within the system are not affected by these modifications [12]. White-box selection techniques have been comprehensively researched [13,14]. However, we are here considering black-box components, and hence selecting test cases that only check modified system blocks gets difficult.
Since the implementation of black-box systems and moreover, the information on performed system modifications is not available [12], reasonably conducting a regression test becomes impossible.
Accordingly, regression testing of safety-critical black-box systems ends up in simply executing all existing test cases; this is a retest-all approach [12]. This is also motivated by the fact that in the au-tomotive industry, up to 80% of system failures [1] that are detected during a regression test have not occurred previously. The reason behind this fact is that often many unintended bugs are introduced during a bug-fixing process. So between two system releases many new unknown errors are often introduced. For reducing the overall test effort, we apply a test case selection method [1] based on hypothesis tests. Those test cases that are assumed to fail on their executions are accordingly selected. However, type errors while performing hypothesis tests are possible, as, for instance, in statistical model-checking.
We extend the proposed selection method into a holistic machine learning-driven optimization framework that utilizes suitable test case features for increasing testing efficiency (see [2]). Machine learning methods are often trained in so-called batch modes. Nevertheless, many applications in the field of autonomous robotics or driving are trained on the basis of continuously arriving training data [15]. Thus, incremental learning facilitates learning from streaming data and hence is exposed to continuous model adaptation [15]. Especially handling non-stationary data assumes key importance in applications like voice and face recognition due to dynamically evolving patterns. Accordingly, many adaptive clustering models have been proposed, including incremental K-means and evolutionary spectral clustering techniques [16].
Furthermore, labeling input data is often awkward and expensive [17] and hence accurately training models can be difficult. Therefore, semi-supervised learning techniques are developed for learning from both labeled and unlabeled data [17]. Motivated by these techniques, we propose a similar approach for effectively learning from labeled data. Hence, we cluster binary labeled data in more than two clusters for improving a classifier's learning capability due to the optimization of an objective function. Our optimization framework thus utilizes evolutionary optimization algorithms for handling the optimization complexity. Minimization of labeling cost on the basis of active learning strategies [18,19] will also be dealt with in this paper.

Basic Definitions
We define the test suite T = {t i | 1 ≤ i ≤ M} consisting of a total of M test cases. T Exec ∈ T and T Exec ∈ T are subsets of T that contain test cases that are executed and deselected in a current regression test respectively. Based on the test case executions (∀t i ∈ T Exec ), a system's reliability is actually learned, and thus the machine learning algorithm is trained.
The focus in supervised learning is on understanding the relationship between feature and data (here test case evaluation) [4]. Therefore, a test case needs to code a feature vector so that the indication of the coded features for a system failure can be learned in a supervised fashion. Such an indication is not just a highly probable forecast of an expected system failure, it is rather a particular risk-associated recognition.
First of all, a feature can be any individual measurable property of a test case. The data type of a feature is mostly numeric, but strings are also possible. However, such features need to be informative, discriminative and independent of one another if they are to be relevant and non-redundant The definition of suitable features increases the classifier performance [20]. In our application, a feature can be varied, such as a • subjective ranking of a test case based on expert knowledge. Such rankings can hint at the error susceptibility of verified parts of the system; • verified function's safety integrity level, known as the ASIL in automotive applications [5]; • name of a function whose reliability is assured; • reference to any hardware component of a circuit board that is being tested in a hardware-inthe-loop (HiL) test environment; • number of totally involved electronic control units during the testing of a networked functionality; Such a number can hint at the complexity of the networked functionality and hence at its error susceptibility.
We define the entire set of features Φ = {φ f | 1 ≤ f ≤ F} of test cases that might be relevant for understanding the behavior of test cases. Thus, features may be e.g. φ 1 = { QM , A , B , C , D } (ASIL) or φ 2 = {f 1 , f 2 , f 3 } (function name). Hence, a test case can verify a function f 3 that has an ASIL A.
The following passages discuss the selection of suitable features, which is an important strategy for improving a classifier's performance.
• Sometimes less is more -If the defined set Φ is too large it can cause huge training effort, high dimensionality of the optimization problem and overfitting. Thus we define a selection mask b s = 1 0 0 · · · 1 of length F for selecting relevant features Φ s . If the f − th matrix entry of b s is greater than or equal to 1, then the corresponding feature φ f ∈ Φ is selected and added to Φ s , otherwise not.
• The set of main features Φ m ⊆ Φ s is coded as follows: If the f − th matrix entry of b s is equal to 2, then the corresponding feature φ f ∈ Φ s is at the same time a main feature φ f ∈ Φ m , otherwise not. The main features are used to establish the overall training data set: The training data is adapted to each test case, and thus it is Hence, we define that two test cases t i and t j are equivalent t i Additionally, a cross-product transformation of In addition, the function state : T × R → S is defined; it returns the state of a dedicated test case in a concrete regression test. The state has to be either 'Pass' or 'Fail', except for cases where the test case has not been executed so that its state is undefined. Therefore, S = {'Pass', 'Fail', 'Undefined'} defines the set of possible states. Furthermore, the set R = {r k | 0 ≤ k ≤ K} includes r 0 which is the current regression test and older regression tests starting from the last regression r 1 to the first considered regression r K . Lastly, we define the tuple history(t i ) = {state(t i , r 1 ), ..., state(t i , r K )} containing t i 's previous test results.

Motivation
In practice, finding suitable features is a difficult task. Since we focus here on black-box systems, systeminternal information is not available that might be useful for understanding the system behavior. As a reason, we can only define the above listed features, which might be too high-level for classifying system failures. To illustrate this fact, Fig. 1 a) shows a typical situation: The behavior of test cases in relation to arbitrarily defined features φ 1 and φ 2 is given. Passed and failed test cases are presented by green squares and red diamonds respectively, and white circles stand for test cases yet to be executed. Figure 1: An artificial regression test with test cases.
We can see that the green squares and the red diamonds are widely scattered. Thus, defining a hyperplane in order to set two acceptance regions for passing and failing test cases is no easy matter. In order to solve this complex task, we develop a novel approach that is basically motivated by the following thought experiment: All test cases that are represented in Fig.  1 a) are now either assigned to 1 b) or 1 c) according to a certain mapping. The individual mappings of test cases will be discussed later, in Sec. 6. In the next step, a cross-product transformation is performed in order to group test cases into sub-regions (we refer these later as sub-clusters). In our example, we create six sub-regions. Table 1 lists the empirical failure probabilities of each sub-region.
In order to keep our thought experiment very simple, we will neglect statistical computations for now and focus only on the main idea of our novel approach. The introduction of Bayesian networks and hence the derivation of weights for linear classifiers will be discussed later, in Sec. 6. We assume for now that the calculated failure probabilities of test cases in Fig. 1 b) and Fig. 1 c) are correlated. So our example remains very simple, we also require that the failure probabilities of the corresponding sub-regions are equal. This assumption reduces the complexity of the following classification task enormously. We will classify the following test cases t 1 , t 2 , t 3 and t 4 in accordance with whether a selection is necessary or not. Table 1: Failure probability of each sub-region.
Only if t 1 passes will the failure probability of subregion VI in Fig. 1 b) be equal to the failure probability of sub-region VI in Fig. 1 c). According to this fact, t 1 is assumed to pass, and hence it is deselected. Furthermore, t 2 will be selected as a fail of a test case inside sub-region V is expected. However, t 2 passes, and, based on the same consideration, t 3 is also selected and finally fails. Since now a failure probability of 1/3 is expected in sub-region V, t 4 is assumed to pass, and therefore it does not need to be selected. Table 2 summarizes all decisions executed. Pass Selected False-Positive t 3 Fail Selected True-Positive t 4 Pass Deselected True-Negative So our novel approach for solving binary classification tasks is based on calculated empirical probabilities and empirically evaluated correlations among those probabilities. The behavior of test cases can be www.astesj.com precisely estimated on the basis of the calculated correlations. In practice, the failure probabilities of same sub-regions in Fig. 1 b) and Fig. 1 c) is often not exactly equal, but these failure probabilities are correlated. So the main task is to find good sub-regions for maximizing the empirically evaluated correlations and thus for precisely estimating the behavior of test cases. A more detailed explanation of our novel approach will follow in Sec. 6.

Performance of Known Machine Learning Methods
We have already indicated, by showing the example regression test in Sec. 4 (see Fig. 1), that according to the distribution of the input data, many machine learning methods cannot be reasonably applied for solving the constrained optimization problem (cf. Eq. 1). We will now demonstrate briefly that training linear classifiers in the classical sense by minimizing a loss function cannot perform well for solving safety-critical binary classification tasks. The situation is that only a small percentage of the data is actually labeled with one ('Fail'). Furthermore, failed and passed test cases are widely scattered in the feature space, which means detecting failing test cases becomes impossible. Additionally, the performance of deep neuronal networks is validated in the following.
The evaluation results (precision/recall) of these machine learning methods are given in Table 3. Each machine learning method is trained in the batch mode. The training data consists of all obtained test evaluations of a special regression test that will also be analyzed in our industrial case study in Sec. 9. For evaluating the machine learning methods, we used the training data first for training and later for testing (training data = test data). Even so, the sensitivity of both machine learning methods is zero, and thus we propose a novel approach for determining a linear classifier's parameters in Sec. 6.

Concept
The concept of our novel approach is shown in Fig. 2.
We start with specifying a feature set Φ, and taking its subset Φ s and finally constitute a cross-product feature transformation to obtain the set Ψ . Based on Ψ and by applying the check-function on T Exec , test cases can be grouped. If we look back to the example where we grouped test cases in Fig. 1 a), then we will see that there is a relationship between test cases' features and their assignments to sub-regions. Correspondingly, we introduce the definitions of clusters and sub-clusters of test cases. In the first step, test cases ∀t k ∈ T t i are assigned to clusters based on their history-tuples. A cluster is basically a partition of T t i and consists of test cases that have the same history-tuples. Accordingly, the number of distinct history-tuples N determines the total number of clusters. This is the step that has already been shown in Sec. 4, where test cases inside Fig. 1 a) were individually mapped into Figs. 1 b) and 1 c). In this way, already executed test cases depicted in Fig. 1 b) belong to one cluster and, analogously, those executed test cases that are depicted in Fig. 1 c) belong to another cluster.
In the next step, each cluster C n is subdivided into L sub-clusters. A test case t k ∈ C n is an element of the l-th sub-cluster C n,l if check(t k , ψ l ) is true. We originally introduced the terminology of sub-regions in Sec. 4. However, we focus in what follows on discrete valued features, which means that grouping test cases into sub-clusters is more appropriate. By introducing the function eval : T Exec → {0, 1} that is defined as follows the calculation of failure probabilities can be given in Eq. 4.
The given selection decisions in our example in Sec. 4 (cf. Fig. 1) were taken based on calculated failure probabilities. Additionally, the correlations between the failure probabilities were considered. Accordingly, we need a stochastic model for estimating the classifier's sensitivity and specificity. We propose a univariate and also a multivariate stochastic model. The short-comings of the univariate stochastic model for solving the optimization problem (cf. Eq. 1) will be discussed later to motivate the introduction of a multivariate stochastic model. First of all, the next step introduces a multidimensional Gaussian distribution that constitutes a distribution for the failure probabilities of test cases. Based on this distribution, two distinct Bayesian Belief networks for both stochastic models will be introduced.
In the following, we interpret p n,l , 1 ≤ l ≤ L as realizations of a random variable X n . X n is Gaussian distributed based on the following assumption: Since each test case evaluation is a binary experiment with two possible outcomes ('0' or '1'), it can be regarded as a realization of a binary random variable. As the sum of independent random variables results into a Gaussian random variable according to the central-limit theorem [21], considering test case evaluations as independent random experiments justifies X n 's assumed distribution. However, test cases are executed on the same system and there may be some dependencies between test case evaluations that cannot be directly validated by such means as performing code inspections. As a result, we assume a mix of dependent and independent test case evaluations, and hence the Gaussian assumption is still valid. The moments of X n are E[X n ] = µ n = 1 As we introduced in total N Gaussian random variables, the moments of the multidimensional Gaussian dis- Since the constraint of the optimization problem (cf. Eq. 1) has to be fulfilled, an accurate sensitivity estimation has to be iteratively performed.

Sensitivity Estimation
The formula for calculating the classifier's falsenegative selection probability is given in Eq. 5.
However,p FN =N FN N FN +N T P has to be estimated, since the number of mistakenly deselected failing test cases N FN is unknown, and thus it is estimated byN FN . The number of already detected failing test cases is given by N T P . Before a decision can be taken on whether a test case t i can be deselected, the currently allowed risk of taking a wrong decision has to be estimated in advance. The estimation ofN FN has to be adjusted by the term xP MAX is required, the maximum allowed false-negative probability for deselecting the next test case t i is given in Eq. 6.

Univariate Stochastic Model
We model the Bayesian Network that consists of the random variables X, H andĤ, in Fig. 3. In the univariate stochastic model, the focus is on modeling of only one failure probability distribution. Thus the random variable X stands for the previously defined X 1 and its realization x is given by p 1,l where l is the index of that sub-cluster C 1,l that fulfills t i ∈ C 1,l . H andĤ are binary random variables for modeling www.astesj.com test case states and classifier decisions. As the state of a test case t i is a-priori unknown, it needs to be modeled by a corresponding random variable. According to the realization of X, a pass or a fail of the corresponding test case t i , whose failure probability distribution is modeled by X, is expected. Finally,Ĥ takes a decision for t i based on its failure probability x:

X H Ĥ
According toĤ's selection rule a very simple hyperplane y(x) = x − p T H is derived where in the case of y(x) ≥ 0 a selection decision is taken. A particularly important factor is the definition of the threshold probability p T H , as its setting determines the classifier's sensitivity and specificity. The common ways of estimating false-negative and false-positive probabilities are given in equations 9 and 10, respectively.
However, we could only estimate the probability density function (pdf) p(X) in contrast to the conditional probability density functions p(X|H 0 ) and p(X|H 1 ). The reason for this is that pdf estimations are based on mean calculations of test case evaluations. Hence passing and failing test cases are both considered in calculating average failure probabilities. Thus, p(X) is a distribution over failure probabilities of passing as well as failing test cases. Accordingly, p(X|H 0 ) and p(X|H 1 ) cannot be estimated and in conclusion,p FN andp FP cannot be estimated as in Eq. 9 and 10. The threshold probability p T H is calculated based on the estimation ofp FN as the following relation in Eq. 11 holds.p As Eq. 11 cannot be directly estimated, the following relation in Eq. 12 is used for estimatingp FN and finally for p T H . (12) p FN is estimated in Eq. 12 according to the assumption that the quantiles of p(X|H 1 ) are larger than the quantiles of p(X). By solving Eq. 12 the threshold probability is computed as given in Eq. 13 with µ = E[X] and σ = V AR(X). As a result, the classifier's sensitivity is larger than 1 − p FN ,Limit , since its false-negative selection probability is smaller than p FN ,Limit . Finally, the decision regions of the linear classifier are defined (cf. Eq. 7 and 8) by determining p T H . Furthermore, the minimization of the classifier's specificity is required by the definition of the constraint optimization problem (cf. Eq. 1). Accordingly, the classifier's false-positive selection probability is estimated as given in Eq. 14 and shown in Fig.  4.p However, p(X|H 0 ) is not given and thusp FP cannot reasonably be estimated. Furthermore,p FP cannot be reasonably minimized as an optimization parameter is not defined; consequently, we need a so-called multivariate stochastic model for performing this. In the first instance, the idea of minimizingp FP and hence gaining testing efficiency by regarding several distribution functions is explained.

Preliminaries
Let us assume that two dependent Gaussian random variables X and X are given. The focus is again on estimatingp FN andp FP . Fig. 5 shows the probability distribution functions p(X), p(X|H 0 ) and p(X|H 1 ) as in Fig. 4. Additionally, the a-posteriori failure probability distribution function p(X|X ) is shown. The main idea is to use several observations of distinct dependent random variables to achieve a considwww.astesj.com erably more representative a-posteriori failure probability distribution function that is relatively narrow within a certain range. So p(X|X ) is considered as the more representative distribution for the failure probabilities and hencep FN and p T H are estimated by using this distribution function. Comparing Figs. 4 and 5, it can easily be seen thatp FP is basically minimized, since the risk of a false-negative selection probability is computed based on p(X|X ), which allows a more representative risk estimation.
All in all, by regarding a set of dependent Gaussian random variables and by using the information about their observations, a more representative aposteriori failure probability distribution function is achieved, which allows a more precise risk estimation. Accordingly, the probability of false-positive selection can be minimized. As a result, a multivariate stochastic model is created to exploit the dependency information between random variables for finally achieving testing efficiency.

Multivariate Stochastic Model
By using the dependency between the random variables X n , 1 ≤ n ≤ N , a considerably more accurate estimation ofp FN is achieved and hencep FP is minimized. Fig. 6 shows the modeled Bayesian network consisting of the random variables X n , 1 ≤ n ≤ N , H andĤ. The focus is on taking a selection decision for an arbitrary test case t i . X N now models the failure probability distribution of t i . In our previous example, as shown in Fig. 1 c), the failure probability distribution of test case t 4 was calculated based on the empirically evaluated failure probabilities of test cases inside Fig.  1 c). Thus t 4 was an element of C 2 (N = 2), and its failure probability was modeled by X 2 . Analogously, X 1 was defined by the empirically evaluated failure probabilities of test cases inside Fig. 1 b). In the interest of simplification, we always assume that the currently focused test case t i is an element of cluster C N and thus X N models its failure probability distribution.
Furthermore, we can calculate the dependency among X n , 1 ≤ n ≤ N . However, the Bayesian network in Fig. 6 models the statistical dependency between H and further random variables X n , 1 ≤ n ≤ N − 1. These dependencies cannot be calculated, but have to be modeled for estimatingp FN andp FP .
First of all, we model the classifierĤ(·) as followŝ [1]) is the maximum likelihood estimation. Accordingly, the likelihood estimation is a weighted sum as given in Eq. 17 x ML =: with weights w n , 1 ≤ n ≤ N − 1 as given in Eq. 18.
Further, p T H has to be calculated based on a precise estimation ofp FN . Thus we derive a calculation formula forp FN for the case N = 2, but we will also provide a general calculation formula ofp FN for an arbitrary number N of random variables.

Derivation of probability distribution functions
In the following, some probability distributions are driven that are used for estimatingp FN . First of all, the joint pdf p(ĤX 1 HX 2 ) p(ĤX 1 HX 2 ) = p(Ĥ|X 1 )p(X 1 |H)p(H|X 2 )p(X 2 ) (19) and the conditional pdf p(ĤX 1 H|X 2 ) are given in Eq. 19 and 20, respectively.
www.astesj.com Thus, the probability calculation P (Ĥ = H 0 |H 1 ) can be estimated by using the relation in Eq. 22 as given in Eq. 23.
Sincep FN cannot be directly estimated, as the conditional pdf p(Ĥ|H) is not given for performing the probability calculation P (Ĥ = H 0 |H 1 ), the relation in Eq. 23 is used for estimating an upper bound forp FN . However, the linear classifier's actual false-negative deselection probability would be smaller than the calculated upper bound.
Since the constraint in Eq. 24 has to be fulfilled, we solve the inequality in Eq. 25.
As x 2 = P (H 1 |X 2 = x 2 ) holds, the following inequality is finally solved.
Eq. 26 is driven for the case N = 2 but in the general case, where the number of random variables X n , 1 ≤ n ≤ N is given by an arbitrary N , the following inequality has to be solved.
By solving Eq. 27, the threshold probability is obtained with weights and Thus, the differential mutual information is defined in Eq. 31.

Conditional Independence
We have already motivated and introduced the following dependent random variables X n , 1 ≤ n ≤ N . We have explained the fact that test case failure probabilities are correlated, since test cases are executed on the same system, and thus they show a dependent behavior. However, the random variables X n , 1 ≤ n ≤ N are conditionally independent. This means that the information about a test case evaluation dominates such that a test case's originally calculated failure probability becomes irrelevant after observation of its state. Accordingly, the dependency among failure probabilities vanishes after observation of test case evaluations. This means that a fail of a test case t m is actually expected based on the information about the evaluation of another test case t n and no longer on t n 's originally calculated failure probability. Thus the remaining random variables X n , 1 ≤ n ≤ N − 1 become independent of the random variable X N after observation of H's realization (cf. Fig. 6).

Specificity Estimation
The specificity is given by the term 1 −p FP . As extensive mathematical derivations are needed for obtaining a calculation formula ofp FP , these derivation steps are given in the appendix and in what follows here only the result is given.

Theorem 1 (False-Positive Probability Estimation). p FP is estimated as given in Eq. 32
For the case N = 2, Eq. 32 can be simplified; after several calculation steps the following Eq. 54 resultŝ Fig. 7 shows five plots ofp FP for different values of displacements ∆ = µ n − µ n,H 0 . Indeed, the actual value of ∆ is unknown. However, the focus is on the minimization ofp FP . Accordingly,p FP decreases in each sub-figure of Fig. 7. The actual value of ∆ only determines how fastp FP decreases. So we can solve the optimization problem (cf. Eq. 1) by minimizingp FP . We considered two random variables X 1 and X 2 , as in our example in Sec. 4 where we created two clusters. A very important factor here is the underlying strategy for clustering test cases. As the distribution of the random variables X 1 and X 2 is directly related to the clustering strategy the main focus is on the maximization of the differential mutual information I (X 1 ; X 2 ). Accordingly, I is an optimization parameter for effectively reducingp FP . Lastly, we chose the  Fig. 7 a), Fig. 7 b), Fig. 7 c), Fig. 7 d) and Fig. 7

Optimization
The first strategy is to optimize the feature selection. Optimal features are learned in an unsupervised learning session where an evolutionary optimization framework is applied to search for optimal features. The next strategy is to improve the labeling of test cases through an active learning strategy.

Evolutionary Optimization
Clustering (and sub-clustering) of test cases is performed based on features. Therefore, different clusterings for different selections of feature subsets (Φ m , Φ s ) are possible. Accordingly, a different statistical model is obtained, as it reflects the failure frequencies in clusters. Furthermore, the differential mutual information (cf. Eq. 31) depends on the statistical dependencies and thus changes for different clusterings. Sec. 6 proposed a calculation formula for the weights w n , 0 ≤ n ≤ N , of a linear classifier. However, those formulas still depend on the differential mutual information I. A desired sensitivity has to be guaranteed, and thus the hyperplane is adjusted according to the value of I. It can be shown that for small values of I, the position of the hyperplane still guarantees a desired sensitivity but the false-positive selection probability increases. To minimize the false-positive selec-tion probability, the differential mutual information has to be maximized, which is the final strategy for solving the constrained optimization problem (cf. Eq. 1).
First, clustering depends on the history-tuples of test cases as, for example, the length of the historytuples determines the maximum number |S| K of clusters. Second, feature selection is optimized. All in all, we have summarized that K (number of considered previous regressions) is an optimization parameter and b s (for coding selected and main features) is an optimization matrix. However, this is a large-scale high dimensional optimization problem, as there exist many possible settings for K and b s . Thus, [3] and [4] suggest that the high dimensional optimization problem can be solved in a reasonable time by using evolutionary algorithms. Accordingly, an evolutionary optimization framework is applied for solving the mentioned high dimensional optimization problem. As each setting for K and b s is one possible solution for clustering test cases, which is the basis for derivation of a stochastic model, the fitness of this solution can be evaluated by calculating the extracted information I in Eq. 31. Thus, the optimal parameter and matrix setting with the best fitness will survive and will be returned by the evolutionary optimization algorithm. Fig. 8 shows the overall flow chart of the evolutionary optimization framework. First of all, a new population consisting of several genotypes is initialized. Each genotype stands for a possible setting of K and b s . In the next step, the corresponding phenotypes of the genotypes are derived. Hence each phe-notype encodes a stochastic model. Accordingly, the population is evaluated, wherein the fitness of each phenotype is calculated. However, a bad fitness is also possible due to bad statistical properties of the underlying stochastic model. This means that statistical calculations based on the stochastic model that a phenotype encodes cannot guarantee desired statistical confidence bounds. This will be explained in more detail in Sec. 8. Those phenotypes with bad fitness cannot survive and hence are eliminated.
Accordingly, remaining genotypes (phenotypes) are stochastically selected, and successively new genotypes are generated due to crossover and mutation operations. After a certain number of iterations, the phenotype with the best fitness will be selected, and this will be used in the selection algorithm. However, if the population is empty since all phenotypes were of bad fitness, then the training mode is activated, in which test cases are still executed without running the selection algorithm.

Active Learning
Our classifier's conducted decisions can be regarded as hard or even as soft decisions. Once taken, hard decisions are never changed later on, in contrast to soft decisions. The test efficiency can be significantly increased by conducting soft decisions as opposed to hard decisions.

Hard Decision
[1] performs hard decisions since a selected test case is automatically executed and a once deselected test case is never selected again in the current regression test. In the following passages, the disadvantage of conducting hard decisions will be explained in detail in relation to classifier decisions.
The linear classifier's decision depends on the current estimation ofN FN as it calculates the allowed residual risk p FN ,Limit (cf. Eq 6) of potentially taking a wrong decision.N FN returns the number of supposedly unrevealed system failures that would be detected by those already deselected test cases that are elements of T Exec . Accordingly, the linear classifier's decision depends on the decisions it has already taken (T Exec ) and hence it is memory driven.
Each deselected test case t j has an individual additional contributionN FN ,j (cf. Eq. 36) to the overall estimationN FN such that the relation in Eq. 35 holds.
N FN ,j is the product of t j 's failure probability x and the false-negative probability P (Ĥ = H 0 |H 1 ) by deselecting t j as given in Eq. 36.
Because of this fact, a deselection of an arbitrary test case can cause that the residual risk p FN ,Limit reaches zero asN FN increases (cf. Eq. 6). This means that no more risk (p FN ,Limit = 0) is allowed, and all remaining test cases have to be consequently selected.
Indeed, selecting test cases even if their deselection is allowed according to risk calculations is sometimes the better choice. In fact, this is the case if p FN ,Limit is zero and thus it can be significantly increased by selecting and executing an already deselected test case in order to eliminate its risk. When this is done,N FN decreases and hence p FN ,Limit increases and thus a residual risk for further deselections is obtained.
However, the amount ∆N FN of how muchN FN can be decreased by selecting an arbitrary test case is significant. If later more than one test case can be deselected, and these deselected test cases add the same amount of expected unrevealed system failures ∆N FN toN FN is in fact a gain in terms of reducing the regression test effort. So the strategy is to deselect primarily those test cases with fewer failure probabilities in order to increase testing efficiency.
As a result, the regression test efficiency can be increased. Therefore, the proposed selection method [1] is extended by a soft decision methodology. So each decision for deselecting a test case is now regarded as a soft decision that might be changed later. (We note here that the other way round is impossible since an already selected test case is automatically executed on the system under test and hence deselecting it later does not make sense). Fig. 9 shows the logic for managing soft selection decisions: Let us assume that t i is the next test case that is analyzed by the linear classifier. If t i is deselected, then it is queued into a priority queue whereby its priority is calculated as given in Eq. 37.

Soft Decision
In the other case, if t i is selected then test cases deselected up to this point are analyzed to the end of improving the trade-off between the assumed risk and the total number of deselected test cases. As a consequence, the most probable failing test case t j is obtained by taking the peek-operation on the priority queue. The priority of t i and t j is compared, and the test case with the higher priority is selected and executed on the system under test. If t j is executed, then it is removed from the set T Exec ← T Exec \t j and added into the set T Exec ← T Exec ∪ t j . Furthermore, t j 's state is evaluated eval(t j ) and accordingly the empirical failure probabilities of test cases are updated in algorithm 1. Since the calculated failure probabilities are averages of test case evaluations, the failure probabilities of those sub-clusters (see Eq. 4) have to be updated where t j is an element of them. Accordingly, the failure probabilities of ∀t k ∈ T Exec are updated in algorithm 1.

Algorithm 1 Test case selection algorithm
procedure update statistics(T Exec , t j ) T Exec contains already deselected test cases; t j is executed for each test case t k ∈ T Exec do ∃!C n,l =⇒ t k ∈ C n,l Find sub-cluster of t k and thus determine n and l if t j ∈ C n,l then p n,l ← 1 |C n,l | ∀t i ∈C n,l eval(t i ) see Eq. 4 P (H 1 ) ← p n,l update t k 's priority: prio(t k ) see Eq. 37 end if end for if eval(t j ) == 1 then N T P ← N T P + 1 end if end procedure The important point is that even the failure probability of t i is computed again. In most cases, t i would be deselected. Nevertheless, it could be possible that the execution of t j has failed, such that a further system failure has been found. In such a case, even t i 's failure probability may have increased such that its deselection has to be checked again by the linear classifier.
All in all, testing efficiency can be significantly increased by performing soft selection decisions. The performance of both selection strategies (hard decision and soft decision) will be compared in Sec. 9.

Learning Phase
The learning phase is of essential importance due to the fact that during this phase, the system reliability is actually learned. Test case selection is a safety-critical binary classification task as probably system failures would remain undetected and hence, a corresponding quality measure of wrong decisions is required. Accordingly, risk estimations on probably undetected system failures due to deselection of test cases have to be as accurate as possible. The more the system is learned during a regression test, the more precise the risk estimations are. However, learning a system in terms of understanding its reliability is a costly process, as it requires test cases to be executed. The fundamentally important research question is how much training data is enough for safely selecting test cases with a desired sensitivity.

Statistical Sensitivity Estimation
We have already required a specific sensitivity in the constraint optimization problem (cf. Eq. 1). Accordingly, we define the following confidence level in Eq. 38, which is basically driven from the constraint of Eq. 1.
Ψ is an estimator for the number of false-negativeŝ N FN = t i ∈T EXECN FN ,i and the bound is given as γ = . Ψ = i ψ i is composed of several random variables ψ i standing for the distribution of eacĥ N FN ,i . ψ i 's distribution is complex, since the individual contribution of a deselected test case t i is given byN FN ,i = x NpFN where x N is t i 's failure probability andp FN is the corresponding estimated false-negative probability: The following theorem is already proved in [1] and gives the formula for the false-negative probability estimation.
Theorem 2 (False-Negative Probability). For a given p th the calculation formula of the false-negative probability P (Ĥ = H 0 |H 1 ) has the form where x N is the failure probability of a test case, whereas µ N , σ N are parameters of the probability distribution function N (µ N , σ N ). Therefore, we choose the following approach to solving Eq. 38. We simplify the definition of ψ i as follows ψ i =p FN · X N wherep FN is assumed to be a constant value without any distribution. This step simplifies the calculation complexity of Eq. 38 significantly, as Ψ becomes simply a weighted sum of Gaussian random variables. However, the variance of p FN is of course relevant and should not be easily neglected. Accordingly, we require a maximum confidence interval width forp FN such that the estimated false-negative probability is quite accurate and hence can be assumed to be just like a constant value without any statistical deviation. We calculate the confidence interval [p The Wilson score interval [22] delivers confidence bounds for binomial proportions. Therefore, we calculate the following confidence intervals [x (l) n ; x

Criteria for Training
In order to guarantee a statistical bound on the sensitivity with a 99% confidence level, the following conditions have to be checked.
If both conditions are fulfilled, then these risk calculations in the selection algorithm are reasonably accurate and hence selection decisions can be performed. However, if one condition is not fulfilled then the training mode is just active, such that test cases still have to be executed.

Industrial Case Study
A German premium car manufacturer constitutes each regression test as being a system release test, and thus the system test takes up to several weeks according to [5]. However, a first detected system failure makes a system release impossible so optimizing the current regression for achieving high efficiency in reducing the regression effort becomes justified.
It is often the case that close to the so-called start of production (SOP) of a vehicle, many electronic control units (ECU) have only some critical spots and thus each regression test is expected to be a systemrelease test. Since many test cases pass, a lot of time is spent in observing passing test cases. Therefore, reducing the number of executed passed test cases (since a system failure is detected and a system release is no longer possible) and keeping the limited testing time back for fault-revealing test cases decreases the regression test effort significantly. In any case, a final regression test will succeed after further system updates have been conducted; this will be constituted as a final release-test that meets the high-quality standards of [5].
In our industrial case study, we applied our selection method to a production-ready controller that implements complex networked functionalities for the protection of passengers and other road users. Therefore its test effort is immense, and hence we apply our regression test selection method for accelerating its testing phase. In Fig. 10 the right-hand side of the well-known V-Model (see [23]) is shown, whereas the focus is on system testing in our case study.
A hardware-in-the-loop simulator (HiL) [24] is used for validating an ECU's networked complex functionalities as well as its I/O-interaction and its robustness during voltage drops, as it provides an effective platform for testing complex real-time embedded systems.  Figure 10: A HiL simulator is used for performing the system test.

HiL
Further, we selected for the following test case features for training the machine learning algorithm: • Name of verified system parts • Name of a function for which reliability is assured Since the quality of our selection decisions is hedged on a stochastic level, it can appear that during different runs of our selection method, a statistical deviation of the false-positive probabilities could occur. Therefore, we constitute several independent runs of a regression test, where we set p FN ,MAX = 1%. The boxplots and the quantiles of the false-positive probabilities are given in Fig. 11 and in Table 4, respectively. Fig. 11 shows the overall boxplots of the falsepositive probabilities achieved during the regression test replications. To compare the hard with the soft decision strategy we performed distinct regression test replications where we disabled and enabled the parameter for 'soft decision', respectively.
It can be seen from Fig. 11a) and Fig. 11b) that the average false-positive probability is about 74% and 23% for hard and enabled soft decisions respectively. As already mentioned, conducting hard decisions does not allow for global optimization of the trade-off between an already assumed risk and the corresponding number of totally deselected test cases. Global optimization hence requires the analysis of all test cases deselected thus far over and over again, and, if necessary, the selection of an already deselected test case. Therefore, test cases with a higher failure probability should be considered again for eventual selection in an ongoing regression test in order to potentially deselect further less risky test cases. As a result, the regression test effort can be reduced much more by applying soft decisions.
Furthermore, the condition in Eq. 40 on the falsenegative probability p FN or on the number of actually occurring false-negatives N FN was fulfilled in all conducted regression test replications.
Our implemented algorithm for selecting test cases runs on a desktop CPU that is specified in Table 5. We decided to conduct a multithreaded execution of the evolutionary algorithm such that the fitness of all phenotypes in a population is computed in a multithreaded manner (in total 32 threads). Thus the average CPU load is approximately 95% and the maximum memory allocation is about 4GB. We need a mean analysis time of 0.9s for deciding whether a test case should be selected or not.

Conclusion and Future Work
We proposed a holistic optimization framework for the safety assessment of systems during regression testing. To this end, we designed a linear classifier for (de-) selecting test cases according to a classification due to a risk-associated recognition. Therefore we defined an optimization problem, since the classifier's specificity has to be maximized whereas its sensitivity still has to exceed a certain threshold 1 − p FN ,MAX . Accordingly, we developed a novel method for determining the weights of a linear classifier that solves the above optimization problem. We have theoretically shown that the classifier performance is directly interrelated with the success of selected relevant features of test cases. Lastly, we applied our method to a production-ready controller and analyzed the overall regression test effort subject to an active learning strategy. We have demonstrated that, in the regression testing of safety-critical systems, significant sav-  ings can be achieved. As feature selection is a complex task, and thus an evolutionary optimization supposedly finds local optima, more thorough research in this field may indeed allow higher-order reductions of the classifier's false-positive selection probability.

Appendix
In the following, a detailed proof of Theorem 1 is given, relating to the proofs given in [1].

Proof of Theorem 1
Proof. According to [1], the maximum likelihood estimation x N ,ML (abbreviated x ML in the following) is given in Eq. 41.
As ΦΛ −1 φ T = P 1 Σ −1 P T 2 and φΛ −1 φ T = P 2 Σ −1 P T 2 = Σ −1 However, µ H 0 is an unknown parameter vector; additionally, the second-order moments are assumed to be invariant of an event H 0 . We will calculatep FP in dependency on the unknown vector µ H 0 and will qualitatively show thatp FP can be minimized independently of µ H 0 due to an optimization strategy. In showing this, we demonstrate that the concept of our work is validated and mathematically proved. First of all, Eq. 44 is given in Eq. 45 with an additional term.
Accordingly,p FP is estimated as given in Eq. 47.
By substituting U := N −1 n=1 w n (X n − µ n,H 0 ) and V := w N (X N − µ N ,H 0 ) with µ n,H 0 = µ H 0 n the false-positive estimation is given as follows.
Based on that fact, that H 0 is given U and V are conditionally independent (see explanation in subsection 6 and Eq. 50 respectively.
It can easily be seen that the variance of U is equal to σ 2 U = holds and thus the variance is further simplified and is given in Eq. 52.
Finally, the false-positive probability is estimated as follows: For the case N = 2, Eq. 53 can be simplified and after several calculation steps the following Eq. 54 resultŝ