Representation of Clinical Information in Outpatient Oncology for Prognosis Using Regression

A R T I C L E I N F O A B S T R A C T Article history: Received: 01 October, 2016 Accepted: 19 October 2016 Online:27 October 2016 The determination of length of survival, or prognosis, is often viewed through statistical hazard models or with respect to a future reference time point in a classification approach (e.g., survival after 2 or 5 years). In this research, regression was used to determine a patient’s prognosis. Also, multiple behavioral representations of clinical data, including difference trends and splines, are considered for predictor variables, which is different from demographic and tumor characteristics often used. With this approach the amount of clinical samples considered from the available patient data in the model in conjunction with the behavioral representation was explored. The models with the best prognostic performance had data representations that included limited clinical samples and some behavioral interpretations.


Introduction
This paper is an extension of work originally presented in 2016 at the IEEE International Conference on Electro Information Technology (EIT) [1]. This extends the prior work by focusing on the prediction of the length of survival through regression rather than with classification techniques. The link between the representation of the patient clinical data and the regression methods for prognosis will be explored. The results show that the data representation with the best prognostic performance may include limited clinical samples and also beharioal interpretations of the data.
The American Cancer Society estimates for the year 2016 there will be 1,685,210 new cases of cancer diagnosed. With 1,630 individuals expected to lose their lives each day to cancer [2]. For those affected by cancer, the accurate length of survival prognosis is an important problem which needs to be addressed in order to provide patients and their families information about the effectiveness of treatments, end of life treatment, and/or palliative care.
There are many factors which may go into cancer prognosis prediction including: the type of cancer (some types of cancer are cure-able or go into long-term remission, and others have a low, five-year survival rate), severity of the cancer (stages), patient specific history and condition (comorbidities, state of health, etc.), and treatments. For any given representation, different methods may be used to predict patient prognosis. Many of the techniques consider binary survival, providing information on only if a patient will live to a certain point in time or not. Alternative prognosis methods include classification and regression, providing more information on the length of survival.
For this work, the representation of clinical data with an outpatient oncology data set is considered for prognosis. The clinical data for the patients, consisting of multi-modal nonuniform time-limited data, will be represented through samples taken at discrete time points and with two behavioral representations, difference trends and splines. The prognosis was predicted as length of survival (LOS) using linear and quadratic regression, Gaussian Process with constant basis, and Support Vector Regression (SVR) using radial bias function and linear kernels. The LOS predicted was compared with the actual LOS for each patient to evaluate the prediction models (presented in terms of absolute and relative error).

ASTESJ ISSN: 2415-6698
Related work concerning approaches for oncology representation and prognosis is presented in Section 2. The methods for representing the clinical data and experimental design are then presented in Sections 3 and 4. Finally, the results of the regression analysis are presented in Section 5.

Background
Machine learning has played a role in many different aspects of oncology including diagnosis, recurance, prognosis, image analysis, malignancy, and staging of tumors [3]. In these methods, the data used can include gene expressions, radiographic images, tissue biopsy sample data, predictors like sex, age, cancer stage, thickness and cancer stage traits such as positive nodes [4]. Cancer tumor staging is a common tool in the data as it considers the size of the tumor, the involvement of lymph nodes and if the cancer has spread [5].
For the clinical data observations, it is possible to treat them as as a time series. In this form there are several methods for representing or transforming the data available, e.g., Fourier analysis (DFT), wavelet analysis (DWT), piecewise aggregate approximation (PAA), etc [6]. Temporal abstraction approaches, which describe a behavior over a period of time (e.g. weight increasing while hemoglobin decreasing), have also been used to represent clinical data [7][8]. It is also possible to take the multiple variables to address the multiple sampling frequencies and types of observations that occur to reduce the values for each observation type to a single value for each period [9].
For the prediction of survival it is often considered from a statistical standpoint with life tables [10], or approaches like Kaplan-Meier or the Cox proportional hazard model [11]. These have the limitation of not providing information about the probability of death, rather only insight based on the population survival over time [12]. Other approaches have been extended to look at survival chances with respect to a point of time, however they are limited to a single point. That is, whether a patient will survive up to time X, where the time points generally considered are for 0.5, 1, 2, 3, and 5 years [13].
Diverse machine learning techniques have been used for predicting survival time including support vector machines [14], Bayesian Networks [15], k-nearest neighbor, and random forest [16]. In one study , the prediction is survivability of 5 years for patients with breast cancer with an accuracy of 89-94% reported using neural networks, decision trees, and logistic regression [17]. Multi-class classification provides more insight into survival time, than a binary classifier, with narrower windows of prognosis. Examples of multi-class approaches include using an ensemble method with 400 support vector machines of binary classifiers [13] or neural networks with four classes [18].
With the complexity of clinical data, classification can also be done based on training incorporating multiple experts. In the case of classification through this approach, temporal abstraction is used to simplify the data and different algorithms, including majority rule and SVM, are used to create consensus classification models [19].

Methods
The data used in this study was provided by a private outpatient oncology practice and made available to the researchers by EMOL Health of Clawson, MI.

Data Collection and LOS Reference Points
For each patient, routine clinical and laboratory tests (weight, WT, albumin, ALB, and hemoglobin, HGB) and treatment administration dates (chemotherapy, blood transfusions, and two erythropoietins) were collected for two years. The amount and duration of data collection varies between patients depending on the number of visits and survival time. The determination of age at time of death was confirmed with the Social Security Death Index. Outpatient clinical data is problematic due to the non-uniform sampling, e.g., time between clinic visits or laboratory tests is not uniform. Additionally, the type of clinical information collected may vary between visits and between patients, e.g., different blood tests may be ordered during each visit or not at all for a given patient. The non-uniformity can be observed in Figure 1 as each set of observations is for a different patient and presents a unique distributions of observations. A prognosis is formed with respect to a reference time point. For example, predicting if a patient has a LOS of two years requires establishing a reference point from which to count the two years. We establish the three reference points, t, t * 1 , and t * 2 as the basis of the LOS prediction. For each patient, the reference time point t is set when the first type of observation ceases being measured (see Figure 1C). This point was selected to minimize extrapolation errors and dealing with missing data. To avoid bias (t coincides with an observation), t * 1 and t * 2 are selected at random from a range about t, with t * 1 ∈ [t-15, t+5 ] selected from the range of 15 days further from death to 5 days closer to death and t * 2 ∈[t-28, t+14].
The reference points t * are used in forming the data representation. The evaluation of the LOS prediction is based on the reference points, t * 1 and t * 2

Data Representation
Three representations of the patient clinical observations are considered: clinical data sample values, difference trends, and splines. A fourth type of data representation that of numeric occurances is used for the counts of medical treatments which  the patient experiences. In the data set, these counts include blood transfusion, two different erythropoietins, and chemotherapy. The numeric occurrences (number of treatments) are based in native units prior to standardization.
A patient's clinical values are estimated at uniform intervals for ALB, HGB, and WT at t * then back at an interval of 7 or 14 days. An example is shown in Figure 1C, where vertical lines represent where the clinical data samples are to be estimated at time t * , t * -7, and t * -14 (a sample spacing of 7 days). Cubic splines were utilized to obtain values at the sample times between clinical observations for input to predict LOS by evaluated the splines at the times that the samples were desired. These values are standardized as inputs to the model.
A difference trend (Diffs) describes the observed behavior as increasing, decreasing or stable via a difference between values for ALB, HGB, and WT. Two versions are considered. First, one difference values (1 Diffs) are calculated between values at t * and 90 days earlier, t * -90 (note, the values may be predicted, as a sample may not have been collected at this exact time interval); see Figure 1B. Alternatively, two trends (2 Diffs) are found, from t * back 45 days, then from this point back an additional 45 days; also, shown in Figure 1C).
Finally, splines are used to describe the behavior of the observations. A two-piece second order spline is used to fit the entire observation period for ALB, HGB, and WT observations for a patient (unlike the difference trend which has a recent specified period of consideration); see Figure 1A. The splines' slope coefficient is discretized and used as input to predict LOS.
In summary, the predictors for prognosis include the number of treatments and the following options to consider in the evaluation: 0-5 patient clinical sample values; 1, 2, or no difference trends; and inclusion or not of spline coefficients.

Length of Survival (LOS) Prediction via Regression
The problem of regression is a supervised learning technique that aims to develop a model to map an input to an output ( ). The assigned output is a prediction of a continuous quantity or numerical value.

Linear and Quadratic Regression
In linear regression, the objective of determining the numerical result of ( ) is found through a linear model, where is the input and is the weight that fits the model, that for a linear model is the slope. The parameter 0 is the offset or bias parameter to adjust the fit. The parameters in this case are chosen based on the minimization of the error when fitting with the training set.
Similar to the linear regression, quadratic regression determines a numerical outcome but from a higher order model,

Gaussian Process Regression
With a Gaussian Process (GP), the inputs are treated as a set of random variables and incorporated with a covariance function to determine a probabilistic outcome of the regression value [20]. The model is defined by the mean and the covariance functions. Given the K input pairs ( , ), the GP regression model summarizes, assuming a zero mean, to [21], such that is the covariance matrix evaluated considering the training set inputs and the current input . The covariance matrix has the ability to incorporate a kernel or function to modify the functionality, often smoothing or bring periodicity to the behavior [21]. The correct covariance function can increase when it is in regions which are further away from previous regions of known values, and thus shrinks when near [22]. The constant basis will be used for the function in this analysis.

Support Vector Regression
Support vector regression (SVR) is a kernel based approach to determine the regression output. The regression is a set of linear functions, A. Two-piece splines.
that is aimed to have the error minimized through the loss function ε, and where α is the Lagrange multiplier. The support vectors are represented in the term and during the fit process variables w and b are determined, such that w is the weight and b is the offset or bias. To allow for the spread in the values, a slack variable is used, ξ i . The objective is then to minimize [23], when there are l samples. To support this boundary, the slack variable, , must be greater then or equal to zero [23]. In the evaluation the constraint is used to relate the loss and slack variables to the function, The SVR approach can be extended to allow for the application of kernel which satisfy Mercer's Condition to be used. In our work, linear and radial basis function kernels will be used.

Experimental Design
There are multiple ways discussed to represent the patient observations: clinical data samples, difference trends, and splines. For example, the number of clinical data samples considered varies from zero to five. The number of difference trends included in the evaluation is zero to two. The spline information is either included or not. All input variables which are not discrete are standardized.
For the evaluation, multiple regression approaches will be used including linear and quadratic regression, GP, and SVR with radial bias function and linear kernels.
In all evaluations, a 10-fold cross evaluation approach was used to train and test. The SVR parameters were selected through a nested cross validation approach. The performance was compared based on the absolute and relative error in the LOS determined for each model evaluated. Statistical p-values from a ttest were used to verify statistical differences or lack thereof in comparing different representation techniques within evaluation methods.

Results
The first part of the evaluation was conducted to examine the impact of different number of clinical sample values in the representation (0-5). The data representation also included both behavioral interpretations; namely 1 Diffs and splines. Table 2 shows the best performance was not with more samples but zero or one based on the lowest median relative error, for all but SVR with a linear kernel (although the difference in median relative error between 1, 2, 3, or 5 samples is small). The analysis of the pvalues from the t-test showed that the increase in samples had no statistical benefit over less samples for the models. An exception is in the quadratic regression which had a p-value of 0.05 in comparing performance of 1 versus 3 samples. The same analysis was done using t * 1 as the reference point, which lead to similar results and conclusions. Because the performance of the models with more samples are not statistically better, then the next part of the evaluation will include only one clinical sample value.  Table 3 presents results examining the performance benefit of the inclusion of the behavioral representations namely difference trends (Diffs) and splines. With two exceptions, SVR with the RBF kernel and the quadratic regression, the best performing models contained one behavioral representation. In the various modes of behavioral representation considered, the models did not have any statistical benefit, with p-values greater than 0.1 in most cases. One exception is in quadratic, the model with no splines and no Diffs showed a statistically significant improvement to the model with 2 Diffs and splines with a p-value of 0.014. Similar results were observeved for t *

1.
The different regression methodologies show an ability to work with the diversity in the clinical data inputs of the samples to various degrees. The best performing methodology consistently is the SVR with the linear kernel followed by the linear regression approach. The RBF kernel version of the SVR did well with the data, just not as well as the linear kernel method, and the GP was not as successful with the fit but did not have the high degree of variance in the error that was seen with the quadratic regression.  The best performing models for each regression methodology is seen in Table 4. These models overall have the best performance with one behavioral representation included with either zero or one sample included. There are a couple cases that the performance was best with multiple behavioral representation included (both Diffs and Splines), and one case with more than one sample being beneficial based on the lower median relative errors.
In Table 4, the median absolute error was also reported. However, it may be a deceiving measure since for each patient the same amount of absolute error may hold more meaning to some cases then other (e.g., an error of 30 days for a patient surviving 40 days versus 180 days). Therefore, to help controlf for each patient's LOS, the relative error has been reported and used to compare representations and methods. Overall, the best performance in the absolute error was also seen with the SVR methods using this representation approach.

Conclusion
The inclusion of more clinical sample values does not provide a statistically significant improvement in the prognostic performance, measured as a reduction in relative error, using regression methodologies. What does help improve the ability to determine a prognosis is the inclusion of behavioral representations and the selection of appropriate regression methods, like the SVR method used here. While regression and classification are not directly comparable, the original results of benefits from the behavioral representations have held true with prior work. There are several future directions for this work with respect to the data representation. For example, rather than use sampling with interpolation, an alternative would be to consider dimensionality reduction techniques to reduce the need for samples and behavioral representations.

Conflict of Interest
The authors declare no conflict of interest.