Time-to-Event Analysis for Recovery from Coronavirus Disease (COVID- 19): A Case Study on Wuhan and Elsewhere in China from Jan 1 to Feb 11, 2020

COVID-19 is a viral disease that became a pandemic representing a very great challenge worldwide The purpose of this article is to analyze COVID-19 patients' data based on time-to-event analysis and identify the factors that affect the recovery time from COVID-19 The datasets that are used in this study are for cases that are clinically diagnosed and confirmed where the date of onset is recorded in Wuhan and elsewhere in China from Jan 1 to Feb 11, 2020 We used the regression imputation technique to replace the missing dates in the onset-symptoms based on the dates of the report We fitted the Kaplan-Meier estimator and Cox regression model to our data The predictor variables (factors) that we used are age, sex, and onset time to hospitalization The results show that the young age group is better than the old age group in recovering from COVID-19 (the p-value of the log-rank is 0 00012) and at any time 1 9 as many patients in the young age group are having an event (recovery) proportionally to the old age group Also, the results show that there is a non-significant difference between male and female groups in recovering from COVID-19 (the p-value of the log-rank is 0 63) The results also show that the early time to hospitalization group can recover from COVID-19 better than the late time to hospitalization group (the p-value of the log-rank is 0 0052) This study demonstrates the association of recovery time from COVID-19 with age, sex, and time to hospitalization © 2020 ASTES Publishers All rights reserved


Introduction
The novel coronavirus disease also known as COVID-19 is a viral disease that became a pandemic and turn out to be a great challenge that the world faced since world war two. This virus is originated in Wuhan city, which located in the Hubei province of China and it is spreading at a fast rate around the globe. The diseases caused by viral infection continue to emerge and raise a serious issue in public health worldwide. Several viral epidemics appeared in the last twenty years [1]. In 2002 the severe acute respiratory syndrome coronavirus, which is known as SARS-CoV, which is still circulating in China [2]- [4] has appeared followed by H1N1 influenza in 2009. Most recently in 2012, the Middle East respiratory syndrome coronavirus, which is known as MERS-CoV have been recorded. The main component of the viral genome is a positive-stranded RNA and it has a different structure [5]. There are four genera of Coronavirinae family: α, β, γ, and δ. it is believed that there is a viral gene in wild animals since it has been isolated from bats and other animals [6]. The novel COVID-19 causes mild to moderate respiratory illness, but some people and people with health problems can develop serious illness. Worldwide this disease affected more than five billion people and the number of people who died due to the infection with it exceeds five hundred thousand according to the World Health Organization (WHO) report on the time of writing this research. According to WHO the mild or asymptomatic COVID-19 infections represent 80% of the cases while the severe infections, which require oxygen and critical infections, which require ventilation represents 15% and 5% of the cases respectively. So far, the mortality for COVID-19, which is the total number of deaths divided by the total cases is 5% (this percentage is calculated according to figures that are taken from the WHO web site on 09 July 2020, which shows a total infection of (12,196,982) and total deaths of (552,781)). This mortality is higher than that of seasonal influenza, which is below 1% according to WHO.
Normally, data mining and machine learning methods can be used to analyze datasets of biomedical data [7]. When data include survival data, it requires a different analysis approach. This approach or the study of the time from the entry of a study until a subsequent event occurs is known as survival analysis. Survival analysis is applied in different disciplines such as medicine, engineering, social sciences or behavioral sciences and biology [8]- [16]. When it is applied to medicine, survival analysis is used to study people at risk of experiencing a negative event such as death, where the name survival analysis comes from. Survival analysis is also applicable to areas other than mortality such as analyzing the time taken to recover from certain diseases or the time taken to practice certain exercises to maximum tolerance [17]- [20]. Normally we compare two or more groups of patients with respect to the time of event. More than one event can be considered in the same analysis, but we normally take one event at a time as the event of interest in the study and it can be death or recovery [21].
Many methods can be used for survival analysis, these methods include Kaplan-Meier method which is an estimator of survival probabilities [19,22] and the Cox regression model, which is now known as the Cox Proportional Hazard Model (CPHM) [23]. These two methods are considered among the methods that contribute significantly to the development of the survival analysis field.
Many studies were conducted to model the survival time and to predict the mortality risk for COVID-19. Guillermo Salinas-Escudero et al. applied survival analysis to study the effect of COVID-19 in the Mexican [24]. The factors they used include age, sex, comorbidities, hospitalization, and admission to the intensive care unit. They applied the Kaplan-Meier and Cox regression models to their data. Their results show that men and older people have higher mortality than women and young people respectively. Monira Mollazehi et al. modeled survival time to recover from COVID-19 [25]. They used data from Singapore in the period between January 23 and March 13, 2020. Their purpose is to identify the factors affecting the recovery time from COVID-19. They used patient's age and nationality as predictors and they found that younger patients can recover from COVID-19 faster than old patients and Singaporean patients can recover faster than non-Singaporean. They compared the results of different models and they found that the Weibull model is the best in fitting their data. Using the Weibull model, they obtained a Hazard rate of 1.01 and 0.76 for age and nationality respectively. Qinxia Wang et al. used survival-convolution models to model the duration of the patient remaining infectious to others [26]. Noam Barda et al proposed a hybrid methodology to construct a multivariable prediction model. In their hybrid method, they used a baseline model which they trained on population data to discriminate the risk then they used a multicalibration algorithm for the risk predation [27] Different factors may have an influence on the mortality or the recovery time from COVID-19. These factors can be used to divide the patients into two or more groups and they include age, gender, and time from acquiring the illness to hospitalization. This study aims to investigate whether these factors affect recovery time. The datasets that are used in this study are downloaded from Github (https://github.com/mrc-ide/COVID19_ CFR_submission). From these datasets, we used two datasets. The first one is for cases that died from COVID-19 in Hubei and the sec-ond dataset is for patients returning to their home, which obtained from six flights that departed between Jan 30 and Feb 1, 2020.
The rest of the paper is organized as follows: The next section describes the materials and methods. The material and methods section starts by showing how we prepared the dataset that we used followed by describing the imputation technique we used to replace the missing data. Kaplan-Meier survival curve, Log-rank test, and Cox proportional hazards (PH) model also are explained in the materials and methods section. we present the results and the discussion in the third section and the conclusion in the last section.

Dataset
The datasets that we used in this study are for cases that are clinically diagnosed and confirmed where the date of onset is recorded in Wuhan and elsewhere in China from Jan 1 to Feb 11, 2020. From these datasets, we used two datasets. The first one is for cases that died from COVID-19 in Hubei. It contains the features: sex, age, date of symptom onset, date of hospitalization, and date of death or recovery from COVID-19. Some of the data for the date of symptom onset are not available for some cases so used imputation based on regression to replace the missing data. The second dataset is for patients returning to their home, which obtained from six flights that departed between Jan 30 and Feb 1, 2020. Also, the cases with incomplete date of symptom onset were replaced using regression imputation and then we merged the two datasets. We removed the cases where the sex or the age or date of hospitalization are not available and we end up with 693 cases, which represent recovered and died patients. The used datasets were downloaded from Github (https://github.com/mrc-ide/ COVID19_CFR_submission). They were used by [28], which extracted it from WHO-China Joint Mission report to estimates the severity of COVID-19 based on the model-based analysis.

Regression imputation
The datasets that we downloaded has missing data on the date onset symptoms, therefore, instead of deleting all the cases that have missing data, we need imputation to replace these missing data with estimated values, because it is important to have the timing of the onset-symptoms to study the recovery time. We used regression imputation to preserves all cases by replacing the missing date onset symptoms with a probable value estimated by the date of the report because it is clear from Figure 1 that there is a strong correlation between these two dates. In Figure 2, the scatter plot shows the relationship between these two dates, and the value of R 2 (0.7266) emphasizes the strength of this relationship. The model that is used to estimate the missing data in the date onset symptoms is y = 0.8227x − 0.5428 also shown in Figure 2, where x and y represent the report date and the onset symptoms respectively. Preserving the cases with missing data using regression imputation has several advantages. In addition to avoiding the deletion of the cases with missing data that can alter the variance of the shape distribution, it can also substitute the missing value based on another variable and no novel information will be added therefore we will be having www.astesj.com an increased sample size and therefore a reduced standard error [29]- [31].

Kaplan-Meier survival curve
In Kaplan-Meier survival curve, the survival times that include the censored data (the observation that does not get the event) is assumed to be t 1 , t 2 , ..., t n . These times are entered to the study ordered by increasing duration of a group of n subjects, We can estimate the proportion (survival rate) of subjects S (t) surviving beyond any follow-up time t p as [17]: Here r i represents the number of subjects alive just before time t i given that t p is the largest survival time and i is any value between 1 and p, d i represents the number of subjects who died at the time t i , therefore d i = 0 for censored observations. Before the occurrence of the first event all the patients are alive, therefore, S (t) = 1. Considering time t i , where the number of events(deaths) is d i and the number of alive is r i just before t i then S (t i ) can be calculated as: In the censored data we will not have information about the survival time, therefore, S (t i ) will not be calculated for censored observations since the survival curve will not change at the time of a censored observation. At the next event, the number of patients at risk is reduced by the number of censored observations between the two events [32].

Log-rank test
Normally, we need to compare two survival curves of two groups. For this sake, we use Log-rank test, which is related to a test that uses the logarithms of the ranks of the data and it is used under the assumptions: i) the survival times are continuous or ordinal, ii) one group's risk of an event relative to the other does not change with time. When the death event occurs at time t i then we will consider the total number alive (r i ) and the total number still alive up to the time t i in a specific group (say group A) r Ai . Consider that d i is the total number of deaths i.e event at the time t i . Then the expected number of deaths in group A at time t i can be calculated as Then the total number of expected deaths for group A can be calculated as: The total number of the expected deaths in group B can be calculated based on the total number of expected deaths for group A given that the total number of events is n as follows: In the Log-rank test, the data for the two groups combined are ordered and then each event, in turn, is considered starting at time t = 0. Then the log-rank statistics is calculated for two groups based on the summed observed minus expected score for a given group and its variance estimate and it is given as follows: Here O A and O B represent the total number of events in groups A and B respectively and E A and E B represent the total number of expected events in group A and group B respectively. This statistic is compared with χ 2 statistics to decide whether there is a significant difference between the two groups or not using a specific confidence interval or level of significance.

Cox regression model
The Cox regression model also known as the Cox proportional hazards model (CPHM) is used to investigate the association between the survival time of patients and one or more predictor variables. CPHM is a regression model that has a dependent variable and independent variables and it is used to know the effect of specific www.astesj.com variables on the event. Its formula is written as shown in the following equation: where h 0 (t) is the baseline hazard, X s here are timeindependent, and B i are the regression coefficients. It is important to note that Kaplan-Meier curves and log-rank tests work with categorical predictor variable and they can describe the survival according to only one factor under investigation. CPHM model can work for quantitative predictor variables as well as categorical predictor variables and it can assess at the same time the effect of several factors on the survival time of patients.

Results and discussion
To analyze COVID-19 data, we used survival and survminer functions under R software. In this COVID-19 patients' data, the event of interest is the recovery of the patients and the outcome is time in days until the recovery. We must consider an important analytical problem called censoring, which occurs when we have sick people at risk who have died or the recovery time for them is not known due to losing their follow up, therefore the patients' exact recovery time will not be known at least in the period of the study so the patient survival time is considered censored. In other words, in this study, we will consider censoring if the patient is died or lost follow-up in the determined period given that the period of the study is from Jan 1 to Feb 11, 2020, as shown in the dataset subsection. After extracting the data and preparing it we read it in R software. We consider three predictive variables (patient gender, age, and time to hospitalization). The gender variable is a categorical variable and it can be easily analyzed using Kaplan-Meier survival curve and log-rank since we have two groups male and female. The patient age is a continuous variable, therefore, we need to convert it into categorical to be able to use it as a predictive variable. To do so we need to use a cutoff, where we will consider the age greater than this cutoff as old and the age less than the cutoff as young. To determine the cutoff, we should look at the overall distribution of age values using the histogram shown in Figure 3, where the cutoff of 50 is obviously suggested to be used. Also, we converted time to hospitalization to a categorical variable by considering hospitalization within 6 days as 'Early' and hospitalization in a time greater than 6 days as 'Late'. To analyze the data based on the age group, we created a survival object and we fit the Kaplan-Meier curves by passing the created survival object to survfit function. We obtained the results given in Table 1. Normally, the results obtained from the survfit function are the probability of non-recovery as shown in the 4 th column of Table 1 i.e. death or negative event (Table 1 shows the results up to day 16). In this study, we are looking for a positive event (recovery) therefore we can calculate the recovery rate as (1-the probability of non-recovery). The results show that in the old age group over the four days period 1 recovered out of 343, therefore, the probability of non-recovery is (343-1)/343=0.997 (see Table 1 the first row) so the recovery rate is (1-0.997) =0.003. Over the five days period as shown in the table (see Table 1 the second row), 21 patients of the remaining 342 patients lost follow-up (censored) so the number of remaining patients on the 5 th day is 321. One of the remaining patients is recovered in the 5 th day therefore, the proportion not recovered is 0.994. We could calculate the survival at a specific time t as the product of the observed survival rates until t i.e S (t) = p.1 * p.2 * . . . . * p.t, where p.1 is the rate of the surviving patients who past the first time point and p.2 is the rate of the surviving patients who past the second time point, and so forth.
www.astesj.com It is very important to take into account that starting from p.2 we should consider only those patients who survived past the previous time point to calculate the survival rate for the following time point, in other words, p.2, p.3. . . , p.t are survival rates that are conditional on the previous survival rates. Given the assumption of independent and random censoring, we assume that the 21 patients who were censored were similar to the 321 who remain at risk regarding their survival experience. Since 1 of the 321 who remained and survived on the 5 th day recovered and we have 1 recovered on the 4 th day then the total number of patients who recovered on the course of 5 days is 2. Subtracting 2 from the original number, which is 343 will yield 341. Then the recovery rate in the 5 th day will be (1-341/343) = 0.006. The same analysis for the old age group is applied to the young age group (the results up to day 16 out of 30 days are shown in Table 1). The lower 95% confidence interval and upper 95% confidence interval tell us how accurate the estimate of the mean is [33]. In the first row in Table 1 the lower 95%CI and the upper 95%CI show us that we are 95% confident that the interval (0.991, 1.000) contains the true value of the parameter. Also, we can see that this interval is very narrow, which means that the certainty of the results is very high. In other words, we are 95% certain about the results. This narrow interval is associated with a very small standard error (0.003).
The corresponding survival curve can be obtained using the function ggsurvplot on the survival object. The obtained curves (see Figure 4) are step functions that allow us to compare the survival time of two age groups. Typically, the curve starts at 1 representing the fact that all of the patients are not having the event at entry into the study (see Figure 4 A). Over time the curve represents the probability of remaining non-recovered patients. Since we are interested in the probability of the recovered patients, we drew the survival curve starting from 0 to represent the portion of the recovered patients as shown in Figure 4 B. It is clear from Figure 4 B that the survival function of the young age group consistently lies above that for the old age group. This indicates that the young age group is better in recovering from COVID-19 than the old age group. We note that the two functions are somewhat close to each other in the first few days. This indicates that the young age group can survive COVID-19 later after infection than its early one. The estimate of the median recovery time for the young age group can be obtained from Figure 4 by selecting the value in the time axis that corresponds to the survival probability of 0.5. From the figure, it is clear that the median recovery time is greater than 20 days. The p-value of the log-rank is 0.00012, which indicates that the results are significant considering p < 0.05 indicates statistical significance, in other words the results show that there is a significant difference between young and old patients regarding the recovery from COVID-19.
To analyze the data based on the gender (Male, Female), we directly created a survival object since we don't need to convert the gender of the patient to a categorical variable (it is already a categorical variable). Then we fit the Kaplan-Meier curves by passing the created survival object to survfit function. We obtained the results given in Table 2 (we showed the results for the first 15 days). The results show that in the Female group over the four days period 1 recovered out of 295 therefore the probability of non-recovery is 294/295=0.997 (see the first row). Then the recovery rate in the 4 th day will be (1-0.997) = 0.003. The rest of the Female group results and the male group results can be described as we did with the age group results that are given in Table 1. The corresponding survival curve for the sex is shown in Figure  5, where the step functions allow us to compare the survival time of two sex groups. Figure 5 A is the probability of remaining unrecovered patients based on gender. The survival function of the Female group and that for the male group from the time 0 up to 40 follow similar paths, therefore the p-value (0.63) from the log-rank test is not significant considering p < 0.05 indicates statistical significance. Figure 5 B shows the survival curves starting from 0 and they represent the proportion of the recovered patients based on sex.
We used Cox regression model to measure the effect of the different factors on the recovery from COVID-9. in Cox regression the measure of the effect is hazard rate. The hazard is the instantaneous event rate or the probability of a patient at time t has an event at that time. Here the assumption is non-recovery if the event does not occur up to time t [23,34]. Hazard ratio of 1 means that event rates are the same in the members of the same group. Figure 7 shows that at any time 1.9 as many patients in the young age group are having an event (recovery) proportionally to the old age group, which is taken as a reference, and the value 0.001** shows that this result is statistically significant. The result of the hazard ratios supports the results that we obtained in the step functions that are depicted in Figure 4 B. Regarding the sex group, the results in the figure shows that the hazard ratio is 1 which indicate that three is no difference between male patients and female patients in recovery from www.astesj.com   www.astesj.com 1614 COVID-19. This result supports the results that we obtained in the step functions. We note that the P-value is quite different from what is shown with the Kaplan-Meier estimator and the log-rank test that is because the hazard ratio calculates the hazard ratio and respective risk of death whereas Kaplan-Meier estimator and the log-rank test estimate the survival probability [35]. Therefore, we can see that the results yielded by these different methods are different in terms of significance. Also, we analyzed the data based on the time to hospitalization (early, late), we considered time to hospitalization as early if the patient is hospitalized within 6 days from catching the disease and as late if is hospitalized in a time greater than 6 days from catching the disease. Since not all the COVID-19 patients are hospitalized, we deleted the cases that have no hospitalization date. We then created a survival object and fit the Kaplan-Meier curves by passing the created survival object to survfit function. We obtained the results given in Table 3, which shows the results for the first 20 days. The results show that in the early time to hospitalization group over the four days period 1 recovered out of 159, therefore, the proportion not recovered rate is 158/159=0.994 (see Table 3 the first row), and therefore the proportion recovered rate is 1-0.994=0.006. In the late hospitalization group, the results show that over 12 days 1 recovered out of 103 so the proportion not recovered is 102/103=0.990 and hence the proportion recovered rate is 1-0.990=0.010. We note that in the early to hospitalization group the recovery starts at day 4, while in the late to hospitalization group the recovery starts at day 12.
The survival curve of the time to hospitalization groups is shown in Figure 6. It is clear from Figure 6 B that the survival function of the early time to hospitalization group consistently lies above that for the late time to hospitalization group. This indicates that the early time to hospitalization group is better recovering from COVID-19 than the late time to hospitalization group. Also, we note that the two functions are somewhat close to each other in the first few days (up to day 4). This indicates that the early time to hospitalization group can survive COVID-19 later after 4 days from infection than its early one. The p-value of the log-rank is 0.0052, which indicates that the results are significant considering p < 0.05 indicates statistical significance, in other words, the results show that there is a significant difference between the early time to hospitalization group and late time to hospitalization group. Cox regression model for time to hospitalization yielded the hazard ratio, which represents relative that compares the early time to hospitalization group with the late time to hospitalization group as shown in Figure 8. A hazard ratio of 0.54 for the late hospitalization group tells us that patients who sent to hospital late have less opportunity of recovering compared to patients who sent to the hospital early, which served as a reference to calculate the hazard ratio. As shown by the forest plot, the respective 95% confidence interval is (0.35 -0.84) and this result is significant (p-value=0.006). Using this model, we can see that the time to hospitalization variable significantly influences the patients' recovery from COVID-19. Also, We note that the obtained p-value is quite different from what is shown with the Kaplan-Meier estimator and the log-rank test and that is due to the same justification that we presented when analyzing the sex and age groups.
Salinas-Escudero et al. study [24], which applied Kaplan-Meier and Cox regression models to the Mexican found that the age factor has a significant effect in recovering from COVID-19. This finding agrees with our finding on the data we used. In another hand, their www.astesj.com study found that sex group has significant effects, which disagrees with our finding. Monira Mollazehi et al study [25] applied Weibull model in Singapore. The factors they used are age and nationality. Their finding agrees with ours regarding the age group.
The limitations that need to be declared in this research are: First, the dataset is for a specific region and in a specific period. Second, the dataset is relatively small compared to the total infected cases.

Conclusion
In this work, we used survival analysis to analyze COVID-19 data that we obtained from the clinically diagnosed and confirmed cases where the date of onset is recorded in Wuhan and elsewhere in China from Jan 1 to Feb 11, 2020. We used the Kaplan-Meier method which is an estimator of survival probabilities and the Cox regression model, which is known as the Proportional Hazard Model for the analysis. The event of interest in our analysis is the recovery of the patients from COVID-19 and the outcome is time in days until the recovery. The predictor variables that we used are sex, age, and time to hospitalization. The results show that the young age group is better in recovering from COVID-19 than the old age group with a significant difference (P-value = 0.00012) and at any time 1.9 as many patients in the young age group is having an event (recovery) proportionally to the old age group. The step functions of the sex group show that the female and male groups are somewhat close to each other in recovering from COVID-19 and the p-value =0.63 indicates that there is a non-significant difference in the results between Male and Female considering p < 0.05 indicates statistical significance. The results also show that early time to hospitalization group can recover from COVID-19 better than late time to hospitalization group (the p-value of the log-rank is 0.0052)

Conflict of Interest
The authors declare no conflict of interest.