This vignette herein describes the methodology used to simulate event times in the simsurv package. For a vignette related to usage of the package, including examples and code, please see How to use the simsurv package.
The survival function for individual i is the probability that their “true” event time T∗i is greater than the current time t. That is, the survival function can be defined as
Si(t)=P(T∗i>t)
Moreover, the corresponding probability of having failed at or before time t (i.e. having not survived up to time t) is the complement to the survival function. That is, the probability of failure is defined as
Fi(t)=P(T∗i≤t)=1−Si(t)
If the survival time T∗i is known to be drawn from some parametric distribution, then it also holds that the definition of the probability of failure, Fi(t), is equivalent to the definition of the cumulative distribution function (CDF) for the distribution of event times. Moreover, probability distribution theory tells us that the CDF for a continuous random variable must follow a uniform distribution on the range 0 to 1 (Mood et al. (1973)). That is, FX(X)∼U(0,1) where FX(.) denotes the CDF for the continuous random variable X. Similarly, the complement of the CDF for X must also follow a uniform distribution on the range 0 to 1, that is, 1−FX(X)∼U(0,1).
These results therefore allow one to conclude that under a standard parametric distributional assumption for the event times T∗i (i=1,…,N), the survival probability for individual i at their true event time will be a uniform random variable on the range 0 to 1. That is,
Si(T∗i)=Ui∼U(0,1)
It is then possible to extend these results to the setting of a proportional hazards model. Under a proportional hazards model the survival probability for individual i at their event time T∗i can be written as
Si(T∗i)=exp(−H0(T∗i)exp(XTiβ))=Ui∼U(0,1)
where H0(t)=∫t0h0(s)ds is the cumulative baseline hazard evaluated at time t, and Xi is a vector of covariates with associated population-level (i.e. fixed effect) parameters β.
Rearranging the equation for Si(t) and solving for t leads to the following general form for the inverted survival function
S−1i(u)=H−10(−log(u)exp(−XTiβ))
where S−1i(u) is the inverted survival function for individual i, H−10(u) corresponds to the inverted cumulative baseline hazard function, and Xi is a vector of covariates with associated population-level (i.e. fixed effect) parameters β.
Therefore, if the cumulative hazard function is invertible, we can easily simulate a new event time as
Tsi=S−1i(Ui)
where Tsi is the simulated event times for individual i, S−1i(u) is the inverted survival function defined previously, and Ui is a random variable drawn from a U(0,1) distribution. Note that if the cumulative baseline hazard is directly invertible then an analytic form will be available for S−1i(u). That is, we can just plug in random draws of Ui and directly calculate the simulated event times. Since independent draws of a U(0,1) random variable are easily obtained using any standard statistical software, this method will be easy and fast for simulating event times.
This method was first proposed by Bender et al. (2005) and is commonly known as the cumulative hazard inversion method.
For the standard parametric survival distributions included in the simsurv package (i.e. Weibull, exponential, Gompertz) an analytic form for S−1i(u) does exist. Therefore, event times for these standard parametric distributions (assuming proportional hazards) are generated using the cumulative hazard inversion method. The parameterisations for each of these standard parametric distributions are shown next.
For the exponential distribution we have the following:
hi(t)=λexp(XTiβ)
Hi(t)=λtexp(XTiβ)
Si(t)=exp(−λtexp(XTiβ))
S−1i(u)=(−log(u)λexp(XTiβ))
where λ>0 is the scale (sometimes known as rate) parameter.
For the Weibull distribution we have the following:
hi(t)=γλ(tγ−1)exp(XTiβ)
Hi(t)=λ(tγ)exp(XTiβ)
Si(t)=exp(−λ(tγ)exp(XTiβ))
S−1i(u)=(−log(u)λexp(XTiβ))1/γ
where λ>0 and γ>0 are the scale and shape parameters, respectively.
For the Gompertz distribution we have the following:
hi(t)=λexp(γt)exp(XTiβ)
Hi(t)=λ(exp(γt)−1)γexp(XTiβ)
Si(t)=exp(−λ(exp(γt)−1)γexp(XTiβ))
S−1i(u)=1γlog[(−γlog(u)λexp(XTiβ))+1]
where λ>0 and γ>0 are the scale and shape parameters, respectively.
If the cumulative baseline hazard function is not invertible, then numerical root finding can be used to solve to t. This method has been discussed by both Bender et al. (2005) and Crowther and Lambert (2013). In simsurv this is required for the two-component mixture distributions (assuming proportional hazards). An analytical form is available for the survival function of each of these distributions, but numerical root finding must be used to invert the survival function. In practice, the simsurv package uses the stats::uniroot
function based on the method of Brent (1973). This means iteratively finding a solution to the equation Si(t)−Ui=0.
The two-component mixture distributions in simsurv are parameterised in the same way as the survsim Stata package (Crowther and Lambert (2002)). That is, they are additive on the survival scale, with a parameter defining the mixing proportions, i.e.
S0(t)=πS01(t)+(1−π)S02(t)
where S0(t) is the baseline survival function, S01(t) and S02(t) are baseline survival functions for the two component distributions, and 0≤π≤1 is the mixing parameter. The specific parameterisations for the hazard and survival functions of each of the two-component mixture distributions in simsurv are shown next.
For the two-component exponential mixture distribution we have the following:
hi(t)=[πλ1exp(−λ1t)+(1−π)λ2exp(−λ2t)πexp(−λ1t)+(1−π)exp(−λ2t)]exp(XTiβ)
Hi(t)=−log[πexp(−λ1t)+(1−π)exp(−λ2t)]exp(XTiβ)
Si(t)=[πexp(−λ1t)+(1−π)exp(−λ2t)]exp(XTiβ)
where λ1>0 and λ2>0 are the scale (sometimes known as rate) parameters for the component exponential distributions.
For the two-component Weibull mixture distribution we have the following:
hi(t)=[πγ1λ1(tγ1−1)exp(−λ1(tγ1))+(1−π)γ2λ2(tγ2−1)exp(−λ2(tγ2))πexp(−λ1(tγ1))+(1−π)exp(−λ2(tγ2))]exp(XTiβ)
Hi(t)=−log[πexp(−λ1(tγ1))+(1−π)exp(−λ2(tγ2))]exp(XTiβ)
Si(t)=[πexp(−λ1(tγ1))+(1−π)exp(−λ2(tγ2))]exp(XTiβ)
where λ1>0 and λ2>0 are the scale parameters, and γ1>0 and γ2>0 are the shape parameters, for the component Weibull distributions.
hi(t)=[πλ1exp(γ1t)exp(−λ1(exp(γ1t)−1)γ1)+(1−π)λ2exp(γ2t)exp(−λ2(exp(γ2t)−1)γ2)πexp(−λ1(exp(γ1t)−1)γ1)+(1−π)exp(−λ2(exp(γ2t)−1)γ2)]exp(XTiβ)
Hi(t)=−log[πexp(−λ1(exp(γ1t)−1)γ1)+(1−π)exp(−λ2(exp(γ2t)−1)γ2)]exp(XTiβ)
Si(t)=[πexp(−λ1(exp(γ1t)−1)γ1)+(1−π)exp(−λ2(exp(γ2t)−1)γ2)]exp(XTiβ)
where λ1>0 and λ2>0 are the scale parameters, and γ1>0 and γ2>0 are the shape parameters, for the component Gompertz distributions.
If one can obtain an algebraic closed-form solution for the inverse cumulative baseline hazard, H−10(u), then a major benefit of the cumulative hazard inversion method is that it is simple and computationally efficient. Moreover, it can be used to generate survival times for a variety of standard parametric baseline hazards, for example, the exponential, Weibull or Gompertz distributions. Even if the cumulative baseline hazard cannot be inverted analytically then one can still use numerical root finding, as described in the previous section, to numerically invert the survival function and solve for t.
However, using numerical root finding still requires an analytical form for the survival function. For some complex data generating processes it may not be possible to obtain a closed-form solution to the cumulative baseline hazard H0(t) and therefore the form of Si(t) will also be intractable. Crowther and Lambert (2013) therefore proposed an extension to overcome these issues. Their extension incorporates both numerical root finding and numerical quadrature. The numerical quadrature is used to evaluate the cumulative hazard in situations where it cannot be evaluated analytically.
An example of a situation where their method is required is the introduction of time-dependent effects on the hazard scale (i.e. non-proportional hazards). The introduction of time-dependent effects often leads to an intractable integral when evaluating the cumulative hazard. The method therefore involves iterating between numerical quadrature and numerical root finding until an appropriate solution for t is obtained. This is the method used by the simsurv package when the user supplies their own hazard or log hazard function for generating the event times, or when they are simulating with time-dependent effects (i.e. non-proportional hazards). The numerical integration is based on Gauss-Kronrod quadrature with a default of 15 nodes (although the user can choose between 7, 11, or 15 nodes). For further details on the method we refer the reader to the the paper by Crowther and Lambert (2013).
Bender R, Augustin T, Blettner M. Generating survival times to simulate Cox proportional hazards models. Stat Med 2005;24(11):1713-1723.
Brent R. Algorithms for Minimization without Derivatives. Englewood Cliffs, NJ: Prentice-Hall, 1973.
Crowther MJ, Lambert PC. Simulating complex survival data. Stata J 2012;12(4):674-687.
Crowther MJ, Lambert PC. Simulating biologically plausible complex survival data. Stat Med 2013;32(23):4118-4134.
Mood AM, Graybill FA, Boes DC. Introduction to the Theory of Statistics. McGraw-Hill: New York, 1974.