\documentclass{article} \usepackage[pdftex]{graphicx} \usepackage{Sweave} \usepackage{amsmath} \addtolength{\textwidth}{1in} \addtolength{\oddsidemargin}{-.5in} \setlength{\evensidemargin}{\oddsidemargin} \SweaveOpts{keep.source=TRUE, fig=FALSE} %\VignetteIndexEntry{Mixed Effects Cox Models} %\VignetteDepends{coxme} %\VignetteDepends{kinship2} \newcommand{\myfig}[1]{\resizebox{\textwidth}{!} {\includegraphics{#1.pdf}}} \title{Mixed Effects Cox Models} \author{Terry Therneau \\Mayo Clinic} \begin{document} \maketitle \section{Introduction} Like many other projects, this software library started with a data set and a problem. From this came statistical ideas for a solution, followed by some initial programming --- which more than anything else helped to define the \emph{real} computational and statistical issues --- and then a more ambitious programming solution. The problem turned out to be harder than I thought; the first release-worthy code took over 3 years in gestation and resulted in the \emph{kinship} library. This in turn led to application to a larger set of problems, and a complete re-design of the underlying code. It was then split into separate libraries: coxme contains the central code for mixed effects Cox and linear models, kinship2 contains code for the construction and manipulation of pedigree data, and bdsmatrix contains underlying routines for block-diagonal sparse matrices. \section{Model} The coxme function fits the model \begin{align*} \lambda(t) &= \lambda_0(t) e^{X\beta + Zb} \\ b &\sim G(0, \Sigma(\theta)) \end{align*} where $\lambda_0$ is an unspecified baseline hazard function, $X$ and $Z$ are the design matrices for the fixed and random effects, respectively, $\beta$ is the vector of fixed-effects coefficients and $b$ is the vector of random effects coefficients. The random effects distribution $G$ is modeled as Gaussian with mean zero and a variance matrix $\Sigma$, which in turn depends a vector of parameters $\theta$. For any fixed values of $\beta$ and $b$ we define the usual Cox partial likelihood PL (often labeled as a log-likelihood in printouts) as \begin{equation*} \log[PL(\beta,b))] = \sum_{i=1}^n \int_0^\infty \left[ Y_i(t) \eta_i(t) - \log \left(\sum_j Y_j(t)e^{\eta_j(t)} \right)\right] \end{equation*} where $\eta_i(t) = X_i(t)\beta + Z_i(t)b$ is the linear score for subject $i$ at time $t$ and $Y_i(t)$ describes the risk set, $Y_i(t) = 1$ if subject $i$ is still under observation at time $t$ and 0 otherwise. For further details see chapter 3 of Therneau and Grambsch \cite{Therneau00} or any other mid-level survival text. We can integrate out the random effects to create the integrated partial likelihood \begin{equation*} IPL(\beta, \theta) = \frac{1}{(2\pi)^{q/2}|\Sigma(\theta)|^{1/2}} \int PPL(\beta, b) e^{-b'\Sigma^{-1}(\theta)b/2}\,db \end{equation*} where $q$ is the length of $b$, i.e., the number of random effects. Ripatti and Palmgren \cite{Ripatti02} showed that the IPL can be treated as a likelihood, just as integrated full likelihoods can. An ML estimate is obtained by joint maximization over $\beta$ and $\theta$. We can compare a nested set of random or fixed effects Cox models using chi-square tests. The REML estimate is obtained by also integrating out the fixed effects \begin{equation*} REML(\theta) = \frac{1}{(2\pi)^{q/2}|\Sigma(\theta)|^{1/2}} \int PPL(\beta, b) e^{-b'\Sigma^{-1}(\theta)b/2}\,db \,d\beta \end{equation*} The REML estimate is obtained by maximization of this quantity over $\theta$. As in linear models, the REML estimate cannot be used for testing fixed effects, due to the lack of a proper prior distribution for $\beta$. For instance, assume that \texttt{age} were a candidate variable. The REML value for a fit with age in days versus one with age in years will differ by the constant $\log(365.25)$; this has no impact on the estimate of $\theta$, but makes the comparison of two models, one with and one without age, completely arbitrary. In linear models the REML estimate is often preferred, since practical experience has shown it to be more reliable; the ML estimates are often too large. By analogy, some have recommended using a REML estimate for the mixed effects Cox model. However, either theoretical and practical evidence for its superiority is sparse. The coxme function currently only provides an ML estimate. For a very simple fit such as <>= fit1 <- coxme(Surv(endage, cancer) ~ parity + (1| famid)) @ found below there will be one random intercept per group (family), and $Z$ will be the usual design matrix for a one-way anova, i.e., the same design matrix as would be used by a linear model \texttt{lm(y $\sim$ factor(famid)-1)}. In this case $Z_{ij} =1$ iff subject $i$ is a member of family $j$. The variance of the random effects in this case is $\Sigma= \theta I$. Note that if there are $k$ groups the vector of random effects coefficients $b$ will have $k$ elements. Because of the shrinkage provided by the random effect they will satisfy $\sum b_k =0$, rather than having one of their members set to zero. With respect to the random effects the \texttt{contrasts} option has no relevance. \section{Simple Models} For many random effects models the random effects can be specified very simply in the model formula. As an example consider the \texttt{eortc} data set, which is included with the package for illustration. This is a simple simulated example, based on the results of a breast cancer trial undertaken by the European Organization for Research and Treatment of Cancer. There are 37 enrolling centers with enrollments ranging from 21 to 247 subjects. We start by fitting a simple model with a random intercept per center. <<>>= library(coxme) stem(table(eortc$center)) efit1 <- coxph(Surv(y, uncens) ~ trt, eortc) efit2 <- coxme(Surv(y, uncens) ~ trt + (1|center), eortc) @ Random effects are specified in the formula by a parenthesised expression which contains a vertical bar separating effects on the left from grouping variables on the right. In the case above we read it as an intercept (effect) per center (group). <<>>= print(efit2) @ The components of the printout are \begin{itemize} \item The total number of observations and the total number of events (deaths) in the data set. \item The computational effort, summarized as the number of iterations for the optim routine and the underlying Newton-Raphson iterations used. \item The log partial likelihood for a model with no covariates or random effects, the fitted partial likelihood, and the value with the random effects integrated out. We will normally be interested in the null and integrated values. (The log values are printed, but labeled as PL and IPL for brevity). \item Likelihood ratio tests based on the integrated and penalized views of the model, along with penalized values. The AIC penalizes by twice the effective degrees of freedom, and the BIC by $\log(d)$ times the effective degrees of freedom, where $d$ is the number of events. \item A summary of the fixed effects \item A summary of the variances of the random effects \end{itemize} After integrating out the random effects, the log partial likelihood for the mixed effects model is \Sexpr{round(diff(efit2$loglik[1:2])*2, 1)}. As a test of the random effects, we would normally compare this to \texttt{efit1}, the fit with no random effects which has a log partial likelihood of \Sexpr{round(diff(efit1$loglik)*2,1)} The estimated standard deviation between centers of .33 is fairly substantial (more on this below). The difference is $>100$ on one degree of freedom, which is highly significant. Another way to approach the fit is as a penalized model. The coefficients are viewed as the solution to a penalized problem $\log(PPL(\beta, b)) - \log(PL(\beta,b)) - p(\theta)$ where the penalty function $p(\theta) = b'\Sigma^{-1}(\theta)b/2$. In this case we use the ordinary $PL$ for assessment, but with an effective degrees of freedom which lies somewhere between the total number of coefficients $(\beta, b)$ of 38 and the degrees of freedom for the fixed effects of 1. For this particular problem $\Sigma(\theta) = \theta I$, a multiple of the identity matrix and our penalty is $p(\theta) = -1/(2\theta) \sum{j=1}{37} b_j^2$. As the penalty goes to zero the fit will be equivalent to treating factor(center) as a fixed effect with degrees of freedom equal to 38, as the penalty becomes larger the effective degrees of freedom will decrease. In this case a test for the random effects would be based on (319.7 - 105.7) on 27.7 degrees of freedom. We do not normally use this test, but the adjusted degrees of freedom is useful in summarizing the fit. The major components of the model can be extracted via the following functions: \begin{center} \begin{tabular}{r|ccc} &\multicolumn{3}{c}{Method} \\ &lme/lmer & coxme $<2.2$ & coxme $\ge 2.2$ \\ \hline fixed effects coefficients & fixef & fixef & fixef \\ random effects coefficients& ranef & & ranef \\ fixed effects variance matrix& vcov& & vcov \\ random effects parameters ($\theta$) & VarCorr & ranef & VarCorr \end{tabular} \end{center} Note that for version 2.2 and greater these follow the same pattern as the linear mixed effects models in R; prior to version 2.2 the ranef method was incorrectly assigned. One feature of the mixed effects Cox model is that the standard deviation of the random effect is directly interpretable. The random effects $b_j$ for each center $j$ are in the risk score, a value of .33 for instance (one standard deviation above the mean) corresponds to a relative risk of $\exp(.33)= 1.39$, an almost 40\% higher risk of death for subjects at that center. (In real data sets we would of course have adjusted for imbalances in patient risk factors between centers.) The random effects coefficients can be retrieved using the \texttt{ranef} function. The code below shows center effects ranging from less than 1/2 to over 1.6 times the average risk for the study. Because there may be multiple random effects in a model the ranef function returns a list, with one element per random effect. We are interested in the first element (only element in this case) of the list. <<>>= stem(exp(ranef(efit2)[[1]])) @ To look at random treatment effects within center we can add a nested effect <<>>= efit3 <- coxme(Surv(y, uncens) ~ trt + (1 | center/trt), eortc) efit3 @ This shows a further improvement in fit, but by much smaller amount. There is one important difference between how random effects in coxme and in lmer are handled concerning intercepts. Linear models have an implied intercept term. Unless explicitly removed, the model \verb!lm(y ~x)! will actually fit the model \verb!lm(y ~x +1)!. This carries forward into the notation of lmer where a random effect \verb!(age | group)! will automatically add the random intercept and fit \verb!(1 + age | group)!. Cox models do not have an intercept term, and an automatic ``1'' is not added to either fixed effects \emph{or} random effects portions of the formula. \section{The Minnesota Breast Cancer Family Study} \subsection{Background} Details of the study can be found in Sellers \cite{Sellers99} Briefly, in 1944 a family study of breast cancer was initiated at the Dight institute for Human Genetics at the University of Minnesota to examine the influence of heredity on the risk of breast cancer. Probands were a consecutive series of all breast cancer patients ascertained at the tumor clinic of the University hospital between 1944 and 1952. A total of 544 families were studied, representing data on 4418 family members. Results of that early study provided some of the first evidence that breast cancer clusters in families \cite{Anderson58}. Records of the research then sat dormant for 40 years, until a series of follow-up studies was initiated in the late 1990s. Prevalent cases ($n=40$) and families where most of the relatives other than the proband were deceased at baseline ($n=42$) were excluded. Of the remainder 30 families could be contacted and 6 refused, leaving 426 participating families. After extending pedigrees to the current generation, the final data set has 28081 subjects. The full set of subjects along with a few selected variables is included in the \texttt{kinship2} package as the \texttt{minnbreast} data set. The analysis of this data set was the original genesis of the coxme function. <<>>= library(coxme) library(kinship2) data(minnbreast) options(show.signif.stars=FALSE) makefig <- function(file) { pdf(paste(file, "pdf", sep='.'), width=7, height=5) par(mar=c(5.1, 4.1, .1, .1)) } names(minnbreast) with(minnbreast, table(sex, cancer, exclude=NULL)) mped <- with(minnbreast, pedigree(id, fatherid, motherid, sex, affected=cancer, famid=famid, status=proband)) makefig("cfig1") plot(mped["8"]) #figure 1 dev.off() @ \begin{figure} \myfig{cfig1} \caption{Pedigree for family 8 of the Minnesota Breast Cancer Study. Cancer cases are filled, the proband is marked with a slash} \label{cfig1} \end{figure} Figure \ref{cfig1} shows the pedigree for family 8. The original case (the proband) has a mother, daughter, and niece with breast cancer. This is a high risk family. There is also a brother-in-law with prostate cancer. An aside. Our first action was to turn off one of the most egregious abominations of statistical packages: using stars to train our users that ``$<.05$'' is all that matters, when in fact that is one of the most useless ways known to employ statistical methods. I also like to place my figures within the document using the latex \verb+\begin{figure}+ command, hence I use \texttt{fig=FALSE} as a default Sweave option and generate the pdfs directly. \subsection{Simple models} The simplest model is to look for a random intercept per family. <<>>= minnfemale <- minnbreast[minnbreast$sex == 'F' & !is.na(minnbreast$sex),] fit0 <- coxph(Surv(endage, cancer) ~ I(parity>0), minnfemale, subset=(proband==0)) summary(fit0) fit1 <- coxme(Surv(endage, cancer) ~ I(parity>0) + (1|famid), minnfemale, subset=(proband==0)) print(fit1) @ Note that we don't include the proband in the model. Because they are the index case that caused a family to be included, they have in a probability of 1 for cancer and don't fit into the predictive framework. From the simple Cox model we see that ever having a child is protective, reducing the risk of breast cancer by about 30\%. A mixed effects model has nearly the same estimate for parity. The random family effect, an estimated intercept (excess risk) for each family, has a standard deviation of .41. We would expect about 15\% of the families to be 1 standard deviation or more above the mean, and these families will have a breast cancer risk that is exp(.41)=1.5 times the norm. A similar fraction have lower risk. This is a modestly large familial effect. The result of \texttt{fit1} is also known as a \emph{shared frailty} model. The term originally arose in demography, where the excess risk $f_k=\exp(b_k)$ for family $k$ could be thought of as an underlying propensity for failure or ``frailty'' for the subject. The statistics literature is divided on the use of $f$ or frailty as the underlying random variable or using $b$. The first leads to investigation of random effects distributions that are $>0$ such as the gamma, positive stable, and log-normal. We strongly prefer the latter form, however, first since it ties into the very familiar notational structure of linear mixed effects models, and second because the Gaussian forms a simpler framework for multiple and correlated random effects. \begin{figure} \myfig{cfig2} \caption{The number of breast cancers in each family versus the estimated familial risk} \label{cfig2} \end{figure} <<>>= ncancer <- with(minnfemale, tapply(cancer, famid, sum, na.rm=T)) pyears <- with(minnfemale, tapply(endage -18, famid, sum, na.rm=T)) count <- with(minnfemale, tapply(cancer, famid, function(x) sum(!is.na(x)))) indx <- match(names(ranef(fit1)[[1]]), names(ncancer)) makefig("cfig2") plot(ncancer[indx], exp(ranef(fit1)[[1]]), log='y', xlab="Number of cancers per family", ylab="Estimated familial risk") abline(h=1, lty=2) text(c(8.1, 1.6), c(.85, 1.2), c("165", "72")) dev.off() indx <- match(c(72,165), names(ncancer)) temp <- cbind(ncancer, count, pyears, 100*ncancer/pyears)[indx,] dimnames(temp) <- list(c(72, 165), c("Cancers", "N", "Years of FU", "Rate")) print(round(temp,2)) @ Figure \ref{cfig2} shows number of cancers in each family versus the estimated excess risk from the model. Family 72 has a high risk but only 2 cases; there are only 5 female relatives in the family and the cancers occur at ages 32 and 36. Because the family is small the model has shrunk the estimated risk closer to the overall mean. Family 165 has 8 cancers but is has lower risk than the average Minnesota subject due to the large family size. \begin{figure} \myfig{cfig3} \caption{Profile likelihood for the per-family random effects model} \label{cfig3} \end{figure} There is not a good closed form formula for the variance of the variance estimates themselves, i.e., one that is both reliable and simple to compute. Valid confidence intervals can be obtained from a profile likelihood, however. We start by computing the likelihood over a range of values for the variance. <<>>= estvar <- seq(.2, .6, length=15)^2 #range of std values loglik <- double(15) for (i in 1:15) { tfit <- coxme(Surv(endage, cancer) ~ I(parity>0) + (1|famid), data=minnfemale, subset=(proband==0), vfixed=estvar[i]) loglik[i] <- 2*diff(tfit$loglik)[1] } makefig("cfig3") plot(sqrt(estvar), loglik, xlab="Std of the random effect", ylab="2 * loglik") abline(h=2*diff(fit1$loglik)[1] - qchisq(.95, 1), lty=2) dev.off() @ The 95\% profile likelihood confidence interval is the region where the curve lies above the line, i.e., the set of values $x$ for which a 1 degree of freedom likelihood ratio test would not reject the hypothesis that the true standard deviation = $x$. (It will sometimes take a couple of tries to guess an appropriate range for the variance, for many data sets it will be considerably wider than this one). We can easily get a numeric estimate from the graph, giving a confidence interval from .28 to .53. <<>>= temp <- 2*diff(fit1$loglik)[1] - loglik approx(temp[1:8], sqrt(estvar[1:8]), 3.84)$y approx(temp[9:15], sqrt(estvar[9:15]), 3.84)$y @ \subsection{Correlated Random Effects} The simple family intercept model has provided some insight, but it has two serious flaws. \begin{itemize} \item Subjects will be counted in a family's risk who should actually %' be excluded. In family 8 there are also 4 women who have married into the family; they are not shown on figure \ref{cfig1} since they have no children and hence have no \emph{genetic} relationship to family 8 at all. Yet they are counted in the familial risk. \item More generally, the simple model cannot account for degrees of association. In figure \ref{cfig1} subject 162 is a marry-in and so does not share the familial burden of the proband/mother/daughter cancers. Yet she is connected indirectly to the family through her high risk daughter. The largest families have over 200 individuals, some of whom are only distantly related while others are close. \end{itemize} We approach this by making use of the kinship matrix $K$. Formally, $K_{ij}$ is the probability that for any given gene, an allele randomly chosen from subject $i$ and another randomly chosen from subject $j$ will be identical by descent, that is, have been passed down from a common ancestor. Then $2K$ is a measure of the expected fraction of shared genes: 1 on the diagonal, 1/2 for mother/daughter or sib/sib, 1/4 for grandparent/grandchild or uncle/niece, etc. If subject $i$ and $j$ are from different families then $K_{ij}=0$ by definition, since they have no common ancestor. The relevant Cox model is \begin{align*} \lambda_i(t) &= \lambda_0(t) e^{X\beta + b_i} \\ b_i &\sim N(0, \sigma^2 2K) \end{align*} Each subject has an individual random genetic risk $b_i$, but those risks are correlated according to the strength of relationship. The model is easy to fit in \texttt{coxme} <<>>= kmat <- kinship(mped) fit2 <- coxme(Surv(endage, cancer) ~ I(parity>0) + (1|id), data=minnfemale, varlist=coxmeMlist(2*kmat, rescale=F), subset=(proband==0)) print(fit2) @ The estimated genetic effect is much larger, with a standard deviation of almost 0.9. This means that there are multiple subjects in the study with quite large relative risks, exp(.9)= 2.5 fold greater than the average Minnesotan. Note that even though we are only interested in fitting the females, creation of the kinship matrix \texttt{kmat} requires use of all the subjects in the pedigree. When the model is fit appropriate rows/columns of the matrix are selected by matching its dimnames with the id variable. In fitting the model we have to explicitly describe the variance structure using one of a small set of variance functions. The coxmeMlist function accepts a set of matrices $V_1$, $V_2$, \ldots as arguments and fits the an overall variance matrix $\sigma_1^2 V_1 + \sigma_2^2 V_2 + \ldots$, solving for the optimal values of $\sigma$. A major portion of the code in the coxmeMlist function is devoted to ensuring that each level of the grouping variable --- \texttt{id} in this instance --- exists in the matrix, that all matrices have the same row and column order, and other consistency checks. Since $K_{ij}=0$ for any unrelated individuals, the kinship matrix is mostly zeros. Although it has a theoretical size of just under 800 million elements (\texttt{prod(dim(kmat))}) only the 500,000 non-zero elements are stored (\texttt{length(kmat@x)}) using a sparse matrix form based on the Matrix package. To maximize storage efficiency families are clustered together in adjacent rows. This carries through to the calculations, and thus it is the order of the elements in \texttt{kmat} which determines the order of the random effects coefficients $b$ in the fitted model. Do not expect them to be sorted in the same way as as they were in the data set, nor in the sorted order found for a factor variable in an ordinary Cox model. (For a simple model like \texttt{fit1} which does not require sparse storage they will be in standard order, however.) \subsection{Breast and prostate cancer} An interesting genetic question is whether there might be a connection between various cancers. That is, inherited traits that predispose subjects to all or some subset of malignancies. Grabrick et al \cite{Grabrick03} undertook a sub study within the Minnesota Breast Family cohort of possible genetic connections between breast and prostate cancer. A subset of 206 families was selected, and all men in those families were invited to participate. Living male members were sent a questionnaire, for those deceased or unable to answer a shortened form was sent to their spouse or close relative. More to do for this section \section{Random slopes} \begin{thebibliography}{9} \bibitem{Anderson58} V E Anderson, H O Goodman, S C Reed. \emph{Variables related to human breast cancer}. University of Minnesota Press, 1958. \bibitem{Grabrick03} D M Grabrick, J R Cerhan, R A Veirkant, T M Therneau, J C Cheville, D J Tindall, and T A Sellers, Evaluation of familial clustering of breast and prostate cancer in the Minnesota Breast Cancer Family Study, \emph{Cancer Detection and Prevention}, 27:30--36, 2003. \bibitem{Ripatti02} S Ripatti and J Palmgren. Estimation of multivariate frailty models using penalized partial likelihood. Biometics 56:1016--1022, 2002. \bibitem{Sellers99} T A Sellers, R A King, J R Cerhan, P L Chen, D M Grabrick, L H Kushi, W S Oetting, R A Vierkant, C M Vachon, F J Couch, T M Therneau, L C Hartman and V E Anderson. Fifty year follow-up of cancer incidence in a historical cohort of Minnesota breast cancer families. \emph{Cancer Epid Biomarkers Prev} 8:1051--7, 1999. \bibitem{Therneau00} T M Therneau and P M Grambsch, Modeling Surivival Data: Extending the Cox Model, Springer-Verlag, 2000. \end{thebibliography} \end{document}