--- title: "Introduction to the EDFtest Package" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{EDFtest-vignette} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` `EDFtest` package contains functions for the calculation of goodness-of-fit test statistics and their $p$-values. The three statistics computed are the Empirical Distribution function statistics called Cramér-von Mises ($W^{2}$), Anderson-Darling ($A^{2}$), and Watson statistic ($U^{2}$). The statistics and their $p$-values can be used to assess an assumed distribution. In the simplest situation you have an i.i.d. sample from some distribution $F$ and want to test the hypothesis that $F$ is a member of some specific parametric family. The following families are available: N(location=$\mu$,scale=$\sigma^{2}$), Gamma(shape=$\alpha$,scale=$\beta$), Logistic(location=$\mu$,scale=$s$), Laplace(location=$\mu$,scale=$b$), Weibull(shape=$\alpha$,scale=$\beta$), and Exponential(scale=$\theta$). ## Theory This package computes goodness-of-fit test statistics $A^2$, $W^2$, and $U^2$ -- the Anderson-Darling, Cramér-von Mises, and Watson statistics. These statistics are used to test the null hypothesis that a sample $X_1,\ldots, X_n$ is drawn from a distribution $F$ which belongs to a specified parametric family of distributions against the alternative that $F$ is not equal to any member of that parametric family. The three test statistics were originally defined to test the null hypothesis that a sample, say $U_1, \ldots, U_n$, is drawn from the uniform distribution on [0,1]. For this problem the empirical cumulative distribution function, $F_n$, of the $U$ sample is compared to the cumulative distribution function $F(u) = u$ (on [0,1]) of the Uniform[0,1] distribution. All three statistics depend on the random function, called the empirical process, $$ W_n(x) = \sqrt{n}\{F_n(x) - x\} $$ defined for $0 \le x \le 1$. The oldest of the three statistics is $W^2$ given by $$ \int_0^1 W_n^2(x) \, dx. $$ Anderson and Darling suggested dividing $W_n(x)$ by its standard deviation to give more weight to the tails (near 0 and 1).This gives the statistic $$ A^2 = \int_0^1 \frac{W_n^2(x)}{x(1-x)} \, dx. $$ Finally Watson adapted the statistic to testing for uniformity of data on a circle. In this case the circle is scaled to have circumference 1 so that starting from some point on the circle and going round the arc length traversed moves from 0 to 1. The resulting statistic is $$ U^2 = \int_0^1 (W_n(u) - \int_0^1 W_n(v) \, dv )^2 \, du. $$ Watson's key observation is that this definition gives a statistic whose value does not depend on the choice of 'some point on the circle' mentioned above. Note, however, that the statistic can be used even if the data did not come from a circle. Our package contains generic functions (`AD`, `CvM`, and `Watson`) which compute these test statistics using computing formulas which can be found, for instance, in articles by Michael Stephens in a volume edited by Ralph D'Agostino and Stephens called *Tests of goodness-of-fit*. Other problems in which we want to check whether or not a sample $X_1,\ldots,X_n$ has some specific *continuous* distribution $F_0$ (such as N(0,1)) are handled by realizing that the $X_i$ have distribution $F_0$ if and only if the 'probability integral transforms' $U_i=F_0(X_i)$ have the standard uniform distribution. So to test the null hypothesis $F=F_0$ we do the probability integral transform and then apply one of the tests of uniformity. These tests are extended to test the composite hypothesis that $F$ is $F_0(\cdot | \theta)$ for some parametric model indexed by $\theta$. The parameter is estimated by maximum likelihood to get $\hat\theta$ and this estimate is used to produce $$ \hat{U}_i = F(X_i | \hat\theta). $$ Our statistics are then calculated from these $\hat{U}_i$ values which should, if the null hypothesis is correct, be approximately uniform. Having computed the test statistic we want to use we then need to compute an appropriate $p$-value. In this package we use the following limit theory to compute asymptotic $p$-values. For regular families, assuming the null hypothesis holds, each of the statistics converges in distribution to an object of the form $$ \int_0^1 Y^2(u) \, du $$ where $Y$ is a Gaussian process with mean function 0 and covariance function $\rho(u,v)$ which depends on the particular model being tested and usually too on the correct parameter values so that $\rho(u,v)=\rho(u,v,\theta)$. It follows that the distribution of the integral is the same as that of $$ S \equiv \sum_{i=1}^\infty \lambda_i^2 Z_i^2 $$ where the $Z_i$ are i.i.d. standard normal (so $Z_i^2$ is $\chi_1^2$) and the $\lambda$ are the eigenvalues of the covariance $\rho$. A number $\lambda$ is an eigenvalue if there is a non-zero function $f$ solving the equation $$ \int_0^1 \rho(u,v,\theta) f(v) \, dv = \lambda f(u). $$ In the i.i.d. sample setting the covariance function $\rho$ for the Cramér-von Mises statistic is given by $$ \min\{u,v\}-uv - \psi(u,\theta)^T {\cal I}^{-1}(\theta) \psi(v,\theta). $$ Here ${\cal I}$ is the $p \times p$ Fisher information matrix for a single observation evaluated at $\theta$ and the function $\psi$ is the $p$ vector with $i$th component $$ \psi_i(v,\theta) = \frac{\partial}{\partial \theta_i} F(y|\theta) $$ evaluated at the value of $y$ solving $F(y|\theta) = v$. For the Anderson-Darling statistic this covariance function must be divided by $\sqrt{u(1-u)v(1-v)}$. For Watson's statistic we replace $\rho$ above by $$ \rho_U(u,v) = \rho(u,v) -\int_0^1 \rho(s,v) \, ds -\int_0^1 \rho(u,t)\, dt + \int_0^1\int_0^1 \rho(s,t)\, ds \, dt. $$ For the built-in parametric models specified here we use the specific forms of these functions appropriate to the model; for models which are not built-in we offer another approach which is described later. ### Built-in distributions Our computational flow for built-in distributions is then the following: 1. Solve the likelihood equations to compute the maximum likelihood estimate $\hat\theta$ for the data set $x$. 2. Compute the probability integral transforms of the vector $x$ using the $\hat\theta$ distribution function for the model being tested. Then compute the value of the test statistic the user has selected. 3. Compute approximate solutions of the eigenvalue equation $$ \int_0^1 \rho(u,v,\hat\theta) f(v) \, dv = \lambda f(u). $$ by discretization. Specifically let $s_i = \frac{i}{m+1}$ for $i$ running from 1 to $m$ for some $m$ (in the code we take `m=100` by default) and create the matrix $R$ given by $$ R_{i,j} = \rho(s_i,s_j,\hat\theta). $$ Notice that the precise form of $\rho$ depends on the statistic being used and on the family of distributions being tested. 4. Then compute the $m$ eigenvalues, say $\hat\lambda_1,\ldots,\hat\lambda_m$ of $R$. 5. Approximate the distribution of $S$ above by the distribution of $$ \sum_{i=1}^m \hat\lambda_i Z_i^2. $$ We use the package `CompQuadForm` to compute $p$-values from this distribution. ### User supplied distributions Users can add their own distributions by providing two functions * `Fdist(x,thetahat, ...)` which takes parameter estimates and a data set `x` and computes the probability integral transform for each element of `x`. `Fdist` must return a vector of $n$ probabilities * `Score(x,thetahat, ...)` which takes parameter estimates and a data set `x` and computes, for each entry in `x`, the component of the score function due to observation `x`. These must be returned in an $n$ by $p$ matrix with 1 row for each observation and 1 column for each parameter. The user is also expected to supply the value `thetahat` of the maximum likelihood estimate. The work-flow above is then modified as follows: 1. Compute the probability integral transforms of the vector $x$ using `Fdist(x,thetahat,...)`. The compute the value of the test statistic the user has selected. 2. Estimate $\rho(s_i,s_j,\hat\theta)$ using sandwich estimates of the covariance function and of the fisher information matrix. To estimate the fisher information matrix we + Call `Score(x,thetahat, ...)` to get an $n \times p$ matrix, say $A$. + Compute the $p \times p$ matrix $\hat{FI}$ given by $\hat{FI}=A^TA/n$. 3. To estimate the covariance function we compute the vector $\hat U$ of probability integral transforms using `Fdist`. Then we compute the $n \times p$ matrix $D$ with $ij$th entry $$ D_{ij} = \sum_{k=1}^n 1(s_i \le \hat{U}_k)A_{kj} /n. $$ We replace the matrix in Step 3 above by the $m \times m$ matrix $$ R = R_0 - D (\hat{FI}^{-1}) D^T $$ where the $ij$ component of the matrix $R_0$ is $\min(s_i,s_j) - s_i s_j$. For the Anderson-Darling test we then divide $R_{ij}$ by $\sqrt{s_i(1-s_i)s_j(1-s_j)}$. And for Watson's test we replace $R$ by $$ (I-J)R(I-J) $$ where $I$ is the $m \times m$ identity matrix and $J$ is the $m \times m$ matrix with every entry given by $1/m$. Thus we have swept out row and column means from $R$. For the built-in cases these adjustments were made in computing $\rho(u,v,\hat\theta)$. Steps 4 and 5 are then as in the previous case. ## Example Here is an example showing you how to use `EDFtest` package to perform goodness-of-fit tests for a given data set. We are going to use Fisher's or Anderson's `iris` data set for our demonstration. `iris` is a data frame with 150 observations and 5 variables which are `Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, and `Species`. Assumed that we have a special interest in the width of sepal in this sample. Here is a histogram of it: ```{r} hist(iris$Sepal.Width, main="Width of sepal") ``` We may conclude that this sample might follow a normal or gamma distribution from above histogram. Now, let's use the `EDFtest` package to perform the goodness-of-fit tests to justify our guess. ```{r,warning = FALSE,message = FALSE} library(EDFtest) set.seed("100") x=iris$Sepal.Width shape=estimate.gamma(x)[1] # Anderson-Darling statistic and P-value (asq=AD.gamma(x)) AD.gamma.pvalue(a=asq,shape=shape)$P #Cramér-von Mises statistic and P-value (wsq=CvM.gamma(x)) CvM.gamma.pvalue(w=wsq,shape=shape)$P #You can also use following generic functions gof.gamma(x,print=TRUE) #Imhof gof.gamma.bootstrap(x,M=10000) #bootstrap ``` We calculated Anderson-Darling and Cramér-von Mises statistics and $p$-values of the sample by both `imhof` and bootstrap methods. In `AD.gamma.pvalue` and `CvM.gamma.pvalue` functions, we use `imhof` function in `CompQuadForm` package to calculated 100 eigenvalues, by default, for the calculation of their $p$-values. Using `imhof` method, $p$-value for $A^{2}$ is 0.057625 and for $W^{2}$ is 0.02859593. At the same time, $p$-values by 10,000 bootstrap are 0.0289 for $A^{2}$ and 0.0576 for $W^{2}$. Both methods are fairly consistent. Now, we can do similar tests for Normal model. ```{r,warning = FALSE,message = FALSE} set.seed("100") # Anderson-Darling statistic and P-value (asq=AD.normal(x)) AD.normal.pvalue(a=asq)$P #Cramér-von Mises statistic and P-value (wsq=CvM.normal(x)) CvM.normal.pvalue(w=wsq)$P #You can also use following generic functions gof.normal(x,print=TRUE) #Imhof gof.normal.bootstrap(x,M=10000) #bootstrap ``` We can see that $p$-value for $A^{2}$ is 0.02037737 and for $W^{2}$ is 0.009486189. At the same time, $p$-values by 10,000 bootstrap are 0.0205 for $A^{2}$ and 0.0105 for $W^{2}$. Both methods are fairly consistent.