\documentclass{book} %% Changes made by JV %% * Changed 6th edition references to 7th %% * Devore6 -> Devore7 %% * went through examples in this text and matched with new numbers, %% page nos. %% * copied some files from Devore6, as the variable names were better %% than those provided through the ascii textfiles \usepackage{hyperref} \usepackage{url} \usepackage{amsmath} \usepackage{pgf} %%\VignetteIndexEntry{Using the Devore7 package} %%\VignetteDepends{Devore7} \newcommand\code{\bgroup\@codex} \def\@codex#1{{\normalfont\ttfamily\hyphenchar\font=-1 #1}\egroup} \begin{document} \SweaveOpts{engine=R,eps=FALSE,pdf=TRUE,width=5,height=3} \setkeys{Gin}{width=\textwidth} \title{Using the Devore7 package with R} \author{Douglas Bates\\Department of Statistics\\ University of Wisconsin-Madison.\\ Updated for the 7th edition by\\ John Verzani, CUNY/CSI.~\footnote{All comments or suggestions regarding this document should be sent to John Verzani (\url{Devore7@gmail.com}).} } \date{} \maketitle <>= options(width=75, show.signif.stars = FALSE) library(Devore7) library(lattice) @ \tableofcontents \setcounter{chapter}{-1} \chapter{Preliminaries} This document describes the use of the statistical package R as computing support in an introductory statistics course based on the text \emph{Probability and Statistics for Engineering and the Sciences (7th edition)} by Jay Devore (Thomson Brooks-Cole, 2008). We demonstrate how R can be used to reproduce the results in many of the examples in the text. One of the desirable features of this text is the number of examples and exercises based on real data sets the Prof.{} Devore has culled from the engineering literature. As they are real data, some of the data sets are large and have a complex structure. Although it is not difficult to enter these data into a computer package like R, the process is tedious and error-prone. Furthermore, it is not much of a learning experience. We have provided copies of the data sets for the examples and the exercises in a ``package'', named \texttt{Devore7}, that can be used with R. This document is also part of the \texttt{Devore7} package.~\footnote{For the \texttt{Devore7} package, some of the data sets were not provided by the publisher. Many, but not all, missing ones were ported from the \texttt{Devore6} package.} You may wish to try some of the examples in this section as you are reading it. We assume that you have both R and the \texttt{Devore7} package for R installed. (See Appendix~\ref{app:Installing} for instructions if you need to do this first.) \subsection*{Calculating a median} \label{prelim:xmp0114} Suppose that we wish to reproduce the calculation of the median of the %% CHECK THIS data on transferrin receptor concentration shown in Example 1.13 (p.~27 of the text). As there are only 12 concentrations, we could enter the data by hand. Start R and type <>= conc = c(7.6, 8.3, 9.3, 9.4, 9.4, 9.7, 10.4, 11.5, 11.9, 15.2, 16.2, 20.4) str(conc) median(conc) @ The first line assigns the 12 data values as a numeric vector to the name \texttt{conc}, a short form of ``concentration''. The function named \texttt{"c"} concatenates a series of data values into a vector that can be assigned a name. In the next line the \texttt{"str"} function is used to examine the structure of the object named \texttt{"conc"}. The output shows that this is a numeric vector of length 12 and displays the first several data values so you can check them against the data in the text. Finally the \texttt{"median"} function, applied to the \texttt{conc} vector returns the median. \subsection*{Using the Devore7 package} \label{sec:UsingDevore7} Entering the data, as shown above, is suitable for small data sets. An alternative and preferred way to access the data for the larger data sets, is to use the \texttt{Devore7} package and load the data set. The data set for Example 1.13 is called \texttt{xmp01.13}. In general, data sets for examples in the text are named \texttt{xmp}\textit{cc.nn}, where \textit{cc} is the two-digit chapter number and \textit{nn} is the two-digit example number. Data sets for exercises are named \texttt{ex}\textit{cc.nn}. (Single digit chapter or example numbers have a \texttt{0} prepended as in \texttt{xmp01.13} so that the names sort in the correct order.) You must attach the \texttt{Devore7} package every time you start R if you are to have access to the data sets from the textbook like this. To attach the package to an R session use <>= library(Devore7) @ after starting R or select \texttt{Packages -> Load package -> Devore7} from the menu bar. (If this produces an error see Appendix~\ref{app:Installing} for instructions on installing the \texttt{Devore7} package.) After attaching the package, you can load a data set with <>= data(xmp01.13) str(xmp01.13) @ The first line loads the data set into the current R session. The second line provides a description of the structure of the data set. It is a good practice always to use \texttt{str} on the data set after loading it. \subsection*{Form of the data sets} \label{sec:Form} The data set \texttt{xmp01.13} is not a single vector like \texttt{conc}. It is a data table (called a ``data frame'' in R) with one column and 12 rows. All the data sets in the \texttt{Devore7} package are data frames. The output from \texttt{str} indicates that the name of the first column is \texttt{concentration}. To calculate the median we must give both the name of the data frame and the name of the column. We can do this in three ways, as described in Appendix~\ref{sec:accessing}. We will focus on just one of these ways, which is to use the \texttt{"with"} function. <>= with(xmp01.13, median(concentration)) @ The \texttt{with} function indicates which data set should be used to gain access to the column (or ``variable'') called \texttt{concentration}. \subsection*{Summary} \label{sec:prelim:summary} To recap: \begin{enumerate} \item You should have R installed on a computer and the \texttt{Devore7} package for R installed. (See Appendix~\ref{app:Installing} for instructions if you need to do this.) \item At the beginning of each session use <>= library(Devore7) @ to allow access to the data sets from the package. \item To load the data for a specific example or a specific exercise use a name of the form \texttt{xmp}\textit{cc.nn} or \texttt{ex}\textit{cc.nn} to \texttt{data()} then check the structure with \texttt{str()}. <>= data(xmp01.13) str(xmp01.13) @ \end{enumerate} In the remainder of this document we will not show these steps explicitly. The R functions we have mentioned are shown in Table~\ref{tab:general}. See also Appendix~\ref{app:help}. \begin{table}[htbp] \centering \begin{tabular}[]{l l} Function & Purpose \\\hline q() & quit R\\ help(name) & display help on an object (function or data set)\\ help.search(``topic'')& search for functions related to a topic\\ library(name) & Make data sets from a package available for loading\\ data(name) & Load a data set \\ str(name) & Display a brief description of the structure \\ with(dataset,$\dots$) & Use the variables in a data set \\ \hline \end{tabular} \caption{R functions for general use} \label{tab:general} \end{table} \chapter{Overview and Descriptive Statistics} \label{ch:overview} We will follow the same sequence of topics and chapter headings as in the text and will begin each chapter with a table of R functions that are used in the chapter. Table~\ref{tab:ch1} lists functions used in chapter 1. \begin{table}[htbp] \centering \begin{tabular}{l l} \multicolumn{1}{c}{\textbf{Function}} & \multicolumn{1}{c}{\textbf{Description}} \\\hline stem(x) & stem-and-leaf display\\ hist(x) & histogram\\ boxplot(x)& boxplot\\ mean(x) & mean (i.e. average) value of x\\ median(x) & median\\ var(x) & sample variance\\ sd(x) & sample standard deviation\\ log(x) & natural logarithm (works on entire vectors)\\ log(x, 10)& common (base 10) logarithm\\ sqrt(x) & square root\\ \verb+x^(1/3)+ & cube root\\ \hline \end{tabular} \caption{R functions used with chapter 1} \label{tab:ch1} \end{table} \section{Example 1.1} \label{sec:xmp01.01} Example 1.1 (p.~4) lists the ambient temperatures ($^\circ$F) for each test firing or actual launch of the space shuttle prior to the \emph{Challenger} tradgedy in 1986. In Figure 1.1 (p.~4) these data are displayed as a stem-and-leaf plot and as a histogram. There are 36 data values. A stem-and-leaf plot similar to that in Figure 1.1 (p.~4) can be produced with <>= with(xmp01.01, stem(temp)) @ (Remember that first you must attach the \texttt{Devore7} package, load the \texttt{xmp01.01} data set and check its structure. We do not show those steps here.) The \texttt{hist} function produces a histogram. \begin{center} <>= with(xmp01.01, hist(temp)) @ \end{center} The stem-and-leaf plot and the histogram shown here are not exactly the same as those shown in Figure 1.1 (p.~4). In section~\ref{sec:ch1enhance} we show how optional arguments to \texttt{stem} and to \texttt{hist} could be used to produce displays similar to those in the text. \section{Example 1.5} \label{sec:xmp01.05} The data in example 1.5 (p.~11), on the percentage of undergraduate students who are binge drinkers at 140 different campuses, are presented as a stem-and-leaf display in Figure 1.4 (p.~12). <>= with(xmp01.05, stem(bingePct)) with(xmp01.05, stem(bingePct, scale = 0.5)) @ The first stem-and-leaf display is more spread out than the one in Figure~1.4 (p.~12). In the second call to \texttt{stem} we use the optional argument \texttt{scale} to shrink the scale by a factor of $\frac{1}{2}$ so the resulting display is similar to that in Figure~1.4. \section{Example 1.6} \label{sec:xmp01.06} As in Example~1.5 a stem-and-leaf display is created, this time from data on yardages of golf courses as given in \emph{Golf Magazine}. <>= with(xmp01.06, stem(yardage)) @ % \section{Examples 1.7 and 1.8} % \label{sec:xmp01.07} % Example~1.7 describes the \emph{time-series plot} in Figure~1.6 % (p.~14) of quarterly beer consumption. These data are not available % so we show how a time-series plot of the shuttle launch temperatures % from Example~1.1 could be created. % <>= % opar = par(mar=c(3.5,4.1,0.1,0.1)) % @ % \begin{center} % <>= % with(xmp01.01, plot(temp, type = "b")) % @ % \end{center}%$ % <>= % par(opar) % @ % The optional argument \texttt{type="b"} to the \texttt{plot} function % indicates that both the points and the connecting lines should be drawn on % the plot. % There is no easy way to use R to produce the dotplot shown in % Figure~1.7 (p.~14) for Example~1.8. \section{Example 1.9} \label{sec:xmp01.09} A histogram such as shown in Figure~1.8 (p.~16) is produced by the \texttt{hist} function. \begin{center} <>= with(xmp01.09, hist(consump)) @ \end{center} The vertical axis on this histogram is frequency. For a vertical axis on the scale of density (relative frequency divided by bin width) use the optional argument \texttt{freq = FALSE}. \section{Example 1.10} \label{sec:xmp01.10} Figure~1.10 (p.~28) shows an example of a histogram with unequal bin widths. The optional argument \texttt{breaks} to the \texttt{hist} function is used to set the breakpoints for the bins. When unequal bin widths are used, the vertical axis switches to the density scale. \begin{center} <>= with(xmp01.10, hist(strength, breaks = c(2,4,6,8,12,20,30))) @ \end{center} The \texttt{breaks} argument is a vector created by concatenating several numbers with the \texttt{c} function. % \section{Example 1.12} % \label{sec:xmp01.12} % Figure~1.13 (p.~22) shows a barplot for the motorcycle ownership % data. As this data set is not available in the \texttt{Devore7} % package, we create the vector of % frequencies, assign names to the frequencies, and then use % \texttt{barplot} to produce the plot % <>= % cycles = c(41,27,20,18,3,11) % names(cycles) = c("Honda", "Yamaha", "Kawasaki", "Harley", "BMW", "other") % barplot(cycles) % @ \section{Examples 1.12 and 1.13} \label{sec:xmp01.13} Numeric measures of location are calculated with the functions \texttt{mean} and \texttt{median}. Another useful summary function is \texttt{sum}. A brief summary, including the mean, the median, the quartiles, the maximum and minimum is returned by \texttt{summary}. <>= with(xmp01.12, mean(crackLength)) with(xmp01.12, sum(crackLength)) with(xmp01.13, median(concentration)) with(xmp01.12, summary(crackLength)) with(xmp01.13, summary(concentration)) @ \section{Example 1.14} \label{sec:xmp01.14} The trimmed mean, described in Example~1.14 (p.~28) is obtained by using the optional \texttt{trim} argument to \texttt{mean}. <>= with(xmp01.14, summary(copper)) with(xmp01.14, mean(copper, trim = 0.1)) @ \section{Example 1.15} \label{sec:xmp01.15} Functions \texttt{var} and \texttt{sd} provide the sample variance and sample standard deviation, respectively. The ``computing formula'' described on p.~34 is not used by these functions because that formula can have poor numerical properties. <>= with(xmp01.15, var(Strength)) with(xmp01.15, sd(Strength)) @ \section{Examples 1.17} \label{sec:xmp01.17} Function \texttt{boxplot} provides the boxplot. By default a vertical boxplot is constructed. Use the optional argument \texttt{horizontal=TRUE} to get a horizontal boxplot <>= opar = par(mar=c(3.5,4.1,0.1,0.1)) @ \begin{center} <>= with(xmp01.17, boxplot(depth, horizontal = TRUE)) @ \end{center} <>= par(opar) @ \section{Example 1.18} \label{sec:xmp01.19} <>= opar = par(mar=c(3.5,4.1,0.1,0.1)) @ \begin{center} <>= with(xmp01.18, boxplot(C1, horizontal = TRUE)) @ \end{center} <>= par(opar) @ \section{Comparative boxplots} \label{sec:xmp0119} The data for Example~1.19 (p.~38) are not available so we use the data on scores for creamy and crunchy peanut butters (exercise~1.15, p.~21) to illustrate comparative boxplots. This data set has two columns and 37 rows. The score is the first column and the indicator of ``Creamy'' or ``Crunchy'' is the second column. We say that these data are in the stacked format (see Appendix~\ref{app:stacked} for details). In the call to \texttt{boxplot} we use the formula \Sexpr{Quote(Score ~ Type)} to indicate that \texttt{Score} is the response and \texttt{Type} defines the groups for the comparative boxplot. <>= opar = par(mar=c(3.5,4.1,0.1,0.1)) @ \begin{center} <>= with(ex01.15, boxplot(Score ~ Type, horizontal = TRUE, las = 1)) @ \end{center} <>= par(opar) @ \section{Enhancing graphical displays} \label{sec:ch1enhance} In this chapter we have used several functions, such as \texttt{hist}, \texttt{barplot}, \texttt{plot}, and \texttt{boxplot} that produce graphical displays of data. These functions all can take optional arguments that provide more effective displays. For example, in \S~\ref{sec:xmp0119}, we used the optional argument \texttt{las=1} to \texttt{boxplot} to change the label style on the vertical axis so the labels are horizontal rather than vertical. The \texttt{las=1} argument can be used with other graphics functions. Compare \begin{center} <>= with(xmp01.09, hist(consump, las = 1)) @ \end{center} with the display in \S~\ref{sec:xmp01.09} We have also seen how the \texttt{breaks} argument can be used with \texttt{hist} and how the \texttt{scale} argument can be used with \texttt{stem}. To reproduce a display like Figure~1.1 (p.~4) we would use <<>>= with(xmp01.01, stem(temp, scale=2)) @ and <>= with(xmp01.01, hist(temp, breaks=c(25,35,45,55,65,75,85))) @ \chapter{Probability} \label{ch:Probability} Table~\ref{tab:ch2} lists a function used in chapter 2. \begin{table}[htbp] \centering \begin{tabular}{l l} \multicolumn{1}{c}{\textbf{Function}} & \multicolumn{1}{c}{\textbf{Description}} \\ \hline choose(n,k) & calculate $\binom{n}{k}$\\ \hline \end{tabular} \caption{R functions used in chapter 2} \label{tab:ch2} \end{table} \section{Example 2.23} \label{sec:xmp0223} In this chapter on probability there is little use for R functions except for the \texttt{choose} function that evaluates the number of combinations of $k$ objects selected from $n$, written $\binom{n}{k}$ and described in section~2.3 (pp.~64--65). To calculate the first probability in example 2.23 <>= choose(15,3)*choose(10,3)/choose(25,6) @ \chapter[Discrete Random Variables]{Discrete Random Variables and Probability Distributions} \label{cha:Discrete} To quote the document ``Introduction to R'' \begin{quote} One convenient use of R is to provide a comprehensive set of statistical tables. Functions are provided to evaluate the cumulative distribution function P(X <= x), the probability density function and the quantile function (given q, the smallest x such that P(X <= x) > q), and to simulate from the distribution. \begin{center} \begin{tabular}{l l l} \multicolumn{1}{c}{Distribution}& \multicolumn{1}{c}{R name}& \multicolumn{1}{c}{additional arguments}\\\hline binomial& `binom'& `size, prob'\\ geometric& `geom' & `prob'\\ hypergeometric&`hyper'&`m, n, k'\\ negative binomial&`nbinom'&`size, prob'\\ Poisson&`pois'&`lambda'\\ \hline \end{tabular} \end{center} Prefix the name given here by `d' for the density, `p' for the CDF, `q' for the quantile function and `r' for simulation (\textbf{r}andom deviates). The first argument is `x' for `dXXX', `q' for `pXXX', `p' for `qXXX' and `n' for `rXXX' (except for `rhyper' and `rwilcox', for which it is `nn'). \end{quote} These functions are more versatile and more accurate than using probability tables. \section{Example 3.31} \label{sec:xmp0330} For $X$ having a binomial distribution with $n=6$ and $p=0.5$ we are to calculate $P(X=3)$, $P(3\le X)$, and $P(X\le 1)$. <>= dbinom(3, size = 6, prob = 0.5) dbinom(3:6, size = 6, prob = 0.5) sum(dbinom(3:6, size = 6, prob = 0.5)) 1 - pbinom(2, size = 6, prob = 0.5) pbinom(2, size = 6, prob = 0.5, lower = FALSE) pbinom(1, 6, 0.5) @ The first call evaluates $b(3; 6, .5)$. The second call evaluates the probability function at $3,\dots,6$ using the \texttt{":"} operator that generates the sequence from 3 to 6. If we sum this vector of probabilities we get $P(3\le X)$. An alternative is to use $P(3\le X)=1-P(X\le2)$ and evaluate $P(X\le 2)$ with \texttt{pbinom}. Another alternative is to use cumulative probability in the upper tail, obtained with the optional argument \texttt{lower=FALSE} to \texttt{pbinom}. Finally \texttt{pbinom} is used to calculate $P(X\le 1)$. \section{Example 3.32} <>= pbinom(8,15,0.2) dbinom(8,15,0.2) 1-pbinom(7,15,0.2) sum(dbinom(4:7,15,0.2)) @ \section{Example 3.35} R uses a different, but equivalent, set of parameters for the hypergeometric distribution than does the text. In the text the parameters of the hypergeometric are $N$, the population size, $n$, the sample size, and $M$, the number of ``successes'' in the population. In R the sample size is called $k$, the parameter $m$ corresponds to $M$ in the text, and $n$ is $N-M$. Thus what is written in the text as $h(2;5,12,20)$ becomes <>= dhyper(2,12,8,5) @ \section{Example 3.36} \label{sec:xmp0336} <>= dhyper(2,5,20,10) phyper(2,5,20,10) @ \section{Example 3.38} \label{sec:xmp0338} The negative binomial density function, \texttt{dnbinom}, shown in the text as $nb(10; 5, 2)$, has essentially the same calling sequence in R. The cumulative probability function is \texttt{pnbinom}. <>= dnbinom(10, 5, 0.2) pnbinom(10, 5, 0.2) @ \section{Example 3.39} \label{sec:xmp0339} <>= dpois(5, lambda = 4.5) ppois(5, 4.5) @ \section{Example 3.40} \label{sec:xmp0340} The Poisson distribution can be used to approximate binomial probabilities with large $n$ and small $p$. However, there is no need to do so because the exact binomial probabilities can be evaluated. <>= dbinom(1, 400, 0.005) dpois(1,2) pbinom(3, 400, 0.005) ppois(3, 2) @ \chapter[Continuous Random Variables]{Continuous Random Variables and Probability Distributions} \label{cha:Continuous} The set of continous distributions available in R is \begin{center} \begin{tabular}{l l l} \multicolumn{1}{c}{Distribution}& \multicolumn{1}{c}{R name}& \multicolumn{1}{c}{additional arguments}\\\hline beta & `beta' & `shape1, shape2, ncp' \\ Cauchy & `cauchy' & `location, scale' \\ $\chi^2$ & `chisq' & `df, ncp' \\ exponential & `exp' & `rate' \\ F & `f' & `df1, df1, ncp' \\ gamma & `gamma' & `shape, scale' \\ log-normal & `lnorm' & `meanlog, sdlog' \\ logistic & `logis' & `location, scale' \\ normal & `norm' & `mean, sd' \\ Student's t & `t' & `df, ncp' \\ uniform & `unif' & `min, max' \\ Weibull & `weibull'& `shape, scale'\\ Wilcoxon & `wilcox' & `m, n'\\ \hline \end{tabular} \end{center} As with the discrete distributions, prefix the name given here by `d' for the density, `p' for the CDF, `q' for the quantile function and `r' for simulation (\textbf{r}andom deviates). The first argument is `x' for `dXXX', `q' for `pXXX', `p' for `qXXX' and `n' for `rXXX' Not all the distributions shown above are discussed in the text. \section{Example 4.13} \label{sec:xmp0413} Function \texttt{pnorm} provides the standard normal cumulative distribution function by default. The optional arguments \texttt{mean} and \texttt{sd} can be set to values other than 0 and 1 to provide probabilities from any normal distribution. $P(Z\le 1.25)$ and $P(Z\le-1.25)$ are evaluated as <>= pnorm(1.25) pnorm(-1.25) @ $P(Z>1.25)$ can be evaluated in two ways <>= 1-pnorm(1.25) pnorm(1.25, lower=FALSE) @ To evaluate probabilities of intervals, such as $P(-0.38\le Z\le1.25)$, apply \texttt{pnorm} to the endpoints as a vector (created with the \texttt{"c"} function) which returns a vector of probabilities. The \texttt{"diff"} function forms successive differences from which we obtain the probability in the interval. <>= pnorm(c(-0.38,1.25)) diff(pnorm(c(-0.38,1.25))) @ \section{Example 4.14} \label{sec:xmp0414} The inverse of the standard normal CDF, called the quantile function, is obtained with \texttt{qnorm}. Notice that the first argument to \texttt{qnorm} is a probability, not a percentage. <>= qnorm(0.99) @ \section{Example 4.15} \label{sec:xmp0415} To obtain $z_\alpha$, use the optional argument \texttt{lower=FALSE} to \texttt{qnorm}. <>= qnorm(0.05, lower=FALSE) @ \section{Example 4.16} \label{sec:xmp0416} Nonstandard normal distribution probabilities or quantiles are obtained with the optional arguments \texttt{mean} and \texttt{sd} to \texttt{pnorm} and \texttt{qnorm}. In this example $\mu=1.25$ and $\sigma=0.46$ and we wish to evaluate $P(1.00\le X\le 1.75)$ <>= diff(pnorm(c(1.0, 1.75), mean = 1.25, sd = 0.46)) @ \section{Example 4.18} \label{sec:xmp0417} For $\mu=64$ and $\sigma=0.78$, the 99.5th percentile is <>= qnorm(0.995, mean=64, sd=0.78) @ \section{Example 4.20} \label{sec:xmp0420} The normal approximation to binomial probabilities can be calculated but, like the Poisson approximation, it is not necessary as the exact binomial probabilities can also be calculated. <>= pnorm(10.5, mean = 12.5, sd = sqrt(12.5*0.75)) pbinom(10, size = 50, prob = 0.25) diff(pnorm(c(4.5,15.5), mean = 12.5, sd = sqrt(12.5*0.75))) diff(pbinom(c(4,15), size = 50, prob = 0.25)) @ \section{Example 4.23} \label{sec:xmp0423} The parameters $\alpha$ and $\beta$ of the gamma distribution are named \texttt{shape} and \texttt{scale} respectively in R. In this example $\alpha=2$ and $\beta$ has the default value of 1. <>= pgamma(c(3,5), shape=2) diff(pgamma(c(3,5), shape=2)) pgamma(4, shape = 2, lower = FALSE) @ \section{Example 4.24} \label{sec:xmp0424} <>= diff(pgamma(c(60,120), shape = 8, scale = 15)) pgamma(30, shape = 8, scale = 15, lower = FALSE) @ \section{Examples 4.21 and 4.22} \label{sec:xmp0422} In R the parameter $\lambda$ of the exponential distribution is called \texttt{rate} <>= pexp(10, rate = 0.2) diff(pexp(c(5,10), rate = 0.2)) pexp(2, rate = 0.5, lower = FALSE) @ \section{Example 4.25} \label{sec:xmp0425} The parameters $\alpha$ and $\beta$ of the Weibull distribution are called \texttt{shape} and \texttt{scale}. <>= pweibull(10, shape = 2, scale = 10) qweibull(0.95, shape = 2, scale = 10) @ \section{Example 4.27} \label{sec:xmp0427} The lognormal distribution takes two parameters named \texttt{meanlog} and \texttt{sdlog}. <>= diff(plnorm(c(1,2), meanlog = 0.375, sdlog = 0.25)) qlnorm(0.99, meanlog = 0.375, sdlog = 0.25) @ \section{Example 4.28} \label{sec:xmp0428} R provides probabilities and quantiles of the standard beta distribution with $A=0$ and $B=1$. The parameters $\alpha$ and $\beta$ are called \texttt{shape1} and \texttt{shape2} respectively. To use other values of $A$ and $B$ the scaling must be done manually. In this example $A=2$, $B=5$, $\alpha=2$ and $\beta=3$. To transform to a standard beta distribution we replace $x$ by $(x-2)/(5-2)$ <>= pbeta((3-2)/(5-2), shape1 = 2, shape2 = 3) @ \section{Examples 4.29 and 4.30} \label{sec:xmp0429} The normal probability plot is produced with \texttt{qqnorm}. A reference line can be added with \texttt{qqline}. \begin{center} \setkeys{Gin}{width=0.7\textwidth} <>= with(xmp04.29, qqnorm(meas.err)) with(xmp04.29, qqline(meas.err)) @ \end{center} \begin{center} \setkeys{Gin}{width=0.7\textwidth} <>= with(xmp04.30, qqnorm(Voltage)) with(xmp04.30, qqline(Voltage)) @ \end{center} \section{Example 4.31} \label{sec:xmp0431} The Weibull probability plot is not available directly in R. However, the plot can be created using the formula $\ln[-\ln(1-p)]$ for the 5th, 15th$,\dots,$ and 95th percentiles as given in text. The sequence $0.05,0.15,\dots,0.95$ is generated with \texttt{seq(0.05, 0.95, 0.1)}. <>= with(xmp04.31, plot(log(-log(1-seq(0.05, 0.95, 0.1))), log(lifetime))) @ \chapter[Joint Probability Distributions]{Joint Probability Distributions and Random Samples} \label{cha:Joint} The main use of R in this chapter is for simulation experiments as described in section~5.3. \section{Example 5.22} \label{sec:xmp0522} In this example six samples of size ten are drawn from a Weibull distribution with $\alpha=2$ and $\beta=5$. To reproduce the plot of the density shown in Figure 5.6 we can use <>= options(digits=5) @ <>= curve(dweibull(x, shape = 2, scale = 5), 0, 15, las = 1) @ To get a single sample of size 10 from this Weibull distribution we use <>= rweibull(10, shape = 2, scale = 5) @ We could store such a sample as, say, \texttt{samp}, then evaluate its sample mean, sample median, and sample standard deviation. <>= samp = rweibull(10, shape = 2, scale = 5) print(samp) mean(samp) median(samp) sd(samp) @ Notice that every time \texttt{rweibull} is called a new sample is generated. This process of generating a sample and evaluating selected statistics on the sample could be repeated manually to get a total of 6 samples. For large simulation experiments this would quickly become tedious so we put these calculations in a loop. <>= means = medians = sds = numeric(6) for (i in 1:6) { samp = rweibull(10, shape = 2, scale = 5) print(samp) means[i] = mean(samp) medians[i] = median(samp) sds[i] = sd(samp) } means medians sds @ In the first line we assign the names \texttt{means}, \texttt{medians}, and \texttt{sds} to numeric vectors of length 6. Within the loop we assign individual elements in these vectors. <>= options(digits=7) @ \section{Normal data, like example 5.23} \label{sec:xmp0523a} In this example 500 samples of size $n=5$ are generated from a normal distribution with $\mu=8.25$ and $\sigma=0.75$ and the mean of each sample is calculated. We could do this in a loop, as shown above. However, it is more compact to generate a matrix with 5 rows and 500 columns then generate a histogram of the means of the columns of this matrix. Function \texttt{colMeans} calculates the means of the columns. Function \texttt{matrix} creates a matrix from a numeric vector. The user can specify the number of rows and the number of columns. If one of these is omitted, it is calculated from the length of the vector and the other dimension. We set the graphics parameter \texttt{"mfrow"} (multiple figures by row) to create a 2 by 2 array of plots. <>= par(mfrow = c(2,2)) samp5 = matrix(rnorm(500 * 5, mean = 8.25, sd = 0.75), ncol = 500) hist(colMeans(samp5), main='Samples of size 5') samp10 = matrix(rnorm(500 * 10, mean = 8.25, sd = 0.75), ncol = 500) hist(colMeans(samp10), main='Samples of size 10') samp20 = matrix(rnorm(500 * 20, mean = 8.25, sd = 0.75), ncol = 500) hist(colMeans(samp20), main='Samples of size 20') samp30 = matrix(rnorm(500 * 30, mean = 8.25, sd = 0.75), ncol = 500) hist(colMeans(samp30), main='Samples of size 30') @ <>= par(mfrow=c(1,1)) @ \section{Example 5.23} \label{sec:xmp0523} This example is similar to Example 5.22. To reproduce Figure 5.12 use <>= curve(dlnorm(x, meanlog=3, sdlog=0.4), from = 0, to = 75, las = 1) @ The samples and the histograms of the means are generated from <>= par(mfrow=c(2,2)) samp5=matrix(rlnorm(500 * 5, 2, 0.4), ncol = 500) hist(colMeans(samp5), main = 'Means of samples of size 5') samp10=matrix(rlnorm(500 * 10, 2, 0.4), ncol = 500) hist(colMeans(samp10), main = 'Means of samples of size 10') samp20=matrix(rlnorm(500 * 20, 2, 0.4), ncol = 500) hist(colMeans(samp20), main = 'Means of samples of size 20') samp30=matrix(rlnorm(500 * 30, 2, 0.4), ncol = 500) hist(colMeans(samp30), main = 'Means of samples of size 30') @ Finally the normal probability plot is generated by \begin{center} \setkeys{Gin}{width=0.7\textwidth} <>= qqnorm(colMeans(samp30)) qqline(colMeans(samp30)) @ \end{center} \chapter{Point Estimation} \label{cha:PointEst} \section{Example 6.2} \label{sec:xmp0602} The various estimators of location described in Example 6.2 (p.~229) can be evaluated as <>= with(xmp06.02, mean(Voltage)) with(xmp06.02, median(Voltage)) with(xmp06.02, mean(range(Voltage))) with(xmp06.02, mean(Voltage, trim=0.1)) @ \section{Example 6.3} \label{sec:xmp0603} Functions \texttt{var} and \texttt{sd} provide $s^2$ and $s$, the sample variance and standard deviation, respectively. <>= with(xmp06.03, var(Strength)) with(xmp06.03, sd(Strength)) @ To evaluate the alternative estimator $\hat{\sigma}=\frac{\sum\left(X_i-\bar{X}\right)}{n}$ we must evaluate the formula <>= with(xmp06.03, sum((Strength-mean(Strength))^2)/length(Strength)) @ Function \texttt{length} applied to a vector returns $n$, the number of elements in the vector. \section{Example 6.13} \label{sec:xmp0613} The calculation of the method of moments estimates in Example 6.13 (p.~244) can be split into stages <>= xbar = with(xmp06.13, mean(Survival)) xsqb = with(xmp06.13, mean(Survival^2)) xbar xsqb xbar^2/(xsqb-xbar^2) (xsqb-xbar^2)/xbar @ These estimates are slightly different from those shown in the text because the intermediate results $\bar{x}$ and $\sum x_i^2/n$ were rounded in the text. Maximum likelihood estimates are discussed later in chapter 6. We can evaluate the maximum likelihood estimates of $\alpha$ and $\beta$ for this example using the function \texttt{fitdistr} from the package \texttt{MASS} that supplements the book ``Modern Applied Statistics with S (4th ed)'' by Bill Venables and Brian Ripley (Springer, 2002). These estimates are determined by numerical optimization of the logarithm of the likelihood function and we must supply starting estimates. We use the method of moments estimates for this. <>= library(MASS) with(xmp06.13, fitdistr(Survival, dgamma, list(shape=10.577,scale=10.726))) @ The MLEs are quite different from the method of moments estimates. The numbers in parentheses under the estimates are their standard errors. \chapter[Statistical Intervals]{Statistical Intervals Based on a Single Sample} \label{cha:Intervals} Table~\ref{tab:ch7} lists functions used in chapters~\ref{cha:Intervals} and \ref{cha:Tests}. \begin{table}[htbp] \centering \begin{tabular}{l l} \multicolumn{1}{c}{\textbf{Function}} & \multicolumn{1}{c}{\textbf{Description}} \\\hline t.test(x) & Student's t test and confidence interval\\ prop.test(x,n) & Test and confidence interval on proportion\\ binom.test(x,n) & Test and confidence interval on proportion\\ \hline \end{tabular} \caption{R functions used with chapters~\ref{cha:Intervals} and \ref{cha:Tests}} \label{tab:ch7} \end{table} \section{Example 7.2} \label{sec:xmp07.02} The calculation of the confidence interval for known $\sigma$, shown in Example 7.2, could be done in R as <>= 80.0 + c(-1, 1) * 1.96 * 2.0 / sqrt(31) @ but it is probably just as easy to use a hand calculator for this. \section{Example 7.6} \label{sec:xmp07.06} The calculation of the sample mean, the sample standard deviation, and the sample size can be combined into a single statement <>= with(xmp07.06, mean(Voltage)+c(-1,1)*1.96*sd(Voltage)/sqrt(length(Voltage))) @ An alternative, and preferred way, of calculating the interval is to use \texttt{t.test}. With 48 observations the confidence interval on $\mu$ from the t distribution is nearly identical to that from the standard normal distribution. <>= with(xmp07.06, t.test(Voltage)) @ It is always a good idea to check the normal probability plot when using \texttt{t.test}, even with a large sample. \begin{center} \setkeys{Gin}{width=0.7\textwidth} <>= with(xmp07.06, qqnorm(Voltage)) @ \end{center} \section{Example 7.8} \label{sec:xmp0708} R has two functions, \texttt{binom.test} and \texttt{prop.test}, that can be used to calculate a large-sample confidence interval on the binomial proportion, $p$. Neither of them corresponds exactly the the confidence interval described on p.~295. <>= prop.test(16,48) binom.test(16,48) @ \section{Example 7.11} \label{sec:xmp0711} \begin{center} \setkeys{Gin}{width=0.7\textwidth} <>= with(xmp07.11, t.test(Elasticity)) with(xmp07.11, qqnorm(Elasticity)) @ \end{center} \section{Example 7.15} \label{sec:xmp0715} Although there is no built-in confidence interval for $\sigma^2$ in R, the \texttt{qchisq} function can be used to obtain the critical values $\chi^2_{\alpha/2,n-1}$ and $\chi^2_{1-\alpha/2,n-1}$ used to calculate the interval. \begin{center} \setkeys{Gin}{width=0.7\textwidth} <>= with(xmp07.15,qqnorm(voltage)) with(xmp07.15, 16*var(voltage)/qchisq(c(0.975,0.025), df = 16)) @ \end{center} \chapter[Tests of Hypotheses]{Tests of Hypotheses Based on a Single Sample} \label{cha:Tests} The functions described in chapter~\ref{cha:Intervals} are used for performing tests of hypotheses based on a single sample. Optional arguments are used to specify $\mu_0$ or $p_0$ and to indicate the form of the alternative hypothesis. All of these tests return a p-value, described in \S8.4 (pp.~311--317). From the p-value the result of the hypothesis test for any level $\alpha$ can be determined. \section{Example 8.8} \label{sec:xmp0808} In R the \texttt{t.test} function can be used with any size of data set. For large $n$ the t test is essentially equivalent to the large-sample z test. <>= with(xmp08.08, t.test(DCP, mu = 30, alt = "less")) @ As the p-value of 0.2349 exceeds 0.05 we do not reject $H_0:\mu=30$ versus $H_a:\mu < 30$ at level $\alpha=0.05$. Although not shown in the text book, it is of interest to examine the normal probability plot for these data \begin{center} \setkeys{Gin}{width=0.7\textwidth} <>= with(xmp08.08, qqnorm(DCP)) @ \end{center} This plot shows considerable skewness in the data. If we transform to the logarithm of the DCP measurement the skewness is diminished. \begin{center} \setkeys{Gin}{width=0.7\textwidth} <>= with(xmp08.08, qqnorm(log(DCP))) @ \end{center} and in the logarithm scale, the hypothesis test $H_0:\mu_\textit{log}=\log(30)$ versus $H_a:\mu_\textit{log}< \log(30)$ is significant at level $\alpha=0.05$. <>= with(xmp08.08, t.test(log(DCP), mu=log(30), alt="less")) @ \section{Example 8.9} \label{sec:xmp0809} <>= with(xmp08.09, t.test(MAWL, mu = 25, alt = "greater")) @ \section{Example 8.10} \label{sec:xmp0810} <>= power.t.test(n=10,delta=0.1,sd=0.1,type="one.sample",alt="one.sided") power.t.test(delta=0.1,sd=0.1,power=0.95,type="one.sample",alt="one.sided") @ \section{Example 8.11} \label{sec:xmp0811} The large-sample test for a population proportion is most closely related to the result of the \texttt{prop.test} function <>= prop.test(1276, 4115, p = 0.3, alt = "greater") @ The value labeled \texttt{X-squared} is the square of the z statistic. This version of the test uses a continuity correction. If you wish to reproduce the test statistic as given in the text book, add the optional argument \texttt{correct=FALSE} <>= prop.test(1276, 4115, p = 0.3, alt = "greater", correct=FALSE) sqrt(1.993) @ In both cases the p-value is less than 0.1 so we reject $H_0:p=0.3$ versus $H_a:p>0.3$ at level $\alpha=0.1$. \section{Example 8.13} \label{sec:xmp0813} The small sample test is provided by \texttt{binom.test} <>= binom.test(x=14,n=20,p=0.9,alt="less") @ \chapter{Inference Based on Two Samples} \label{cha:twoSample} <>= print(bwplot(type ~ strength, xmp09.07, xlab = "Tensile strength (psi)")) @ \pgfdeclareimage[width=\textwidth]{strbw}{Devore7-strbw} <>= print(qqmath(~ strength|type, xmp09.07, xlab = "Standard normal quantiles", ylab = "Tensile strength (psi)", type = c("g","p"), aspect = 1)) @ \pgfdeclareimage[width=\textwidth]{strqq}{Devore7-strqq} \section{Example 9.7} \label{sec:xmp0907} When given vectors of numbers as arguments, the summary function returns the mean, min, max and quartiles of the given vector. The qqnorm function is used to generate a qqplot of its given argument. The simplest use of the boxplot function creates a boxplot for each vector argument passed to it. <>= str(xmp09.07) @ <>= bwplot(type ~ strength, xmp09.07) @ \centerline{\pgfuseimage{strbw}} <>= qqmath(~strength|type, xmp09.07) @ \centerline{\pgfuseimage{strqq}} <>= t.test(strength ~ type, xmp09.07, alt = "greater") @ <>= print(qqmath( ~ I(bottom-surface), xmp09.08, type = c("g","p"), xlab = "Standard normal quantiles", aspect = 1, ylab = "Difference in zinc concentrations (mg/L)")) @ \pgfdeclareimage[width=\textwidth]{zincdiffqq}{Devore7-zincdiffqq} \section{Example 9.8} \label{sec:xmp0908} The t.test function conducts various types of t tests. When given one vector argument, it performs a one sample test, when two vector arguments are given, and the paired option is specified as TRUE, the function can perform a paired two sample tests as well. Options can also present to specify if a test is one or two sided. <>= str(xmp09.08) @ <>= qqmath( ~ I(bottom-surface), xmp09.08) @ \centerline{\pgfuseimage{zincdiffqq}} <>= with(xmp09.08, t.test(bottom,surface,paired=TRUE)) @ <>= print(qqmath( ~ Difference, xmp09.09, type = c("g","p"), xlab = "Standard normal quantiles", aspect = 1, ylab = "Difference in proportion of time (%)")) @ \pgfdeclareimage[width=\textwidth]{propdiffqq}{Devore7-propdiffqq} \section{Example 9.9} \label{sec:xmp0909} <>= str(xmp09.09) @ <>= qqmath(~ Difference, xmp09.09) @ \centerline{\pgfuseimage{propdiffqq}} <>= with(xmp09.09, t.test(Before, After, paired = TRUE)) @ \section{Example 9.10} This can be done directly with the differences: \label{sec:xmp0910} <>= Difference <- c(5,19,25,10,10,10,28,46,25,38,14,23,14) qqnorm(Difference) qqline(Difference) t.test(Difference) @ Or, using the data frame with the \texttt{paired=TRUE} argument of \texttt{t.test}. <<>= with(xmp09.10, t.test(slide, digital, paired=TRUE)) @ \chapter{The Analysis of Variance} \label{cha:anova} \chapter{Multifactor Analysis of Variance} \label{cha:manova} \section{Example 11.01} \label{sec:xmp1101} An \emph{interaction plot} shows the response versus the levels of one factor with the points joined according to the levels of the other factor. It an additive model is valid, the lines should be approximately parallel. For a balanced data set, the order of the factors does not affect the calculations of the sums of squares nor any of the test statistics and conclusions. However, for unbalanced data the order of the factors is important. In general we put the blocking factor(s) first and the treatment factor(s) last. The \code{TukeyHSD} function can be applied to aov models with more than one factor. If we are only interested in selected factors we can use the argument \code{which} to restrict the factors being considered. Plots are useful aids in checking if assumptions have been satisfied. To assess the constant variance assumption, residual plots are used and the normal probability plot of the residuals is used to assess the assumption of a normal distribution of the noise term. The simplest way to obtain such plots in R is to plot the fitted model object. Use \code{which = 1} to get residuals versus fitted values, \code{which = 2} to get the qqnorm plot of the residuals, or \code{which = 1:2} to get both. Before beginning, we coerce the variables \texttt{brand} and \texttt{treatment} to be factors. <<>= xmp11.01$brand = as.factor(xmp11.01$brand) xmp11.01$treatment = as.factor(xmp11.01$treatment) @ <>= with(xmp11.01, interaction.plot(treatment, brand, strength, col=2:4, lty=1)) with(xmp11.01, interaction.plot(treatment, brand, strength, col=2:4, lty=1)) with(xmp11.01, interaction.plot(brand, treatment, strength, col=2:4, lty=1)) with(xmp11.01, interaction.plot(brand, treatment, strength, col=2:4, lty=1)) anova(fm1 <- aov(strength ~ treatment + brand, xmp11.01)) TukeyHSD(fm1, which = "treatment") plot(fm1, which = 1:2) @ \section{Example 11.05} \label{sec:xmp1105} <>= str(xmp11.05) with(xmp11.05, interaction.plot(humid, brand, power, col = 2:6, lty = 1)) anova(fm1 <- aov(power ~ humid + brand, data = xmp11.05)) TukeyHSD(fm1, which = "brand") plot(TukeyHSD(fm1, which = "brand")) plot(TukeyHSD(fm1, which = "brand")) @ \section{Example 11.06} \label{sec:xmp1106} <>= str(xmp11.06) anova(fm1 <- aov(Resp ~ Subject + Stimulus, data = xmp11.06)) with(xmp11.06, interaction.plot(Stimulus, Subject, Resp, col = 2:6, lty = 1)) TukeyHSD(fm1, which = "Stimulus") plot(fm1,which = 1) range(xmp11.06$Resp) anova(fm2 <- aov(log(Resp) ~ Subject + Stimulus, data = xmp11.06)) with(xmp11.06, interaction.plot(Stimulus, Subject, log(Resp), col = 2:6, lty = 1)) plot(fm2,which = 1) opar <- par(pty = 's') plot(fm2,which = 2) par(opar) plot(TukeyHSD(fm2, which = "Stimulus")) with(xmp11.06,interaction.plot(Stimulus, Subject, fitted(fm2), col=2:6, lty=1)) @ \section{Example 11.07} \label{sec:xmp1107} <>= str(xmp11.07) xtabs(~ Variety + Density, data = xmp11.07) xtabs(Yield ~ Variety + Density, data = xmp11.07) xtabs(Yield ~ Variety + Density, xmp11.07)/ xtabs( ~ Variety + Density, xmp11.07) anova(fm1 <- aov(Yield ~ Density * Variety, data = xmp11.07)) with(xmp11.07, interaction.plot(Density, Variety, Yield, col=2:6, lty=1)) anova(fm2 <- aov(Yield ~ Density + Variety, xmp11.07)) TukeyHSD(fm2) model.tables(fm2, type = "means") @ \section{Example 11.10} \label{sec:xmp1110} <>= str(xmp11.10) xtabs(~ Period + Coat + Strain, xmp11.10) anova(fm1 <- aov(Tempr ~ Period * Coat * Strain, xmp11.10)) anova(fm2 <- update(fm1, . ~ . - Period:Strain - Period:Coat:Strain)) @ \section{Example 11.11} \label{sec:xmp1111} <>= str(xmp11.11) xtabs(as.integer(humidity) ~ row + column, xmp11.11) xtabs(abrasion ~ row + column, xmp11.11) anova(fm1 <- aov(abrasion ~ row + column + humidity, xmp11.11)) model.tables(fm1, cterms = "humidity", type = "mean") TukeyHSD(fm1, which = "humidity") @ \appendix \chapter{Installing R and the Devore7 package} \label{app:Installing} \section{What is R?} \label{sec:WhatsR} R is a freely available, open source, computer system for statistical analysis and graphics. It can be downloaded from the main R information site \url{http://www.r-project.org}, from the Comprehensive R Archive Network (CRAN) site \url{http://cran.r-project.org}, or from any of the mirrors of that site. Those in the United States, for example, are encouraged to use the U.S. mirror \url{http://cran.us.r-project.org}. R provides facilities for data input and manipulation, for graphical and numerical summaries, for simulation and exploration of probability distributions, and for statistical analysis of data. It can be used for computing support for essentially all the topics in an introductory statistics course. This document describes how to use R for computing support in a course that uses the text \emph{Probability and Statistics for Engineering and the Sciences (7th edition)} by Jay Devore (Duxbury, 2008). Althought there are some limited graphical user interface (GUI) capabilites for R, it is basically a command-line system.~\footnote{Some GUI capabilities are provided the add-on packages \texttt{Rcmdr} (\url{http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/}) and \url{pmg} (\textit{http://www.math.csi.cuny.edu/pmg}).} We will concentrate on the command-line interface, showing what the user types and what R responds. We will refer to what the user types as a ``command'' although, technically, we should use the term ``function call''. \section{Obtaining and Installing R} \label{sec:GettingR} Binary versions of R are available for various operating systems including Microsoft Windows (Windows 95 or later), the Macintosh (OS X), and several Linux distributions. Complete source code for R is also available on the archives but it is quite unlikely that you will need to compile R for use with an introductory Statistics course. As most students use R with Windows we will provide more detailed installation instructions for this operating system. The current release of R is 2.4.1. For WIndows there is a binary installer file for R. It is approximately 30MB in size and can be found at \url{http://cran.r-project.org/bin/windows/base/}. If you have a fast network connection you should download and execute this file to install R. Without a network connection or with a slow network connection you will need to make other arrangements for obtaining a copy of this file. The installation should provide a desktop icon or menu item for R. Use one of these to start R. The program should display a welcome message and a prompt \texttt{"> "}. At this point you could use it as a calculator. Try, for example, <>= 2+2 @ \section{Quitting R} \label{sec:quitting} To quit from R you can either select \texttt{File -> Exit} from the menu bar or type <>= q() @ at the prompt, as indicated in the startup message. It is necessary to type the parentheses. That is, typing \texttt{q} by itself is not sufficient. Both of these methods will bring up a confirmation panel asking if you want to save the worksheet. In most cases you will not need to save the worksheet. \section{Using data sets} \label{sec:builtinData} A standard R installation provides several data sets that are used to demonstrate different techniques. The \texttt{data} command provides a list of these <>= data() @ You can obtain a description of a data set with the \texttt{help} command or with the short form for help which is \texttt{?} followed by the name. Try, for example, <>= help(pressure) ?pressure @ \section{What is Devore7?} \label{sec:WhatsDevore7} Notice that the description of the available data sets groups them into ``packages'' such as the ``base'' package, the ``modreg'' package, etc. Packages are groups of functions and data sets that extend the capabilities of R for specific purposes. The ``Devore7'' package provides the data sets for the 7th edition of Devore's engineering statistics text book. By installing and attaching this package you will be able to use the data sets from the examples and exercises in this text book without having to enter the data by hand. The Devore7 package also provides a ``vignette'' - a document that describe particular aspects of the use of R. This document is one the vignette from the Devore7 package. \section{Installing and attaching Devore7} \label{sec:Installing} Installing and attaching a package are two different operations. Installation involves downloading the package from a web site and installing the files on the local hard drive. It only needs to be done once. A package that has been installed can be attached to an R session after which the data sets will be available in the session. To install the Devore7 package on a computer with access to the internet, either use the command <>= install.packages('Devore7') @ or select \texttt{Packages -> Install package(s) from CRAN -> Devore7} from the menu bar. If you do not have access to the Internet you will need to obtain a copy of the zip file whose name begins with \texttt{Devore7} in the proper sub-directory \url{http://cran.r-project.org/bin/windows/contrib/}. (The exact name of the file changes as the package is updated but it will always begin with \texttt{Devore7} and end with \texttt{.zip}.) Use the menu selection \texttt{Packages -> Install package(s) from local zip files} to install the package from the zip file. We emphasize that it is only necessary to do the installation once. To attach the package to an R session use <>= library(Devore7) @ after starting R or select \texttt{Packages -> Load package -> Devore7} from the menu bar. You must attach the package every time you start R if you are to have access to the data sets from the textbook. \section{Names of the data sets} \label{sec:DSnames} Data sets for exercises are named \texttt{ex}\textit{cc.nn} where \textit{cc} is two-digit chapter number and \textit{nn} is the two-digit exercise number. Thus the data for exercise 27 in chapter 10 (p. 394) is called \texttt{ex10.27}. To provide the correct order when sorting the data set names, single-digit chapter or exercise numbers have a zero prepended. The data for exercise 1 in chapter 6 (p. 240) is called \texttt{ex06.01}. Data sets for examples in the text are named \texttt{xmp}\textit{cc.nn}. A listing of all the data sets in the package can be obtained with <>= data(package = 'Devore7') @ \section{Data sets as tables} \label{sec:DataTables} All the data sets in the \texttt{Devore7} package are in a tabular form called a \emph{data frame} in R. Rows correspond to observations and columns correspond to ``variables''. We use the tabular form even when there is only one variable. The columns have names, usually reflecting the description of the data from the exercise or the example, although names like \texttt{C1} also occur frequently. (That name happens to be the default name of the first column assigned by another statistical computing system called Minitab.) You can check the names and types of data in the data frame with \texttt{str}, which prints a concise summary of the structure of the data. <>= data(xmp01.02) str(xmp01.02) @ This shows that the data for example 1.2 (p. 5) consists of 27 observations of 1 variable called \texttt{strength}, which is a numeric variable. The first several data values are printed so you can check that they correspond to the values in the text. Most of the data sets discussed in chapters 1 to 8 are univariate (i.e. only one variable), numeric data like \texttt{xmp01.19}. There are a few examples of univariate, categorical data such as the health complaints discussed in exercise 1.29 (p. 24) <>= data(ex01.29) str(ex01.29) @ These data are a set of observations of a variable that can take on only limited set of values named \texttt{B} for back pain, \texttt{C} for coughing, etc. In R such data are said to be a \texttt{factor}. Some summary information about the variables in a data frame can be obtained with the \texttt{summary} function. <>= summary(xmp01.02) summary(ex01.29) @ For a numeric variable \texttt{summary} provides a `five-number' summary and the mean (making a total of 6 numbers in all). For a factor \texttt{summary} provides a frequency table. \section{Accessing individual variables} \label{sec:accessing} The \texttt{summary} function can be applied to entire data frames or to individual variables in a data frame. This is unusual. Most graphical or numerical functions apply to individual variables. There are two ways to access a variable from within a data frame: \begin{enumerate} \item Use the name of the data set and the name of the variable separated by \texttt{\$} \item \texttt{attach} the data frame and use the variable name by itself \end{enumerate} For example, the two ways to obtain the stem-and-leaf plot of the space shuttle launch ambient temperature data from example 1.1 are <>= data(xmp01.01) str(xmp01.01) stem(xmp01.01$temp) @ and <>= attach(xmp01.01) stem(temp) @ \section{Stacked data} \label{app:stacked} When storing multiple variables in an object there are two common choices. For instance suppose we had two variables <>= a = c(1,2,3,4) b = c(2,3,5,7) @ Then a data frame could have two columns, one for each <>= df = data.frame(a,b) df @ This is a common storage method, but doesn't work well for independent samples, as they need not have the same number of observations, hence won't fit well in a rectangular format. As an alternative, the data can be stacked end to end, with an extra value indicating which variable the data refers to. <>= stack(df) @ This storage method also works well when model formula are used. Stacked data frames may be unstacked with \texttt{unstack}. For more extensive stacking options, the \texttt{reshape} add-on package is available from your nearest CRAN site. \section{Finding help} \label{app:help} R has an extensive amount of online documentation available through the \texttt{help} function. \begin{description} \item[Help on a data set] Each data set in this package has a corresponding help page. For example, to view that page for the \texttt{xmp01.01} data set, one would enter the command \texttt{help(xmp01.01)}. This is more useful for data sets in other packages, as the ones in \texttt{Devore7} are created generically. To view all the data sets in a package, say the \texttt{Devore7} package, use the \texttt{package} argument: e.g., \texttt{data(package="Devore7")}. \item[Help on a function name] To find out more about a function, say the mean, the command \texttt{help(mean)} may be used. For many functions, the document writers have provided examples. The \texttt{example} function will execute the examples. \item[Seeing a vignette] Many R packages have accompanying vignettes describing how the package works. This document is an example. Vignettes may be read from R with the command \texttt{vignette}. For instance, \texttt{vignette("Devore7")}. To list all available vignettes, drop the vignette name: \texttt{vignette()}. \end{description} \end{document}