%\VignetteIndexEntry{zipfR Tutorial} \documentclass[a4paper,12pt]{article} \usepackage[utf8]{inputenc} \usepackage{mathrsfs} \usepackage{amsmath,amssymb} \usepackage{url} \usepackage{graphicx} \usepackage{natbib} \usepackage[a4paper]{geometry} \geometry{left=30mm, right=30mm, top=20mm, bottom=25mm} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% title, authors, running heads \title{The \texttt{zipfR} package for lexical statistics:\\ A tutorial introduction} \date{3 October 2014\\ \texttt{zipfR} version 0.6-7} \author{Marco Baroni \and Stefan Evert} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% some macro definitions %% ... \TODO{something missing here} ... \newcommand{\TODO}[1]{{$\bigl[\![$\marginpar{\textbf{\small TODO}}\textit{#1}$]\!\bigr]$}} %% ... \CITE{missing citation} ... \newcommand{\CITE}[1]{{$\bigl[$\marginpar{\textbf{\small REF}}#1$\bigr]$}} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% the main text \begin{document} \maketitle %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% set up Sweave engine \setkeys{Gin}{width=7cm} \SweaveOpts{width=6, height=6} <>= options(useFancyQuotes=FALSE) @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% 1 Motivation \section{Introduction} This document provides a tutorial introduction to the zipfR package \citep{Evert:Baroni:07} for Large-Number-of-Rare-Events (LNRE) modeling of lexical distributions \citep{Baayen:2001}. We assume that R is installed on your computer and that you have basic familiarity with it. If this is not the case, please start by visiting the R page at \url{https://www.r-project.org/}. The page provides links to download sites, documentation and introductory material, as well as a wide selection of textbooks on R programming. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% 2 Installing zipfR \section{Installing and loading zipfR} The zipfR package is available on CRAN, the central R repository. If you are using the RStudio IDE (or the native R graphical user interface under Mac OS X or Windows, installing zipfR should be as easy as selecting Package Installer from the Packages \& Data menu, getting a list of the CRAN binaries, looking for zipfR, selecting it and clicking on the Install Selected button (there might be small differences between operating systems and among R versions, but the Package Installer is very user friendly, and the steps to take should always be fairly obvious). If you are using the command-line version of R on Linux or another Unix platform, you can install zipfR with the command <>= install.packages("zipfR") @ provided that you have the required access permissions to install packages on your computer. See \verb|?install.packages| for further information about package installation options, or take a look at the instructions found in the general R documentation. Whenever you intend to access zipfR functionality during a R session, you have to load the package first. If you are using a graphical user interface, you can select Package Manager from the Packages \& Data menu (or similar) and load the package through the graphical options. The recommended way, however, which works both on the command line and in the GUI, is to load zipfR as follows: <<>>= library(zipfR) @ Once zipfR has been loaded, you can browse its documentation with the command: <>= ?zipfR @ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% 3 Case study 1 \section{Case study 1: Estimating the vocabulary size of a productive process} We illustrate the basic functionalities of the package by considering a classic problem in quantitative linguistics, i.e., that of estimating the number of distinct \emph{types} that belong to a certain category, given a sample of \emph{tokens} from the relevant category. Our example data come from the Italian word formation process of \emph{ri-} prefixation (comparable, although not identical, to English \emph{re-} prefixation). The data were extracted from a 380 million token corpus of newspaper Italian \citep{Baroni:etal:2004}, and they consist of all verbal lemmas beginning with the prefix \emph{ri-} in the corpus (extracted with a mixture of automated methods and manual checking). Morphology has been a traditional field of application of lexical statistics (see, e.g., \citealp{Baayen:1992}), although the general problem of computing the number of types (vocabulary size) in a population given a sample from the population is also relevant to many other language-related fields, from stylometry (estimating the vocabulary size of different authors) to syntax (measuring the size and productivity of word classes and constructions) to language acquisition (measuring lexical richness of children or language learners). We will use the symbol $V$ for vocabulary size (number of types), $N$ for sample size (number of tokens) and $V_{m}$ ($m$ an integer value) for the number of types that have frequency $m$. In particular, $V_1$ is the number of \emph{hapax legomena}, i.e., the number of types that occur only once in the corpus. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Frequency spectra} If you have an open R session and you loaded zipfR, you can access the \emph{ri-} frequency data under the name \texttt{ItaRi.spc}. This object is a \textbf{frequency spectrum}, the most important data structure in zipfR. A frequency spectrum summarizes a frequency distribution in terms of number of types ($V_{m}$) per frequency class ($m$), i.e., it reports how many distinct types occur once, how many types occur twice, and so on. For example, the first few rows of the \texttt{ItaRi.spc} spectrum (which we can inspect by typing the object name) are: <<>>= ItaRi.spc @ Thus, we see that in our corpus there are 346 \emph{ri-} verbs that occur once, 105 \emph{ri-} verbs that occur twice, 74 that occur three times, etc. General information about the spc data structure can be found in the help entry for \texttt{spc}. For information on reading your own frequency spectrum into R from a TAB-delimited text file (and saving a spectrum object to an output text file), take a look at the documentation for the \texttt{read.spc} and \texttt{write.spc} functions. You can also import a type frequency list (a list of frequencies of the linguistic entities of interest) and generate a spectrum file from it: see documentation for the \texttt{read.tfl} and \texttt{tfl2spc} functions (while we do not discuss it further here, the zipfR \textbf{type frequency list} object is of fundamental practical importance, given that input data will often be in the form of frequency lists; see \texttt{?tfl} for details and examples). The zipfR package provides various utility functions to explore the frequency distribution encoded in a spectrum object. First, \texttt{summary} applied to a spectrum data structure returns a useful sketch of the spectrum ($N$, $V$, the type frequencies of the first few classes of the spectrum): <<>>= summary(ItaRi.spc) @ % The function \texttt{summary} can be applied to most zipfR data structures, to obtain basic information about the structure contents. The function \texttt{N} returns the sample size $N$, \texttt{V} returns the vocabulary size $V$: <<>>= N(ItaRi.spc) V(ItaRi.spc) @ % You can use \texttt{N} and \texttt{V} with any zipfR object for which $N$ and $V$ constitute relevant information (although you might get different types of information: see the documentation for these functions). The function \texttt{Vm} returns the number of types with frequency $m$; if the argument specifying $m$ is a vector, it returns a vector containing the number of types for the specified range of spectrum elements: <<>>= Vm(ItaRi.spc, 1) Vm(ItaRi.spc, 1:5) @ % You can use \texttt{Vm} and \texttt{N} to compute the famous $\mathscr{P}$ measure of productivity proposed by Harald Baayen (see, e.g., \citealp{Baayen:1992}). $\mathscr{P}$ is obtained by dividing the number of hapax legomena by the overall sample size: <<>>= Vm(ItaRi.spc, 1) / N(ItaRi.spc) @ % It is informative to plot a frequency spectrum, with $m$ on the $x$ axis and $V_{m}$ on the $y$ axis. The zipfR package provides a special \texttt{plot} method for spectrum objects, implementing two different visualization styles: <>= plot(ItaRi.spc) plot(ItaRi.spc, log="x") @ \begin{center} \begin{tabular}{cc} <>= plot(ItaRi.spc) @ & <>= plot(ItaRi.spc, log="x") @ \end{tabular} \end{center} % As the left figure shows, applying the function \texttt{plot} to a zipfR spectrum with no other arguments produces a histogram of the first 15 frequency classes in the spectrum. If the argument \texttt{log="x"} is passed, we obtain by default a plot of the first 50 spectrum elements with the $x$ (i.e., $m$) axis on a logarithmic scale, as illustrated in the right figure. You can also use visualize the frequency spectrum with a standard R scatterplot by providing $x$- and $y$-coordinates as separate arguments: \begin{center} \setkeys{Gin}{width=14cm} <>= with(ItaRi.spc, plot(m, Vm, main="Frequency Spectrum")) @ \end{center} % This figure shows why the zipfR \texttt{plot} method does not display the full range of frequency classes. A spectrum is often characterized by very high values corresponding to the lowest frequency classes, and a very long tail of frequency classes with only one member (i.e., just one word with frequency 100, just one word with frequency 103, etc.) Thus, a full spectrum plot on a non-logarithmic scale will always have the rather uninformative L-shaped profile seen above. The zipfR functions try to provide reasonable default values for all parameters that are not specified in a function call, making it very easy to obtain basic plots of the desired data structures. The defaults can of course be overridden. For example, you can pass a different title through the \texttt{main} argument, or use \texttt{xlab} and \texttt{ylab} to change the axis labels. For more information on the plotting parameters, look at the help for \texttt{plot.spc}. In case you need even finer control over the parameters, a zipfR spectrum (like most other zipfR data structures) is also typed as a regular R data frame, which means that you can access its contents with standard R syntax (\texttt{ItaRi.spc\$m, ItaRi.spc\$Vm}, or using the shorthand \texttt{with()} as in the last plot above). Keep in mind, however, that the results of directly accessing the data structure might not be what you expect. E.g., \texttt{ItaRi.spc\$Vm[50]} does not necessarily return the value of $V_{50}$, but the value of the 50th \emph{non-zero} frequency class! Use \texttt{Vm(ItaRi.spc, 50)} to read out $V_{50}$ reliably. Once you have generated a plot, you can export it to a PDF file or various other formats using standard R functionality such as the \texttt{dev.copy2pdf} function (or, if you are using a GUI, by selecting Save As\ldots{} from the File menu). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Vocabulary growth curves} Frequency spectrum plots provide valuable information about the nature of a process: if, as is the case with \emph{ri-}, a process is characterized by a high proportion of hapax legomena and other low frequency classes, this indicates that the process is productive, i.e., the chances that if we were to sample more tokens of the same category we would encounter new types are high. In order to develop an intuition about how rapidly vocabulary size is growing, another type of data structure (and associated plot) is more informative, i.e., the \emph{vocabulary growth curve} (VGC). A vocabulary growth curve reports vocabulary size (number of types, $V$) as a function of sample size (number of tokens, $N$). The data necessary to plot an \emph{empirical} vocabulary growth curve (we discuss \emph{interpolated} and \emph{extrapolated} VGCs below) cannot be derived from a single frequency spectrum, since the spectrum is only providing us with information about a single sample size, and we do not know how ``we got there''. Consider, for example, the two ``corpora'' \texttt{a b a b a b a b} and \texttt{a a a a b b b b}: they have the same frequency spectrum, but different vocabulary growth curves. Vocabulary growth data can be imported into a \textbf{vocabulary growth curve} (\texttt{vgc}) data structure, that will contain $V$ values for increasing $N$s. Moreover, the zipfR \texttt{vgc} data structure can have optional columns for $V_1$ to $V_9$, reporting the number of types in the corresponding frequency classes at the specified $N$s (how many distinct types occur once, how many types occur twice, etc.). The most interesting frequency class is typically that of hapax legomena; thus, we often find ourselves working with VGCs that contain the fields $N$, $V$ and $V_1$. As a case in point, you can access the empirical VGC for the \emph{ri-} under the name \texttt{ItaRi.emp.vgc}. The first few rows of this object (inspected with the handy function \texttt{head}) are: <<>>= head(ItaRi.emp.vgc) @ % This indicates that, after the first 1,000 \emph{ri-} tokens in the target corpus, we saw 140 distinct \emph{ri-} types, 62 of them having occurred only once at that point; after the first 2,000 tokens, we saw 178 distinct types, 58 of them being hapax legomena at that point; and so on. For instructions on how to import your VGCs from TAB-delimited text files and more information about this data structure, browse the documentation for \texttt{read.vgc} and \texttt{vgc}. If you simply print the \texttt{vgc} object, a random selection of 25 rows will be shown to give you a general overview of the vocabulary development (not displayed here for reasons of space). <>= ItaRi.emp.vgc @ From the corresponding summary, you can see how many samples are included in the vocabulary growth curve: <<>>= summary(ItaRi.emp.vgc) @ % A vocabulary growth plot (with $V$ and $V_1$ curves) can be created with the following command. By specifying \texttt{add.m=1}, we ask for $V_1$ to be plotted as well (it appears as a thinner line below $V$). \begin{center} \setkeys{Gin}{width=14cm} <>= plot(ItaRi.emp.vgc, add.m=1) @ \end{center} % More information about plotting VGCs is available on the \texttt{plot.vgc} help page. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Interpolation} \label{interpolation} An \emph{empirical} growth curve such as the one we just plotted is typically not very smooth, as it reflects all the quirks due to the non-random distribution of words and texts in a corpus. A smoother curve can be obtained with the technique of \emph{binomial interpolation} \citep[ch.~2]{Baayen:2001}. Given a frequency spectrum, binomial interpolation produces the \emph{expected values} of vocabulary size (and number of types in specific frequency classes -- e.g., number of hapax legomena) for arbitrary sample sizes (smaller or equal to the sample size at which the spectrum has been computed). These expected values can be thought of as the average of vocabulary size (or other measures) computed over a large number of randomizations of the order of tokens in the corpus. Notice that expected values, unlike observed counts, are not necessarily integers. We can obtain an interpolated vocabulary growth curve even when we only have a spectrum or frequency list. For example, we can compute an interpolated growth curve from the \emph{ri-} spectrum as follows: <<>>= ItaRi.bin.vgc <- vgc.interp(ItaRi.spc, N(ItaRi.emp.vgc), m.max=1) head(ItaRi.bin.vgc) @ % Besides the spectrum, \texttt{vgc.interp} requires as argument a vector of sample sizes at which the interpolated $V$ values should be computed. In this case, we use the same sample sizes that are contained in our empirical VGC object. Moreover, we use \texttt{m.max=1} to request $V_1$ estimates as well (see the \texttt{vgc.interp} documentation for more information). You can plot the expected $V$ and $V_1$ growth curves exactly as shown above for the empirical curves (try it). Here, we illustrate how to use the \texttt{plot} function to compare multiple $V$ growth curves -- specifically, the empirical and expected \emph{ri-} VGCs (but the same method can be used to compare VGCs of different processes):% \footnote{Note that the \texttt{+} sign at the start of the second line indicates that the full command does not fit into a single line and has spilled over to the following line (\texttt{+} is the prompt that R uses when you have entered an incomplete command). Simply type the entire command (with all continuation lines) on a single line when you run these examples.} \begin{center} \setkeys{Gin}{width=14cm} <>= plot(ItaRi.emp.vgc, ItaRi.bin.vgc, legend=c("observed", "interpolated")) @ \end{center} % This command produces a plot with colors. For a black and white version, simply add the option \texttt{bw=TRUE}. Interpolated curves look smoother, abstract away from fluctuations in the original development profile, and they can be computed directly from frequency spectra. However, if the relevant data are available, it is a good idea to always take a look at the empirical curves as well, as they might reveal the presence of strong non-randomness patterns in the data, which invalidate the assumptions at the basis of statistical model estimation. Indeed, the different shapes of the empirical and expected curves for the \emph{ri-} data should be the source of some mild concern about the validity of randomness assumptions even in this case. Similarly to spectrum objects, VGCs are also typed as regular R data frames, so that their contents can be accessed with standard R syntax if finer control over plotting and analysis parameters is needed. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Estimating $V$ and other quantities at arbitrary sample sizes} The VGC for \emph{ri-} shows that, had we sampled a smaller amount of \emph{ri-} tokens, our dataset would have contained a smaller number of types. For example, according to the empirical VGC data, $V$ amounts to 761 after the first 500,000 tokens, vs.\ 1,098 for the whole dataset:% \footnote{Note that this command only works because the vocabulary growth curve for \emph{ri-} happens to contain a data point for $N = 50000$.} <<>>= V(ItaRi.emp.vgc)[N(ItaRi.emp.vgc) == 50000] V(ItaRi.spc) @ % The shape of the VGC also strongly suggests that, if we were to keep sampling \emph{ri-} words, we would keep encountering new types, and the vocabulary size would increase further. Thus, it is clear that the vocabulary size $V$ is not a stable quantity, but that it increases as sample size increases. Consequently, the $V$ in our sample cannot be taken as a reliable estimate of the overall $V$ in the \emph{population} we are sampling from (in our case, the population of all possible \emph{ri-} prefixed verbs in Italian), nor of $V$s in smaller and larger samples. We already saw how binomial interpolation can be used to estimate $V$ for sample sizes smaller than $N$, the sample size at which the frequency spectrum was computed. In order to \emph{extrapolate} $V$ to larger samples (up to the whole population), we need to resort to parametric statistical models for the distribution of frequencies in the population (see \citealp{Evert:Baroni:07,Evert:2004}; and see \citealp[ch.~2]{Baayen:2001}, on why non-parametric extrapolation from the observed frequency spectrum is problematic). Parametric models appropriate for word frequency distributions (and distributions of other linguistic types) belong to the family of Large-Number-of-Rare-Events (LNRE) models. These are implemented in zipfR as \textbf{LNRE model} objects. Currently, the toolkit supports 3 classes of LNRE models: Generalized Inverse Gauss Poisson (\texttt{lnre.gigp}; \citealp[ch.~4]{Baayen:2001}), Zipf-Mandelbrot (\texttt{lnre.zm}; \citealp{Evert:2004}) and finite Zipf-Mandelbrot (\texttt{lnre.fzm}; \citealp{Evert:2004}). Mathematical details are provided in the relevant references, whereas more information about the zipfR implementations is available in the \texttt{lnre} and \texttt{lnre.*} documentation entries. In future releases we might add support for the other LNRE models introduced by \citet{Baayen:2001}, although the tests of \citet{Baroni:Evert:05} suggest that the models currently implemented in the package clearly outperform the others. Coming back to our \emph{ri-} data, we will now use them to estimate a fZM model. We call the \texttt{lnre} function with the string \texttt{"fzm"} and the \emph{ri-} spectrum as arguments (the same syntax can be used with the other models, substituting \texttt{"fzm"} with \texttt{"zm"} or \texttt{"gigp"}, respectively). The function automatically determines suitable values for the 3 parameters of a fZM model by fitting the \emph{expected} frequency spectrum of the model to the \emph{observed} spectrum for \emph{ri-}. Technically, this is achieved by non-linear minimization of a cost function that quantifies the ``distance'' between observed and expected spectrum, but you don't have to worry about such details right now. The command you need to enter is: <<>>= ItaRi.fzm <- lnre("fzm", ItaRi.spc, exact=FALSE) @ % We specified the name of the model we want to employ (here, fZM), the spectrum to be used to estimate parameters and, by setting \texttt{exact=FALSE}, we allowed approximations in the calculation of expected values (which improves performance and numerical stability, but may lead to inaccurate results under certain conditions; see the documentation of the \texttt{lnre} function for details). You might try now to re-estimate the model without the \texttt{exact=FALSE} option (noting that the estimation procedure takes considerably longer to complete). As usual, \texttt{summary} provides you with a sketch of the object you just created (you can also simply type the LNRE object name): <<>>= summary(ItaRi.fzm) @ % The \texttt{summary} function applied to a LNRE model returns the parameters of the model and other useful information. In the case of a fZM model, this includes the number of types in the population ($S$) and a comparison of observed and expected values for the vocabulary size and the first five spectrum elements. Next, the \texttt{summary} reports the results of a multivariate chi-squared test used to measure goodness of fit (see \citealp[section 3.3]{Baayen:2001}). The lower the chi-squared statistic (and the higher the p-value), the better the fit. Based on our experience (see, e.g., \citealp{Evert:2004}), the goodness of fit reported in this case is relatively good (although, in absolute terms, we would of course like to see lower chi-squared values and higher $p$'s). The fit to the observed spectrum can also be visualized with a comparative spectrum plot. First, we produce the spectrum of expected frequencies predicted by the fZM model at the sample size we used to compute the model (i.e., the whole-dataset sample size): <<>>= ItaRi.fzm.spc <- lnre.spc(ItaRi.fzm, N(ItaRi.fzm)) @ Now, we plot the observed and expected spectra: \begin{center} <>= plot(ItaRi.spc, ItaRi.fzm.spc, legend=c("observed", "fZM")) @ \end{center} % The fZM model can now be used to obtain estimates of $V$ and $V_{m}$ (the spectrum elements) at arbitrary sample sizes. For example, we can generate a VGC of expected $V$s up to a $N$ of 2.8 million (about twice the size of the \emph{ri-} sample) with the following command (notice the trick we use to request 100 equally spaced estimates of $V$ up to a sample size of 2.8 million): <<>>= ItaRi.fzm.vgc <- lnre.vgc(ItaRi.fzm, (1:100) * 28e3) @ % It is worth mentioning that the function \texttt{lnre.vgc} can also provide variance estimates for the vocabulary size (and the spectrum elements, when relevant), via the \texttt{variances=TRUE} option. These could then be used to plot confidence intervals around the growth curves. However, in our experience, for many real-life datasets these intervals will be so narrow as to be visually indiscernible from the curves (though not in the case of \emph{ri-}). We do not discuss these quantities here, but see the package documentation (\texttt{lnre.vgc}, \texttt{plot.vgc}, etc.) as well as \citet[ch.~3]{Baayen:2001}. We now plot the extrapolated VGC together with the empirical VGC as follows: \begin{center} \setkeys{Gin}{width=14cm} <>= plot(ItaRi.emp.vgc, ItaRi.fzm.vgc, N0=N(ItaRi.fzm), legend=c("observed", "fZM")) @ \end{center} % Notice the use of the argument \texttt{N0} to highlight the position of $N_0$ (the estimation size) with a vertical line (see the \texttt{plot.vgc} documentation). As an exercise, you should now use the \emph{ri-} data to estimate ZM and GIGP models and look at the model summaries (it is instructive to compare the predictions of the three models for the population vocabulary size $S$). Then add the expected values of the ZM and GIGP models to the comparative spectrum and VGC plots. Parameter estimation is one of the most difficult aspects of LNRE modeling. We have made great efforts to implement a robust and efficient estimation procedure for each LNRE model, so in most cases you can conveniently rely on the default settings. Sometimes parameter estimation will fail, however, or produce an unsatisfactory fit to the observed frequency spectrum (as you will be able to tell from the summary or comparative spectrum plot). In these cases, you can use three optional arguments of the \texttt{lnre} function to fine-tune the estimation procedure. The \texttt{cost} option allows you to choose from a range of cost functions, while \texttt{m.max} determines how many spectrum elements are included in the calculation of the selected cost function. The \texttt{method} option offers several different algorithms for minimization of the cost function. It can be interesting to look at summaries, comparative spectrum plots and comparative vocabulary growth curves of different LNRE models as well as the same model with different settings for the parameter estimation procedure (and, of course, to test parameter estimation for other data sets included in the \texttt{zipfR} package). See the \texttt{lnre} help page for detailed information about available options and a long list of examples. (Technical information for developers can be found on the \texttt{lnre.details} and \texttt{estimate.model} help pages.) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Evaluating extrapolation quality} Although comparison with the empirical curve allows a visual assessment of the goodness of fit of the \emph{interpolated} values (the expected $V$s up to the observed sample size $N$), we have no way to visually assess the quality of extrapolation beyond $N$, since we do not have observed values at sample sizes larger than $N$. However, if we estimate the parameters of a LNRE model from a subset of the data we have (i.e., using only $N_0$ tokens, where $N_0 < N$), we can then compare extrapolation of this model up to our maximum observed sample size $N$ to the empirical or interpolated growth curve up to $N$ (this idea is explored in detail by \citealp{Baroni:Evert:05} and \citealp{Baroni:Evert:07}). If we have access to the original data, we can of course collect a spectrum or type frequency list from the first $N_0$ tokens (outside R), and import these data. However, here we will generate a spectrum from a random subsample of the \texttt{ItaRi.spc} spectrum at $N$. This allows us to illustrate another functionality of zipfR, i.e., the possibility of building the frequency spectrum of a random (sub)sample from an available frequency spectrum. In particular, with the following function we create a subspectrum from a random sample of $N_0=700,000$ tokens, about half the observed size.% \footnote{Note that we set the random number seed in order to obtain reproducible results. If you omit the \texttt{set.seed} command, or set the seed to a different value, your subsample and the resulting LNRE model will be slightly different.} % <<>>= set.seed(42) ItaRi.sub.spc <- sample.spc(ItaRi.spc, N=700000) @ We can now estimate a LNRE model from this subspectrum:% \footnote{If parameter estimation fails (as has been observed in one of our test runs), try omitting the \texttt{exact=FALSE} option, specify \texttt{method="NLM"}, or use a different cost function.} <<>>= ItaRi.sub.fzm <- lnre("fzm", ItaRi.sub.spc, exact=FALSE) ItaRi.sub.fzm @ % In this case, the parameters of the model are similar, but not very close to those we obtained when using all the available data, and the estimated population size has increased considerably, illustrating an undesirable dependency of LNRE model estimation on sample size. Moreover, goodness of fit is lower than for the model estimated from all the available data. We generate a VGC up to the original $N$ from the \texttt{ItaRi.sub.fzm} model: <<>>= ItaRi.sub.fzm.vgc <- lnre.vgc(ItaRi.sub.fzm, N=N(ItaRi.emp.vgc)) @ % Given that we took a random subsample, it is more appropriate to compare the resulting VGC to one that is binomially interpolated from $N$ rather than to the empirical VGC (we generated \texttt{ItaRi.bin.vgc} in Sec.~\ref{interpolation} above): \begin{center} \setkeys{Gin}{width=14cm} <>= plot(ItaRi.bin.vgc, ItaRi.sub.fzm.vgc, N0=N(ItaRi.sub.fzm), legend=c("interpolated", "fZM")) @ \end{center} % The plot indicates that, despite the problems we mentioned above, the fZM model provides a very reasonable match to the interpolated $V$ curve (it is hard to tell the two curves apart). Of course, the issue of evaluating the quality of LNRE models is more complex than what we illustrated here. At the very least, one should also look at models estimated from the first $N_0$ tokens, on top of those obtained from a random subsample, and at averages of models estimated from multiple random subsamples. The latter requires basic R programming skills, to automatize the iterative estimation, VGC generation and plotting procedure. We hope that future versions of the zipfR toolkit will feature batch estimation and plotting functions, to facilitate the process of running multiple randomization experiments. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Comparing vocabulary growth curves of different categories} For expository purposes, we focused here on the frequency distribution of a single class (\emph{ri-} prefixed verbs in Italian). Of course, it is typically more interesting to look at multiple frequency distributions, e.g., by comparing their VGCs in order to determine which of the classes under analysis is more productive. The zipfR plotting functions make such comparisons easy, by accepting multiple \texttt{vgc} (or \texttt{spc}) data structures as input, and automatically determining the best graphic parameters to plot them together. As an example, we can compare the \emph{ri-} data with the data for another Italian prefix, adjectival \emph{ultra-} (also extracted from the \emph{la Repubblica} corpus with similar methods, and available under the name \texttt{ItaUltra.spc}). At first sight, one might think that \emph{ultra-} is less productive than \emph{ri-}, given that its sample vocabulary size $V$ is half that of \emph{ri-}: <<>>= V(ItaUltra.spc) V(ItaRi.spc) @ % However, the \emph{ultra-} sample is much smaller, making a direct comparison meaningless: <<>>= N(ItaUltra.spc) N(ItaRi.spc) @ % In order to compare the two word formation processes, we can compute a LNRE model from the \emph{ultra-} data and then use it to extrapolate the \emph{ultra-} growth curve up to the size of \emph{ri-}: <<>>= ItaUltra.fzm <- lnre("fzm", ItaUltra.spc, exact=FALSE) ItaUltra.ext.vgc <- lnre.vgc(ItaUltra.fzm, N(ItaRi.emp.vgc)) @ \begin{center} \setkeys{Gin}{width=14cm} <>= plot(ItaUltra.ext.vgc, ItaRi.bin.vgc, legend=c("ultra-", "ri-")) @ \end{center} % Interestingly, the extrapolation suggests that \emph{ultra-}, while occurring more rarely, has the potential to generate many more words than the already productive \emph{ri-} process. You should now try to apply some of the functions illustrated here to your own datasets (all you need is a frequency list or a frequency spectrum for the process/class you are interested in), or to some of the other example datasets that come with the zipfR package. You can find out about the available datasets by typing: <>= data(package="zipfR") @ You can then obtain more information about a specific dataset by typing its name preceded by a question mark, e.g.: <>= ?ItaRi.spc @ % As we mentioned at the beginning of this case study, while word formation has been a classic area of application of word frequency distribution modeling, techniques to estimate vocabulary size and related quantities can find applications in any area of linguistics where it makes sense to speak of a vocabulary of types, and the overall size of this vocabulary is much larger than the size of the observed vocabulary in our sample. For example, we could compare verbs and adjectives (to assess whether verb formation is more productive than adjective formation -- incidentally, you can try this with the Brown corpus verb and adjective datasets available in the zipfR package), or word pairs that occur in two different constructions (to assess which construction is more productive), and so on and so forth. With a bit of creativity in the definition of the target type classes, vocabulary statistics modeling techniques can be applied to a very wide range of linguistic problems. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% 4 Case study 2 \section{Case study 2: Estimating lexical coverage} The first case study showed how to apply frequency distribution analysis to a problem of (theoretical) linguistic interest, i.e., measuring the productivity/vocabulary size of a word formation process. In this second case study we discuss a potential application of the same tools to a practical problem in language engineering. This case study is slightly more involved and it requires (just a little bit) more R skills. If you feel that you've already learned enough about zipfR to run your own experiments, feel free to skip to the last section. Many NLP algorithms heavily rely on specialized lexical resources. Thus, performance on Out-Of-Vocabulary (OOV) types is lower than when the input contains words/types that are stored in the application lexicon. Among the language-related technologies that crucially rely on a lexicon resource, we might mention part-of-speech tagging and lexicalized probabilistic context-free grammars (see, e.g., \citealp{Jurafsky:Martin:2000}). It is therefore important to be able to estimate the proportion of OOV words/types that we will encounter, given a lexicon of a certain size, or, from a different angle, to determine how large our lexicon should be in order to keep the OOV proportion below a certain threshold. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Estimating the proportion of OOV types and tokens given a fixed size lexicon} Consider the case in which we are developing an application that will rely on a lexicon, and we are using the first 100k lemma tokens from the Brown corpus \citep{Kucera:Francis:1967} as our development resource. The frequency spectrum of this subset of the Brown is part of the zipfR exampl data sets and, if the zipfR library is loaded, it can be accessed under the name \texttt{Brown100k.spc}. The vocabulary size of this data set is: <<>>= V(Brown100k.spc) @ % We decide that we can afford to write lexical entries of the kind required by our application for all the lemmas that occur at least two times in our dataset, i.e., our lexicon of seen types will have the following size (obtained by subtracting the hapax type count from the overall vocabulary size): <<>>= Vseen <- V(Brown100k.spc) - Vm(Brown100k.spc, 1) Vseen @ % In this way, we will cover about 50\% of the \emph{types} in our development set, whereas the other 50\% will count as OOV types: <<>>= Vseen / V(Brown100k.spc) @ % However, given that the OOV lemmas are hapax legomena, they actually account for a much lower proportion of the overall \emph{tokens}: <<>>= Vm(Brown100k.spc, 1) / N(Brown100k.spc) @ % I.e., given our choice to insert in the lexicon all lemmas that occur at least twice in the development set, we cover about 94\% of the tokens in the development set itself, while the remaining 6\% of the tokens belong to OOV types. However, what will happen when we use our application on a larger input? What proportion of OOV types and tokens should we expect? To answer this question, we estimate a LNRE model from the available spectrum, we extrapolate to larger $N$s via the model, and we check what proportion of the types and tokens will be maximally covered by our $V_{seen}$ entries at these larger $N$s. Of course, we have to make the very delicate assumption that the input that our application will see is very similar to the 100k Brown fragment we used as our development set; i.e., that larger texts fed to the application are larger random samples of the population we sampled our development set from (indeed, here we also assume that the development fragment itself is always part of the larger input). The assumption might be reasonable if, e.g., our application is targeted at automatically annotating the remaining 900k tokens from the Brown, and for some reason we do not have access to them during development (this sounds silly in the case of the Brown, but it might be a realistic scenario when the corpus we are working on is still in the construction phase); or if the application will be used to parse a certain specialized language, and the development set is made of texts from that language. In this case study, we estimate a ZM model from our data: <<>>= Brown100k.zm <- lnre("zm", Brown100k.spc) Brown100k.zm @ % Notice that $S$ (the population vocabulary size) is infinite, since the ZM model, unlike fZM, does not assume a finite population. Notice also that the multivariate chi-squared test suggests a very poor fit. This seems to be a general property of ZM \citep{Evert:2004} and one that, interestingly, does not always imply poor extrapolation quality \citep{Baroni:Evert:05,Baroni:Evert:07}. We can use the model we just estimated to generate expected $V$s at arbitrary $N$s, in order to calculate the proportion of OOV types we are likely to encounter. For example, we can see what this proportion will be if the input contains 1, 10, or 100 million tokens (notice the function \texttt{EV} that, given a model, produces the expected $V$ values for a vector of $N$s): <<>>= EV(Brown100k.zm, c(1e6, 10e6, 100e6)) @ % Thus, the proportions of in-lexicon types for inputs of 1, 10, or 100 million tokens are as follows (assuming that our development set is a subset of the input, so that all the types in our lexicon occur in the input). <<>>= Vseen / EV(Brown100k.zm, c(1e6, 10e6, 100e6)) @ % Equivalently, the proportions of OOV types are: <<>>= 1 - (Vseen / EV(Brown100k.zm, c(1e6, 10e6, 100e6))) @ % We can get a sense of how good the estimates are by looking at the frequency spectrum for the whole Brown corpus (\texttt{Brown.spc}), and comparing the number of types in the whole corpus to the expected number of types at this sample size according to the model. The observed $N$ and $V$ are: <<>>= N(Brown.spc) V(Brown.spc) @ The model-based estimate of $V$ at the same $N$ is: <<>>= EV(Brown100k.zm, N(Brown.spc)) @ % This is reasonably close to the observed value (although it overestimates the true vocabulary growth), and consequently the empirical and estimated OOV ratios are quite close: <<>>= 1 - (Vseen / V(Brown.spc)) 1 - (Vseen / EV(Brown100k.zm, N(Brown.spc))) @ % Let us now move on to the probably more interesting question of estimating the proportion of OOV \emph{tokens} in larger samples, given our lexicon of 6,477 types. First, we generate a model-derived spectrum at the desired $N$. From this, we obtain the expected $V$ estimate and, by subtracting $V_{seen}$ from $V$, an estimate of the number of OOV types ($V_{OOV}$). At this point, we make the best-case scenario (but not completely unreasonable) assumption that the OOV types will always be less frequent than the types that are already in our lexicon. Thus, we first look at the number of hapax legomena in the estimated spectrum. If this value is below $V_{OOV}$, we also add the number of types occurring twice, three times, etc., until we reach a value close to $V_{OOV}$. Once we determined (roughly) up to what frequency class the OOV types extend, we can calculate their overall token frequency by multiplying the number of types in each of the OOV classes by the corresponding frequency and summing up (to get the number of hapax legomenon tokens, we multiply the number of hapax types by 1, to get the number of tokens of types that occur twice, we multiply the number of twice-occurring types by 2, and so on -- finally, to get the overall token frequency of the classes involved, we sum all these values). We estimate the spectrum from our ZM model at $N$ = 1,006,770, so that the result will be directly comparable to the observed spectrum from the whole Brown: <<>>= Brown.zm.spc <- lnre.spc(Brown100k.zm, N(Brown.spc)) @ % By default, \texttt{lnre.spc} generates the first 100 spectrum elements (ideally, we would like to generate a full frequency spectrum: however, certain LNRE models -- in particular GIGP -- tend to run into numerical problems with higher spectrum elements; hence the conservative default). The estimate of $V_{OOV}$ is obtained with:% \footnote{Notice that we cannot extract the expected $V$ from the estimated spectrum, since the latter does not contain all frequency classes} <<>>= EV(Brown100k.zm, N(Brown.spc)) - Vseen @ % Starting from the lowest expected frequency class, let us sum types until we reach a value close to 50,293: <<>>= sum(Vm(Brown.zm.spc, 1)) sum(Vm(Brown.zm.spc, 1:2)) @ and so on, until <<>>= sum(Vm(Brown.zm.spc, 1:6)) @ % Given that we are working with rough estimates anyway, we take the sum of the types in the lowest 6 frequency classes to be close enough to the estimated $V_{OOV}$. Thus, to calculate the number of OOV tokens we multiply the types in each of these classes by the corresponding frequencies and sum: <<>>= Noov.zm <- sum(Vm(Brown.zm.spc, 1:6) * (1:6)) Noov.zm @ In proportion: <<>>= Noov.zm / N(Brown.spc) @ % I.e., about 7.5\% of the overall tokens belong to OOV types. Let us compare this to the same quantity computed from the full observed frequency data (the spectrum of frequencies in the whole Brown). First, we need to determine $V_{OOV}$ for the full Brown corpus: <<>>= V(Brown.spc) - Vseen @ % Approximating this value from the low end of the spectrum leads to: <<>>= sum(Vm(Brown.spc, 1:13)) @ % Thus, the $N_{OOV}$ estimate from the observed frequency spectrum is: <<>>= Noov.emp <- sum(Vm(Brown.spc, 1:13) * (1:13)) Noov.emp Noov.emp / N(Brown.spc) @ % The model-derived estimate of the proportion of OOV tokens is considerably lower than the one derived from the observed spectrum (in accordance with the general tendency of LNRE models to underestimate type-related quantities in extrapolation -- see \citealp{Baroni:Evert:05}). However, the model-based proportion still seems reasonable as a ballpark estimate. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Determining lexicon size} Suppose that, in a certain project, we will have to develop a lexical resource, and we are in the phase in which we need to decide how large it should be. One approach to this task is to estimate how many types we should enter in the lexicon to achieve a certain coverage on input of various sizes, and use this estimate as a guide in deciding the size of the target lexicon. For example, suppose that we want to try to reach 95\% coverage, i.e., no more than about 5\% OOV tokens, given an expected input of 10 million words. With our usual ZM model, we estimate the spectrum at $N=10$ millions: <<>>= Brown10M.zm.spc <- lnre.spc(Brown100k.zm, 10e6) @ % 5\% of 10 million is 500,000, thus we must sum up the lowest frequency spectrum elements multiplied by the corresponding frequencies until we reach a value close to 500,000. By trial and error, we find that we can get a reasonably close value by summing up the token frequencies of the first 18 frequency classes (trial and error can be made less typing-intensive by using the R function \texttt{cumsum} instead of \texttt{sum}, in order to obtain a vector of cumulative sums): <<>>= sum(Vm(Brown10M.zm.spc, 1:18) * (1:18)) @ % This corresponds to a $V_{OOV}$ of: <<>>= sum(Vm(Brown10M.zm.spc, 1:18)) @ % Thus, the number of types we should insert in our lexicon to have hopes to cover 95\% of the tokens in a 10 million token input is: <<>>= EV(Brown100k.zm, 10e6) - sum(Vm(Brown10M.zm.spc, 1:18)) @ % Now, as an exercise, find out how many types we need in the lexicon in order to cover 95\% of the tokens in a 100 million token input. Is this quantity much higher than the one needed for the 10 million token input? Notice that you will need to use the \texttt{m.max} argument of \texttt{lnre.spc} in order to obtain more than 100 spectrum elements. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% 5 Outlook \section{Directions for further exploration} The case studies in this tutorial illustrate many functionalities of the zipfR package. You can find out more about the package by exploring its documentation. To practice, you can repeat the procedures above, experimenting with different LNRE models and exploring other zipfR datasets. Of course, it is probably more interesting for you to work with your own data. As long as you can convert them to a plain frequency list format, you will be able to import them into zipfR using \texttt{read.tfl}. On the zipfR website (\url{https:/zipfR.r-forge.r-project.org/}) you can also find a script to compute empirical vocabulary growth curves from plain tokenized corpus data. Finally, we would like to remind you that zipfR, like R, is an open source, GPL-licensed project. We invite you to contribute to its further development with bug reports, encouragement, suggestions, and code! %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% bibliography \begin{thebibliography}{} \bibitem[Baayen, 1992]{Baayen:1992} Harald Baayen (1992). Quantitative aspects of morphological productivity. {\em Yearbook of Morphology 1991}, 109-150. \bibitem[Baayen, 2001]{Baayen:2001} Harald Baayen (2001). \emph{Word frequency distributions}. Dordrecht: Kluwer. \bibitem[Baroni et al., 2004]{Baroni:etal:2004} Marco Baroni, Silvia Bernardini, Federica Comastri, Lorenzo Piccioni, Alessandra Volpi, Guy Aston and Marco Mazzoleni (2004). Introducing the ``la Repubblica'' corpus: A large, annotated, TEI(XML)-compliant corpus of newspaper Italian. \emph{Proceedings of LREC 2004}, 771-1774. \bibitem[Baroni and Evert(2005)]{Baroni:Evert:05} Baroni, Marco and Evert, Stefan (2005). Testing the extrapolation quality of word frequency models. In P.~Danielsson and M.~Wagenmakers, editors, {\em Proceedings of Corpus Linguistics 2005}, volume~1 of {\em The Corpus Linguistics Conference Series}. ISSN 1747-9398. \bibitem[Baroni and Evert(2007)]{Baroni:Evert:07} Baroni, Marco and Evert, Stefan (2007). Words and echoes: Assessing and mitigating the non-randomness problem in word frequency distribution modeling. In {\em Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics}, pages 904--911, Prague, Czech Republic. \bibitem[Evert, 2004]{Evert:2004} Stefan Evert (2004). A simple LNRE model for random character sequences. \emph{Proceedings of JADT 2004}, 411-422. \bibitem[Evert and Baroni(2007)]{Evert:Baroni:07} Evert, Stefan and Baroni, Marco (2007). {\em zipfR}: Word frequency distributions in {R}. In {\em Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Posters and Demonstrations Sessions}, pages 29--32, Prague, Czech Republic. \bibitem[Jurafsky and Martin, 2000]{Jurafsky:Martin:2000} Daniel Jurafsky and James Martin. 2000. \emph{Speech and language processing}. Upper Saddle River: Prentice Hall. \bibitem[Ku\v{c}era and Francis, 1967]{Kucera:Francis:1967} Henry Ku\v{c}era and Nelson Francis (1967). {\em Computational analysis of present-day American English}. Providence: Brown University Press. \end{thebibliography} \end{document}