% \VignetteEngine{knitr::knitr} % \VignetteIndexEntry{PepSAVIms introduction} \documentclass{article} \input{Shared_Preamble.tex} % Document start --------------------------------------------------------------- \begin{document} % Title and table of contents -------------------------------------------------- \begin{center} {\LARGE An introduction to the \pkgname/ package} \vspace{10mm} {\large \today} \end{center} \vspace{5mm} \tableofcontents \vspace{5mm} % Begin document body ---------------------------------------------------------- % Configure global options <>= library(knitr) knit_theme$set("default") opts_chunk$set(cache=FALSE) opts_knit$set(root.dir=normalizePath("..")) options(width=90) @ \setcounter{section}{-1} \section{Introduction} The \pkgname/ \R/ package provides a collection of software tools used to facilitate the prioritization of putative bioactive compounds from a complex biological matrix. The package was constructed to provide an implementation of the statistical portion of the laboratory and statistical procedure proposed in \papernamefull/ (herafter abbreviated to \papername/) \cite{kirkpatrick2016}. This document provides an introduction to the functionality provided by the \pkgname/ package. % Flowchart and fcn descriptions ----------------------------------------------- \section{Package Overview} A flowchart for a data analysis performed using \pkgname/ is show in Figure \ref{fig: flow chart}. The blue rectangles and diamond are functions in the \pkgname/ ecosystem; the pink oval denotes mass spectrometry data to be used as data inputs; the green oval denotes bioactivity data to be used as a data input; and the yellow oval denotes the analysis results of using the \methodname/ methodology and software. The prototypical data analysis workflow is the path described by the solid lines; this is the procedure performed in \papername/, and described in more detail in the \texttt{Paper\_Analysis} vignette. The dashed lines show alternative workflows which may be performed in situations where some of the data processing steps have already been performed using other tools, or where some of the data processing steps may not be appropriate for the particular data analysis at hand. \begin{figure}[H] \caption{Data analysis flow chart} \vspace{-6mm} \label{fig: flow chart} \centering \begin{tikzpicture}[node distance = 3cm, auto] % Place nodes \node [decision] (rankEN) {rankEN}; \node [block, left of=rankEN] (filterMS) {filterMS}; \node [cloudYel, right of=rankEN, node distance=3.75cm] (ranked_data) {\makecell[c]{ranked candid-\\ate compounds}}; \node [block, left of=filterMS] (binMS) {binMS}; \node [cloud, left of=binMS] (raw_data) {\makecell[c]{raw MS\\data}}; \node [block, below left=1.5cm and 1cm of filterMS, node distance=1cm] (msDat1) {msDat}; \node [block, below left=1.6cm and 1.45cm of rankEN, node distance=1cm] (msDat2) {msDat}; \node [cloud, below of=msDat1] (bin_dat) {\makecell[c]{binned\\MS data}}; \node [cloud, below of=msDat2] (filt_dat) {\makecell[c]{filtered\\MS data}}; \node [cloudGr, below of=rankEN] (bioact) {\makecell[c]{bioactivity\\data}}; % Draw edges \path [line, thick] (raw_data) -- (binMS); \path [line, thick] (binMS) -- (filterMS); \path [line, thick] (filterMS) -- (rankEN); \path [line, thick] (rankEN) -- (ranked_data); \path [line, dashed] (msDat1) -- (filterMS); \path [line, dashed] (msDat2) -- (rankEN); \path [line, dashed] (bin_dat) -- (msDat1); \path [line, dashed] (filt_dat) -- (msDat2); \path [line, thick] (bioact) -- (rankEN); \path [line, dashed] (binMS) to[out=95, in=90] (rankEN); \end{tikzpicture} \vspace{4mm} \caption*{Solid arrows represent the prototypical data analysis workflow;\\ dashed lines represent alternative workflows} \end{figure} A conceptual presentation of the functions in the \pkgname/ package is provided in the following subsections. Please refer to the function documentation for more information regarding the application programming interface, as well as for further technical detail. \subsection{The \binMS/ function} The mass spectrometry abundance data can optionally undergo two preprocessing steps. The first step is a consolidation step: the goal is to to consolidate mass spectrometry observations in the data that are believed to belong to the same underlying compound. In other words, the instrumentation may have obtained multiple reads of mass spectrometry abundances that in actuality belong to the same compound - in which case we wish to attribute all of those observations to a single compound. The function name \binMS/ derives from the fact that we use a binning procedure to consolidate the data. The consolidation procedure is undergone as follows. Firstly, all observations must satisfy each of the following criterions or they are removed from consideration for consolidation (i.e. they are dropped from the data). \begin{enumerate}[label=(\roman*)] \item Each observation must have its peak elution time occur during the specified interval. \item Each observation must have a mass that falls within the specified interval. \item Each observation must be detected in a charge state that falls within the specified interval. \end{enumerate} Once the subset of observations satisfying the above criteria is obtained, a second step attempts to combine observations believed to belong to the same underlying compound. This procedure considers two observations that satisfy each of the following criterions to belong to the same compound. \begin{enumerate}[label=(\roman*)] \item The absolute difference (in Daltons) of the mass-to-charge (m/z) values between the two observations is less the the specified value. \item The absolute difference of the peak elution time between the two observations is less than the specified value. \item The charge state must be the same for the two observations. \end{enumerate} \subsection{The \code{filterMS} function} The second optional preprocessing step for the mass spectrometry abundance data is a filtering step. The goal of the filtering step is to further reduce the data set to focus on only those compounds that could plausibly be contributing to the bioactivity area of interest. Furthermore, these criteria aim to filter out some of the noise detected in the dataset. By filtering the candidate set prior to statistical analysis, the ability of the analysis to effectively differentiate such compounds is greatly increased. The criteria for the downstream inclusion of a candidate compound are listed below. \begin{enumerate}[label=(\roman*)] \item The m/z intensity maximum must fall inside the range of the bioactivity region of interest. \item The ratio of the m/z intensity of a species in the areas bordering the region of interest and the species maximum intensity must be less than the specified value. \item The right adjacent fraction to the fraction with maximum intensity for a given species must have a non-zero abundance. \item At least 1 fraction in the region of interest must have intensity greater than the specified value. \item The compound charge state must be less than or equal to the specified value. \end{enumerate} The region of interest is chosen to be a region where a high percent of bioactivity is observed in multiple sequential fractions; the assumption then is that there is some compound(s) eluting in those fractions that are associated with the observed bioactivity. \subsection{The \rankEN/ function} Once the mass spectrometry abundance data has optionally undergone any preprocessing steps, a statistical procedure to search for putative bioactive peptides is performed. This step is performed by the \rankEN/ function, which takes as inputs both the preprocessed mass spectrometry abundance data and the bioactivity data. The procedure works by specifying the level of the $\ell_2$ penalty parameter in the elastic net penalty \cite{zou2005}, and tracking the inclusion of the coefficients corresponding to compounds into the nonzero set along the elastic net path. An ordered list of candidate compounds is obtained by providing the order in which the coefficients corresponding to compounds entered the nonzero set. \subsection{The \msDat/ function} The functions \filterMS/ and \rankEN/ require objects of class \code{msDat} as the arguments providing the mass spectrometry abundance data; the \msDat/ function takes mass spectrometry data as its input and creates an object of this class. Note however, that the \binMS/ and \filterMS/ functions return objects that inherit from the \code{msDat} class, and consequently you can provide an object created by \binMS/ and \filterMS/ anywhere that an \code{msDat} class object is required. The raison d'\^{e}tre for an object of class \code{msDat} is to allow for a consistent interface to \filterMS/ and \rankEN/ whether the mass spectrometry data is obtained via a prior call to \binMS/, \filterMS/, or directly from user via a call to \msDat/. % Flowchart and fcn descriptions ----------------------------------------------- \section{An example case: performing the \methodname/ pipeline} In the following section we will illustrate the usage of the core \pkgname/ functions by walking through the prototypical \methodname/ pipeline. A flowchart for this prototypical \methodname/ pipeline is shown below. \begin{figure}[H] \caption{Data analysis flow chart for the data analysis performed in \papername/} \centering \begin{tikzpicture}[node distance = 3cm, auto] % Place nodes \node [decision] (rankEN) {rankEN}; \node [block, left of=rankEN] (filterMS) {filterMS}; \node [cloudYel, right of=rankEN, node distance=3.75cm] (ranked_data) {\makecell[c]{ranked candid-\\ate compounds}}; \node [block, left of=filterMS] (binMS) {binMS}; \node [cloud, left of=binMS] (raw_data) {\makecell[c]{raw MS\\data}}; \node [cloudGr, below of=rankEN] (bioact) {\makecell[c]{bioactivity\\data}}; % Draw edges \path [line, thick] (raw_data) -- (binMS); \path [line, thick] (binMS) -- (filterMS); \path [line, thick] (filterMS) -- (rankEN); \path [line, thick] (rankEN) -- (ranked_data); \path [line, thick] (bioact) -- (rankEN); \end{tikzpicture} \label{fig: flow chart prototypical} \end{figure} \subsection{Reading in the data} % Data from \papername/ is included as part of the package; in this document we % will use the data to illustrate the usage of the package interface. The goal of % this document is to provide examples for users wishing to \subsubsection{Reading in the mass spectrometry data} First we load some mass spectrometry data included with the package into memory. <>= # Load package into memory library(PepSAVIms) # The mass spectrometry data is provided as a data.frame is.data.frame(mass_spec) # There are 30,799 mass-to-charge levels, and 38 variables dim(mass_spec) # The first four variables provide the m/z level, time of peak retention, mass, # and charge state of each observation. The remaining 34 variables are the mass # spectrometric intensities for each compound across fractions 11 through 43, and fraction 47. names(mass_spec) @ \subsubsection{Reading in the bioactivity data} Next we load some bioactivity data included with the package into memory. The object \texttt{bioact} is a list containing bioactivity values (in replicates) for sweet violet against several bacterial and viral pathogens, as well as some cancer cell lines; we will arbitrarily choose the bioactivity data against the Gram-negative bacterium \textit{E. coli} as an illustrative example, and create an object for this data from the element in the list. <>= # Load data into memory data(bioact) # bioact is a list with each element corresponding to bioactivity data is.list(bioact) # Names of the elements in bioact names(bioact) # Arbitrarily select one of the datasets for further examples EC <- bioact$ec # EC is provided as a data.frame is.data.frame(EC) # EC contains data for 3 replicates and 44 fractions dim(EC) # The names of the fractions for which bioactivity observations were obtained names(EC) @ \subsection{Using \texttt{binMS}} \label{sec: using binMS} Now that we have the mass spectrometry data loaded into memory, the first step in the \pkgname/ pipeline is to to consolidate mass spectrometry observations in the data that are believed to belong to the same underlying compound. This is performed via the \binMS/ function. In order to perform its task, the consolidation routine needs to be provided with data for each mass-to-charge ratio and charge state, the retention time, and (optionally) the mass. These can be provided as data vectors, or can be given as columns in the mass spectrometry data. When provided as part of the mass spectrometry data, the variables can each be identified either by column index or by column name. The mass spectrometry intensities are required to be included as part of the data provided as the argument to \texttt{mass\_spec}. If all of the columns in \texttt{mass\_spec} correspond either to mass spectrometry intensities or alternatively provide data for the m/z, charge, mass, or time of peak retention values, then we can specify the argument as \texttt{NULL} which indicates that all remaining values are to be included. In either case, we can always provide a vector either of column indices or of column names to specify which columns of \texttt{mass\_spec} to include as mass spectrometry intensities. As part of the procedure, we also need to specify acceptable values for the time of peak retention, an allowable mass range,and an allowable charge state range. These are provided as length-2 vectors specifying the lower and upper bounds of the acceptable ranges. Finally, we need to specify the closeness of mass-to-charge values and times of peak retention for observations which we presume as having come from the same underlying compounds. This is done by specifying a cutoff value for which differences larger than these values will be considered as coming from two distinct mass-to-charge levels. \begin{figure}[H] \caption{\texttt{binMS} data input flowchart} \centering \includegraphics{images/Binning_Flowchart} \end{figure} \subsubsection{\texttt{binMS} specification via variable names} In this example we specify the data for the mass-to-charge, charge, mass, time of peak retention, and the intensities by specifying the names of the corresponding columns in \texttt{mass\_spec}. Note that we do not need to provide the exact names of the columns, we only need to provide enough information to yield a unique match. Further notice that we are able to provide \texttt{NULL} as the argument specifying the mass spectrometry intensities. Recall that the mass spectrometry data has columns for the m/z, charge, mass, and time of peak retention data, and that the remaining columns are intensities. Thus by specifying \texttt{NULL} this causes the function to include all of the remaining columns as intensity data after removing the m/z, charge, mass, and time of peak retention data. <>= # Perform consolidation using names bin_out <- binMS(mass_spec = mass_spec, mtoz = "m/z", charge = "Charge", mass = "Mass", time_peak_reten = "Reten", ms_inten = NULL, time_range = c(14, 45), mass_range = c(2000, 15000), charge_range = c(2, 10), mtoz_diff = 0.05, time_diff = 60) @ \subsubsection{\texttt{binMS} via names, indices, and vectors} In the following example we explicitly provide the data for \texttt{mass} and \texttt{time\_peak\_retention} by passing data vectors as their arguments. Notice that now we have to specify the columns from \texttt{mass\_spec} to use as intensity data, since after removing the columns containing the m/z and charge data, there are still non-intensity columns in \texttt{mass\_spec} (specifically the columns containing the mass and time of peak retention data). See Figure \ref{fig: binMS input} for a visual schematic of the internal \texttt{binMS} processing of the data inputs for this following example. <>= # Make copies of some of the vectors in mass_spec to pass directly to function mass_vals <- mass_spec[, "Mass"] time_vals <- mass_spec[, "Retention time (min)"] # Vector of names for the intensity columns. We include the leading underscore # so as to prevent any ambiguity between the fraction number and date. inten_nm <- c(paste0("_", 11:43), "_47") # Perform consolidation alternate input bin_out_v2 <- binMS(mass_spec = mass_spec, mtoz = "m/z", charge = "Charge", mass = mass_vals, time_peak_reten = time_vals, ms_inten = inten_nm, time_range = c(14, 45), mass_range = c(2000, 15000), charge_range = c(2, 10), mtoz_diff = 0.05, time_diff = 60) # We get the same results whether specifying data via column names or column # indices identical(bin_out_v2, bin_out) @ \begin{figure}[H] \caption{\texttt{binMS} data inputs corresponding to \texttt{bin\_out\_v2}} \label{fig: binMS input} \vspace{5mm} \centering \includegraphics{images/Binning_Input} % \vspace{5mm} % \caption*{ % \begin{minipage}[t]{1\textwidth} % In this example, the m/z and charge values % for the data are provided as part of the \texttt{mass\_spec} data, and the % time of peak retention and mass data are each provided as separate data % vectors. The \texttt{binMS} function internally splits apart the data % before performing the consolidation. % \end{minipage} % } \end{figure} \vspace{10mm} \subsubsection{The print and summary function for \texttt{binMS}} The \texttt{binMS} class is equipped with a print and a summary function. We can see from the output that there were originally 30,799 mass-to-charge levels. After filtering by the inclusion criteria this reduced the data to 10,902 levels, and then consolidation of the data resulted in 6,258 levels. <>= # Print the size of the consolidated data bin_out # Show summary information describing the consolidation process summary(bin_out) @ \subsection{Using \texttt{filterMS}} The next step in the \pkgname/ pipeline is to remove any potential candidate compounds with observed abundances for which it is scientifically unlikely that they might correspond to compound with an effect on the bioactivity area of interest. This step is performed by the \texttt{filterMS} function. When using the \texttt{filterMS} function, the first task is to specify the region of interest and bordering region. This is done by providing an \texttt{msDat} object as an argument to \texttt{msObj}, and then: \begin{enumerate}[label=(\roman*)] \item specifying a contiguous region of columns from the intensity matrix by either providing a vector either of column names or of column indices (\texttt{region} formal argument) \item specifying the bordering region relative to the region of interest by providing either the value \texttt{"all"}, the value \texttt{"none"}, or a length-1 or length-2 vector provided the width of the left and right bordering regions (\texttt{border} formal argument) \end{enumerate} \begin{figure}[H] \centering \includegraphics{images/Region_Of_Interest} \vspace{6mm} \caption{Region of interest and bordering regions} \end{figure} Once the region of interest and bordering region has been specified, the remaining variables \texttt{bord\_ratio}, \texttt{min\_inten}, and \texttt{max\_chg} are each specified by selecting a single numeric value; see the function documentation for the precise definition of these variables. \subsubsection{\texttt{filterMS}: specifying the region of interest} In the following code snippet we provide examples of using \texttt{filterMS} with \texttt{region} specified either by using column names or by column indices. <>= # Invoke filterMS using column names to specify the region of interest filter_out <- filterMS(msObj = bin_out, region = paste0("VO_", 17:25), border = "all", bord_ratio = 0.01, min_inten = 1000, max_chg = 10) # The column indices 7-15 correspond to fractions 17-25 colnames(filter_out)[7:15] # Invoke filterMS using indices to specify the region of interest filter_out_v2 <- filterMS(msObj = bin_out, region = 7:15, border = "all", bord_ratio = 0.01, min_inten = 1000, max_chg = 10) # Confirm that the two objects are equivalent identical(filter_out_v2, filter_out) @ \subsubsection{\texttt{filterMS}: specifying bordering region} In the following code snippet we provide examples of using \texttt{filterMS} with \texttt{border} specified either by using a length-1 or length-2 numeric vector. Since in both cases the choices of \texttt{border} are large enough to encompass all of the fractions not included in the region of interest, these choices have the same effect as specifying \texttt{"all"}. <>= # Use one value to specify the width of both the left and the right bordering # region filter_out_v3 <- filterMS(msObj = bin_out, region = paste0("VO_", 17:25), border = 100, bord_ratio = 0.01, min_inten = 1000, max_chg = 10) # Use two values to specify the left width and right width of the bordering # region filter_out_v4 <- filterMS(msObj = bin_out, region = paste0("VO_", 17:25), border = c(150, 200), bord_ratio = 0.01, min_inten = 1000, max_chg = 10) # We get the same result be specifying the left and right bordering regions as # having widths 100 as by choosing "all" identical(filter_out_v3$msDatObj, filter_out$msDatObj) # We get the same result be specifying the left and right bordering regions as # having widths 150 and 200 as by choosing "all" identical(filter_out_v4$msDatObj, filter_out$msDatObj) @ \subsubsection{The print and summary function for \texttt{filterMS}} The \texttt{filterMS} class is equipped with a print and a summary function. We see that the number of candidate compounds is reduced from 6,258 compounds to 225. Note that when the number of fractions in the region of interest or the bordering regions is large, then the summary function omits printing the fractions so as to prevent the output from becoming overly lengthy - in this case the faction names for the bordering region is omitted. <>= # Print the size of the filtered data filter_out # Show summary information describing the filtering process summary(filter_out) @ \subsection{Using \texttt{rankEN}} Once the mass spectrometry abundance data has optionally undergone any preprocessing steps, a statistical procedure to search for candidate compounds for reduction of bioactivity levels is performed. This step is performed by the \rankEN/ function, and takes as inputs both the preprocessed mass spectrometry abundance data and the bioactivity levels data. \subsubsection{\texttt{rankEN}: specifying the region of interest for mass spectrometry and bioactivity data} The first task in invoking the \texttt{rankEN} procedure is to specify the the region of interest for mass spectrometry and bioactivity data. This can be done by specifying the appropriate column names or column indices in the respective data. So the argument for \texttt{region\_ms} should specify the region of interest for the mass spectrometry data by providing the appropriate column names or column indices with respect to the argument provided for \texttt{msObj}. Similarly, the argument for \texttt{region\_bio} should be with respect to the argument for \texttt{bioact}. It is worth clarifying that it should be the same region of interest for the intensity and bioactivity data (i.e. the region should correspond to the same fractions for each); it just might be the case that the column names or indices may differ. Once the mass spectrometry and bioactivity data has been provided and the region of interest for each has been specified, it remains to specify \begin{itemize} \item the quadratic penalty paramter for the elastic net penalty \item a switch specifying whether the function should retain only compounds that are positively correlated the bioactivity \item the maximum number of candidate compounds to retain \end{itemize} See the function documentation for more detail. <>= # Perform the candidate ranking procedure with fractions 21-24 as the region of # interest rank_out <- rankEN(msObj = filter_out, bioact = EC, region_ms = paste0("_", 21:24), region_bio = paste0("_", 21:24), lambda = 0.001, pos_only = TRUE, ncomp = NULL) @ \subsubsection{The summary function for \texttt{rankEN}} The \texttt{rankEN} class is equipped with a print and summary function. The summary function provides a list of the candidate compounds obtained by the procedure and their correlation with mean bioactivity levels across replicates. <>= # Prints the dimensions of the data rank_out # Shows the first 10 candidate compounds obtained by the procedure summary(rank_out, 10) @ \subsubsection{Accessing the ranked candidate compounds} The m/z and charge values of the ranked candidate compounds as well as their correlations with respect to the average bioactivity levels for the region of interest can be extracted and returned as a \texttt{data.frame} via the \texttt{extract\_ranked} function. <>= # Extract the ranked candidates ranked_candidates <- extract_ranked(rank_out) # Return object is a data.frame is.data.frame(ranked_candidates) # Print first few candidates; should be the same as from the summary function head(ranked_candidates) @ % PepSAVIms objects and class methods ------------------------------------------ \section{Data access and manipulation tools} In this section we take a deeper look into the objects created by the functions in \pkgname/ and consider how to further access and manipulate the data throughout the process. \subsection{The mass spectrometry data class hierarchy} The core data structure in the \pkgname/ package is the \texttt{msDat} \textit{class}; it is the class of the object returned by the \texttt{msDat} \textit{function}. This data structure is used to maintain mass spectrometry data. See the \texttt{msDat} function documentation for further details regarding the \texttt{msDat} class internal structure. The importance of the \texttt{msDat} class lies in the fact that the \texttt{binMS} and \texttt{filterMS} \textit{classes} derive from it; these classes are the objects that are returned from the \texttt{binMS} and \texttt{filterMS} \textit{functions}. The \texttt{binMS} and \texttt{filterMS} classes are essentially objects that decorate the \texttt{msDat} class; each of these data structures includes additional information used by the class's respective summary function to describe the data processing procedure. See the respective function documentation for more detail regarding the internal structure of the \texttt{binMS} and \texttt{filterMS} classes. \begin{figure}[H] \caption{The \pkgname/ class hierarchy} \centering % Adapted from http://www.texample.net/tikz/examples/tree/ \begin{tikzpicture}[sibling distance=10em, every node/.style = {shape=rectangle, rounded corners, draw, align=center, top color=white, bottom color=blue!20}] \node {\texttt{msDat}} child { node {\texttt{binMS}} } child { node {\texttt{filterMS}} }; \end{tikzpicture} \label{fig: class hierarchy} \end{figure} \subsection{The \texttt{extractMS} function} The \texttt{extractMS} function is a convenience function taking an object inheriting from the \texttt{msDat} class and returning the encapsulated mass spectrometry data as either (as specified by the user): \begin{enumerate}[label=\roman*)] \item a matrix \item an \texttt{msDat} object (strictly an \texttt{msDat} object, i.e. not a subclass) \end{enumerate} \subsubsection{Extracting a \texttt{matrix} object} The user may find it convenient to view and / or manipulate the mass spectrometry data encapsulated in an \texttt{msDat} object as a data matrix. This data may always be refactored again as an \texttt{msDat} object via the \texttt{msDat} function. Converting an \texttt{msDat} object to a matrix is done by specifying \texttt{type} as \texttt{"matrix"}. The matrix object returned by the \texttt{msDat} function has a form where the first two columns provide the mass-to-charge and charge values of the data respectively, and the remaining columns provide the intensity data across the fractions. One important situation where the user may wish to refactor the mass spectrometry data into the form of a matrix is as an intermediate step if they wish to convert the data to for example a comma-separated values file or native spreadsheet data format. <>= # Refactor the data as a matrix filter_matr <- extractMS(msObj = filter_out, type = "matrix") # Return object is a matrix is.matrix(filter_matr) # The data has two extra columns, one each for the m/z and charge information dim(filter_matr) # Compare to the result of calling dim on the original msDat object dim(filter_out) # Print the first few rows and columns of the newly formed matrix. The row # names of the matrix are the concatonation of the mass-to-charge ratio and # charge state, separated by a /. filter_matr[1:5, 1:4] @ \vspace{2mm} \noindent Once the data has been refactored in matrix form, any of the usual \R/ data export tools can be used. <>= # Save the data as a csv file. Probably don't want to keep the row names as that # information is contained in the first two columns of the data. write.csv(filter_matr, file = "filtered_mass_spec.csv", row.names = FALSE) @ \subsubsection{Extracting an \texttt{msDat} object} An alternative to refactoring a \texttt{binMS} or \texttt{filterMS} object as a \texttt{matrix} is to extract the internal \texttt{msDat} object. This is done by specifying \texttt{type} as \texttt{"msDat"}. A potential advantage of keeping the data as an \texttt{msDat} object is that it may prevent later converting the data back from a matrix into an \texttt{msDat} object. The \texttt{msDat} class is equipped with fundamental operations common to typical matrix classes such as data printing and matrix subsetting (described in section \ref{sec: msDat interface}). The main difference between the extracted \texttt{msDat} object and its encapsulating \texttt{binMS} or \texttt{filterMS} object is that the print and summary functions for the \texttt{msDat} emulates the print and summary functions for a matrix of intensity values rather than describing the consolidation or filtering process. <>= # Extract the encapsulated msDat object filter_msDat <- extractMS(filter_out, "msDat") # For a subclass of msDat the extractMS function has the effect of performing # the following command filter_msDat_v2 <- filter_out$msDatObj # extractMS is the same as copying the msDatObj element for a subclass of msDat identical(filter_msDat_v2, filter_msDat) # Calling extractMS on an object that is strictly of class msDat is effectively # a noop filter_msDat_v3 <- extractMS(filter_msDat, "msDat") # extractMS on a strictly msDat object returns the original object identical(filter_msDat_v3, filter_msDat) # Printing the extracted msDat object prints the intensity matrix (as opposed to # the print function for binMS or filterMS objects. Also compare this to the # extracted matrix in the previous section: in this form the mass-to-charge and # charge data is not exposed to the user. filter_msDat[1:5, 1:2] @ \subsection{The \texttt{msDat} function} As might be expected, the \texttt{msDat} \textit{function} takes mass spectrometry data as input and returns an \texttt{msDat} \textit{object}. In one sense the \texttt{msDat} function can be though of as the inverse of the \texttt{extractMS} function with \texttt{type} specified as \texttt{"matrix"}; while in this form \texttt{extractMS} turns an \texttt{msDat} object into a matrix, the \texttt{msDat} function turns a matrix into an \texttt{msDat} object. In general, the \texttt{msDat} function is used to create an \texttt{msDat} object for use as input for either the \texttt{filterMS} or the \texttt{rankEN} function. This need may occur when the researcher wishes to (i) enter the \methodname/ pipeline without executing either or both of the \texttt{binMS} or \texttt{filterMS} functions or (ii) wants to do some addition processing between steps of the pipeline using the raw data. The forms in which the data can be input is similar to that as for the \texttt{binMS} function, so we do not go into great deal here. In fact the internal routines used to process the arguments are the same for both functions. <>= # Construct an msDat object from object created by a call to extractMS filter_out_v5 <- msDat(mass_spec = filter_matr, mtoz = "mtoz", charge = "charge", ms_inten = NULL) # Confirm that reconstructed msDat object is equal. Need to ignore attributes # when testing for equality b/c row names are not retained. all.equal(filter_out_v5, filter_out$msDatObj, check.attributes=FALSE) @ \subsection{The \texttt{msDat} class interface} \label{sec: msDat interface} The \texttt{msDat} class and its subclasses \texttt{binMS} and \texttt{filterMS} are equipped with some basic class methods to support fundamental data operations. The basic operations that are supported are the following: \begin{itemize} \item \texttt{dim}, \texttt{nrow}, \texttt{ncol} (in terms of the dimensions of the intensity data) \item \texttt{dimnames}, \texttt{colnames}, and \texttt{row.names} read / write \item extract or replace via the \texttt{[ ]} operator \item \texttt{print} \end{itemize} We've used many of these functions throughout the rest of this document without explanation, but now let us provide some concrete examples. <>= # Check the dimension; can also use nrow, ncol dim(filter_msDat) # Print the first few rows and columns filter_msDat[1:5, 1:3] # Let's change the fraction names to something more concise colnames(filter_msDat) <- c(paste0("frac", 11:43), "frac47") # Print the first few rows and columns with the new fraction names filter_msDat[1:5, 1:10] # Suppose there are some m/z levels that we wish to remove filter_msDat <- filter_msDat[-c(2, 4), ] # Print the first few rows and columns after removing rows 2 and 4 filter_msDat[1:5, 1:10] # Suppose that there was an instrumentation error and that we need to change # some values filter_msDat[1, paste0("frac", 12:17)] <- c(55, 57, 62, 66, 71, 79) # Print the first few rows and columns after changing some of the values in # the first row filter_msDat[1:5, 1:10] @ % bibliography ----------------------------------------------------------------- \begin{thebibliography}{9} \bibitem{kirkpatrick2016} Kirkpatrick et al. A PRISMS pipeline for natural product bioactive peptide discovery. Under review. \bibitem{zou2005} Zou, H., \& Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320. \end{thebibliography} \end{document}