\documentclass[a4paper]{article} \usepackage[margin=2.25cm]{geometry} \usepackage{xspace} %%\usepackage[round]{natbib} \usepackage[colorlinks=true,urlcolor=blue]{hyperref} \newcommand{\code}[1]{\texttt{#1}} \newcommand{\pkg}[1]{{\it #1}} \newcommand{\Prisma}{\pkg{PRISMA}\xspace} \SweaveOpts{keep.source=TRUE, strip.white=all} %% \VignetteIndexEntry{Quick introduction} <>= if (!exists("PRISMA",.GlobalEnv)) library(PRISMA) @ \begin{document} \title{Introduction to the \Prisma package} \author{Tammo Krueger} \date{\today\\[1cm] \url{https://github.com/tammok/PRISMA}} \maketitle \section*{Introduction} This vignette gives you a first tour to the features of the \Prisma package. We will give an overview of the application of the algorithm, yet, the full story is available in the papers \cite{krueger12,krueger10}. If you use the \Prisma package in your research, please cite at least one of these references. The \Prisma package consists essentially out of three parts: \begin{enumerate} \item Efficiently reading \code{sally} output, an extremely fast n-gram processor available at \url{http://www.mlsec.org/sally/} \item Testing-based feature dimension reduction \item Optimized matrix factorization of the reduced data exploiting the replicate structure of the data \end{enumerate} For the theory behind these parts please consult \cite{krueger12,krueger10}. We will start this walk-through with the reading of \code{sally} data, then showing the inner structure of the resulting data object on which the replicate-aware non-negative matrix factorization can be applied. \section*{Loading the Data} This section serves just as a reference how to apply the processing chain to new data, to get a usable \Prisma data set. The generated data set is already prepackaged inside the \Prisma package and can be loaded via \code{data(asap)}. Before executing the examples please extract asap.tar.gz located in the \code{extdata} path of the \Prisma package to find all data necessary to understand the processing chain from the raw data (asap.raw) to the sally file (asap.sally) and the optimized file (asap.fsally). The asap.sally file can be produced as follows: \begin{verbatim} sally -c asap.cfg asap.raw asap.sally \end{verbatim} this call generates asap.sally from the raw data found in asap.raw. To speed up the loading of the data in R, one should apply the \code{sallyPreprocessing.py} Python script as follows: \begin{verbatim} python sallyPreprocessing.py asap.sally asap.fsally \end{verbatim} Now the data is ready to be efficiently loaded and processed in R via \code{loadPrismaData("asap")} which also executes the feature dimension reduction step. \section*{The \Prisma Data Set} As an example we use the prepackages ASAP toy data set as described in \cite{krueger10}: <<>>= data(asap) asap @ We see that the feature reduction step worked quite well. Let's have a look behind the scenes: <<>>= asap$data @ This shows us the reduced form of the initial data matrix in a features $\times$ documents representation, i.e. this is a replicate-free version of it. We can see that the features partly consists of grouped tokens (for instance \code{admin.php par action} contains 3 tokens, which always co-occurred in the data) and how theses tokens are present in the different documents. We can see the initial tokens before the grouping and their corresponding group assignment in the \code{group} variable: <<>>= asap$group @ The member variable \code{unprocessed} contains the initial data matrix before the feature selection and grouping step. If we want to reconstruct all replicates in the reduced feature space, we need the \code{getDuplicateData} function: <<>>= dim(getDuplicateData(asap)) dim(asap$unprocessed) @ This will blow up the reduced matrix to the full 10.000 initial data points in the reduced feature space. To see, how often a specific entry in the reduced data matrix was present, we can have a look at the duplicate count: <<>>= asap$duplicatecount sum(asap$duplicatecount) @ \section*{The Replicate-Aware Non-Negative Matrix Factorization (NMF)} The replicate-aware NMF is a matrix factorization method which describes the data according to a new base vector system, i.e. each data point is described as a weighted sum of these base vectors. Thus, the base vectors can be seen as the parts of which a document is constructed. Furthermore, the new coordinates of a document (the base weights) can also be interpreted as a soft clustering. But before we can apply the NMF we need to specify the inner dimension of the factorization. This could either be supplied by a number (which should be even, if \code{pca.init} is \code{TRUE}), or a \code{prismaDimension} object generated by the fully automatized dimension estimation method: <<>>= asapDim = estimateDimension(asap) asapDim @ Equipped with this object, we can now apply the NMF to the data: \begin{verbatim} > asapNMF = prismaNMF(asap, asapDim, time=60) Error: 3771.392 Error: 3113.138 Error: 2855.863 Error: 2810.286 Error: 2765.763 Error: 2755.29 Error: 2752.505 > asapLabels = getMatrixFactorizationLabels(asapNMF) > table(asapLabels) asapLabels 1 2 3 4 5 6 7 8 623 607 602 660 1696 2473 817 2522 \end{verbatim} We can look at the results via \code{plot(asapNMF)} which is shown in Figure \ref{fig:asap}. We can see that the NMF extracts a \code{search} template, then the four \code{admin.php}-action templates, a Firefox template and two \code{static} templates, which reproduces the results in \cite{krueger10}, Section 3.1., with added user agents as ``noise''. \begin{figure}[tb] \centering \includegraphics{asap} \caption{Result of the replicate-aware NMF on the \code{asap} data set.} \label{fig:asap} \end{figure} \section*{Interface to the \pkg{tm} Package} To allow the application of the replicate-aware NMF to corpora generated by the \pkg{tm} package \cite{feinerer08}, the \Prisma package contains a converter function which maps a \pkg{tm} corpus object to a \Prisma data object. We exemplify this procedure with an already stemmed and cleansed version of the 15 subsections of \cite{krueger2013}: \begin{verbatim} > data(thesis) > thesis A corpus with 15 text documents > thesis = corpusToPrisma(thesis, NULL, TRUE) > thesis PRISMA data tm-Corpus Unprocessed data: # features: 2002 # entries: 15 Processed data: # features: 2002 # entries: 15 > thesisNMF = prismaNMF(thesis, 3, pca.init=FALSE) Error: 1329.73 Error: 1310.481 Error: 1295.959 Error: 1295.509 \end{verbatim} Since we have just 15 documents, the application of the feature reduction step and the correlation analysis suffers from too less data, which also holds true for the PCA-based initialization scheme. Thus, we ignore all these processings and apply the NMF directly on the data with three components as a sophisticated guess. To analyze the result we look at the top 20 words of the resulting base matrix: \begin{verbatim} > isQuantile = (t(thesisNMF$B) > apply(thesisNMF$B, 2, quantile, prob=.99)) > maxFeatures = apply(isQuantile, 1, function(r) which(r == 1)) > rownames(thesis$data)[maxFeatures[, 1]] [1] "add" "align" "associ" "cluster" "communic" "correct" [7] "extract" "fill" "format" "inner" "machin" "messag" [13] "obvious" "preserv" "reflect" "return" "simul" "templat" [19] "trace" "transit" "tri" > rownames(thesis$data)[maxFeatures[, 2]] [1] "behavior" "chang" "configur" "crossvalid" "drop" [6] "fast" "figur" "follow" "lead" "learn" [11] "lower" "observ" "optim" "overal" "procedur" [16] "process" "relat" "shown" "speed" "statist" [21] "use" > rownames(thesis$data)[maxFeatures[, 3]] [1] "addit" "applic" "approach" "attack" "base" "construct" [7] "content" "exploit" "method" "model" "network" "normal" [13] "protocol" "server" "similar" "simpl" "structur" "techniqu" [19] "token" "traffic" "use" \end{verbatim} These word stems accurately describe the contents of the three chapters of \cite{krueger2013} which concludes the analysis of this section. \bibliographystyle{plain} \bibliography{PRISMA} \end{document}