\documentclass[a4paper]{article}

\usepackage[margin=2.25cm]{geometry}
\usepackage{xspace}
%%\usepackage[round]{natbib}
\usepackage[colorlinks=true,urlcolor=blue]{hyperref}

\newcommand{\code}[1]{\texttt{#1}}
\newcommand{\pkg}[1]{{\it #1}}
\newcommand{\Prisma}{\pkg{PRISMA}\xspace}
\SweaveOpts{keep.source=TRUE, strip.white=all}
%% \VignetteIndexEntry{Quick introduction}

<<echo=FALSE,results=hide>>=
if (!exists("PRISMA",.GlobalEnv)) library(PRISMA)  
@

\begin{document}
\title{Introduction to the \Prisma package}
\author{Tammo Krueger}
\date{\today\\[1cm]
  \url{https://github.com/tammok/PRISMA}}
\maketitle

\section*{Introduction}

This vignette gives you a first tour to the features of the \Prisma
package. We will give an overview of the application of the algorithm,
yet, the full story is available in the papers
\cite{krueger12,krueger10}. If you use the \Prisma package in your
research, please cite at least one of these references.

The \Prisma package consists essentially out of three parts: 
\begin{enumerate}
\item Efficiently reading \code{sally} output, an extremely fast n-gram
  processor available at \url{http://www.mlsec.org/sally/}
\item Testing-based feature dimension reduction
\item Optimized matrix factorization of the reduced data exploiting
  the replicate structure of the data
\end{enumerate}

For the theory behind these parts please consult
\cite{krueger12,krueger10}. We will start this walk-through with the
reading of \code{sally} data, then showing the inner structure of the
resulting data object on which the replicate-aware non-negative matrix
factorization can be applied.

\section*{Loading the Data}
This section serves just as a reference how to apply the processing
chain to new data, to get a usable \Prisma data set. The generated
data set is already prepackaged inside the \Prisma package and can be
loaded via \code{data(asap)}. 

Before executing the examples please extract asap.tar.gz located in
the \code{extdata} path of the \Prisma package to find all data
necessary to understand the processing chain from the raw data
(asap.raw) to the sally file (asap.sally) and the optimized file
(asap.fsally). The asap.sally file can be produced as follows:
\begin{verbatim}
sally -c asap.cfg asap.raw asap.sally
\end{verbatim}

this call generates asap.sally from the raw data found in asap.raw. To
speed up the loading of the data in R, one should apply the
\code{sallyPreprocessing.py} Python script as follows:

\begin{verbatim}
python sallyPreprocessing.py asap.sally asap.fsally
\end{verbatim}

Now the data is ready to be efficiently loaded and processed in R via
\code{loadPrismaData("asap")} which also executes the feature
dimension reduction step.

\section*{The \Prisma Data Set}

As an example we use the prepackages ASAP toy data set as described in \cite{krueger10}:
<<>>=
data(asap)
asap
@
We see that the feature reduction step worked quite well. Let's have a
look behind the scenes:
<<>>=
asap$data
@ 
This shows us the reduced form of the initial data matrix in a
features $\times$ documents representation, i.e. this is a replicate-free
version of it. We can see that the features partly consists of grouped
tokens (for instance \code{admin.php par action} contains 3 tokens,
which always co-occurred in the data) and how theses tokens are present
in the different documents. We can see the initial tokens before the
grouping and their corresponding group assignment in the \code{group} variable:
<<>>=
asap$group
@ 

The member variable \code{unprocessed} contains the initial data matrix
before the feature selection and grouping step. If we want to
reconstruct all replicates in the reduced feature space, we need the
\code{getDuplicateData} function:
<<>>=
dim(getDuplicateData(asap))
dim(asap$unprocessed)
@ 
This will blow up the reduced matrix to the full 10.000 initial data
points in the reduced feature space. To see, how often a specific
entry in the reduced data matrix was present, we can have a look at
the duplicate count:
<<>>=
asap$duplicatecount
sum(asap$duplicatecount)
@ 
\section*{The Replicate-Aware Non-Negative Matrix Factorization (NMF)}

The replicate-aware NMF is a matrix factorization method which
describes the data according to a new base vector system, i.e. each
data point is described as a weighted sum of these base vectors. Thus,
the base vectors can be seen as the parts of which a document is
constructed. Furthermore, the new coordinates of a document (the base
weights) can also be interpreted as a soft clustering. But before we
can apply the NMF we need to specify the inner dimension of the
factorization. This could either be supplied by a number (which should
be even, if \code{pca.init} is \code{TRUE}), or a
\code{prismaDimension} object generated by the fully automatized
dimension estimation method:
<<>>=
asapDim = estimateDimension(asap)
asapDim
@ 
Equipped with this object, we can now apply the NMF to the data:
\begin{verbatim}
> asapNMF = prismaNMF(asap, asapDim, time=60)
Error: 3771.392 
Error: 3113.138 
Error: 2855.863 
Error: 2810.286 
Error: 2765.763 
Error: 2755.29 
Error: 2752.505 
> asapLabels = getMatrixFactorizationLabels(asapNMF)
> table(asapLabels)
asapLabels
   1    2    3    4    5    6    7    8 
 623  607  602  660 1696 2473  817 2522 
\end{verbatim}
We can look at the results via \code{plot(asapNMF)} which is shown in
Figure \ref{fig:asap}. We can see that the NMF extracts a
\code{search} template, then the four \code{admin.php}-action
templates, a Firefox template and two \code{static} templates, which
reproduces the results in \cite{krueger10}, Section 3.1., with added
user agents as ``noise''.

\begin{figure}[tb]
  \centering
  \includegraphics{asap}
  \caption{Result of the replicate-aware NMF on the \code{asap} data set.}
  \label{fig:asap}
\end{figure}

\section*{Interface to the \pkg{tm} Package}

To allow the application of the replicate-aware NMF to corpora
generated by the \pkg{tm} package \cite{feinerer08}, the \Prisma
package contains a converter function which maps a \pkg{tm} corpus
object to a \Prisma data object. We exemplify this procedure with an
already stemmed and cleansed version of the 15 subsections of
\cite{krueger2013}:

\begin{verbatim}
> data(thesis)
> thesis
A corpus with 15 text documents
> thesis = corpusToPrisma(thesis, NULL, TRUE)
> thesis
PRISMA data tm-Corpus 
Unprocessed data: # features: 2002 # entries: 15 
Processed data: # features: 2002 # entries: 15 
> thesisNMF = prismaNMF(thesis, 3, pca.init=FALSE)
Error: 1329.73 
Error: 1310.481 
Error: 1295.959 
Error: 1295.509 
\end{verbatim}

Since we have just 15 documents, the application of the feature
reduction step and the correlation analysis suffers from too less
data, which also holds true for the PCA-based initialization
scheme. Thus, we ignore all these processings and apply the NMF
directly on the data with three components as a sophisticated
guess. To analyze the result we look at the top 20 words of the
resulting base matrix:

\begin{verbatim}
> isQuantile = (t(thesisNMF$B) > apply(thesisNMF$B, 2, quantile, prob=.99))
> maxFeatures = apply(isQuantile, 1, function(r) which(r == 1))
> rownames(thesis$data)[maxFeatures[, 1]]
 [1] "add"      "align"    "associ"   "cluster"  "communic" "correct" 
 [7] "extract"  "fill"     "format"   "inner"    "machin"   "messag"  
[13] "obvious"  "preserv"  "reflect"  "return"   "simul"    "templat" 
[19] "trace"    "transit"  "tri"     
> rownames(thesis$data)[maxFeatures[, 2]]
 [1] "behavior"   "chang"      "configur"   "crossvalid" "drop"      
 [6] "fast"       "figur"      "follow"     "lead"       "learn"     
[11] "lower"      "observ"     "optim"      "overal"     "procedur"  
[16] "process"    "relat"      "shown"      "speed"      "statist"   
[21] "use"       
> rownames(thesis$data)[maxFeatures[, 3]]
 [1] "addit"     "applic"    "approach"  "attack"    "base"      "construct"
 [7] "content"   "exploit"   "method"    "model"     "network"   "normal"   
[13] "protocol"  "server"    "similar"   "simpl"     "structur"  "techniqu" 
[19] "token"     "traffic"   "use"      
\end{verbatim}

These word stems accurately describe the contents of the three
chapters of \cite{krueger2013} which concludes the analysis of this
section.

\bibliographystyle{plain}
\bibliography{PRISMA}
\end{document}