---
title: "The parseLatex package"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{The parseLatex package}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## Introduction

Parsing LaTeX is tricky, because LaTeX macros (in LaTeX packages, or
in user code) can change the parsing rules as they go.  

`parseLatex` is not a LaTeX interpreter (at least
mostly it isn't, but see the detailed comparison below), 
so it can't do that:  it uses
the same parsing rules for all code that it looks at.  If you're
using a LaTeX package that uses non-standard rules, you can use
those, but they have to apply to the whole section of code
passed to `parseLatex()`.

Subject to the limitation that the code only uses one
set of
rules, `parseLatex` should be able to parse any LaTeX code.  It
extends the base `tools::parseLatex()` function in a few ways:

1.  It classifies every character in the source file according to 
TeX "catcodes".  The base function only handles some of them.
2.  The `parseLatex::parseLatex()` function marks its output with
class `"LaTeX2"` instead of `"LaTeX"`, and marks each item in the
output with class `"LaTeX2item"`.  This allows it to print things
in a more readable way.
3.  The `parseLatex` package includes a large selection of functions
for extracting and modifying parts of the parsed LaTeX.

More differences are listed below.

## Demo

A simple demonstration is in order.

First, we use `knitr` to create a LaTeX table.
```{r}
library(knitr)
latex <- kable(mtcars[1:2, 1:2], format = "latex")
cat(latex)
```

Next, we parse it in `parseLatex`.  
```{r}
library(parseLatex)
parsed <- parseLatex(latex)
```
Printing the result would appear to duplicate the input, but in fact
it is quite different.  `parsed` is a list of class `"LaTeX2"`.  Items
in the list are of class `"LaTeX2item"`.  In this example, there
are only two items:  the blank that `knitr` puts at the beginning of
each table, and a second entry which is the whole table environment:
```{r}
parsed[[1]]
parsed[[2]]
```
"`SPECIAL`" and "`ENVIRONMENT`" label the types of items. The table
environment contains the environment name, and a `"LaTeX2"` list
containing all the content.

If we hadn't known where we put it, we could find the table location
using `find_env()`:
```{r}
find_env(parsed, "tabular")
```

We can extract the table, and use other functions to work with it:

```{r}
table <- parsed[[find_env(parsed, "tabular")]]
# Get the alignment options from the content
columnOptions(table)
tableCell(table, 2,2) # The title counts as a row
tableCell(table, 1,1) <- "Model"
table
```

## Differences from `tools::parseLatex`

The parser in this package is based on the one used by the base R
`tools::parseLatex` function (which I also wrote, based on other 
parsers in R).  The output format is similar, but not compatible.
These are the main differences.

- In both this package and `tools::parseLatex`, the result of calling
the parser is a list of items.  
- The list has class `"LaTeX2"` in this
package, and class `"LaTeX"` in `tools::parseLatex`.
- All items have an attribute returned by the `latexTag()` function
identifying the type of item.  In this package the possible tags are
```{r echo = FALSE}
tags <- data.frame(Tag = c("BLOCK", "COMMENT", "DISPLAYMATH", "ENVIRONMENT", "MACRO", "MATH", "SPECIAL", "TEXT", "VERB", "DEFINITION", "ERROR"),
        Description = c(
          "A block enclosed in curly braces",
          "A LaTeX comment",
          "A display math block",
          "A LaTeX environment",
          "A LaTeX macro",
          "An inline math block",
          "A non-alphabetic character",
          "Text (consisting of letters only)",
          "A verbatim environment",
          "A command or environment definition",
          "A block of items referenced in an error message"),
        Type = c("list", "character", "list", "list", "character",
                 "list", "character", "character", "character", "list", "list"))
knitr::kable(tags, booktabs = TRUE)
```
- The `tools::parseLatex()` parser does not have the `SPECIAL`;
such characters are included in `TEXT`.  It also doesn't
have the `DEFINITION` or `ERROR` tags.  Definitions are treated
as regular macros, which sometimes leads to parsing errors.
Errors are always fatal.
- This parser stops when it reaches `\end{document}`, just
as LaTeX does.  The `tools::parseLatex()` parser continues
parsing beyond that, often leading to parsing errors as
it tries to parse things that LaTeX would ignore.
- In both implementations, some items (`COMMENT`, `MACRO`, `SPECIAL`, `TEXT`, and `VERB`)
are stored as length 1 character vectors; the others are stored
as lists of items corresponding to their content.  
- The list storage is different between the two.  The
`tools::parseLatex()` function
stores some lists in two levels (e.g. the content of
an environment named `item` would be in `item[[2]]`), while in
this package, all lists contain the content directly (e.g.
the content of that environment would be in `item` itself).
- This package marks all items with class `"LaTeX2item"`.  `tools::parseLatex()` does not assign a class to items.
- This package provides print methods for class `"LaTeX2item"`
so that individual items print nicely.
- Parsing errors are reported more informatively by this package.
- This parser supports "magic comments".  See the next section for details.
- This parser is more flexible in handling `verb` macros
like `\Sexpr`.  The `tools::parseLatex()` parser assumed
there would be no braces within the macro (which is the
case for legal `Sweave()` source).  This parser assumes 
any braces within the macro are balanced, e.g. this would 
be legal:\
\
`\Sexpr{1 + {x <- 2; x + 1}}`\
\
whereas unbalanced braces would not be.


As mentioned above, `parseLatex()` does a little bit more
than parsing.  Both versions recognize LaTeX environments
and verbatim code.

The parser in this package also takes special action 
when it sees the `document` environment:  it stops
parsing at `\end{document}`. (You can use the
`get_leftovers()` function to see what parts of the input
were skipped.) 

It also changes the rules
a bit when it sees macros 
defining things:  `\newenvironment`, `\renewenvironment`,
`\newcommand`, `\renewcommand` and `\providecommand`.  The 
arguments to these macros are parsed but not interpreted,
allowing definitions to parse without triggering
a syntax error.  For example:

    \newenvironment{newenv}{\begin{oldenv}}{\end{oldenv}}
    
The `\begin{oldenv}` part of the definition shouldn't be
interpreted here as the start of an `oldenv` environment,
because `\end{oldenv}` isn't in the same `{}` block.

One plain TeX version of these macros is `\def`.  It is
recognized and an attempt is made to handle it, but there's
some really arcane syntax possible with `\def`.  If you use
that, it probably won't be parsed properly.  Stick with
simple syntax like

    \def\bea{\begin{eqnarray*}}
    
and you should be okay.  

## Magic Comments

The `parseLatex::parseLatex()` parser can parse most
LaTeX inputs, but not all.  To allow it to be used on
files that contain unsupported syntax, it allows "magic
comments" to be inserted to control its actions.

Several LaTeX editors support magic comments of the form 
`% !TEX ...`, and those were the model for `parseLatex`
magic comment support.  There are 4 magic comments supported
in this parser:

- `% !parser off` This tells the parser to absorb all following
text as part of the comment, so anything that would be
classed as a parsing error is never seen.
- `% !parser on` This tells it to resume normal parsing.
- `% !parser verb [name]` This tells the parser
to add the name to the list of macros holding verbatim
text, i.e. the list given by the `verb` argument when
`parseLatex()` was called.  The name should include the 
backslash, e.g.\
\
`  % !parser verb \Sexpr`\
\
would add the default verb macro.
- `% !parser defcmd [name]` does the same for commands like
`\newcommand`.
- `% !parser defenv [name]` does it for commands like 
`\newenvironment`.
- `% !parser verbatim [name]` This tells the parser
to add the name to the list of environments holding verbatim
text, i.e. the list given by the `verbatim` argument. 
For example\
\
`  % !parser verbatim Sinput`\
\
would add one of the default verbatim environments.

The parser is quite strict about the format of the magic
comments.  The whitespace between parts of it must be
spaces, not tabs, and nothing else can appear in the comment
after the magic text other than more spaces.

## Work in progress!

This is a work in progress, so if you have a use for something like
this and need help, post an "issue" on the Github page: 
https://github.com/dmurdoch/parseLatex .