R-Package to facilitate the annotation of text data by hand in R.
The goal of the handcodeR package is to provide an easy to use app to
annotate text data by hand. Often times when we work with text data, we
rely on hand coded annotations of texts either as unit of analysis in
itself, or as training and test samples for supervised machine learning
tools to classify text data. handcodeR offers a Shiny-App that can be
run within R to annotate individual texts one by one in up to six
different variables. To do so, the package uses the function
handcode()
:
handcode()
opens a Shiny-App which allows for
hand-coding strings of text into pre-defined categories. You can code
between one and six variables at a time. It returns a data frame with
your coded annotations.I present a short step-by-step guide as well as the functions in more detail below.
To cite the handcodeR package, you can use:
Isermann, Lukas. (2023). handcodeR: Text annotation app. R package version 0.1.2. http://doi.org/10.5281/zenodo.8075100.
You can also access the preferred citation as well as the bibtex entry for the handcodeR Package via R:
citation("handcodeR")
#> To cite handcodeR in publications, please use:
#>
#> Isermann, Lukas. 2023. handcodeR: Text annotation app. R package
#> version 0.1.2. https://doi.org/10.5281/zenodo.8075100
#>
#> Ein BibTeX-Eintrag für LaTeX-Benutzer ist
#>
#> @Misc{,
#> title = {handcodeR: Text annotation app},
#> author = {Lukas Isermann},
#> year = {2023},
#> note = {R package version 0.1.2},
#> doi = {10.5281/zenodo.8075100},
#> url = {https://github.com/liserman/handcodeR},
#> }
A stable version of handcodeR
can be directly accessed
on CRAN:
install.packages("handcodeR", force = TRUE)
To install the latest development version of handcodeR
directly from GitHub
use:
library(devtools) # Tools to Make Developing R Packages Easier
::install_github("liserman/handcodeR", force = TRUE) devtools
First, load the package
library(handcodeR) # classify texts by hand in R
In the following, we are going to exemplify the workflow of the package using a minimal working example.
The workflow of the package follows a simple rule:
If you start the coding process, initialize the coding with
handcode()
by providing a text vector of texts you wish to
annotate as data
input, and up to three named character
vectors of categories you want to code. Hand code as much data as you
would like and return the output data frame via the
save and exit
-button.
If you want to resume coding that you have already been working
on, continue the coding with handcode()
by providing the
data frame you received as output from your last call of
handcode()
as data
input.
The main function of the handcodeR package is
handcode()
. handcode()
takes either a vector
of texts and up to 6 named character vectors with classification
categories, or a data frame already initialized by
handcode()
as input. The function allows users to annotate
texts using the pre-defined categories in an interactive Shiny-App and
returns a data frame of the texts with their annotations.
In order to demonstrate the functionality of handcode()
,
we first use the R-package archiveRetriever
to download a New York Times article on the presidential debate between
Joe Biden and Donald Trump in the 2020 American presidential campaign.
We split the article in individual sentences which we can then annotate
with handcode()
.
# Install pacman if not already installed
if(!require(pacman)) install.packages("pacman")
# Use pacman to install and load archiveRetriever and stringr
::p_load(archiveRetriever,
pacman
stringr)
# Use the archiveRetriever to download article
<- scrape_urls(Urls = "http://web.archive.org/web/20201001004918/https://www.nytimes.com/2020/09/30/opinion/biden-trump-2020-debate.html",
nytimes_article Paths = c(title = "//h1[@itemprop='headline']",
author = "//span[@itemprop='name']",
date = "//time//text()",
article = "//section[@itemprop='articleBody']//p"))
# Split up the article in different sentences
<- unlist(str_split(nytimes_article$article, pattern = "(?<=(?<!Mr)[\\.!?])\\s"))
sentences
head(sentences)
#> [1] "I wasn’t in the crowd of people who believed Joe Biden shouldn’t deign to debate President Trump, but put me in the crowd that believes he shouldn’t debate him again."
#> [2] "Not after Tuesday night’s horror show: a disgrace to the format, an insult to the country, a nearly pointless 90 minutes."
#> [3] "And, I should add, a degradation of the presidency itself, which Trump had degraded so thoroughly already."
#> [4] "He put on a performance so contemptuous, so puerile, so dishonest and so across-the-board repellent that the moderator, Chris Wallace, morphed into some amalgam of elementary-school principal, child psychologist, traffic cop and roadkill."
#> [5] "No matter how Wallace pleaded with Trump or admonished him, he couldn’t make him behave."
#> [6] "But then why should Wallace have an experience any different from that of Trump’s chiefs of staff, of all the other former administration officials who have fled for the hills, of the Republican lawmakers who just threw up their hands and threw away any scruples they had?"
We can now use these sentences as input in handcode()
to
annotate the individual sentences of the New York Times article. We will
annotate two variables measuring the candidate a sentence talks about
and the sentiment of the statement.
<- handcode(data = sentences,
annotated candidate = c("Joe Biden", "Donald Trump"),
sentiment = c("positive", "negative"))
If we want to see not only the sentence we are currently coding, but
also the surrounding sentences, we can use the option
context = TRUE
. This gives us our current sentence
alongside its previous and following sentence. To not generate any
confusion about which sentence is currently evaluated, the surrounding
sentences are shown in gray.
<- handcode(data = sentences,
annotated candidate = c("Joe Biden", "Donald Trump"),
sentiment = c("positive", "negative"),
context = TRUE)
If our text vector does not form a continuous text, but you
nonetheless want to provide a previous and next sentence as context, you
can also specify a vector with all previous and a vector with all next
sentences as pre
and post
inputs.
# Vectors of all previous and all subsequent sentences
<- c("", sentences[2:length(sentences)])
previous <- c(sentences[2:length(sentences)-1])
subsequent
<- handcode(data = sentences,
annotated candidate = c("Joe Biden", "Donald Trump"),
sentiment = c("positive", "negative"),
context = TRUE,
pre = previous,
post = subsequent)
We can stop the annotation process at any point by clicking on the
button save and exit
. Once we click this button, the app
will close and the function returns a data frame with our texts and
annotations.
annotated#> # A tibble: 60 × 3
#> texts candidate sentiment
#> <chr> <fct> <fct>
#> 1 I wasn’t in the crowd of people who believed Joe Biden s… "Joe Bid… "negativ…
#> 2 Not after Tuesday night’s horror show: a disgrace to the… "Not app… "negativ…
#> 3 And, I should add, a degradation of the presidency itsel… "" ""
#> 4 He put on a performance so contemptuous, so puerile, so … "" ""
#> 5 No matter how Wallace pleaded with Trump or admonished h… "" ""
#> 6 But then why should Wallace have an experience any diffe… "" ""
#> 7 Trump runs roughshod over everyone and everything, and o… "" ""
#> 8 Almost from the start, he talked over Biden, taunting hi… "" ""
#> 9 He interrupted him and interrupted him and then interrup… "" ""
#> 10 “Mr. President, I’m the moderator of this debate, and I … "" ""
#> # ℹ 50 more rows
We can resume the annotation process at any point by using the
returned data frame from our last execution of handcode()
as input to a new handcode()
command. By default, the
function will resume the annotation at the first text that has not been
annotated yet.
<- handcode(data = annotated,
annotated context = TRUE)
To facilitate the classification process, handcode()
takes the keyboard shortcuts space
for
previous
and enter
for next
. If
you go back to already coded lines of your data, the app automatically
displays your previous coding, if you go to new lines of your data, the
default values for your variables always are ““. If the last row of your
data is reached, next
automatically leads to the saving of
the data and exit from the Shiny-App.
By default, handcode
uses the first line in the input
data that has not been annotated yet as start value. However, the option
start
allows users to specify with which observation they
want to start their coding process. If we have not yet annotated lines
of data that lie between already coded lines of data, you can also
specify start = "all_empty"
to annotate all lines that have
not been coded yet in the order in which they appear.
Sometimes, we explicitly want to display texts in a random order to
rule out that the context of a text within the larger body of texts
influences our annotations. If we want to randomize the order in which
texts are displayed, we can set the option
randomize = TRUE
. This will, however, not influence the
order in which texts are sorted in the resulting output.
As a default, handcode
will display one missing category
“Not applicable”. If you want a different, or more than one missing
category, you can provide a character vector of missing categories you
would like to have displayed as missing
. Missing categories
will automatically be displayed in gray. In the output these values will
be returned with a leading and trailing _
.
<- handcode(data = sentences,
annotated candidate = c("Joe Biden", "Donald Trump"),
sentiment = c("positive", "negative"),
missing = c("Not applicable", "Undecided"))