This tutorial shows how to use gittargets
with the
Git-based data versioning backend. Before proceeding, please read the README
file or documentation
website front page for an overview of the package.
Please begin with the installation instructions on the documentation website.
In addition, if your targets
pipeline generates large data
files, consider installing Git LFS.
The Git data backend in gittargets
automatically opts into
Git LFS, so you should not need to do
any manual configuration to reap the performance benefits.
This backend is uses local Git
only. It is possible to push the snapshotted data store to a
cloud service like GitHub, GitLab, or Bitbucket, but this is the user’s
responsibility. Pipelines usually generate large data files, and GitHub and its peers have file size
limitations. Also, gittargets
automatically opts into Git LFS locally (unless
git_lfs
is FALSE
in
tar_git_init()
), and Git
LFS on the cloud is a paid service.
The most important steps of the Git data backend are as follows. The rest of this vignette walks through these steps in greater depth.
tar_git_init()
: initialize a Git/Git-LFS repository for the data store.tar_make()
)
and commit any changes to the source code.tar_git_snapshot()
: create a data snapshot for the
current code commit.tar_git_checkout()
: revert the data to the appropriate
prior snapshot.To begin development, we write _targets.R
file for a targets
pipeline. targets
can
handle large complex pipelines for machine learning, Bayesian data
analysis, and much more. However, this tutorial focuses on a much
simpler pipeline for the sake of pedagogical simplicity.
With our target script in hand, we run the pipeline.1
tar_make()
#> ▶ dispatched target data
#> ● completed target data [0.001 seconds]
#> ▶ dispatched target result
#> ● completed target result [0.002 seconds]
#> ▶ end pipeline [0.073 seconds]
We inspect the output with tar_read()
.
tar_read(result)
#> Ozone Solar.R Wind Temp
#> Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
#> 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
#> Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
#> Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
#> 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
#> Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
#> NA's :37 NA's :7
#> Month Day
#> Min. :5.000 Min. : 1.0
#> 1st Qu.:6.000 1st Qu.: 8.0
#> Median :7.000 Median :16.0
#> Mean :6.993 Mean :15.8
#> 3rd Qu.:8.000 3rd Qu.:23.0
#> Max. :9.000 Max. :31.0
#>
We usually iterate between writing code and running the pipeline
until we have a decent set of results. After that, we commit the code to
a Git repository,
which may or may not live on GitHub.2 Happy Git with R is a great way to
learn Git, and the gert
package is
a convenient way to interact with Git from R.
Before we snapshot the data, we should check that the code is up to
date in the Git repository and the targets are up to date in the
pipeline. The tar_git_status()
function is an easy way to
do this.3
tar_git_status()
#>
#> ── Data Git status ─────────────────────────────────────────────────────────────
#> ✖ No Git repository for the data store.
#> ! Create one with `gittargets::tar_git_init()`.
#>
#> ── Code Git status ─────────────────────────────────────────────────────────────
#> # A tibble: 1 × 3
#> file status staged
#> <chr> <chr> <lgl>
#> 1 _targets/ new FALSE
#>
#> ── Outdated targets ────────────────────────────────────────────────────────────
#> ✔ All targets are up to date.
Our code and pipeline look ready for a data snapshot. First, we
initialize the data repository with tar_git_init()
.
tar_git_init()
writes a .gitattributes
file in
the data store that automatically opts into Git LFS. If you have Git LFS but do not wish to use it, please
remove the .gitattributes
after calling
tar_git_init()
.
tar_git_init()
#> ✔ Created data store Git repository
#> ✔ Wrote to _targets/.gitattributes for git-lfs: <https://git-lfs.com>.
#> ✔ Created stub commit without data.
#> • Run tar_git_snapshot() to put the data files under version control.
Then, we create our first data commit with
tar_git_snapshot()
.4
#> • Creating data branch code=c6aabcc36fb7ed7ca269312b3367dca1c3db40e4.
#> • Staging data files.
#> ✔ Staged 6 files in the data store.
#> • Committing data changes.
#> ✔ Created new data snapshot a0bd5f1f3e341e83b577702490d180f9dc80a358.
#> • Packing references.
In the Git data backend, a data snapshot is a special kind of Git
commit (gray boxes above). Each data commit is part of a data branch
(vertical dashed lines above), and each data branch is specific to the
current code commit (green and brown boxes above). In fact, each data
branch name is of the form "code=<SHA1>"
, where
<SHA1>
is the Git SHA1 hash of the corresponding code
commit. You can always create a data snapshot, but it will supersede any
prior data snapshot you already have for the current code commit. To
revert to a previous data snapshots for a given code snapshot, you will
need to manually enter the repository and check out the relevant data
commit.
Development typically happens in cycles: develop the code, run the
pipeline, commit the code, snapshot the data, and repeat. Not all code
commits need a data snapshot, especially if the targets
pipeline generates a lot of data. But even then, it is helpful to
snapshot the data at key milestones, e.g. if an alternative research
question comes up and it is desirable to create a new Git branch for the
code. For example, suppose we wish to apply the same pipeline to a
different dataset. The code changes:
# _targets.R
library(targets)
list(
tar_target(data, datasets::UKgas), # different dataset
tar_target(result, summary(data))
)
We run the pipeline and inspect the new output.
tar_make()
#> ▶ dispatched target data
#> ● completed target data [0.001 seconds]
#> ▶ dispatched target result
#> ● completed target result [0.001 seconds]
#> ▶ end pipeline [0.063 seconds]
We put the code in a new Git branch.
git_branch_create("UKgas")
git_add("_targets.R")
#> # A tibble: 2 × 3
#> file status staged
#> <chr> <chr> <lgl>
#> 1 _targets.R modified TRUE
#> 2 _targets/ new FALSE
git_commit("Switch to UKgas dataset")
#> [1] "7a2cebc2be82e62ee9628b144ea392832ae4d6bd"
Finally, we create a data snapshot for the new code commit.
#> • Creating data branch code=7a2cebc2be82e62ee9628b144ea392832ae4d6bd.
#> • Staging data files.
#> ✔ Staged 5 files in the data store.
#> • Committing data changes.
#> ✔ Created new data snapshot 62120b4ce2256ea68de6acb6671288c9a1848a7c.
#> • Packing references.
Now, suppose we want to switch the project back to the original
dataset (airquality
). To transition completely, we need to
revert both the code and the data. If we only revert the code, then the
data store will sill reflect the UKgas
dataset, and none of
our targets will be up to date. At this point, it is a good time to
pause and check the gittargets
log to see which code
commits have available data snapshots.5
tar_git_log()
#> # A tibble: 2 × 6
#> message_code message_data time_code time_data commit_code
#> <chr> <chr> <dttm> <dttm> <chr>
#> 1 Switch to UK… Switch to U… 2023-12-04 13:55:36 2023-12-04 13:55:36 7a2cebc2be…
#> 2 Begin analyz… Begin analy… 2023-12-04 13:55:34 2023-12-04 13:55:35 c6aabcc36f…
#> # ℹ 1 more variable: commit_data <chr>
To check out the old airquality
code, we use
gert::git_branch_checkout()
.
But because we did not revert the data, our results still reflect the
UKgas
dataset.
Thus, all our targets are out of date.
To bring our targets back up to date with the airquality
data, instead of beginning a potentially long computation with
tar_make()
, we can check out the data snapshot that matches
our current code commit.
tar_git_checkout()
#> ✔ Checked out data snapshot a0bd5f1f3e341e83b577702490d180f9dc80a358.
#> • Code commit: code=c6aabcc36fb7ed7ca269312b3367dca1c3db40e4
#> • Message: Begin analyzing the airquality dataset
#> • Resetting to HEAD of checked-out snapshot.
Now, our results reflect the airquality
dataset we
previously analyzed.
tar_read(result)
#> Ozone Solar.R Wind Temp
#> Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
#> 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
#> Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
#> Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
#> 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
#> Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
#> NA's :37 NA's :7
#> Month Day
#> Min. :5.000 Min. : 1.0
#> 1st Qu.:6.000 1st Qu.: 8.0
#> Median :7.000 Median :16.0
#> Mean :6.993 Mean :15.8
#> 3rd Qu.:8.000 3rd Qu.:23.0
#> Max. :9.000 Max. :31.0
#>
And all our targets are up to date.
It is common to merge code branches into
one another. When this happens, a new merge commit is created in the
code repository, and the data repository remains unchanged. In fact, the
only change is that the code repository is now at a new commit that has
no data snapshot yet. If you wish, simply create a new data snapshot
with tar_git_snapshot()
. If the code commit immediately
prior had an up-to-date data snapshot of its own, then the new snapshot
for the merge commit should cost little storage or runtime.
Only files inside the targets
data store are tracked in
a gittargets
data snapshot. If your pipeline requires
custom external files, you may put them in a folder called
_targets/user/
. That way, gittargets
will
automatically put them under data version control in the next
snapshot.
If your targets
pipeline generates large data files,
consider installing Git LFS. Once you
install Git LFS, it should just work
on your project right out of the box because tar_git_init()
writes the following to _targets/.gitattributes
:
In addition, every data snapshot with tar_git_snapshot()
creates a new Git branch. With thousands of commits and thus thousands
of branches, performance may suffer unless you ensure
pack_refs
is TRUE
in the function call
(default).6(https://git-scm.com/docs/git-pack-refs) in the command
line with your working directory inside _targets/
.]
https://books.ropensci.org/targets/hpc.html describes
heavy-duty alternatives to tar_make()
.↩︎
Alternatives to GitHub include GitLab and Bitbucket.↩︎
Helper functions tar_git_status_code()
,
tar_git_status_targets()
, and
tar_git_status_data()
each generate a piece of the
tar_git_status()
output.↩︎
Ordinarily, tar_git_snapshot()
shows runs
tar_git_status()
and prompts the user to confirm the
snapshot. But in this example, we skip this step.↩︎
If you chose not to call tar_git_snapshot()
for some code commits, then not all your code commits will have
available data snapshots.↩︎
Alternatively, you can call
tar_git_snapshot(pack_refs = FALSE)
and then afterwards run
git pack-refs --all
↩︎