Main etl
methods
Each of the three main etl
methods must take an
etl_foo
object as it’s first argument, and (should
invisibly) return an etl_foo
object. These methods are
pipeable and predictable, but not pure, since
they by design have side-effects (i.e. downloading files, etc.) Your
major task in writing the foo
package will be to write
these functions. How you write them is entirely up to you, and the
particular implementation will of course depend on what the purpose of
foo
is.
All three of the main etl
methods should take the same
set or arguments. Most commonly these define the span of time for the
files that you want to extract, transform, or load. For example, in the
airlines
package, these functions take optional
year
and month
arguments.
We illustrate with cities
, which unfortunately takes
only ...
. Also, etl_cities
uses
etl_load.default()
, so there is no
etl:::etl_load.etl_cities()
method.
etl_extract.etl_cities %>% args()
## Error in eval(expr, envir, enclos): object 'etl_extract.etl_cities' not found
etl_transform.etl_cities %>% args()
## Error in eval(expr, envir, enclos): object 'etl_transform.etl_cities' not found
etl_load.etl_cities %>% args()
## Error in eval(expr, envir, enclos): object 'etl_load.etl_cities' not found
Other etl
methods
There are four additional functions in the etl
toolchain:
etl_init()
- initialize the database
etl_cleanup()
- delete unnecessary files
etl_update()
- run etl_extract
,
etl_transform()
and etl_load()
in succession
with the same arguments
etl_create()
- run etl_init()
,
etl_update()
, and etl_cleanup()
in
succession
These functions can generally be used without modification and thus
are not commonly extended by foo
.
The etl_init()
function will initialize the SQL
database.
If you want to contribute your own hard-coded SQL initialization
script, it must be placed in inst/sql/
. The
etl_init()
function will look there, and find files whose
file extensions match the database type. For example, scripts written
for MySQL should have the .mysql
file extension, while
scripts written for PostgreSQL should have the .postgresql
file extension.
If no such file exists, all of the tables and views in the database
will be deleted, and new tables schemas will be created on-the-fly by
dplyr
.
etl_foo
object attributes
Every etl_foo
object has a directory where it can store
files and a DBIConnection
where it can write to a database.
By default, these come from tempdir()
and
RSQLite::SQLite()
, but the user can alternatively specify
other locations.
## No database was specified so I created one for you at:
## /tmp/Rtmp6KULdZ/filed38749e44164.sqlite3
## List of 2
## $ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
## .. ..@ ptr :<externalptr>
## .. ..@ dbname : chr "/tmp/Rtmp6KULdZ/filed38749e44164.sqlite3"
## .. ..@ loadable.extensions: logi TRUE
## .. ..@ flags : int 70
## .. ..@ vfs : chr ""
## .. ..@ ref :<environment: 0x56233406dbd8>
## .. ..@ bigint : chr "integer64"
## .. ..@ extended_types : logi FALSE
## $ disco: NULL
## - attr(*, "class")= chr [1:6] "etl_cities" "etl" "src_SQLiteConnection" "src_dbi" ...
## - attr(*, "pkg")= chr "etl"
## - attr(*, "dir")= chr "/tmp/Rtmp6KULdZ"
## - attr(*, "raw_dir")= chr "/tmp/Rtmp6KULdZ/raw"
## - attr(*, "load_dir")= chr "/tmp/Rtmp6KULdZ/load"
Note that an etl_foo
object is also a
src_dbi
object and a src_sql
object. Please
see the dbplyr
vignette for more information about these database connections.