--- title: "Advanced Scheduling" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Advanced Scheduling} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This vignette covers more advanced concepts and examples related to scheduling. At it's core, maestro adheres to two keys principles: 1. Stateless: It does not need to be continuously running - it can be run in a serverless architecture 2. Use of *rounded scheduling*: The timeliness of pipeline executions depends on how often you run your orchestrator Here, we'll explain these concepts more deeply in the context of maestro and give some concrete examples. ## Stateless Execution Maestro takes a unique approach to scheduling compared to other orchestration tools. Whereas most schedulers involve a continuously running program to monitor the time and execute jobs when the current time is right, maestro is designed to run intermittently. It also doesn't need to save or cache data between executions - in other words, it's *stateless*. This design has several benefits; namely, you can run it in a serverless way which saves on compute resources. However, to achieve this it takes some shortcuts which may mean that precise timeliness is lost. This will become clearer in our examples. ## Rounded Scheduling The timeliness of a pipeline is measured in how close the scheduled execution time is to the actual execution time. **Maestro is only as timely as it needs to be relative to the unit of time you are interested in**. This is the concept of *rounded scheduling*. If you run your orchestrator once daily, then the timeliness of the pipelines will be within the nearest day - it doesn't care that you specified a pipeline to run at exactly 09:21:20 each day. If you run it every 10 minutes, then the timeliness of the pipelines will be within the nearest 10 minute interval. Let's look at some examples: First, we'll consider one pipeline that is scheduled to run daily at 09:20:00 and we'll configure the orchestrator to run daily. ```{r, echo=FALSE, warning=FALSE, message=FALSE} dir.create("pipelines") writeLines( " # ./pipelines/daily_example.R #' daily_example maestro pipeline #' #' @maestroFrequency 1 day #' @maestroStartTime 2024-06-20 09:20:00 daily_example <- function() { # Pipeline code } ", con = "pipelines/daily_example.R" ) ``` ```{r, eval=FALSE} # ./pipelines/daily_example.R #' daily_example maestro pipeline #' #' @maestroFrequency 1 day #' @maestroStartTime 2024-06-20 09:20:00 daily_example <- function() { # Pipeline code } ``` For demonstration purposes, I'll manually set the check time to be 08:00:00 UTC (this is the time maestro will use to compare against the scheduled time). In practice, you almost always want this to be the system time using either `Sys.time()` or `lubridate::now()`. ```{r} # ./orchestrator.R library(maestro) schedule <- build_schedule() status <- run_schedule( schedule, orch_frequency = "1 day", check_datetime = as.POSIXct("2024-06-20 08:00:00", tz = "UTC") ) ``` We can see that the pipeline executed even though the current time was not 09:20:00. This is because we set the orchestrator to run daily and so it considers it close enough to within a day. Let's see what happens if we up the frequency of the orchestrator: ```{r} # ./orchestrator.R status <- run_schedule( schedule, orch_frequency = "15 minutes", check_datetime = as.POSIXct("2024-06-20 08:00:00", tz = "UTC") ) ``` It was skipped because it wasn't within a 15 minute degree of difference but the output tells us that will next run at `2024-06-20 09:15:00`.\ \ The takeaway message is that the timeliness of the pipeline depends on how frequently the orchestrator runs. Remember that when you declare the `orch_frequency = "15 minutes"` that is essentially a contract stating that you *will* run it every 15 minutes - maestro does not do this for you. If you run the orchestrator more or less frequently than you said you would unexpected things will happen. Specifically, if you run it more frequently than stated, your pipelines will run more often than expected, likewise less frequently than stated means that pipelines won't run as often. ## *How often should I schedule my orchestrator?* If you have a single pipeline or even multiple pipelines that all run at the same time this is an easy question to answer. In practice (and in our experience using maestro in production) you have multiple pipelines that run at different intervals and different times. Maybe some run hourly, some run daily, and others run monthly. ### Example 1 Let's say we have three pipelines with the following frequencies and start times: ```{r, echo=FALSE} data.frame( name = paste0("pipe", 1:3), frequency = c("1 hour", "2 days", "4 months"), start_time = as.POSIXct(c("2024-06-18 12:30:00", "2024-06-18 06:00:00", "2024-06-20 00:00:00")) ) ``` A good starting point is to schedule it as often as the highest frequency pipeline in the project - so `1 hour` in the above example. If you run it on \*:30:00 minute each day, pipe1 will execute nearly exactly at the scheduled time and the other pipelines will be executed 30 minutes early. If you're comfortable with this margin of error then it's no big deal, but if not then an orchestrator frequency of `30 minutes` will ensure all pipelines run as scheduled exactly. Let's see another example: ### Example 2 ```{r, echo=FALSE} data.frame( name = paste0("pipe", 4:6), frequency = c("1 hour", "1 hour", "1 hour"), start_time = as.POSIXct(c("2024-06-18 00:00:00", "2024-06-18 00:10:00", "2024-06-18 00:20:00")) ) ``` All three pipelines are hourly but they start on different 10-minute intervals. If we run the orchestrator at `1 hour` they'll all execute at the same time. If it is important that the execute at different times, we should set it to `10 minutes`. So a good heuristic is to run it as often as the smallest interval of time difference between any pipeline. This is pretty good so long as we don't run it so often that the pipelines can't complete before the next execution time. We don't recommend running the orchestrator more frequently than every 5 minutes unless you're confident that your pipelines are fast to execute.[^1] [^1]: Make sure you consider the amount of time it takes to execute `build_schedule()` and that you account for the additional work done in `run_schedule()` not related to your actual pipeline logic. Maestro has a function for providing a reasonable estimate for the ideal orchestrator frequency called `suggest_orch_frequency()`. It uses the heuristic of the most frequent interval halved (e.g., if your highest frequency pipeline is 1 hour, it suggests 30 minutes) to a minimum 15 minutes or your most frequent interval if it's less. This is a bit more conservative but protects against running things too often. If timeliness and high frequency are really important to you, maestro might not be your best choice. ## Final Remarks When it comes time to deploy your project make sure that whatever you use to actually run your project (e.g., cron, TaskScheduler, Google Cloud Scheduler, etc.) is indeed running it at the same frequency as you have in the orchestrator. It's best to stick to using whole units of time rather than fractional units - if my orchestrator runs every 15 minutes I like to have it run on the 00:00, 15:00, 30:00, and 45:00 minutes. It makes reasoning about the scheduling simpler. ```{r cleanup, echo=FALSE, message=FALSE, warning=FALSE} unlink("pipelines", recursive = TRUE) ```