Advanced Scheduling

This vignette covers more advanced concepts and examples related to scheduling. At it’s core, maestro adheres to two keys principles:

  1. Stateless: It does not need to be continuously running - it can be run in a serverless architecture
  2. Use of rounded scheduling: The timeliness of pipeline executions depends on how often you run your orchestrator

Here, we’ll explain these concepts more deeply in the context of maestro and give some concrete examples.

Stateless Execution

Maestro takes a unique approach to scheduling compared to other orchestration tools. Whereas most schedulers involve a continuously running program to monitor the time and execute jobs when the current time is right, maestro is designed to run intermittently. It also doesn’t need to save or cache data between executions - in other words, it’s stateless.

This design has several benefits; namely, you can run it in a serverless way which saves on compute resources. However, to achieve this it takes some shortcuts which may mean that precise timeliness is lost. This will become clearer in our examples.

Rounded Scheduling

The timeliness of a pipeline is measured in how close the scheduled execution time is to the actual execution time. Maestro is only as timely as it needs to be relative to the unit of time you are interested in. This is the concept of rounded scheduling. If you run your orchestrator once daily, then the timeliness of the pipelines will be within the nearest day - it doesn’t care that you specified a pipeline to run at exactly 09:21:20 each day. If you run it every 10 minutes, then the timeliness of the pipelines will be within the nearest 10 minute interval.

Let’s look at some examples:

First, we’ll consider one pipeline that is scheduled to run daily at 09:20:00 and we’ll configure the orchestrator to run daily.

# ./pipelines/daily_example.R
#' daily_example maestro pipeline
#'
#' @maestroFrequency 1 day
#' @maestroStartTime 2024-06-20 09:20:00
daily_example <- function() {

  # Pipeline code
}

For demonstration purposes, I’ll manually set the check time to be 08:00:00 UTC (this is the time maestro will use to compare against the scheduled time). In practice, you almost always want this to be the system time using either Sys.time() or lubridate::now().

# ./orchestrator.R

library(maestro)

schedule <- build_schedule()
## ℹ 1 script successfully parsed
status <- run_schedule(
  schedule,
  orch_frequency = "1 day",
  check_datetime = as.POSIXct("2024-06-20 08:00:00", tz = "UTC")
)
## 
## ── Running pipelines ▶ 
## ℹ ./pipelines/daily_example.R daily_example
## ✔ ./pipelines/daily_example.R daily_example [23ms]
## 
## 
## ── Pipeline execution completed ■ | 0.031 sec elapsed 
## ✔ 1 success | → 0 skipped | ! 0 warnings | ✖ 0 errors | ◼ 1 total
## ────────────────────────────────────────────────────────────────────────────────
## 
## ── Next scheduled pipelines ❯ 
## Pipe name | Next scheduled run
## • daily_example | 2024-06-20

We can see that the pipeline executed even though the current time was not 09:20:00. This is because we set the orchestrator to run daily and so it considers it close enough to within a day.

Let’s see what happens if we up the frequency of the orchestrator:

# ./orchestrator.R
status <- run_schedule(
  schedule,
  orch_frequency = "15 minutes",
  check_datetime = as.POSIXct("2024-06-20 08:00:00", tz = "UTC")
)
## 
## ── Running pipelines ▶
## → ./pipelines/daily_example.R daily_example
## 
## ── Pipeline execution completed ■ | 0.002 sec elapsed
## ✔ 0 successes | → 1 skipped | ! 0 warnings | ✖ 0 errors | ◼ 1 total
## ────────────────────────────────────────────────────────────────────────────────
## 
## ── Next scheduled pipelines ❯
## Pipe name | Next scheduled run
## • daily_example | 2024-06-20 09:15:00

It was skipped because it wasn’t within a 15 minute degree of difference but the output tells us that will next run at 2024-06-20 09:15:00.

The takeaway message is that the timeliness of the pipeline depends on how frequently the orchestrator runs. Remember that when you declare the orch_frequency = "15 minutes" that is essentially a contract stating that you will run it every 15 minutes - maestro does not do this for you. If you run the orchestrator more or less frequently than you said you would unexpected things will happen. Specifically, if you run it more frequently than stated, your pipelines will run more often than expected, likewise less frequently than stated means that pipelines won’t run as often.

How often should I schedule my orchestrator?

If you have a single pipeline or even multiple pipelines that all run at the same time this is an easy question to answer. In practice (and in our experience using maestro in production) you have multiple pipelines that run at different intervals and different times. Maybe some run hourly, some run daily, and others run monthly.

Example 1

Let’s say we have three pipelines with the following frequencies and start times:

##    name frequency          start_time
## 1 pipe1    1 hour 2024-06-18 12:30:00
## 2 pipe2    2 days 2024-06-18 06:00:00
## 3 pipe3  4 months 2024-06-20 00:00:00

A good starting point is to schedule it as often as the highest frequency pipeline in the project - so 1 hour in the above example. If you run it on *:30:00 minute each day, pipe1 will execute nearly exactly at the scheduled time and the other pipelines will be executed 30 minutes early. If you’re comfortable with this margin of error then it’s no big deal, but if not then an orchestrator frequency of 30 minutes will ensure all pipelines run as scheduled exactly.

Let’s see another example:

Example 2

##    name frequency          start_time
## 1 pipe4    1 hour 2024-06-18 00:00:00
## 2 pipe5    1 hour 2024-06-18 00:10:00
## 3 pipe6    1 hour 2024-06-18 00:20:00

All three pipelines are hourly but they start on different 10-minute intervals. If we run the orchestrator at 1 hour they’ll all execute at the same time. If it is important that the execute at different times, we should set it to 10 minutes.

So a good heuristic is to run it as often as the smallest interval of time difference between any pipeline. This is pretty good so long as we don’t run it so often that the pipelines can’t complete before the next execution time. We don’t recommend running the orchestrator more frequently than every 5 minutes unless you’re confident that your pipelines are fast to execute.1

Maestro has a function for providing a reasonable estimate for the ideal orchestrator frequency called suggest_orch_frequency(). It uses the heuristic of the most frequent interval halved (e.g., if your highest frequency pipeline is 1 hour, it suggests 30 minutes) to a minimum 15 minutes or your most frequent interval if it’s less. This is a bit more conservative but protects against running things too often. If timeliness and high frequency are really important to you, maestro might not be your best choice.

Final Remarks

When it comes time to deploy your project make sure that whatever you use to actually run your project (e.g., cron, TaskScheduler, Google Cloud Scheduler, etc.) is indeed running it at the same frequency as you have in the orchestrator. It’s best to stick to using whole units of time rather than fractional units - if my orchestrator runs every 15 minutes I like to have it run on the 00:00, 15:00, 30:00, and 45:00 minutes. It makes reasoning about the scheduling simpler.


  1. Make sure you consider the amount of time it takes to execute build_schedule() and that you account for the additional work done in run_schedule() not related to your actual pipeline logic.↩︎