rredlist benchmarks

William Gearty

2024-12-10

Introduction

rredlist provides two APIs, a higher-level one that takes slightly more time but returns the data in a more user-friendly format (a list), and a lower-level one (i.e., functions that end with “_“) that takes less time but does no processing of the data (returning the raw JSON string). Both APIs return the exact same information, but it is up to the user whether the format processing is worth the extra time, especially when performing bulk operations. To help inform this decision by the user, here is some benchmarking related to the two APIs. First, we’ll break down the total difference in computation time between the two APIs, then we’ll dig into what components are causing this difference. We’ll use microbenchmark::microbenchmark() which has very little computational overhead. Note that the time units vary from comparison to comparison, and the speed of these functions may be highly hardware- and network-dependent.

library(rredlist)
library(microbenchmark)

Head-to-head benchmarks

We’ll start by benchmarking the two APIs head-to-head. We’ll test a couple of use cases, in rough order of increasing complexity.

1. Get species count

microbenchmark(
  rl_sp_count(),
  rl_sp_count_(),
  times = 10
)
#> Unit: milliseconds
#>            expr      min       lq     mean   median       uq      max neval cld
#>   rl_sp_count() 112.1753 112.9675 116.0770 114.6603 119.2189 126.5307    10   a
#>  rl_sp_count_() 111.1262 112.3976 116.1929 115.4099 116.0395 127.9950    10   a

2. Lookup individual assessment

microbenchmark(
  rl_assessment(136250858),
  rl_assessment_(136250858),
  times = 10
)
#> Unit: milliseconds
#>                       expr      min       lq     mean   median       uq      max neval cld
#>   rl_assessment(136250858) 257.1335 266.4728 286.9152 277.9695 313.7179 322.6544    10   a
#>  rl_assessment_(136250858) 252.2234 260.1433 311.6918 287.6650 323.1625 478.7026    10   a

3. Taxonomic lookup with defaults

microbenchmark(
  rl_family(),
  rl_family_(),
  times = 10
)
#> Unit: milliseconds
#>          expr      min       lq     mean   median       uq      max neval cld
#>   rl_family() 118.1405 119.3698 123.6975 120.1616 123.0535 144.8783    10   a
#>  rl_family_() 116.6910 118.6265 156.5042 119.3431 132.2629 428.2468    10   a

4. Taxonomic lookup with query (one page of results)

microbenchmark(
  rl_family("Rheidae"),
  rl_family_("Rheidae"),
  times = 10
)
#> Unit: milliseconds
#>                   expr      min       lq     mean   median       uq      max neval cld
#>   rl_family("Rheidae") 616.2717 617.7706 626.2586 621.4997 637.6830 647.0362    10   a
#>  rl_family_("Rheidae") 610.7493 618.8672 633.7496 627.0082 650.9436 677.3837    10   a

5. Taxonomic lookup with query (~10 pages of results)

microbenchmark(
  rl_family("Corvidae", quiet = TRUE),
  rl_family_("Corvidae", quiet = TRUE),
  times = 10
)
#> Unit: seconds
#>                                  expr      min       lq      mean    median       uq      max neval cld
#>   rl_family("Corvidae", quiet = TRUE) 9.913948 9.984358 10.109995 10.153347 10.17414 10.23669    10  a 
#>  rl_family_("Corvidae", quiet = TRUE) 9.866713 9.895834  9.985805  9.986027 10.06351 10.07924    10   b

6. Taxonomic lookup with query (~40 pages of results)

microbenchmark(
  rl_family("Tyrannidae", quiet = TRUE),
  rl_family_("Tyrannidae", quiet = TRUE),
  times = 10
)
#> Unit: seconds
#>                                    expr      min       lq     mean   median       uq      max neval cld
#>   rl_family("Tyrannidae", quiet = TRUE) 35.19196 35.23844 35.50863 35.28687 35.38780 37.11036    10  a 
#>  rl_family_("Tyrannidae", quiet = TRUE) 34.76442 34.91699 35.00961 35.02371 35.08263 35.22149    10   b

7. Taxonomic lookup with query (~900 pages of results)

microbenchmark(
  rl_class("Aves", quiet = TRUE),
  rl_class_("Aves", quiet = TRUE),
  times = 10
)
#> Unit: seconds
#>                             expr      min       lq     mean   median       uq      max neval cld
#>   rl_class("Aves", quiet = TRUE) 1137.053 1138.816 1142.828 1140.627 1145.490 1157.649    10  a 
#>  rl_class_("Aves", quiet = TRUE) 1127.992 1131.732 1132.963 1133.211 1135.561 1136.631    10   b

And the winner is…

As you can see above, the two APIs take roughly the same amount of time for most use cases. I previously said that the low-level API is designed to be faster. While most of these comparisons agree with that statement, the time reduction is usually a few milliseconds per function call. When we get into more complex queries, like returning multiple pages of API results, we start to see larger time reductions, especially as the number of pages of results increases (10+ seconds for hundreds of pages).

Query breakdown

Based on the above, it doesn’t seem to matter much, time-wise, whether we parse the data or not. So then what takes up all of the query time? Let’s break down the process of querying the API and downloading a single page of assessments using some of the internal functions of rredlist:

microbenchmark(
  res <- rredlist:::rr_GET_raw("taxa/family/Rheidae"), # get the raw data for the first page
  x <- res$parse("UTF-8"), # parse the raw response data to JSON
  rredlist:::rl_parse(x, parse = TRUE), # parse the JSON to a list of dataframes
  rredlist:::rl_parse(x, parse = FALSE), # parse the JSON to a list of lists
  times = 10
)
#> Unit: microseconds
#>                                                 expr        min         lq        mean      median
#>  res <- rredlist:::rr_GET_raw("taxa/family/Rheidae") 612396.800 616270.001 629656.6911 618445.5010
#>                              x <- res$parse("UTF-8")    770.001    875.801   1043.2110    898.9515
#>                 rredlist:::rl_parse(x, parse = TRUE)   1030.701   1110.602   1604.1812   1256.4015
#>                rredlist:::rl_parse(x, parse = FALSE)     59.501     62.102     81.9312     85.3010
#>          uq        max neval cld
#>  644299.401 658205.701    10  a 
#>     930.101   2283.101    10   b
#>    2094.601   2946.201    10   b
#>      94.601    113.401    10   b

The above benchmarking shows us that the vast majority of time is spent downloading data from the IUCN API. For a single page of results, even the highest level of parsing takes only 0.15% of the time it takes to download the data. Further, while parsing to a list of dataframes (parse = TRUE) takes about 10 times as long as just parsing to a list of lists (parse = FALSE), both methods remain very quick compared to the process of downloading the data.

Now let’s break down a multi-page query:

microbenchmark(
  lst <- rredlist:::page_assessments("taxa/family/Tyrannidae",
                                     key = rredlist:::check_key(NULL),
                                     quiet = TRUE), # get the data for all of the pages
  rredlist:::combine_assessments(lst, parse = TRUE), # parse the JSON to a list of dataframes
  rredlist:::combine_assessments(lst, parse = FALSE), # parse the JSON to a list of lists
  times = 10
)
#> Unit: milliseconds
#>                                                                                                               expr
#>  lst <- rredlist:::page_assessments("taxa/family/Tyrannidae",      key = rredlist:::check_key(NULL), quiet = TRUE)
#>                                                                  rredlist:::combine_assessments(lst, parse = TRUE)
#>                                                                 rredlist:::combine_assessments(lst, parse = FALSE)
#>         min         lq        mean     median         uq        max neval cld
#>  34726.9180 35026.3228 35239.94843 35172.9793 35400.4013 35931.0798    10 a  
#>    238.9440   243.5736   249.50199   248.0357   253.9973   263.0525    10  b 
#>     12.4059    12.8977    15.36693    13.4167    14.0818    27.1965    10   c

Again, even with about 40 pages of data to parse, the download takes the vast majority of the time. The highest-level parsing has increased to about 1% of the time it takes to download the data, but this remains less than a second compared to the ~35 second download.

Conclusion

Ultimately, both APIs take about the same amount of time because the majority of time is spent downloading the data from the IUCN database and reading it into R. For larger downloads, the parsing done by the high-level API may take an appreciable amount of time (tenths of seconds to seconds), It’s possible that users who are calling these functions many (e.g., thousands) of times would appreciate this time reduction. However, for most users it probably won’t matter. Furthermore, keep in mind that if you use the low-level API you will likely need to do your own processing after the fact in order to do any sort of downstream analyses. Ultimately, the choice is up to you.