The {bbw}
package was developed primarily as a tool for analysing complex sample survey data. It was developed specifically for use with the Rapid Assessment Method (RAM) and the Simple Spatial Survey Method (S3M).
The indicatorsHH
is a survey dataset collected from a RAM survey in Bakool, Bay, and Middle Shabelle regions of Somalia. The villageData
contains the list of villages/clusters that were sampled in the survey that collected the indicatorsHH
dataset. These is a good set of data to demonstrate the use of the {bbw}
package to perform blocked weighted bootstrap estimation.
bootBW()
The bootBW()
function is the original bootstrap resampling function of the package. It can be used as follows:
bootBW(
boot_df <-x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2")
)#> ✔ x has the appropriate/expected data structure
This call to bootBW()
takes in the survey dataset indicatorsHH
as its first argument (x
). This dataset is expected to have a variable labelled as psu
which identifies the primary sampling unit from which data was collected during the survey and then additional variables for the indicators to be estimated. The second argument (w
) is for the dataset of the list of primary sampling units that were sampled in the survey to collect the survey data specified in x
. This dataset, which in this case is villageData
, should have at least a variable labelled psu
which identified the primary sampling unit that matches the same variable in the survey dataset and a variable labelled pop
for the population size of the primary sampling unit. The statistic
argument specified the type of statistic to apply to the bootstrap replicates. There are two of these functions available from the {bbw}
package - bootClassic()
and the bootPROBIT()
. For this example, the bootClassic()
function is used to get the mean value of the bootstrap replicates. This is generally useful for binomial type of indicators and for continuous variables of which to get the mean of. The params
argument takes in values of the indicator names in x
to be estimated. In this example, two indicator names for antenatal care are specified. Finally, the argument for replicates
specify the number of replicate bootstraps to be performed. The default of 400 replicates is used here. This results in the following (showing first 10 rows):
head(boot_df, 10)
#> anc1 anc2
#> 1 0.1864175 0.01874714
#> 2 0.2290978 0.02035985
#> 3 0.2343529 0.02641509
#> 4 0.2548555 0.03084955
#> 5 0.2698864 0.02662863
#> 6 0.2151356 0.01819052
#> 7 0.1937834 0.02677702
#> 8 0.2148349 0.01678766
#> 9 0.2593480 0.02891566
#> 10 0.2151414 0.01922153
The result is a data.frame()
of bootstrap replicates with number of rows equal to the number or replicates and number of columns equal to the number of params
specified. Hence, boot_df
has 400 rows and 2 columns.
Using boot_df
containing bootstrap replicates of the indicators anc1
and anc2
, estimating each indicator with a 95% confidence interval using the percentile bootstrap method. This can be simply done using the quantile()
function from the stats
package as follows:
lapply(
est_df <-X = boot_df,
FUN = quantile,
probs = c(0.5, 0.025, 0.975)
|>
) do.call(rbind, args = _)
The quantile()
function is used to get the 50th percentile (for the estimate) and the 2.5th and the 97.5th percentile of the bootstrap replicates to get the lower confidence limit and the upper confidence limits (respectively) of the indicator estimate. This gives the following results:
est_df#> 50% 2.5% 97.5%
#> anc1 0.2316597 0.17709920 0.28849265
#> anc2 0.0218962 0.01347537 0.03484835
Note that the indicatorsHH
dataset has geographical stratification. Specifically, the survey from which this data was collected was designed to be representative of three regions in Somalia with the regions identified through the region
variable in indicatorsHH
. Because of this the more appropriate bootstrap resampling approach would be to resample within each region. To do this using the original bootBW()
function would require restructuring the survey dataset by region and then passing the region-stratified datasets individually to the bootBW()
function. This may look something like this:
## Split indicators by region ----
split(indicatorsHH, f = indicatorsHH$region)
indicators_by_region <-
## Split psus by region ----
split(villageData, f = villageData$region)
psus_by_region <-
## Bootstrap
Map(
boot_df <-f = bootBW,
x = indicators_by_region,
w = psus_by_region,
statistic = rep(list(get("bootClassic")), length(indicators_by_region)),
params = rep(list(c("anc1", "anc2")), length(indicators_by_region))
)#> ✔ x has the appropriate/expected data structure
#> ✔ x has the appropriate/expected data structure
#> ✔ x has the appropriate/expected data structure
The bootBW()
function only accepts single data.frame
inputs for x
and w
arguments. Hence, to resample data from within region, the datasets will have to be split into separate data.frame
inputs per region and then bootBW()
applied to each separately. In the example above, this is done by concatenating each of the inputs to bootBW()
into a list and then using the Map()
function is sent to bootBW()
sequentially. This produces a list of the data.frame
bootstrap resample for each region (shown below):
class(boot_df)
#> [1] "list"
head(boot_df$Bay, 10)
#> anc1 anc2
#> 1 0.4043419 0.013568521
#> 2 0.3907104 0.020491803
#> 3 0.3224044 0.023224044
#> 4 0.2645862 0.016282225
#> 5 0.2708618 0.008207934
#> 6 0.3297151 0.024423338
#> 7 0.3627717 0.004076087
#> 8 0.3662551 0.016460905
#> 9 0.3410641 0.016371078
#> 10 0.2277628 0.014824798
head(boot_df$Bakool, 10)
#> anc1 anc2
#> 1 0.2916667 0.17415730
#> 2 0.2928177 0.09497207
#> 3 0.3260274 0.14804469
#> 4 0.2747253 0.11864407
#> 5 0.2900552 0.11797753
#> 6 0.1823204 0.05849582
#> 7 0.4065934 0.16343490
#> 8 0.2727273 0.11731844
#> 9 0.2821918 0.06944444
#> 10 0.2939560 0.09749304
head(boot_df$`Middle Shabelle`, 10)
#> anc1 anc2
#> 1 0.1723447 0.011055276
#> 2 0.2550607 0.018367347
#> 3 0.1330724 0.010816126
#> 4 0.2830189 0.024551464
#> 5 0.1921569 0.014792899
#> 6 0.2217782 0.010989011
#> 7 0.2117647 0.007881773
#> 8 0.2165156 0.019172553
#> 9 0.2195122 0.015625000
#> 10 0.2274549 0.015075377
To estimate the per region results from this bootstrap resampling, the following can be implemented:
lapply(
est_df <-X = boot_df,
FUN = function(x) lapply(
FUN = quantile, probs = c(0.5, 0.025, 0.975)
x, |>
) do.call(rbind, args = _)
)
data.frame(
est_df <-region = names(est_df),
indicators = lapply(est_df, FUN = row.names) |> unlist(),
do.call(rbind, args = est_df)
)
row.names(est_df) <- NULL
which results in the following output:
est_df#> region indicators X50. X2.5. X97.5.
#> 1 Bakool anc1 0.30261405 0.188862799 0.41167127
#> 2 Bay anc2 0.11251780 0.050903865 0.19504237
#> 3 Middle Shabelle anc1 0.32391543 0.217411669 0.43781930
#> 4 Bakool anc2 0.01893172 0.002766156 0.03663750
#> 5 Bay anc1 0.20220114 0.134820317 0.27676772
#> 6 Middle Shabelle anc2 0.01724140 0.007237952 0.03006251
From this demonstration, the bootBW()
function proves to be straightforward to implement and can be easily incorporated into a user’s workflow based on their dataset and their analytic needs. However, as shown above, this flexibility requires a lot more extra coding from the user to get from resampling to indicator estimates.
Starting from v0.3.0
, an alternative set of functions is available to perform blocked weighted bootstrap resampling that facilitates all the steps from resampling to estimation. Below is an example of how to use this alternative set of functions for the same tasks shown above.
This set of functions attempts to make the blocked weighted bootstrap algorithm more efficient through vectorisation and use of parallelisation techniques. The function syntax has been kept consistent with bootBW()
for ease of transition.
boot_bw()
The boot_bw()
function is the alternative bootstrap resampling function of the package. It can be used as follows:
boot_bw(
boot_df <-x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2")
)
This call to boot_bw()
takes in the survey dataset indicatorsHH
as its first argument (x
). This dataset is expected to have a variable labelled as psu
which identifies the primary sampling unit from which data was collected during the survey and then additional variables for the indicators to be estimated. The second argument (w
) is for the dataset of the list of primary sampling units that were sampled in the survey to collect the survey data specified in x
. This dataset, which in this case is villageData
, should have at least a variable labelled psu
which identified the primary sampling unit that matches the same variable in the survey dataset and a variable labelled pop
for the population size of the primary sampling unit. The statistic
argument specified the type of statistic to apply to the bootstrap replicates. There are two of these functions available from the {bbw}
package - bootClassic()
and the bootPROBIT()
. For this example, the bootClassic()
function is used to get the mean value of the bootstrap replicates. This is generally useful for binomial type of indicators and for continuous variables of which to get the mean of. The params
argument takes in values of the indicator names in x
to be estimated. In this example, two indicator names for antenatal care are specified. Finally, the argument for replicates
specify the number of replicate bootstraps to be performed. The default of 400 replicates is used here. As can be noted, the boot_bw()
takes on the same type of arguments as bootBW()
and the syntax is exactly the same. Hence, using this alternative function will be familiar to those who have had experience using the original function.
However, the output of the boot_bw()
function is structured differently from the bootBW()
function. The boot_bw()
function produces and object of class boot_bw
.
class(boot_df)
#> [1] "boot_bw"
The object boot_bw
is a list with 4 named components: params
for the values specified for the params
argument, replicates
for the number of bootstrap replicates performed, strata
for the values specified for stratification, and boot_data
which is the bootstrap results.
names(boot_df)
#> [1] "params" "replicates" "strata" "boot_data"
The boot_data
component of the boot_bw
object corresponds to the output of the bootBW()
function.
Other than the difference in the structure of the output, this alternative function also has three additional arguments for the new features it provides.
strata
- the variable name in x
that provides information on the stratification in the survey data. This is by default set to NULL
signifying no stratification. This argument allows the user to perform stratified bootstrap resampling conveniently through the boot_bw()
function.
parallel
- whether or not to use parallel computation for the bootstrap resampling. This is by default set to FALSE in which case bootstrap resampling is done sequentially as is with the bootBW()
function. If set to TRUE, the function sets up parallel computing and utilises the machines available cores (see cores
argument below).
cores
- the number of cores to use for parallel computation. This is only evaluated if parallel = TRUE
. By default, this is set to 1 less the total available number of cores of the current machine.
To use these new features and functionality, the call to boot_bw()
would look something like this:
boot_bw(
boot_df <-x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2"), strata = "region", parallel = TRUE
)
This produces a boot_bw
class list
object with the same components as above. The only different is that the boot_data
component is a list
(instead of a data.frame
) with each component being the data.frame
bootstrap resampling output for each of the strata in the dataset.
class(boot_df)
#> [1] "boot_bw"
class(boot_df$boot_data)
#> [1] "list"
names(boot_df$boot_data)
#> [1] "Bakool" "Bay" "Middle Shabelle"
The boot_bw_estimate()
function can then be applied to the output of the boot_bw()
function to get the indicator estimates with 95% confidence interval.
boot_bw_estimate(boot_df)
#> region indicator est lcl ucl
#> 1 Bakool anc1 0.43888889 0.38881944 0.48888889
#> 2 Bakool anc2 0.38055556 0.32497749 0.43062500
#> 3 Bay anc1 0.71619066 0.63887512 0.77849135
#> 4 Bay anc2 0.00254615 0.00000000 0.01294677
#> 5 Middle Shabelle anc1 0.20757542 0.14514451 0.28293531
#> 6 Middle Shabelle anc2 0.05065259 0.03133757 0.07453108
#> se
#> 1 0.027718319
#> 2 0.027983726
#> 3 0.036466569
#> 4 0.003743969
#> 5 0.036375151
#> 6 0.011463590
These two functions can be piped to each other for a single workflow from bootstrap resampling to estimation.
boot_bw(
x = indicatorsHH, w = villageData, statistic = bootClassic,
params = c("anc1", "anc2"), strata = "region", parallel = TRUE
|>
) boot_bw_estimate()
#> region indicator est lcl ucl
#> 1 Bakool anc1 0.438888889 0.3805556 0.49444444
#> 2 Bakool anc2 0.376731302 0.3138889 0.43888889
#> 3 Bay anc1 0.719130072 0.6487833 0.78255787
#> 4 Bay anc2 0.002534854 0.0000000 0.01262706
#> 5 Middle Shabelle anc1 0.203423968 0.1428536 0.27819673
#> 6 Middle Shabelle anc2 0.051256281 0.0339071 0.07622767
#> se
#> 1 0.030425611
#> 2 0.030033802
#> 3 0.034273078
#> 4 0.003372086
#> 5 0.033966913
#> 6 0.010573679