In this vignette, we demonstrate neural Bayes estimation with gridded data using convolutional neural networks (CNNs). These architectures require completely-observed gridded data; in this vignette, we also show how one can use CNNs with incomplete data using two techniques.
Before proceeding, we load the required packages (see here
for instructions on installing Julia and the Julia packages
NeuralEstimators
and Flux
):
library("NeuralEstimators")
library("JuliaConnectoR")
library("ggplot2")
juliaEval('using NeuralEstimators, Flux')
#> Starting Julia ...
We first consider the case that the data are collected completely over a regular grid.
We develop a neural Bayes estimator for a spatial Gaussian process model, where \boldsymbol{Z} \equiv (Z_{1}, \dots, Z_{n})' are data collected at locations \{\boldsymbol{s}_{1}, \dots, \boldsymbol{s}_{n}\} in a spatial domain that is a subset of \mathbb{R}^2. The data are assumed to be spatially-correlated mean-zero Gaussian random variables with exponential covariance function, \textrm{cov}(Z_i, Z_j) = \textrm{exp}(-\|\boldsymbol{s}_i - \boldsymbol{s}_j\|/\theta), with unknown range parameter \theta > 0. Here, we take the spatial domain be the unit square, we simulate data on a grid with 16^2 = 256 possible observation locations, and we adopt a uniform prior, \theta \sim \rm{Unif}(0, 0.4).
To begin, we define a function for sampling from the prior
distribution and a function for marginal simulation from the statistical
model. Note that our simulation code can be made faster (e.g., using
parallel versions of lapply
, utilising GPUs, etc.), but we
provide a relatively simple implementation for pedagogical purposes:
# Sampling from the prior distribution
# K: number of samples to draw from the prior
prior <- function(K) {
theta <- 0.4 * runif(K) # draw from the prior distribution
theta <- t(theta) # reshape to matrix
return(theta)
}
# Marginal simulation from the statistical model
# theta: a matrix of parameters drawn from the prior
# m: number of conditionally independent replicates for each parameter vector
simulate <- function(theta, m = 1) {
# Spatial locations, a grid over the unit square, and spatial distance matrix
N <- 16 # number of locations along each dimension of the grid
S <- expand.grid(seq(1, 0, len = N), seq(0, 1, len = N))
D <- as.matrix(dist(S))
# Simulate conditionally independent replicates for each parameter vector
Z <- lapply(1:ncol(theta), function(k) {
Sigma <- exp(-D/theta[k]) # covariance matrix
L <- t(chol(Sigma)) # lower Cholesky factor of Sigma
n <- nrow(L) # number of observation locations
mm <- if (length(m) == 1) m else sample(m, 1) # allow for variable sample sizes
z <- matrix(rnorm(n*mm), nrow = n, ncol = mm) # standard normal variates
Z <- L %*% z # conditionally independent replicates from the model
Z <- array(Z, dim = c(N, N, 1, mm)) # reshape to multidimensional array
Z
})
return(Z)
}
A note on data format: When working with the package
NeuralEstimators
, the following general rules regarding
assumed data format apply:
When using CNNs, each data set must be stored as a multi-dimensional array. The penultimate dimension stores the so-called channels (this dimension is singleton for univariate processes, two for bivariate processes, etc.), while the final dimension stores conditionally independent replicates. For example, to store 50 conditionally independent replicates of a bivariate spatial process measured over a 10\times15 grid, one would construct an array of dimension 10\times15\times2\times50.
Now, let’s visualise a few realisations from our spatial Gaussian process model:
Next, we design our neural-network architecture.
The package NeuralEstimators
is centred on the DeepSets
framework (Zaheer et
al., 2017), which allows for making inference with an arbitrary
number of independent replicates and for incorporating both neural
(learned) and user-defined summary statistics. Specifically, the neural
Bayes estimator is taken to be of the form,
\hat{\boldsymbol{\theta}}(\boldsymbol{Z}_1,\dots,\boldsymbol{Z}_m) =
\boldsymbol{\phi}\bigg\{\frac{1}{m}
\sum_{i=1}^m\boldsymbol{\psi}(\boldsymbol{Z}_i)\bigg\},
where \boldsymbol{Z}_1,\dots,\boldsymbol{Z}_m
are replicates of \boldsymbol{Z},
\boldsymbol{\psi}(\cdot) is a
conventional neural network whose architecture depends on the
multivariate structure of the data (e.g., a CNN for gridded data), and
\boldsymbol{\phi}(\cdot) is
(always) a multilayer perceptron (MLP). See Sainsbury-Dale et
al. (2024) for further details on the use of DeepSets in the context
of neural Bayes estimation and a discussion on the framework’s
connection to conventional estimators. See also Dumoulin and Visin (2016)
for a detailed description of CNNs and their construction, which is
beyond the scope of this vignette:
estimator <- juliaEval('
# Summary network
psi = Chain(
Conv((3, 3), 1 => 32, relu), # 3x3 convolutional filter, 1 input channel to 32 output channels, ReLU activation
MaxPool((2, 2)), # 2x2 max pooling layer for dimension reduction
Conv((3, 3), 32 => 64, relu), # 3x3 convolutional filter, 32 input channels to 64 output channels, ReLU activation
MaxPool((2, 2)), # 2x2 max pooling layer for dimension reduction
Flux.flatten # flatten the output to feed into a fully connected layer
)
# Inference network
phi = Chain(
Dense(256, 64, relu), # fully connected layer, 256 input neurons to 64 output neurons, ReLU activation
Dense(64, 1) # fully connected layer, 64 input neurons to 1 output neuron
)
# Construct DeepSet object and initialise a point estimator
deepset = DeepSet(psi, phi)
estimator = PointEstimator(deepset)
')
Next, we train the estimator using parameter vectors sampled from the prior distribution, and data simulated from the statistical model conditional on these parameter vectors:
# Construct training and validation sets
K <- 25000 # size of the training set
theta_train <- prior(K) # parameter vectors used in stochastic-gradient descent during training
theta_val <- prior(K/10) # parameter vectors used to monitor performance during training
Z_train <- simulate(theta_train) # data used in stochastic-gradient descent during training
Z_val <- simulate(theta_val) # data used to monitor performance during training
# Train the estimator
estimator <- train(
estimator,
theta_train = theta_train,
theta_val = theta_val,
Z_train = Z_train,
Z_val = Z_val
)
#> Computing the initial validation risk... Initial validation risk = 0.17335139
#> Computing the initial training risk... Initial training risk = 0.17565463
#> Epoch: 1 Training risk: 0.07 Validation risk: 0.05 Run time of epoch: 29.999 seconds
#> Epoch: 2 Training risk: 0.04 Validation risk: 0.037 Run time of epoch: 6.462 seconds
#> Epoch: 3 Training risk: 0.03 Validation risk: 0.029 Run time of epoch: 6.491 seconds
#> Epoch: 4 Training risk: 0.027 Validation risk: 0.027 Run time of epoch: 6.512 seconds
#> Epoch: 5 Training risk: 0.024 Validation risk: 0.025 Run time of epoch: 6.525 seconds
#> Epoch: 6 Training risk: 0.023 Validation risk: 0.024 Run time of epoch: 6.649 seconds
#> Epoch: 7 Training risk: 0.022 Validation risk: 0.024 Run time of epoch: 6.67 seconds
#> Epoch: 8 Training risk: 0.021 Validation risk: 0.023 Run time of epoch: 6.948 seconds
#> Epoch: 9 Training risk: 0.021 Validation risk: 0.023 Run time of epoch: 6.936 seconds
#> Epoch: 10 Training risk: 0.021 Validation risk: 0.023 Run time of epoch: 6.553 seconds
#> Epoch: 11 Training risk: 0.02 Validation risk: 0.023 Run time of epoch: 6.504 seconds
#> Epoch: 12 Training risk: 0.02 Validation risk: 0.022 Run time of epoch: 6.515 seconds
#> Epoch: 13 Training risk: 0.019 Validation risk: 0.022 Run time of epoch: 6.513 seconds
#> Epoch: 14 Training risk: 0.019 Validation risk: 0.022 Run time of epoch: 6.513 seconds
#> Epoch: 15 Training risk: 0.019 Validation risk: 0.021 Run time of epoch: 6.522 seconds
#> Epoch: 16 Training risk: 0.019 Validation risk: 0.021 Run time of epoch: 6.506 seconds
#> Epoch: 17 Training risk: 0.019 Validation risk: 0.022 Run time of epoch: 6.496 seconds
#> Epoch: 18 Training risk: 0.018 Validation risk: 0.022 Run time of epoch: 6.527 seconds
#> Epoch: 19 Training risk: 0.018 Validation risk: 0.021 Run time of epoch: 6.509 seconds
#> Epoch: 20 Training risk: 0.018 Validation risk: 0.021 Run time of epoch: 6.512 seconds
#> Epoch: 21 Training risk: 0.018 Validation risk: 0.021 Run time of epoch: 6.502 seconds
#> Epoch: 22 Training risk: 0.017 Validation risk: 0.021 Run time of epoch: 6.506 seconds
#> Epoch: 23 Training risk: 0.017 Validation risk: 0.021 Run time of epoch: 6.605 seconds
#> Epoch: 24 Training risk: 0.017 Validation risk: 0.021 Run time of epoch: 6.545 seconds
#> Epoch: 25 Training risk: 0.017 Validation risk: 0.021 Run time of epoch: 6.537 seconds
#> Epoch: 26 Training risk: 0.017 Validation risk: 0.021 Run time of epoch: 6.535 seconds
#> Epoch: 27 Training risk: 0.017 Validation risk: 0.021 Run time of epoch: 6.54 seconds
#> Stopping early since the validation loss has not improved in 5 epochs
It is good practice to assess the performance of a trained estimator using out-of-sample test data:
# Test the estimator using out-of-sample data
theta <- prior(1000)
Z <- simulate(theta)
assessment <- assess(estimator, theta, Z, estimator_names = "NBE")
#> Running estimator NBE...
plotestimates(assessment)
Once the neural Bayes estimator has been trained and assessed, it can
be applied to observed data collected completely over a 16\times 16 grid using
estimate()
. Below, we use simulated data as a surrogate for
observed data:
theta <- matrix(0.1) # true parameter
Z <- simulate(theta) # pretend that this is observed data
estimate(estimator, Z) # point estimates from the "observed" data
#> [,1]
#> [1,] 0.1062393
#> attr(,"JLDIM")
#> [1] 1 1
In practice, data are often incomplete and, in the next section, we describe methods that facilitate neural Bayes estimation in these settings.
We now consider the case where the data are collected over a regular grid, but where some elements of the grid are unobserved. This situation often arises in, for example, remote-sensing applications, where the presence of cloud cover prevents measurement in some places.
For instance, our data may look like:
We here consider two techniques that facilitate neural Bayes estimation with incomplete data: the “masking approach” of Wang et al. (2024) and the “neural EM algorithm”.
The first technique that we consider is the so-called masking approach of Wang et al. (2024). The strategy involves completing the data by replacing missing values with zeros, and using auxiliary variables to encode the missingness pattern, which are also passed into the network.
Let \boldsymbol{Z} denote the complete-data vector. Then, the masking approach considers inference based on \boldsymbol{W}, a vector of indicator variables that encode the missingness pattern (with elements equal to one or zero if the corresponding element of \boldsymbol{Z} is observed or missing, respectively), and \boldsymbol{U} \equiv \boldsymbol{Z} \odot \boldsymbol{W}, where \odot denotes elementwise multiplication and the product of a missing element and zero is defined to be zero. Irrespective of the missingness pattern, \boldsymbol{U} and \boldsymbol{W} have the same fixed dimensions and hence may be processed easily using a single neural network. A neural point estimator is then trained on realisations of \{\boldsymbol{U}, \boldsymbol{W}\} which, by construction, do not contain any missing elements.
Since the missingness pattern \boldsymbol{W} is now an input to the neural network, it must be incorporated during the training phase. When interest lies only in making inference from a single already-observed data set, \boldsymbol{W} is fixed and known. However, amortised inference, whereby one trains a single neural network that will be used to make inference with many data sets, requires a model for the missingness pattern \boldsymbol{W}.
Below, we define a helper function that removes a specified proportion of the data completely at random, and which serves to define our missingness model when using the masking approach in this example:
# Replaces a proportion of elements with NA
removedata <- function(Z, proportion = runif(1, 0.2, 0.8)) {
# Ensure proportion is between 0 and 1
if (proportion < 0 || proportion > 1) stop("Proportion must be between 0 and 1")
# Randomly sample indices to replace
n <- length(Z)
n_na <- round(proportion * n)
na_indices <- sample(1:n, n_na)
# Replace selected elements with NA
Z[na_indices] <- NA
return(Z)
}
To make the general strategy concrete, consider the encoded data set \{\boldsymbol{U}, \boldsymbol{W}\} constructed from the following incomplete data \boldsymbol{Z}_1:
Next, we construct and train a masked neural Bayes estimator. Here, the first convolutional layer takes two input channels, since we store the encoded data \boldsymbol{U} in the first channel and the missingness pattern \boldsymbol{W} in the second. Otherwise, the architecture remains as given above:
masked_estimator <- juliaEval('
# Summary network
psi = Chain(
Conv((3, 3), 2 => 32, relu),
MaxPool((2, 2)),
Conv((3, 3), 32 => 64, relu),
MaxPool((2, 2)),
Flux.flatten
)
# Inference network
phi = Chain(Dense(256, 64, relu), Dense(64, 1))
deepset = DeepSet(psi, phi)
')
Next, we generate incomplete data for training and validation using
removedata()
, and construct the corresponding encoded data
sets \{\boldsymbol{U},
\boldsymbol{W}\} using encodedata()
:
Training and assessment of the neural Bayes estimator then proceeds as usual:
# Train the estimator
masked_estimator <- train(
masked_estimator,
theta_train = theta_train,
theta_val = theta_val,
Z_train = UW_train,
Z_val = UW_val
)
#> Computing the initial validation risk... Initial validation risk = 0.2802006
#> Computing the initial training risk... Initial training risk = 0.27887243
#> Epoch: 1 Training risk: 0.086 Validation risk: 0.067 Run time of epoch: 7.147 seconds
#> Epoch: 2 Training risk: 0.06 Validation risk: 0.06 Run time of epoch: 6.799 seconds
#> Epoch: 3 Training risk: 0.053 Validation risk: 0.053 Run time of epoch: 6.764 seconds
#> Epoch: 4 Training risk: 0.048 Validation risk: 0.051 Run time of epoch: 6.752 seconds
#> Epoch: 5 Training risk: 0.045 Validation risk: 0.048 Run time of epoch: 6.738 seconds
#> Epoch: 6 Training risk: 0.043 Validation risk: 0.045 Run time of epoch: 6.741 seconds
#> Epoch: 7 Training risk: 0.042 Validation risk: 0.045 Run time of epoch: 6.727 seconds
#> Epoch: 8 Training risk: 0.04 Validation risk: 0.043 Run time of epoch: 6.748 seconds
#> Epoch: 9 Training risk: 0.039 Validation risk: 0.043 Run time of epoch: 6.751 seconds
#> Epoch: 10 Training risk: 0.038 Validation risk: 0.043 Run time of epoch: 6.754 seconds
#> Epoch: 11 Training risk: 0.038 Validation risk: 0.043 Run time of epoch: 6.727 seconds
#> Epoch: 12 Training risk: 0.037 Validation risk: 0.042 Run time of epoch: 6.758 seconds
#> Epoch: 13 Training risk: 0.036 Validation risk: 0.042 Run time of epoch: 6.864 seconds
#> Epoch: 14 Training risk: 0.035 Validation risk: 0.041 Run time of epoch: 6.754 seconds
#> Epoch: 15 Training risk: 0.035 Validation risk: 0.041 Run time of epoch: 6.829 seconds
#> Epoch: 16 Training risk: 0.034 Validation risk: 0.04 Run time of epoch: 6.744 seconds
#> Epoch: 17 Training risk: 0.033 Validation risk: 0.04 Run time of epoch: 6.864 seconds
#> Epoch: 18 Training risk: 0.033 Validation risk: 0.04 Run time of epoch: 6.741 seconds
#> Epoch: 19 Training risk: 0.033 Validation risk: 0.04 Run time of epoch: 6.788 seconds
#> Epoch: 20 Training risk: 0.032 Validation risk: 0.04 Run time of epoch: 6.794 seconds
#> Epoch: 21 Training risk: 0.032 Validation risk: 0.041 Run time of epoch: 6.721 seconds
#> Epoch: 22 Training risk: 0.031 Validation risk: 0.039 Run time of epoch: 6.742 seconds
#> Epoch: 23 Training risk: 0.031 Validation risk: 0.039 Run time of epoch: 6.748 seconds
#> Epoch: 24 Training risk: 0.031 Validation risk: 0.04 Run time of epoch: 6.784 seconds
#> Epoch: 25 Training risk: 0.03 Validation risk: 0.04 Run time of epoch: 6.762 seconds
#> Epoch: 26 Training risk: 0.03 Validation risk: 0.039 Run time of epoch: 6.735 seconds
#> Epoch: 27 Training risk: 0.029 Validation risk: 0.039 Run time of epoch: 6.718 seconds
#> Epoch: 28 Training risk: 0.029 Validation risk: 0.039 Run time of epoch: 6.746 seconds
#> Epoch: 29 Training risk: 0.029 Validation risk: 0.039 Run time of epoch: 6.735 seconds
#> Epoch: 30 Training risk: 0.028 Validation risk: 0.039 Run time of epoch: 6.734 seconds
#> Epoch: 31 Training risk: 0.028 Validation risk: 0.039 Run time of epoch: 6.759 seconds
#> Epoch: 32 Training risk: 0.028 Validation risk: 0.04 Run time of epoch: 6.76 seconds
#> Epoch: 33 Training risk: 0.027 Validation risk: 0.04 Run time of epoch: 6.749 seconds
#> Epoch: 34 Training risk: 0.027 Validation risk: 0.041 Run time of epoch: 6.763 seconds
#> Epoch: 35 Training risk: 0.027 Validation risk: 0.039 Run time of epoch: 6.734 seconds
#> Stopping early since the validation loss has not improved in 5 epochs
# Test the estimator with many data sets, each with a missingness proportion of 25%
theta <- prior(1000)
Z <- simulate(theta)
Z1 <- lapply(Z, removedata, 0.25)
UW <- lapply(Z1, encodedata)
assessment <- assess(masked_estimator, theta, UW, estimator_names = "Masked NBE")
#> Running estimator Masked NBE...
plotestimates(assessment)
Note that the variance of the estimates is larger when compared to the estimates obtained from complete data; this is to be expected, since missingness reduces the available information for making inference on \boldsymbol{\theta}.
Once trained and assessed, we can apply our masked neural Bayes estimator to (incomplete) observed data. The data must be encoded in the same manner that was done during training. Below, we use simulated data as a surrogate for real data, with a missingness proportion of 0.25:
Let \boldsymbol{Z}_1 and \boldsymbol{Z}_2 denote the observed and unobserved (i.e., missing) data, respectively, and let \boldsymbol{Z} \equiv (\boldsymbol{Z}_1', \boldsymbol{Z}_2')' denote the complete data. A classical approach to facilitating inference when data are missing is the expectation-maximisation (EM) algorithm (Dempster et al., 1977).
The neural EM algorithm is an approximate version of the conventional (Bayesian) Monte Carlo EM algorithm (Wei and Tanner, 1990) which, at the lth iteration, updates the parameter vector through \boldsymbol{\theta}^{(l)} = \textrm{argmax}_{\boldsymbol{\theta}} \sum_{h = 1}^H \ell(\boldsymbol{\theta}; \boldsymbol{Z}_1, \boldsymbol{Z}_2^{(lh)}) + \log \pi_H(\boldsymbol{\theta}), where realisations of the missing-data component, \{\boldsymbol{Z}_2^{(lh)} : h = 1, \dots, H\}, are sampled from the probability distribution of \boldsymbol{Z}_2 given \boldsymbol{Z}_1 and \boldsymbol{\theta}^{(l-1)}, and where \pi_H(\boldsymbol{\theta}) \propto \{\pi(\boldsymbol{\theta})\}^H is a concentrated version of the chosen prior. Note that when \pi(\boldsymbol{\theta}) is uniform, as is the case in our working example, the distribution implied by \pi_H(\cdot) is the same as that implied by \pi(\cdot).
Given the conditionally simulated data, the neural EM algorithm performs the above EM update using a neural network that returns the MAP estimate from the conditionally simulated data. Such a neural network, which we refer to as a neural MAP estimator, can be obtained by training a neural Bayes estimator under a continuous relaxation of the 0–1 loss function, for example, the loss function, L_{\kappa}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}) = \rm{tanh}(\|\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\|/\kappa), \quad \kappa > 0, which yields the 0–1 loss function in the limit as \kappa \to 0.
An advantage of using the neural EM algorithm is that the neural-network architecture does not need to be altered compared with that used in the complete-data case, and we can therefore use our complete-data estimator trained earlier as the starting point of our neural MAP estimator; this is known as pretraining, and it can substantially reduce the computational cost of training.
Below, we train a neural MAP estimator, employing the 0–1 surrogate loss function given above with \kappa = 0.1:
# Train the estimator under the tanh loss, a surrogate for the 0-1 loss
estimator <- train(
estimator,
theta_train = theta_train,
theta_val = theta_val,
Z_train = Z_train,
Z_val = Z_val,
loss = tanhloss(0.1)
)
#> Computing the initial validation risk... Initial validation risk = 0.1976900831136299
#> Computing the initial training risk... Initial training risk = 0.1612901621867654
#> Epoch: 1 Training risk: 0.169 Validation risk: 0.206 Run time of epoch: 7.36 seconds
#> Epoch: 2 Training risk: 0.167 Validation risk: 0.199 Run time of epoch: 6.357 seconds
#> Epoch: 3 Training risk: 0.165 Validation risk: 0.201 Run time of epoch: 6.369 seconds
#> Epoch: 4 Training risk: 0.164 Validation risk: 0.205 Run time of epoch: 6.372 seconds
#> Epoch: 5 Training risk: 0.161 Validation risk: 0.21 Run time of epoch: 6.345 seconds
#> Epoch: 6 Training risk: 0.161 Validation risk: 0.201 Run time of epoch: 6.386 seconds
#> Stopping early since the validation loss has not improved in 5 epochs
An advantage of the neural EM algorithm is that the training of the neural network is exactly the same as the complete-data case, so the methods for assessing the estimator that we describe in Section 1 can be applied directly here.
Next, we define a function for conditional simulation. For the current model, this simply involves sampling from a conditional multivariate Gaussian distribution (see, e.g., here):
simulateconditional <- function(Z, theta, nsims = 1){
# Coerce theta to scalar if given as a matrix
theta <- theta[1]
# Save the original dimensions
dims <- dim(Z)
N <- nrow(Z)
# Convert to vector
Z <- c(Z)
# Indices of observed and missing elements
I_1 <- which(!is.na(Z))
I_2 <- which(is.na(Z))
n_1 <- length(I_1)
n_2 <- length(I_2)
# Extract the observed data
Z_1 <- Z[I_1]
# Spatial locations and distance matrices
S <- expand.grid(seq(1, 0, len = N), seq(0, 1, len = N))
D <- as.matrix(dist(S))
D_22 <- D[I_2, I_2]
D_11 <- D[I_1, I_1]
D_12 <- D[I_1, I_2]
# Marginal covariance matrices
Sigma_22 <- exp(-D_22 / theta)
Sigma_11 <- exp(-D_11 / theta)
Sigma_12 <- exp(-D_12 / theta)
# Conditional covariance matrix, cov(Z_2 | Z_1, theta), its Cholesky factor,
# and the conditional mean, E(Z_2 | Z_1, theta)
L_11 <- t(chol(Sigma_11))
x <- solve(L_11, Sigma_12)
y <- solve(L_11, Z_1)
Sigma <- Sigma_22 - crossprod(x)
L <- t(chol(Sigma))
mu <- c(crossprod(x, y))
# Simulate from the conditional distribution Z_2 | Z_1, theta ~ Gau(mu, Sigma)
z <- matrix(rnorm(n_2 * nsims), nrow = n_2, ncol = nsims)
Z_2 <- mu + L %*% z
# Combine Z_1 with conditionally-simulated replicates of Z_2
Z <- sapply(1:nsims,function(l){
z <-rep(NA, n_1 + n_2)
z[I_1]<- Z_1
z[I_2]<- Z_2[, l]
z
})
# Convert Z to an array with appropriate dimensions
Z <- array(Z, dim=c(dims,1, nsims))
return(Z)
}
Let’s visualise a few conditional simulations given incomplete data \boldsymbol{Z}_1. Below, the left panel shows the incomplete data \boldsymbol{Z}_1, and the remaining panels show conditional simulations given \boldsymbol{Z}_1 and the true parameter \theta = 0.2:
The final step is to define a function that implements the Monte Carlo EM algorithm. This involves the specification of an initial estimate \boldsymbol{\theta}^{(0)}, the maximum number of iterations, and a convergence criterion:
# Monte Carlo EM algorithm
EM <- function(Z1, # incomplete data
estimator, # (neural) MAP estimator
theta_0, # initial estimate
niterations = 100, # maximum number of iterations
tolerance = 0.01, # convergence tolerance
nconsecutive = 3, # number of consecutive iterations for which the convergence criterion must be met
nsims = 1, # Monte Carlo sample size
verbose = TRUE, # print current estimate to console if TRUE
return_iterates = FALSE # return all iterates if TRUE
) {
if(verbose) print(paste("Initial estimate:", theta_0))
theta_l <- theta_0 # initial estimate
convergence_counter <- 0 # initialise counter for consecutive convergence
# Initialize a matrix to store all iterates as columns
p <- length(theta_0)
theta_all <- matrix(NA, nrow = p, ncol = niterations + 1)
theta_all[, 1] <- theta_0
for (l in 1:niterations) {
# complete the data by conditional simulation
Z <- simulateconditional(drop(Z1), theta_l, nsims = nsims)
# compute the MAP estimate from the conditionally sampled replicates
if ("JuliaProxy" %in% class(estimator)) {
theta_l_plus_1 <- c(estimate(estimator, Z)) # neural MAP estimator
} else {
theta_l_plus_1 <- estimator(Z, theta_l) # analytic MAP estimator
}
# check convergence criterion
if (max(abs(theta_l_plus_1 - theta_l) / abs(theta_l)) < tolerance) {
# increment counter if condition is met
convergence_counter <- convergence_counter + 1
# check if convergence criterion has been met for required number of iterations
if (convergence_counter == nconsecutive) {
if(verbose) message("The EM algorithm has converged")
theta_all[, l + 1] <- theta_l_plus_1 # store the final iterate
break
}
} else {
# reset counter if condition is not met
convergence_counter <- 0
}
theta_l <- theta_l_plus_1
theta_all[, l + 1] <- theta_l # store the iterate
if(verbose) print(paste0("Iteration ", l, ": ", theta_l))
}
# Remove unused columns if convergence occurred before max iterations
theta_all <- theta_all[, 1:(l + 1), drop = FALSE]
# Return all iterates if return_iterates is TRUE, otherwise return the last iterate
if (return_iterates) {
return(theta_all)
} else {
return(theta_all[, ncol(theta_all)])
}
}
We are now ready to apply the neural EM algorithm with incomplete data. Here, we use the same incomplete data \boldsymbol{Z}_1 simulated conditionally on \theta = 0.2 at the end of the preceding subsection:
theta_0 <- 0.1
EM(Z1, estimator, theta_0, nsims = 100)
#> [1] "Initial estimate: 0.1"
#> [1] "Iteration 1: 0.166097149252892"
#> [1] "Iteration 2: 0.198796659708023"
#> [1] "Iteration 3: 0.214800849556923"
#> [1] "Iteration 4: 0.220833465456963"
#> [1] "Iteration 5: 0.217493936419487"
#> [1] "Iteration 6: 0.219034016132355"
#> [1] "Iteration 7: 0.219332501292229"
#> The EM algorithm has converged
#> [1] 0.2213956
Visualise the Monte Carlo variability with different Monte Carlo sample sizes:
all_H <- c(1, 10, 100)
dfs <- list()
for (H in all_H) {
estimates <- c(EM(Z1, estimator, theta_0, nsims = H, return_iterates = TRUE, verbose = FALSE, tolerance = 0.0001))
df <- data.frame(
iteration = 1:length(estimates),
estimate = estimates,
H = H
)
dfs <- c(dfs, list(df))
}
df <- do.call(rbind, dfs)
ggplot(df) +
geom_line(aes(x = iteration, y = estimate)) +
facet_wrap(~ H, labeller = labeller(H = function(labels) paste0("H = ", labels)), nrow = 1) +
theme_bw()
We have considered two approaches that facilitate neural Bayes estimation with incomplete data.
+ Does not require conditional simulation, and is therefore broadly applicable.
+ Can be used with all loss functions that are amenable to gradient-based training.
- Needs the neural network to take an additional input (the missingness pattern).
- More complicated learning task.
- Requires a model for the missingness mechanism.
+ Neural-network architecture is the same as that used in the complete-data case.
+ Simpler learning task (mapping from the complete data to the parameter space).
+ Does not require a model for the missingness mechanism.
- Requires conditional simulation, which places restrictions on the applicable class of models.
- Limited to providing MAP estimates.