Processing math: 0%

NeuralEstimators: Incomplete gridded data

Matthew Sainsbury-Dale, Andrew Zammit-Mangion, and Raphaël Huser

In this vignette, we demonstrate neural Bayes estimation with gridded data using convolutional neural networks (CNNs). These architectures require completely-observed gridded data; in this vignette, we also show how one can use CNNs with incomplete data using two techniques.

Before proceeding, we load the required packages (see here for instructions on installing Julia and the Julia packages NeuralEstimators and Flux):

library("NeuralEstimators")
library("JuliaConnectoR")
library("ggplot2")
juliaEval('using NeuralEstimators, Flux')
#> Starting Julia ...

1 Complete data

We first consider the case that the data are collected completely over a regular grid.

We develop a neural Bayes estimator for a spatial Gaussian process model, where \boldsymbol{Z} \equiv (Z_{1}, \dots, Z_{n})' are data collected at locations \{\boldsymbol{s}_{1}, \dots, \boldsymbol{s}_{n}\} in a spatial domain that is a subset of \mathbb{R}^2. The data are assumed to be spatially-correlated mean-zero Gaussian random variables with exponential covariance function, \textrm{cov}(Z_i, Z_j) = \textrm{exp}(-\|\boldsymbol{s}_i - \boldsymbol{s}_j\|/\theta), with unknown range parameter \theta > 0. Here, we take the spatial domain be the unit square, we simulate data on a grid with 16^2 = 256 possible observation locations, and we adopt a uniform prior, \theta \sim \rm{Unif}(0, 0.4).

To begin, we define a function for sampling from the prior distribution and a function for marginal simulation from the statistical model. Note that our simulation code can be made faster (e.g., using parallel versions of lapply, utilising GPUs, etc.), but we provide a relatively simple implementation for pedagogical purposes:

# Sampling from the prior distribution
# K: number of samples to draw from the prior
prior <- function(K) { 
  theta <- 0.4 * runif(K) # draw from the prior distribution
  theta <- t(theta)       # reshape to matrix
  return(theta)
}

# Marginal simulation from the statistical model
# theta: a matrix of parameters drawn from the prior
# m: number of conditionally independent replicates for each parameter vector
simulate <- function(theta, m = 1) { 
  
  # Spatial locations, a grid over the unit square, and spatial distance matrix
  N <- 16 # number of locations along each dimension of the grid
  S <- expand.grid(seq(1, 0, len = N), seq(0, 1, len = N))
  D <- as.matrix(dist(S))         
  
  # Simulate conditionally independent replicates for each parameter vector
  Z <- lapply(1:ncol(theta), function(k) {
    Sigma <- exp(-D/theta[k])  # covariance matrix
    L <- t(chol(Sigma))        # lower Cholesky factor of Sigma
    n <- nrow(L)               # number of observation locations
    mm <- if (length(m) == 1) m else sample(m, 1) # allow for variable sample sizes
    z <- matrix(rnorm(n*mm), nrow = n, ncol = mm) # standard normal variates
    Z <- L %*% z               # conditionally independent replicates from the model
    Z <- array(Z, dim = c(N, N, 1, mm)) # reshape to multidimensional array
    Z
  })
  
  return(Z)
}

A note on data format: When working with the package NeuralEstimators, the following general rules regarding assumed data format apply:

When using CNNs, each data set must be stored as a multi-dimensional array. The penultimate dimension stores the so-called channels (this dimension is singleton for univariate processes, two for bivariate processes, etc.), while the final dimension stores conditionally independent replicates. For example, to store 50 conditionally independent replicates of a bivariate spatial process measured over a 10\times15 grid, one would construct an array of dimension 10\times15\times2\times50.

Now, let’s visualise a few realisations from our spatial Gaussian process model:

Next, we design our neural-network architecture.

The package NeuralEstimators is centred on the DeepSets framework (Zaheer et al., 2017), which allows for making inference with an arbitrary number of independent replicates and for incorporating both neural (learned) and user-defined summary statistics. Specifically, the neural Bayes estimator is taken to be of the form, \hat{\boldsymbol{\theta}}(\boldsymbol{Z}_1,\dots,\boldsymbol{Z}_m) = \boldsymbol{\phi}\bigg\{\frac{1}{m} \sum_{i=1}^m\boldsymbol{\psi}(\boldsymbol{Z}_i)\bigg\}, where \boldsymbol{Z}_1,\dots,\boldsymbol{Z}_m are replicates of \boldsymbol{Z}, \boldsymbol{\psi}(\cdot) is a conventional neural network whose architecture depends on the multivariate structure of the data (e.g., a CNN for gridded data), and \boldsymbol{\phi}(\cdot) is (always) a multilayer perceptron (MLP). See Sainsbury-Dale et al. (2024) for further details on the use of DeepSets in the context of neural Bayes estimation and a discussion on the framework’s connection to conventional estimators. See also Dumoulin and Visin (2016) for a detailed description of CNNs and their construction, which is beyond the scope of this vignette:

estimator <- juliaEval('
  # Summary network
  psi = Chain(
      Conv((3, 3), 1 => 32, relu),   # 3x3 convolutional filter, 1 input channel to 32 output channels, ReLU activation
      MaxPool((2, 2)),               # 2x2 max pooling layer for dimension reduction
      Conv((3, 3), 32 => 64, relu),  # 3x3 convolutional filter, 32 input channels to 64 output channels, ReLU activation
      MaxPool((2, 2)),               # 2x2 max pooling layer for dimension reduction
      Flux.flatten                   # flatten the output to feed into a fully connected layer
  )
  
  # Inference network
  phi = Chain(
      Dense(256, 64, relu),          # fully connected layer, 256 input neurons to 64 output neurons, ReLU activation
      Dense(64, 1)                   # fully connected layer, 64 input neurons to 1 output neuron
  )
  
  # Construct DeepSet object and initialise a point estimator
  deepset = DeepSet(psi, phi)
  estimator = PointEstimator(deepset)
')

Next, we train the estimator using parameter vectors sampled from the prior distribution, and data simulated from the statistical model conditional on these parameter vectors:

# Construct training and validation sets 
K <- 25000                       # size of the training set 
theta_train <- prior(K)          # parameter vectors used in stochastic-gradient descent during training
theta_val   <- prior(K/10)       # parameter vectors used to monitor performance during training
Z_train <- simulate(theta_train) # data used in stochastic-gradient descent during training
Z_val   <- simulate(theta_val)   # data used to monitor performance during training

# Train the estimator 
estimator <- train(
  estimator,
  theta_train = theta_train,
  theta_val   = theta_val,
  Z_train = Z_train,
  Z_val   = Z_val
)
#> Computing the initial validation risk... Initial validation risk = 0.17335139
#> Computing the initial training risk... Initial training risk = 0.17565463
#> Epoch: 1  Training risk: 0.07  Validation risk: 0.05  Run time of epoch: 29.999 seconds
#> Epoch: 2  Training risk: 0.04  Validation risk: 0.037  Run time of epoch: 6.462 seconds
#> Epoch: 3  Training risk: 0.03  Validation risk: 0.029  Run time of epoch: 6.491 seconds
#> Epoch: 4  Training risk: 0.027  Validation risk: 0.027  Run time of epoch: 6.512 seconds
#> Epoch: 5  Training risk: 0.024  Validation risk: 0.025  Run time of epoch: 6.525 seconds
#> Epoch: 6  Training risk: 0.023  Validation risk: 0.024  Run time of epoch: 6.649 seconds
#> Epoch: 7  Training risk: 0.022  Validation risk: 0.024  Run time of epoch: 6.67 seconds
#> Epoch: 8  Training risk: 0.021  Validation risk: 0.023  Run time of epoch: 6.948 seconds
#> Epoch: 9  Training risk: 0.021  Validation risk: 0.023  Run time of epoch: 6.936 seconds
#> Epoch: 10  Training risk: 0.021  Validation risk: 0.023  Run time of epoch: 6.553 seconds
#> Epoch: 11  Training risk: 0.02  Validation risk: 0.023  Run time of epoch: 6.504 seconds
#> Epoch: 12  Training risk: 0.02  Validation risk: 0.022  Run time of epoch: 6.515 seconds
#> Epoch: 13  Training risk: 0.019  Validation risk: 0.022  Run time of epoch: 6.513 seconds
#> Epoch: 14  Training risk: 0.019  Validation risk: 0.022  Run time of epoch: 6.513 seconds
#> Epoch: 15  Training risk: 0.019  Validation risk: 0.021  Run time of epoch: 6.522 seconds
#> Epoch: 16  Training risk: 0.019  Validation risk: 0.021  Run time of epoch: 6.506 seconds
#> Epoch: 17  Training risk: 0.019  Validation risk: 0.022  Run time of epoch: 6.496 seconds
#> Epoch: 18  Training risk: 0.018  Validation risk: 0.022  Run time of epoch: 6.527 seconds
#> Epoch: 19  Training risk: 0.018  Validation risk: 0.021  Run time of epoch: 6.509 seconds
#> Epoch: 20  Training risk: 0.018  Validation risk: 0.021  Run time of epoch: 6.512 seconds
#> Epoch: 21  Training risk: 0.018  Validation risk: 0.021  Run time of epoch: 6.502 seconds
#> Epoch: 22  Training risk: 0.017  Validation risk: 0.021  Run time of epoch: 6.506 seconds
#> Epoch: 23  Training risk: 0.017  Validation risk: 0.021  Run time of epoch: 6.605 seconds
#> Epoch: 24  Training risk: 0.017  Validation risk: 0.021  Run time of epoch: 6.545 seconds
#> Epoch: 25  Training risk: 0.017  Validation risk: 0.021  Run time of epoch: 6.537 seconds
#> Epoch: 26  Training risk: 0.017  Validation risk: 0.021  Run time of epoch: 6.535 seconds
#> Epoch: 27  Training risk: 0.017  Validation risk: 0.021  Run time of epoch: 6.54 seconds
#> Stopping early since the validation loss has not improved in 5 epochs

It is good practice to assess the performance of a trained estimator using out-of-sample test data:

# Test the estimator using out-of-sample data
theta <- prior(1000)
Z <- simulate(theta)
assessment <- assess(estimator, theta, Z, estimator_names = "NBE")
#>  Running estimator NBE...
plotestimates(assessment)

Once the neural Bayes estimator has been trained and assessed, it can be applied to observed data collected completely over a 16\times 16 grid using estimate(). Below, we use simulated data as a surrogate for observed data:

theta    <- matrix(0.1)         # true parameter
Z        <- simulate(theta)     # pretend that this is observed data
estimate(estimator, Z)          # point estimates from the "observed" data
#>           [,1]
#> [1,] 0.1062393
#> attr(,"JLDIM")
#> [1] 1 1

In practice, data are often incomplete and, in the next section, we describe methods that facilitate neural Bayes estimation in these settings.

2 Incomplete data

We now consider the case where the data are collected over a regular grid, but where some elements of the grid are unobserved. This situation often arises in, for example, remote-sensing applications, where the presence of cloud cover prevents measurement in some places.

For instance, our data may look like:

We here consider two techniques that facilitate neural Bayes estimation with incomplete data: the “masking approach” of Wang et al. (2024) and the “neural EM algorithm”.

2.1 The masking approach

The first technique that we consider is the so-called masking approach of Wang et al. (2024). The strategy involves completing the data by replacing missing values with zeros, and using auxiliary variables to encode the missingness pattern, which are also passed into the network.

Let \boldsymbol{Z} denote the complete-data vector. Then, the masking approach considers inference based on \boldsymbol{W}, a vector of indicator variables that encode the missingness pattern (with elements equal to one or zero if the corresponding element of \boldsymbol{Z} is observed or missing, respectively), and \boldsymbol{U} \equiv \boldsymbol{Z} \odot \boldsymbol{W}, where \odot denotes elementwise multiplication and the product of a missing element and zero is defined to be zero. Irrespective of the missingness pattern, \boldsymbol{U} and \boldsymbol{W} have the same fixed dimensions and hence may be processed easily using a single neural network. A neural point estimator is then trained on realisations of \{\boldsymbol{U}, \boldsymbol{W}\} which, by construction, do not contain any missing elements.

Since the missingness pattern \boldsymbol{W} is now an input to the neural network, it must be incorporated during the training phase. When interest lies only in making inference from a single already-observed data set, \boldsymbol{W} is fixed and known. However, amortised inference, whereby one trains a single neural network that will be used to make inference with many data sets, requires a model for the missingness pattern \boldsymbol{W}.

Below, we define a helper function that removes a specified proportion of the data completely at random, and which serves to define our missingness model when using the masking approach in this example:

# Replaces a proportion of elements with NA
removedata <- function(Z, proportion = runif(1, 0.2, 0.8)) {
  
  # Ensure proportion is between 0 and 1
  if (proportion < 0 || proportion > 1) stop("Proportion must be between 0 and 1")
  
  # Randomly sample indices to replace
  n <- length(Z)
  n_na <- round(proportion * n)
  na_indices <- sample(1:n, n_na)
  
  # Replace selected elements with NA
  Z[na_indices] <- NA
  
  return(Z)
}

To make the general strategy concrete, consider the encoded data set \{\boldsymbol{U}, \boldsymbol{W}\} constructed from the following incomplete data \boldsymbol{Z}_1:

Next, we construct and train a masked neural Bayes estimator. Here, the first convolutional layer takes two input channels, since we store the encoded data \boldsymbol{U} in the first channel and the missingness pattern \boldsymbol{W} in the second. Otherwise, the architecture remains as given above:

masked_estimator <- juliaEval('
  # Summary network
  psi = Chain(
    Conv((3, 3), 2 => 32, relu),
    MaxPool((2, 2)),
    Conv((3, 3),  32 => 64, relu),
    MaxPool((2, 2)),
    Flux.flatten
    )
    
    # Inference network
  phi = Chain(Dense(256, 64, relu), Dense(64, 1))

  deepset = DeepSet(psi, phi)
')

Next, we generate incomplete data for training and validation using removedata(), and construct the corresponding encoded data sets \{\boldsymbol{U}, \boldsymbol{W}\} using encodedata():

UW_train <- encodedata(lapply(Z_train, removedata))
UW_val   <- encodedata(lapply(Z_val, removedata))

Training and assessment of the neural Bayes estimator then proceeds as usual:

# Train the estimator 
masked_estimator <- train(
  masked_estimator,
  theta_train = theta_train,
  theta_val   = theta_val,
  Z_train = UW_train,
  Z_val   = UW_val
)
#> Computing the initial validation risk... Initial validation risk = 0.2802006
#> Computing the initial training risk... Initial training risk = 0.27887243
#> Epoch: 1  Training risk: 0.086  Validation risk: 0.067  Run time of epoch: 7.147 seconds
#> Epoch: 2  Training risk: 0.06  Validation risk: 0.06  Run time of epoch: 6.799 seconds
#> Epoch: 3  Training risk: 0.053  Validation risk: 0.053  Run time of epoch: 6.764 seconds
#> Epoch: 4  Training risk: 0.048  Validation risk: 0.051  Run time of epoch: 6.752 seconds
#> Epoch: 5  Training risk: 0.045  Validation risk: 0.048  Run time of epoch: 6.738 seconds
#> Epoch: 6  Training risk: 0.043  Validation risk: 0.045  Run time of epoch: 6.741 seconds
#> Epoch: 7  Training risk: 0.042  Validation risk: 0.045  Run time of epoch: 6.727 seconds
#> Epoch: 8  Training risk: 0.04  Validation risk: 0.043  Run time of epoch: 6.748 seconds
#> Epoch: 9  Training risk: 0.039  Validation risk: 0.043  Run time of epoch: 6.751 seconds
#> Epoch: 10  Training risk: 0.038  Validation risk: 0.043  Run time of epoch: 6.754 seconds
#> Epoch: 11  Training risk: 0.038  Validation risk: 0.043  Run time of epoch: 6.727 seconds
#> Epoch: 12  Training risk: 0.037  Validation risk: 0.042  Run time of epoch: 6.758 seconds
#> Epoch: 13  Training risk: 0.036  Validation risk: 0.042  Run time of epoch: 6.864 seconds
#> Epoch: 14  Training risk: 0.035  Validation risk: 0.041  Run time of epoch: 6.754 seconds
#> Epoch: 15  Training risk: 0.035  Validation risk: 0.041  Run time of epoch: 6.829 seconds
#> Epoch: 16  Training risk: 0.034  Validation risk: 0.04  Run time of epoch: 6.744 seconds
#> Epoch: 17  Training risk: 0.033  Validation risk: 0.04  Run time of epoch: 6.864 seconds
#> Epoch: 18  Training risk: 0.033  Validation risk: 0.04  Run time of epoch: 6.741 seconds
#> Epoch: 19  Training risk: 0.033  Validation risk: 0.04  Run time of epoch: 6.788 seconds
#> Epoch: 20  Training risk: 0.032  Validation risk: 0.04  Run time of epoch: 6.794 seconds
#> Epoch: 21  Training risk: 0.032  Validation risk: 0.041  Run time of epoch: 6.721 seconds
#> Epoch: 22  Training risk: 0.031  Validation risk: 0.039  Run time of epoch: 6.742 seconds
#> Epoch: 23  Training risk: 0.031  Validation risk: 0.039  Run time of epoch: 6.748 seconds
#> Epoch: 24  Training risk: 0.031  Validation risk: 0.04  Run time of epoch: 6.784 seconds
#> Epoch: 25  Training risk: 0.03  Validation risk: 0.04  Run time of epoch: 6.762 seconds
#> Epoch: 26  Training risk: 0.03  Validation risk: 0.039  Run time of epoch: 6.735 seconds
#> Epoch: 27  Training risk: 0.029  Validation risk: 0.039  Run time of epoch: 6.718 seconds
#> Epoch: 28  Training risk: 0.029  Validation risk: 0.039  Run time of epoch: 6.746 seconds
#> Epoch: 29  Training risk: 0.029  Validation risk: 0.039  Run time of epoch: 6.735 seconds
#> Epoch: 30  Training risk: 0.028  Validation risk: 0.039  Run time of epoch: 6.734 seconds
#> Epoch: 31  Training risk: 0.028  Validation risk: 0.039  Run time of epoch: 6.759 seconds
#> Epoch: 32  Training risk: 0.028  Validation risk: 0.04  Run time of epoch: 6.76 seconds
#> Epoch: 33  Training risk: 0.027  Validation risk: 0.04  Run time of epoch: 6.749 seconds
#> Epoch: 34  Training risk: 0.027  Validation risk: 0.041  Run time of epoch: 6.763 seconds
#> Epoch: 35  Training risk: 0.027  Validation risk: 0.039  Run time of epoch: 6.734 seconds
#> Stopping early since the validation loss has not improved in 5 epochs

# Test the estimator with many data sets, each with a missingness proportion of 25%
theta <- prior(1000)
Z <- simulate(theta)
Z1 <- lapply(Z, removedata, 0.25)
UW <- lapply(Z1, encodedata)
assessment <- assess(masked_estimator, theta, UW, estimator_names = "Masked NBE")
#>  Running estimator Masked NBE...
plotestimates(assessment)

Note that the variance of the estimates is larger when compared to the estimates obtained from complete data; this is to be expected, since missingness reduces the available information for making inference on \boldsymbol{\theta}.

Once trained and assessed, we can apply our masked neural Bayes estimator to (incomplete) observed data. The data must be encoded in the same manner that was done during training. Below, we use simulated data as a surrogate for real data, with a missingness proportion of 0.25:

theta <- matrix(0.2)
Z <- simulate(theta)[[1]] 
Z1 <- removedata(Z, 0.25)
UW <- encodedata(Z1)
c(estimate(masked_estimator, UW))
#> [1] 0.2607822

2.2 The neural EM algorithm

Let \boldsymbol{Z}_1 and \boldsymbol{Z}_2 denote the observed and unobserved (i.e., missing) data, respectively, and let \boldsymbol{Z} \equiv (\boldsymbol{Z}_1', \boldsymbol{Z}_2')' denote the complete data. A classical approach to facilitating inference when data are missing is the expectation-maximisation (EM) algorithm (Dempster et al., 1977).

The neural EM algorithm is an approximate version of the conventional (Bayesian) Monte Carlo EM algorithm (Wei and Tanner, 1990) which, at the lth iteration, updates the parameter vector through \boldsymbol{\theta}^{(l)} = \textrm{argmax}_{\boldsymbol{\theta}} \sum_{h = 1}^H \ell(\boldsymbol{\theta}; \boldsymbol{Z}_1, \boldsymbol{Z}_2^{(lh)}) + \log \pi_H(\boldsymbol{\theta}), where realisations of the missing-data component, \{\boldsymbol{Z}_2^{(lh)} : h = 1, \dots, H\}, are sampled from the probability distribution of \boldsymbol{Z}_2 given \boldsymbol{Z}_1 and \boldsymbol{\theta}^{(l-1)}, and where \pi_H(\boldsymbol{\theta}) \propto \{\pi(\boldsymbol{\theta})\}^H is a concentrated version of the chosen prior. Note that when \pi(\boldsymbol{\theta}) is uniform, as is the case in our working example, the distribution implied by \pi_H(\cdot) is the same as that implied by \pi(\cdot).

Given the conditionally simulated data, the neural EM algorithm performs the above EM update using a neural network that returns the MAP estimate from the conditionally simulated data. Such a neural network, which we refer to as a neural MAP estimator, can be obtained by training a neural Bayes estimator under a continuous relaxation of the 0–1 loss function, for example, the loss function, L_{\kappa}(\boldsymbol{\theta}, \hat{\boldsymbol{\theta}}) = \rm{tanh}(\|\hat{\boldsymbol{\theta}} - \boldsymbol{\theta}\|/\kappa), \quad \kappa > 0, which yields the 0–1 loss function in the limit as \kappa \to 0.

An advantage of using the neural EM algorithm is that the neural-network architecture does not need to be altered compared with that used in the complete-data case, and we can therefore use our complete-data estimator trained earlier as the starting point of our neural MAP estimator; this is known as pretraining, and it can substantially reduce the computational cost of training.

Below, we train a neural MAP estimator, employing the 0–1 surrogate loss function given above with \kappa = 0.1:

# Train the estimator under the tanh loss, a surrogate for the 0-1 loss 
estimator <- train(
  estimator,
  theta_train = theta_train,
  theta_val   = theta_val,
  Z_train = Z_train,
  Z_val   = Z_val, 
  loss = tanhloss(0.1)
)
#> Computing the initial validation risk... Initial validation risk = 0.1976900831136299
#> Computing the initial training risk... Initial training risk = 0.1612901621867654
#> Epoch: 1  Training risk: 0.169  Validation risk: 0.206  Run time of epoch: 7.36 seconds
#> Epoch: 2  Training risk: 0.167  Validation risk: 0.199  Run time of epoch: 6.357 seconds
#> Epoch: 3  Training risk: 0.165  Validation risk: 0.201  Run time of epoch: 6.369 seconds
#> Epoch: 4  Training risk: 0.164  Validation risk: 0.205  Run time of epoch: 6.372 seconds
#> Epoch: 5  Training risk: 0.161  Validation risk: 0.21  Run time of epoch: 6.345 seconds
#> Epoch: 6  Training risk: 0.161  Validation risk: 0.201  Run time of epoch: 6.386 seconds
#> Stopping early since the validation loss has not improved in 5 epochs

An advantage of the neural EM algorithm is that the training of the neural network is exactly the same as the complete-data case, so the methods for assessing the estimator that we describe in Section 1 can be applied directly here.

Next, we define a function for conditional simulation. For the current model, this simply involves sampling from a conditional multivariate Gaussian distribution (see, e.g., here):

simulateconditional <- function(Z, theta, nsims = 1){
  
  # Coerce theta to scalar if given as a matrix
  theta <- theta[1] 
  
  # Save the original dimensions
  dims <- dim(Z) 
  N <- nrow(Z)
  
  # Convert to vector 
  Z <- c(Z) 
  
  # Indices of observed and missing elements 
  I_1 <- which(!is.na(Z)) 
  I_2 <- which(is.na(Z))  
  n_1 <- length(I_1)
  n_2 <- length(I_2)
  
  # Extract the observed data 
  Z_1 <- Z[I_1] 
  
  # Spatial locations and distance matrices
  S <- expand.grid(seq(1, 0, len = N), seq(0, 1, len = N))
  D <- as.matrix(dist(S))  
  D_22 <- D[I_2, I_2]
  D_11 <- D[I_1, I_1]
  D_12 <- D[I_1, I_2]
  
  # Marginal covariance matrices
  Sigma_22 <- exp(-D_22 / theta)
  Sigma_11 <- exp(-D_11 / theta)
  Sigma_12 <- exp(-D_12 / theta)
  
  # Conditional covariance matrix, cov(Z_2 | Z_1, theta), its Cholesky factor, 
  # and the conditional mean, E(Z_2 | Z_1, theta)
  L_11 <- t(chol(Sigma_11))
  x <- solve(L_11, Sigma_12)
  y <- solve(L_11, Z_1)
  Sigma <- Sigma_22 - crossprod(x)
  L <- t(chol(Sigma))
  mu <- c(crossprod(x, y))
  
  # Simulate from the conditional distribution Z_2 | Z_1, theta ~ Gau(mu, Sigma)
  z <- matrix(rnorm(n_2 * nsims), nrow = n_2, ncol = nsims)
  Z_2 <- mu + L %*% z
  
  # Combine Z_1 with conditionally-simulated replicates of Z_2
  Z <- sapply(1:nsims,function(l){
    z <-rep(NA, n_1 + n_2)
    z[I_1]<- Z_1
    z[I_2]<- Z_2[, l]
    z
  })
  
  # Convert Z to an array with appropriate dimensions
  Z <- array(Z, dim=c(dims,1, nsims))
  
  return(Z)
}

Let’s visualise a few conditional simulations given incomplete data \boldsymbol{Z}_1. Below, the left panel shows the incomplete data \boldsymbol{Z}_1, and the remaining panels show conditional simulations given \boldsymbol{Z}_1 and the true parameter \theta = 0.2:

The final step is to define a function that implements the Monte Carlo EM algorithm. This involves the specification of an initial estimate \boldsymbol{\theta}^{(0)}, the maximum number of iterations, and a convergence criterion:

# Monte Carlo EM algorithm 
EM <- function(Z1,                # incomplete data
               estimator,         # (neural) MAP estimator
               theta_0,           # initial estimate
               niterations = 100, # maximum number of iterations
               tolerance = 0.01,  # convergence tolerance
               nconsecutive = 3,  # number of consecutive iterations for which the convergence criterion must be met
               nsims = 1,         # Monte Carlo sample size
               verbose = TRUE,    # print current estimate to console if TRUE
               return_iterates = FALSE  # return all iterates if TRUE
               ) {
  
  if(verbose) print(paste("Initial estimate:", theta_0))
  theta_l <- theta_0          # initial estimate
  convergence_counter <- 0    # initialise counter for consecutive convergence
  
  # Initialize a matrix to store all iterates as columns
  p <- length(theta_0)
  theta_all <- matrix(NA, nrow = p, ncol = niterations + 1)
  theta_all[, 1] <- theta_0

  for (l in 1:niterations) {
    # complete the data by conditional simulation
    Z <- simulateconditional(drop(Z1), theta_l, nsims = nsims)
    # compute the MAP estimate from the conditionally sampled replicates
    if ("JuliaProxy" %in% class(estimator)) {
      theta_l_plus_1 <- c(estimate(estimator, Z)) # neural MAP estimator
    } else {
      theta_l_plus_1 <- estimator(Z, theta_l)     # analytic MAP estimator
    }
    # check convergence criterion
    if (max(abs(theta_l_plus_1 - theta_l) / abs(theta_l)) < tolerance) {
      # increment counter if condition is met
      convergence_counter <- convergence_counter + 1  
      # check if convergence criterion has been met for required number of iterations
      if (convergence_counter == nconsecutive) {        
        if(verbose) message("The EM algorithm has converged")
        theta_all[, l + 1] <- theta_l_plus_1  # store the final iterate
        break
      }
    } else {
      # reset counter if condition is not met
      convergence_counter <- 0  
    }
    theta_l <- theta_l_plus_1  
    theta_all[, l + 1] <- theta_l  # store the iterate
    if(verbose) print(paste0("Iteration ", l, ": ", theta_l))
  }
  
  # Remove unused columns if convergence occurred before max iterations
  theta_all <- theta_all[, 1:(l + 1), drop = FALSE]

  # Return all iterates if return_iterates is TRUE, otherwise return the last iterate
  if (return_iterates) {
    return(theta_all)
  } else {
    return(theta_all[, ncol(theta_all)])
  }
}

We are now ready to apply the neural EM algorithm with incomplete data. Here, we use the same incomplete data \boldsymbol{Z}_1 simulated conditionally on \theta = 0.2 at the end of the preceding subsection:

theta_0 <- 0.1
EM(Z1, estimator, theta_0, nsims = 100)
#> [1] "Initial estimate: 0.1"
#> [1] "Iteration 1: 0.166097149252892"
#> [1] "Iteration 2: 0.198796659708023"
#> [1] "Iteration 3: 0.214800849556923"
#> [1] "Iteration 4: 0.220833465456963"
#> [1] "Iteration 5: 0.217493936419487"
#> [1] "Iteration 6: 0.219034016132355"
#> [1] "Iteration 7: 0.219332501292229"
#> The EM algorithm has converged
#> [1] 0.2213956

Visualise the Monte Carlo variability with different Monte Carlo sample sizes:

all_H <- c(1, 10, 100)
dfs <- list()
for (H in all_H) {
    estimates <- c(EM(Z1, estimator, theta_0, nsims = H, return_iterates = TRUE, verbose = FALSE, tolerance = 0.0001))
    df <- data.frame(
      iteration = 1:length(estimates),
      estimate = estimates, 
      H = H
      )
    dfs <- c(dfs, list(df))
}
df <- do.call(rbind, dfs)
ggplot(df) + 
  geom_line(aes(x = iteration, y = estimate)) + 
  facet_wrap(~ H, labeller = labeller(H = function(labels) paste0("H = ", labels)), nrow = 1) + 
  theme_bw()

2.3 Summary

We have considered two approaches that facilitate neural Bayes estimation with incomplete data.

  1. The masking approach, where the input to the neural network is the complete-data vector with missing entries replaced by a constant (typically zero), along with a vector of indicator variables that encode the missingness pattern.

    + Does not require conditional simulation, and is therefore broadly applicable.

    + Can be used with all loss functions that are amenable to gradient-based training.

    - Needs the neural network to take an additional input (the missingness pattern).

    - More complicated learning task.

    - Requires a model for the missingness mechanism.

  2. The neural EM algorithm, a Monte Carlo EM algorithm where the incomplete data are completed using conditional simulation.

    + Neural-network architecture is the same as that used in the complete-data case.

    + Simpler learning task (mapping from the complete data to the parameter space).

    + Does not require a model for the missingness mechanism.

    - Requires conditional simulation, which places restrictions on the applicable class of models.

    - Limited to providing MAP estimates.