Abstract
Functional data analysis is a statistical approach used to analyze data that appear as functions or images. Functional data analysis methods can be used to analyze device-based measures such as physical activity and sleep. Although device-based measures are more objective than self-reported measures of physical activity patterns or sleep activity, device-based measures can still be prone to measurement error. When assessing associations between scalar-valued outcomes and device-based measures, scalar-on-function regression models that correct for measurement error can be applied. We develop the MECfda package for several scalar-on-function regression models including multi-level generalized scalar-on-function regression models and functional quantile linear regression models. The developed package implements several bias-correction methods that account for the presence of multiple functional and scalar covariates prone to measurement error in various scalar-on-function regression settings.
Functional data analysis (FDA) is an essential statistical approach for handling high-dimensional data that appear in the form of functions or images [1–3]. In many applications, data are collected continuously or at frequent time intervals, resulting in complex functions over time. When the objective is to study the relationship between a scalar response and both functional and scalar covariates, scalar-on-function regression models become a natural choice.
Although self-reported measures (e.g., dietary intake) are known to suffer from measurement error [4], recent studies have demonstrated that even device-based measurements (such as those from wearable devices) can be error prone [5, 6]. Failing to correct for measurement error in either scalar or function-valued covariates may lead to biased parameter estimates.
The goal of the MECfda package is to provide solutions for a range of scalar-on-function regression models—including multi-level generalized scalar-on-function regression models and functional quantile regression models—while incorporating several bias-correction methods for measurement error in scalar-on-function regression problem.
This package is developed in R.
The general form of the scalar-on-function regression model is given by \[T(F_{Y_i|X_i,Z_i}) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) dt + (1,Z_i^T)\gamma\] where
A statistical functional is defined as a mapping from a probability distribution to a real number. Statistical functionals represent quantifiable properties of probability distributions, (e.g., an expectation, a variance or a quantile).
Two common model types supported in this package are:
Multi-level Generalized Scalar-on-Function
Regression:
This model is formulated as \[
T\bigl(F_{Y_i\mid X_i,Z_i}\bigr) = g\Bigl\{E(Y_i\mid X_i,Z_i)\Bigr\} =
g\left\{\int_\mathbb{R}ydF_{Y_i|X_i,Z_i}(y)\right\},
\] where \(g(\cdot)\) is a
strictly monotonic link function analogous to that used in generalized
linear models.
Scalar-on-Function Quantile Regression:
This model is expressed as \[
T\bigl(F_{Y_i\mid X_i,Z_i}\bigr) = Q_{Y_i\mid X_i,Z_i}(\tau) =
F_{Y_i\mid X_i,Z_i}^{-1}(\tau),
\] where \(\tau \in (0,1)\)
denotes the quantile of interest.
Function-valued variables (functional variables) are typically recorded as measurements at specific time points, resulting in an \(n \times m\) data matrix \((x_{ij})\), where each row corresponds to an observation (subject) and each column to a measurement time.
Function-valued variables (functional variables) are typically recorded as the measurements of the values of the functions at specific (time) points in the domain. The data of a function-valued variable are often presented in the form of a matrix, \((x_{ij})_{n\times m}\), where \(x_{ij} = f_i(t_j)\), represents the value of \(f_i(t_j)\), each row corresponds to an observation (subject) and each column to a measurement time.
functional_variable
In the MECfda package, the s4 class
functional_variable
represents function-valued data stored
in matrix form. The main slots include:
X
: The data matrix \((x_{ij})\);t_0
and
period
: The left endpoint and length of
the function’s domain, respectively (The class only support the domain
in the format of interval);t_points
: A vector of the time points
at which the measurements are recorded.A dedicated method, dim
, returns the number of subjects
and the number of measurement points for a
functional_variable
object.
Let \(\{\rho_{k}\}_{k=1}^\infty\) be a complete basis for \(L^2(\Omega)\). For an arbitrary function \(\beta(t) \in L^2(\Omega)\), there exists a sequence of (real) number \(\{c_k\}\) such that \[ \beta(t) = \sum_{k=1}^\infty c_k \, \rho_k(t). \]
For the function-valued variable \(X_i(t)\), we have \[ \int_\Omega \beta(t) X_i(t) \, dt = \int_\Omega X_i(t) \sum_{k=1}^\infty c_{k}\rho_{k}(t) dt = \sum_{k=1}^\infty c_k \int_\Omega \rho_k(t) X_i(t) \, dt. \]
In practice, for \(\int_\Omega\beta(t) X_i(t) dt\) in statistical models, we truncate the infinite series to \(p\) terms (\(p<\infty\)), using \(c_{k} \int_\Omega X_i(t) \rho_{k}(t) dt\) to approximate \(\int_\Omega\beta(t) X_i(t) dt\) so that the problem reduces to estimating a finite number of parameters \(c_1, \dots, c_p\), and treating \(\int_\Omega \rho_k(t) X_i(t) \, dt\) as new covariates. The number of basis, \(p\), can be related to the sample size, \(n\), and the \(p\) is in a form of \(p(n)\).
In practice, we may not necessarily use the truncated complete basis of \(L^2\) function space. Instead, we can just use a finite sequence of linearly independent functions as the basis function. Common choices for the basis include the Fourier basis, b-spline basis, and eigenfunction basis.
In numerical computing level, performing basis expansion methods on function-valued variable data in matrix form as we mentioned is to compute \((b_{ik})_{n\times p}\), where \(b_{ik} = \int_\Omega f(t)\rho_k(t) dt\).
On the interval \([t_0, t_0+T]\), the Fourier basis consists of
\[ \frac{1}{2}, \quad \cos\left(\frac{2\pi}{T}k(x-t_0)\right), \quad \sin\left(\frac{2\pi}{T}k(x-t_0)\right), \quad k = 1, 2, \dots \]
Fourier_series
In the MECfda package, s4 class
Fourier_series
represents a Fourier series of the form
\[ \frac{a_0}{2} + \sum_{k=1}^{p_a} a_k \cos\left(\frac{2\pi k (x-t_0)}{T}\right) + \sum_{k=1}^{p_b} b_k \sin\left(\frac{2\pi k (x-t_0)}{T}\right),\quad x \in [t_0, t_0+T]. \]
Its main slots include:
double_constant
: The value of \(a_0\);cos
and
sin
: The coefficients \(a_k\) and \(b_k\) for the cosine and sine terms,
respectively;k_cos
and
k_sin
: The frequency values corresponding
to the cosine and sine coefficients;t_0
and
period
: The left endpoint, \(t_0\), and length, \(T\), of the domain (interval \([t_0, t_0+T]\)), respectively.Additional methods such as plot
,
FourierSeries2fun
, and extractCoef
are
provided for visualization, function evaluation, and coefficient
extraction.
fsc = Fourier_series(
double_constant = 3,
cos = c(0,2/3),
sin = c(1,7/5),
k_cos = 1:2,
k_sin = 1:2,
t_0 = 0,
period = 1
)
plot(fsc)
FourierSeries2fun(fsc,seq(0,1,0.05))
#> [1] 2.1666667 3.1712610 3.6252757 3.4344848 2.7346112 1.8333333
#> [7] 1.0888125 0.7715265 0.9623175 1.5254623 2.1666667 2.5532270
#> [13] 2.4497052 1.8164508 0.8324982 -0.1666667 -0.8133005 -0.8465074
#> [19] -0.2132530 0.9074283 2.1666667
extractCoef(fsc)
#> $a_0
#> 0
#> 3
#>
#> $a_k
#> 1 2
#> 0.0000000 0.6666667
#>
#> $b_k
#> 1 2
#> 1.0 1.4
The object fsc
represents the summation \[\frac32 + \frac23 \cos(2\pi\cdot2x) + \sin(2\pi
x) + \frac75\sin(2\pi\cdot2x).\]
A b-spline basis \(\{B_{i,p}(x)\}_{i=-p}^{k}\) on the interval \([t_0, t_{k+1}]\) is defined recursively as follows: - For \(p = 0\), \[ B_{i,0}(x) = \begin{cases} I_{(t_i, t_{i+1}]}(x), & i = 0, 1, \dots, k, \\ 0, & \text{otherwise}. \end{cases} \] - For \(p > 0\), the recursion is given by \[ B_{i,p}(x) = \frac{x-t_i}{t_{i+p}-t_i} B_{i,p-1}(x) + \frac{t_{i+p+1}-x}{t_{i+p+1}-t_{i+1}} B_{i+1,p-1}(x). \]
At discontinuities in the interval \((t_0,t_k)\), the basis function is defined as its limit, \(B_{i,p}(x) = \lim_{t\to x} B_{i,p}(t)\), to ensure continuity.
bspline_basis
and
bspline_series
The s4 class bspline_basis
represents a b-spline basis,
\(\{B_{i,p}(x)\}_{i=-p}^{k}\), on the
interval \([t_0, t_{k+1}]\). Its key
slots include:
Boundary.knots
: The boundary \([t_0, t_{k+1}]\), (default is \([0,1]\));knots
: The internal spline knots,
\((t_1,\dots,t_k)\), (automatically
generated as equally spaced, \(t_j = t_0 +
j\cdot\frac{t_{k+1}-t_0}{k+1}\), if not assigned);intercept
: Indicates whether an
intercept is included (default is TRUE
);df
: Degrees of freedom of the basis,
which is the number of the splines, equal to \(p + k + 1\) (by default \(k =0\) and \(\text{df} = p+1\));degree
: The degree of the spline,
which is the degree of piecewise polynomials, \(p\), (default is 3).The s4 class bspline_series
represents the linear
combination
\[ \sum_{i=0}^{k} b_i B_{i,p}(x), \]
where the slot coef
stores the
coefficients \(b_i\) (\(i = 0,\dots,k\)) and
bspline_basis
specifies the associated
basis (represented by a bspline_basis
object).
Methods such as plot
and bsplineSeries2fun
are provided for visualization and function evaluation.
bsb = bspline_basis(
Boundary.knots = c(0,24),
df = 7,
degree = 3
)
bss = bspline_series(
coef = c(2,1,3/4,2/3,7/8,5/2,19/10),
bspline_basis = bsb
)
plot(bss)
bsplineSeries2fun(bss,seq(0,24,0.5))
#> [1] 2.0000000 1.7677509 1.5690908 1.4011502 1.2610597 1.1459499 1.0529514
#> [8] 0.9791948 0.9218107 0.8779297 0.8446824 0.8191993 0.7986111 0.7805266
#> [15] 0.7644676 0.7504340 0.7384259 0.7284433 0.7204861 0.7145544 0.7106481
#> [22] 0.7087674 0.7089120 0.7110822 0.7152778 0.7216857 0.7312404 0.7450629
#> [29] 0.7642747 0.7899969 0.8233507 0.8654574 0.9174383 0.9804145 1.0555073
#> [36] 1.1438380 1.2465278 1.3634786 1.4897151 1.6190430 1.7452675 1.8621942
#> [43] 1.9636285 2.0433759 2.0952418 2.1130317 2.0905511 2.0216053 1.9000000
The object bsb
represents \(\{B_{i,3}(x)\}_{i=-3}^{0}\), and the object
bss
represents
\[2B_{i,-3}(x)+B_{i,-2}(x)+\frac34B_{i,-1}(x)+\frac23B_{i,0}(x)
+ \frac78B_{i,1}(x) +
\frac52B_{i,2}(x) +\frac{19}{10}B_{i,3}(x),\] where \(x\in[t_0,t_4]\) and \(t_0=0\), \(t_k =
t_{k-1}+6\) ,\(k=1,2,3,4\).
Suppose \(K(s,t)\in L^2(\Omega\times \Omega)\), \(f(t)\in L^2(\Omega)\). Then \(K\) induces an linear operator \(\mathcal{K}\) by \[(\mathcal{K}f)(x) = \int_{\Omega} K(t,x)f(t)dt\] If \(\xi \in L^2(\Omega)\) such that \[\mathcal{K}\xi = \lambda \xi\] where \(\lambda\in \mathbb{C}\), we call \(\xi\) an eigenfunction/eigenvector of \(\mathcal{K}\), and \(\lambda\) an eigenvalue associated with \(\xi\).
For a stochastic process \(\{X(t),t\in\Omega\}\) we call the orthogonal basis, \(\{\xi_k\}_{k=1}^\infty\) corresponding to eigenvalues \(\{\lambda_k\}_{k=1}^\infty\) (\(\lambda_1\geq\lambda_2\geq\dots\)), induced by \[K(s,t)=\operatorname{Cov}(X(t),X(s))\] a functional principal component (FPC) basis for \(L^2(\Omega)\).
The eigenfunction basis relies on the covariance function \(K(s,t)\). And we usually cannot analytically express the equation of the basis funcitons in eigenfunction basis. In practice, the covariance function is estimated from the data and the eigenfunctions are computed numerically.
Sometimes, analytical expressions for the basis functions are not available. In these cases, they can be represented numerically. Let \(\{\rho_k\}_{k=1}^\infty\) denote a basis of the function space. We can numerically approximate the basis by storing the values of a finite subset of these functions at a finite set of points in the domain, i.e., \(\rho_k(t_j)\) for \(k=1,\dots,p\) and \(j=1,\dots,m\).
numeric_basis
and
numericBasis_series
In this package, the s4 class numeric_basis
is designed
to represent a finite sequence of functions \(\{\rho_k\}_{k=1}^p\) by their values at a
finite set of points within their domain. In this context, all functions
share the same domain, which is assumed to be an interval. The key slots
include:
basis_function
: The matrix \((\zeta_{jk})_{m \times p}\), where each
element is defined as \(\zeta_{jk} =
\rho_k(t_j)\) for \(j = 1, \dots,
m\) and \(k = 1, \dots, p\). In
this matrix, each row corresponds to a point in the domain, and each
column corresponds to a specific basis function.t_points
: A numeric vector that
represents the points in the domain at which the basis functions are
evaluated. The \(j\)-th element of this
vector corresponds to the \(j\)-th row
of the basis_function
matrix.t_0
and
period
: The left endpoint and length of
the domain, respectively.Additionally, the package provides an s4 class
numericBasis_series
, which represents a linear combination
of the basis functions represented by a numeric_basis
object. The key slots include:
coef
: The linear coefficients for the
summation series.numeric_basis
: Function basis as
represented by a numeric_basis
object.basis2fun
The generic function basis2fun
is provided for
Fourier_series
, bspline_series
, and
numericBasis_series
objects. For a
Fourier_series
object, it is equivalent to
FourierSeries2fun
; for a bspline_series
object, it is equivalent to bsplineSeries2fun
; For a
numericBasis_series
object, it is equivalent to
numericBasisSeries2fun
.
basis2fun(fsc,seq(0,1,0.05))
#> [1] 2.1666667 3.1712610 3.6252757 3.4344848 2.7346112 1.8333333
#> [7] 1.0888125 0.7715265 0.9623175 1.5254623 2.1666667 2.5532270
#> [13] 2.4497052 1.8164508 0.8324982 -0.1666667 -0.8133005 -0.8465074
#> [19] -0.2132530 0.9074283 2.1666667
basis2fun(bss,seq(0,24,0.5))
#> [1] 2.0000000 1.7677509 1.5690908 1.4011502 1.2610597 1.1459499 1.0529514
#> [8] 0.9791948 0.9218107 0.8779297 0.8446824 0.8191993 0.7986111 0.7805266
#> [15] 0.7644676 0.7504340 0.7384259 0.7284433 0.7204861 0.7145544 0.7106481
#> [22] 0.7087674 0.7089120 0.7110822 0.7152778 0.7216857 0.7312404 0.7450629
#> [29] 0.7642747 0.7899969 0.8233507 0.8654574 0.9174383 0.9804145 1.0555073
#> [36] 1.1438380 1.2465278 1.3634786 1.4897151 1.6190430 1.7452675 1.8621942
#> [43] 1.9636285 2.0433759 2.0952418 2.1130317 2.0905511 2.0216053 1.9000000
functional_variable
classThe MECfda pakcage provide the methods
fourier_basis_expansion
,
bspline_basis_expansion
, FPC_basis_expansion
,
and numeric_basis_expansion
for the class
functional_variable
, which perform basis expansion using
Fourier basis, b-spline basis, functional principal component basis, and
numerical basis respectively.
The package approximates the integral
\[ \int_{\Omega} \rho_{k}(t) X_{i}(t) \, dt \approx \frac{\mu(\Omega)}{|T|} \sum_{t \in T} \rho_{k}(t) X_{i}(t), \]
where \(\mu(\cdot)\) denote the Lebesgue measure. \(T\) denotes the set of measurement time points for \(X_i(t)\) and \(|T|\) represents the cardinal number of \(T\).
fcRegression(Y, FC, Z, formula.Z, family = gaussian(link = "identity"),
basis.type = c("Fourier", "Bspline"), basis.order = 6L,
bs_degree = 3)
The MECfda package provides a function,
fcRegression
, for fitting generalized linear mixed-effect
models—including ordinary linear models and generalized linear models
with fixed and random effects—by discretizing function-valued covariates
using basis expansion. The function can solve a linear model of the
form
\[ g\bigl(E(Y_i\mid X_i,Z_i)\bigr) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) \, dt + (1,Z_i^T)\gamma. \]
It supports one or multiple function-valued covariates as fixed
effects, and zero, one, or multiple scalar covariates, which can be
specified as either fixed or random effects. The response variable,
function-valued covariates, and scalar covariates are supplied
separately via the arguments Y
, FC
, and
Z
, respectively, allowing great flexibility in data
format.
Response Variable:
The response can be provided as an atomic vector, a one-column matrix,
or a data frame. However, a one-column data frame or matrix with a
column name is recommended so that the response variable’s name is
clearly specified.
Function-Valued Covariates:
The function-valued covariates can be input as a
functional_variable
object, a matrix, a data frame, or a
list of these objects. If a single functional_variable
object (or matrix or data frame) is provided via FC
, the
model includes one function-valued covariate. If a list is provided, the
model accommodates multiple function-valued covariates, with each
element in the list representing a distinct covariate.
Scalar Covariates:
Scalar covariates can be provided as a matrix, data frame, atomic
vector, or even NULL
if no scalar covariates are present.
If Z
is omitted or set to NULL
, no scalar
covariate is included and the argument formula.Z
should
also be NULL
or omitted. When an atomic vector is used for
Z
, it represents a single scalar covariate (without a
name). Hence, even when only one scalar covariate is included, it is
recommended to supply it as a matrix or data frame with a column name.
The formula.Z
argument specifies which parts of
Z
to use and whether scalar covariates should be treated as
fixed or random effects.
Other key arguments include:
family
: Specifies the response
variable’s distribution and the link function to be used.basis.type
: Indicates the type of
basis function for the expansion. Options include
'Fourier'
, 'Bspline'
, and 'FPC'
,
corresponding to the Fourier basis, b-spline basis, and functional
principal component basis, respectively.basis.order
: Specifies the number of
basis functions to use. For the Fourier basis (i.e., \(\frac{1}{2},\ \sin(kt),\ \cos(kt)\) for
\(k = 1, \dots, p_f\)),
basis.order
equals \(p_f\). For the b-spline basis \(\{B_{i,p}(x)\}_{i=-p}^{k}\),
basis.order
is set to \(k+p+1\). For the FPC basis,
basis.order
represents the number of functional principal
components.bs_degree
: Specifies the degree of the
piecewise polynomials for the b-spline basis; this argument is only
necessary when using the b-spline basis.The function returns an s3 object of class fcRegression
,
which is a list containing the following elements:
Fourier_series
, bspline_series
, or
numericBasis_series
objects representing the
function-valued linear coefficients of the covariates.Additionally, the predict
method can be used to obtain
predicted values from the fitted model, and the fc.beta
method is available to evaluate the estimated function-valued
coefficient parameters \(\beta_l(t)\).
data(MECfda.data.sim.0.0)
res = fcRegression(FC = MECfda.data.sim.0.0$FC,
Y=MECfda.data.sim.0.0$Y,
Z=MECfda.data.sim.0.0$Z,
family = gaussian(link = "identity"),
basis.order = 5, basis.type = c('Bspline'),
formula.Z = ~ Z_1 + (1|Z_2))
t = (0:100)/100
plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t)))
fcQR(Y, FC, Z, formula.Z, tau = 0.5, basis.type = c("Fourier", "Bspline"),
basis.order = 6L, bs_degree = 3)
The MECfda package provides a function,
fcRegression
, for fitting quantile linear regression models
by discretizing function-valued covariates using basis expansion. The
function can solve a linear model of the form
\[Q_{Y_i|X_i,Z_i}(\tau) =
\sum_{l=1}^L\int_{\Omega_l} \beta_l(\tau,t) X_{li}(t) dt
+ (1,Z_i^T)\gamma.\] The function supports one or multiple
function-valued covariates, as well as zero, one, or multiple
scalar-valued covariates. Data input follows the same guidelines as for
the fcRegression
function, and the treatment of scalar
covariates is determined by the formula.Z
argument. The
primary difference between fcQR
and
fcRegression
is that the quantile linear regression model
does not incorporate random effects. The quantile \(\tau\) is specified by the argument
tau
, and the type and parameters of the basis functions are
defined by the basis.type
, basis.order
, and
bs_degree
arguments, just as in
fcRegression
.
The function returns an s3 object of class fcQR
, which
is a list containing the following elements:
Fourier_series
, bspline_series
, or
numericBasis_series
objects representing the
function-valued linear coefficients of the covariates.Additionally, the predict
method can be used to obtain
predicted values from the fitted model, and the fc.beta
method is available to evaluate the estimated function-valued
coefficient parameters \(\beta_l(t)\).
data(MECfda.data.sim.0.0)
res = fcQR(FC = MECfda.data.sim.0.0$FC,
Y=MECfda.data.sim.0.0$Y,
Z=MECfda.data.sim.0.0$Z,
tau = 0.5,
basis.order = 5, basis.type = c('Bspline'),
formula.Z = ~ .)
t = (0:100)/100
plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t)))
Data collected in real-world settings often include measurement error, especially function-valued data. Measurement error in a data set may result in estimation bias. The *MECfda** package also provides bias correction estimation method functions for certain linear regression models for use with data containing measurement error.
ME.fcRegression_MEM(
data.Y,
data.W,
data.Z,
method = c("UP_MEM", "MP_MEM", "average"),
t_interval = c(0, 1),
t_points = NULL,
d = 3,
family.W = c("gaussian", "poisson"),
family.Y = "gaussian",
formula.Z,
basis.type = c("Fourier", "Bspline"),
basis.order = NULL,
bs_degree = 3,
smooth = FALSE,
silent = TRUE
)
Wearable monitoring devices permit the continuous monitoring of
biological processes, such as blood glucose metabolism, and behaviors,
such as sleep quality and physical activity.
Continuous monitoring often collect data in 60-second epochs over
multiple days, resulting in high-dimensional multi-level longitudinal
curves that are best described and analyzed as multi-level functional
data.
Although researchers have previously addressed measurement error in
scalar covariates prone to error, less work has been done for correcting
measurement error in multi-level high dimensional curves prone to
heteroscedastic measurement error. Therefore, Luan et. al. proposed
mixed effects-based-model-based (MEM) methods for bias correction due to
measurement error in multi-level functional data from the exponential
family of distributions that are prone to complex heteroscedastic
measurement error.
They first developed MEM methods to adjust for biases due to the
presence of measurement error in multi-level generalized functional
regression models.
They assumed that the distributions of the function-valued covariates
prone to measurement error belong to the exponential family. This
assumption allows for a more general specification of the distributions
of error-prone functional covariates compared to current approaches that
often entail normality assumptions for these observed measures. The
approach adopted by Luan et al. allows a nonlinear association between
the true measurement and the observed measurement prone to measurement
error.
Second, they treated the random errors in the observed measures as
complex heteroscedastic errors from the Gaussian distribution with
covariance error functions. Third, these methods can be used to evaluate
relationships between multi-level function-valued covariates with
complex measurement error structures and scalar outcomes with
distributions in the exponential family.
Fourth, they treat the function-valued covariate as an observed measure
of the true function-valued unobserved latent covariate.
Additionally, these methods employ a point-wise method for fitting the
multi-level functional MEM approach, avoiding the need to compute
complex and intractable integrals that would be required in traditional
approaches for reducing biases due to measurement error in multi-level
functional data [7].
Their statistical model is as follows: \[\begin{align*} &g(E(Y_i|X_i,Z_i)) = \int_{\Omega} \beta(t) X_{i}(t) dt + (1,Z_i^T)\gamma\\ &h(E(W_{ij}(t)|X_i(t))) = X_i(t)\\ &X_i(t) = \mu_x(t) + \varepsilon_{xi}(t) \end{align*}\] where the response variable \(Y_i\) and scalar-valued covariates \(Z_i\) are measured without error, function-valued covariate \(X_i(t)\) is repeatedly measured with error as \(W_{ij}(t)\). The model also includes the following assumptions:
They proposed a MEM estimation method to correct for bias caused by the presence of measurement error in the function-valued covariate. This method allows for the investigation of a nonlinear measurement error model, in which the relationship between the true and observed measurements is not constrained to be linear, and the distribution assumption for the observed measurement is relaxed to encompass the exponential family rather than being limited to a Gaussian distribution.
The MEM approach is a two stage method that employs functional
mixed-effects models. The first stage of the MEM approach involves using
a functional mixed-effects model
to approximate the true measure \(X_i(t)\) with the repeated observed
measurement \(W_{ij}(t)\). The MEM
approach is primarily based on the assumptions that \(h[E\{W_{ij}(t)|X_i(t)\}] = X_i(t)\) and
\(X_i(t) = \mu_x(t) +
\varepsilon_{xi}(t)\). That is, in the functional mixed-effects
model containing one fixed intercept and one random intercept, the
random intercept is assumed to to be the subject-wise deviation of \(X_i(t)\) from the mean process \(\mu_x(t)\), and the fixed intercept is
assumed to represent \(\mu_x(t)\). The
second stage involves replacing \(X_i(t)\) in the regression model with the
resulting approximation of \(X_i(t)\)
from the first stage. The MEM approach employs point-wise (UP_MEM) and
multi-point-wise (MP_MEM) estimation procedures to avoid potential
computational complexities caused by analyses of multi-level functional
data and computations of potentially intractable and complex
integrals.
The MECfda package provides the function
ME.fcRegression_MEM
for applying bias-correction estimation
methods. In this function, the response variable, function-valued
covariates, and scalar covariates are supplied separately via the
arguments data.Y
, data.W
, and
data.Z
, respectively.
data.Y
(Response Variable):
The response variable can be provided as an atomic vector, a one-column
matrix, or a data frame. However, a one-column data frame with a column
name is recommended so that the response variable is explicitly
identified.
data.W
(Measurement Data):
This argument represents \(W\), the
observed measurement of the true function-valued covariate \(X\) in the statistical model. It should be
provided as a three-dimensional array where each row corresponds to a
subject, each column to a measurement (time) point, and each layer to a
separate observation.
data.Z
(Scalar Covariates):
The scalar covariates can be input as a matrix, data frame, or atomic
vector. If the model does not include any scalar covariates,
data.Z
can be omitted or set to NULL
. For a
single scalar covariate, although an atomic vector is acceptable, a data
frame or matrix with a column name is recommended. For multiple scalar
covariates, use a matrix or data frame with named columns.
Other key arguments include:
method
: Specifies the method for
constructing the substitute for \(X\);
available options are 'UP_MEM'
, 'MP_MEM'
, and
'average'
.t_interval
: Defines the domain of the
function-valued covariate as a two-element vector (default is
c(0,1)
, representing the interval \([0,1]\)).t_points
: Specifies the sequence of
measurement time points (default is NULL
).d
: When using the MP_MEM
method, this argument sets the number of measurement points involved
(default is 3, which is also the minimum value).family.W
: Specifies the distribution
of \(W\) given \(X\); available options are
'gaussian'
and 'poisson'
.family.Y
: Describes the error
distribution and link function for the response variable (see
stats::family
for details).formula.Z
: Indicates which parts of
data.Z
to include in the model and whether to treat the
scalar covariates as fixed or random effects.basis.type
: Indicates the type of
basis function for the expansion. Options include
'Fourier'
, 'Bspline'
, and 'FPC'
,
corresponding to the Fourier basis, b-spline basis, and functional
principal component basis, respectively.basis.order
: Specifies the number of
basis functions to use. For the Fourier basis (i.e., \(\frac{1}{2},\ \sin(kt),\ \cos(kt)\) for
\(k = 1, \dots, p_f\)),
basis.order
equals \(p_f\). For the b-spline basis \(\{B_{i,p}(x)\}_{i=-p}^{k}\),
basis.order
is set to \(k+p+1\). For the FPC basis,
basis.order
represents the number of functional principal
components.bs_degree
: Specifies the degree of the
piecewise polynomials for the b-spline basis; this argument is only
necessary when using the b-spline basis.smooth
: Specifies whether to smooth
the substitution for \(X\) (default is
FALSE
).silent
: Controls whether the function
displays its progress during execution (default is
TRUE
).The function ME.fcRegression_MEM
returns an object of
class fcRegression
.
The package also provide a function, MEM_X_hat
. This
function can return the \(\hat X(t)\)
in this bias-correction method.
ME.fcQR_IV.SIMEX(
data.Y,
data.W,
data.Z,
data.M,
tau = 0.5,
t_interval = c(0, 1),
t_points = NULL,
formula.Z,
basis.type = c("Fourier", "Bspline"),
basis.order = NULL,
bs_degree = 3
)
Health a major concerns for many people, and as technology advances,
wearable devices have become the mainstream method for monitoring and
evaluating individual physical activity levels. Despite personal
preferences in brands and feature design, the accuracy of the data
presented is what makes the device a great product. These functional
data collected from devices are generally considered more accurate and
subjective compared to other objective methods like questionnaires and
self-reports. Because physical activity levels are not directly
observable, algorithms are used generate corresponding summary reports
of different activity level using complex data (e.g. steps, heart
rate).
However, these results may differ depending on which device is used. And
aside from variation in hardware, data collected could also vary by the
combinations between how the device is worn and the activity of
interest. Although current devices may be sufficiently accurate to
monitor general physical activity levels, more precise data may enable
additional functions such as detecting irregular heart rhythms or
radiation exposures that would contribute toward improving the health of
the general public and elevating the overall well-being of society.
Quantile regression is a tool that was developed to meet the need for modeling complex relationships between a set of covariates and quantiles of an outcome. In obesity studies, the effects of interventions or covariates on body mass index (BMI) are most commonly estimated using ordinary least square methods. However, mean regression approaches are unable to address these effects for individuals whose BMI values fall in the upper quantile of the distribution. Compared with traditional linear regression approaches, quantile regression approaches make no assumptions regarding the distribution of residuals and are robust to outlying observations.
Tekwe et. al. proposed a simulation extrapolation (SIMEX) estimation method that uses an instrumental variable to correct for bias due to measurement error in scalar-on-function quantile linear regression. They demonstrated the usefulness of this method by evaluating the association between physical activity and obesity using the National Health and Nutrition Examination Survey (NHANES) dataset [8]. Their statistical model is as follows: \[\begin{align*} &Q_{Y_i|X_i,Z_i}(\tau) = \int_{\Omega} \beta(\tau,t) X_{i}(t) dt + (1,Z_i^T)\gamma(\tau)\\ &W_i(t) = X_i(t) + U_i(t)\\ &M_i(t) = \delta(t) X_i(t) + \eta_i(t) \end{align*}\] where the response variable \(Y_i\) and scalar-valued covariates \(Z_i\) are measured without error, the function-valued covariate \(X_i(t)\) is measured with error as \(W_i(t)\), and \(M_i(t)\) is a measured instrumental variable. The model also includes the following assumptions:
for \(\forall t,s\in[0,1]\) and \(i = 1,\dots,n\).
Their bias correction estimation method uses a two-stage strategy to correct for measurement error in a function-valued covariate and then fits a linear quantile regression model. In the first stage, an instrumental variable is used to estimate the covariance matrix associated with the measurement error. In the second stage, SIMEX is used to correct for measurement error in the function-valued covariate.
The MECfda package provides the function
ME.fcQR_IV.SIMEX
for applying its bias-correction
estimation method in scalar-on-function quantile regression. The
arguments are described as follows:
data.Y
(Response Variable):
The response variable can be provided as an atomic vector, a one-column
matrix, or a data frame. A one-column data frame with a column name is
recommended for clarity.
data.W
(Measurement Data):
This argument corresponds to \(W\), the
observed measurement of \(X\) in the
statistical model. It should be provided as a data frame or matrix where
each row represents a subject and each column corresponds to a
measurement (time) point.
data.Z
(Scalar Covariates):
The scalar covariate data can be omitted (or set to NULL
)
if no scalar covariates are included in the model. Alternatively, it can
be provided as an atomic vector (for a single scalar covariate) or as a
matrix/data frame (for multiple scalar covariates). A data frame with
column names is recommended.
data.M
(Instrumental
Variable):
This argument provides the data for \(M\), the instrumental variable. It should
be supplied as a data frame or matrix where each row represents a
subject and each column corresponds to a measurement (time)
point.
tau
(Quantile Level):
Specifies the quantile level \(\tau \in
(0,1)\) for the quantile regression model. The default is
0.5.
t_interval
(Domain of the Function-Valued
Covariate):
A two-element vector defining the interval over which the
function-valued covariate is defined (default is c(0,1)
,
representing the interval \([0,1]\)).
t_points
(Measurement Time
Points):
Specifies the sequence of time points at which measurements are taken.
The default is NULL
.
formula.Z
(Scalar Covariate
Formula):
This argument determines which components of data.Z
are
included in the model and how the scalar covariates are treated (i.e.,
as fixed or random effects).
basis.type
: Indicates the type of
basis function for the expansion. Options include
'Fourier'
, 'Bspline'
, and 'FPC'
,
corresponding to the Fourier basis, b-spline basis, and functional
principal component basis, respectively.
basis.order
: Specifies the number
of basis functions to use. For the Fourier basis (i.e., \(\frac{1}{2},\ \sin(kt),\ \cos(kt)\) for
\(k = 1, \dots, p_f\)),
basis.order
equals \(p_f\). For the b-spline basis \(\{B_{i,p}(x)\}_{i=-p}^{k}\),
basis.order
is set to \(k+p+1\). For the FPC basis,
basis.order
represents the number of functional principal
components.
bs_degree
: Specifies the degree of
the piecewise polynomials for the b-spline basis; this argument is only
necessary when using the b-spline basis.
The function ME.fcQR_IV.SIMEX
returns an object of class
ME.fcQR_IV.SIMEX
, which is a list containing the following
elements:
coef.X
: A Fourier_series
or bspline_series
object representing the function-valued
coefficient parameter of the function-valued covariate.coef.Z
: The estimated linear
coefficients of the scalar covariates.coef.all
: The original estimate of the
linear coefficients.function.basis.type
: The type of
function basis used.basis.order
: The number of basis
functions used (as specified in the input).t_interval
: A two-element vector
representing the interval, which defines the domain of the
function-valued covariate.t_points
: The sequence of measurement
(time) points.formula
: The regression model.formula.Z
: A formula object containing
only the scalar covariates.zlevels
: The levels of any categorical
(non-continuous) scalar covariates.This comprehensive structure ensures that users can flexibly input their data and clearly specify the modeling framework for bias-correction in scalar-on-function quantile regression.
ME.fcQR_CLS(
data.Y,
data.W,
data.Z,
tau = 0.5,
t_interval = c(0, 1),
t_points = NULL,
grid_k,
grid_h,
degree = 45,
observed_X = NULL
)
Zhang et. al. proposed a corrected loss score estimation method for scalar-on-function quantile linear regression to correct for bias due to measurement error [9]. Their statistical model is as follows: \[\begin{align*} &Q_{Y_i|X_i,Z_i}(\tau) = \int_{\Omega} \beta(\tau,t) X_{i}(t) dt + (1,Z_i^T)\gamma(\tau)\\ &W_i(t) = X_i(t) + U_i(t) \end{align*}\] where the response variable \(Y_i\) and the scalar-valued covariates \(Z_i\) are measured without error, the function-valued covariate \(X_i(t)\) is measured with error as \(W_i(t)\).
Zhang et al. proposed a new corrected loss function for a partially functional linear quantile model with functional measurement error in this manuscript. They established a corrected quantile objective function for an observed measurement, which is an unbiased estimator of the quantile objective function that would be obtained if the true measurements were available. The estimators of the regression parameters are obtained by optimizing the resulting corrected loss function. The resulting estimator of the regression parameters is shown to be consistent.
The MECfda package provides a function,
ME.fcQR_CLS
, for implementing their bias-correction
estimation method. Below is a description of its arguments and return
values:
data.Y
(Response Variable):
The response variable can be provided as an atomic vector, a one-column
matrix, or a data frame. The recommended format is a one-column data
frame with a column name.
data.W
(Measurement Data for \(W\)):
This argument represents \(W\), the
observed measurement of \(X\) in the
statistical model. It should be provided as a three-dimensional array
where each row corresponds to a subject, each column to a measurement
(time) point, and each layer to an observation.
data.Z
(Scalar Covariates):
Scalar covariates can be omitted (or set to NULL
) if no
scalar covariates are included in the model, provided as an atomic
vector for a single scalar covariate, or as a matrix or data frame for
multiple covariates. The recommended format is a data frame with column
names.
tau
(Quantile Level):
Specifies the quantile \(\tau \in
(0,1)\) to be used in the quantile regression model. The default
value is 0.5.
t_interval
(Domain of the Function-Valued
Covariate):
A two-element vector that specifies the domain (interval) of the
function-valued covariate. The default is c(0,1)
,
representing the interval \([0,1]\).
t_points
(Measurement Time
Points):
Specifies the sequence of measurement (time) points. The default is
NULL
.
grid_k
(Candidate Basis
Numbers):
An atomic vector where each element is a candidate number for the basis
functions.
grid_h
(Candidate Tuning Parameter
Values):
A non-zero atomic vector where each element is a candidate value for the
tuning parameter.
degree
(Degree for Derivative and Integral
Calculations):
This argument is used when computing derivatives and integrals, with a
default value of 45—which is sufficient for most scenarios.
observed_X
(Observed \(X\) Data for Variance
Estimation):
Used for estimating parametric variance; the default value is
NULL
.
The function returns an object of class ME.fcQR_CLS
(a
list) containing the following elements:
estimated_beta_hat
: Estimated
coefficients from the corrected loss function (including the functional
component).estimated_beta_t
: The estimated
functional curve.SE_est
: The estimated parametric
variance (returned only if observed_X
is not
NULL
).estimated_Xbasis
: The basis matrix
used in the estimation.res_naive
: Results from the naive
(uncorrected) method.Tekwe et. al. proposed a bias correction estimation method for scalar-on-function linear regression model with measurement error using an instrumental variable [6]. Their statistical model is as follows: \[\begin{align*} &Y_i = \int_0^1 \beta(t)X_i(t)dt + \varepsilon_i\\ &W_i(t) = X_i(t) + U_i(t)\\ &M_i(t) = \delta X_i(t) + \eta_i(t) \end{align*}\] where the response variable \(Y_i\) and are measured without error, the function-valued covariate \(X_i(t)\) is measured with error as \(W_i(t)\), and \(M_i(t)\) is an measured instrumental variable. They included the following additional assumptions:
for \(\forall t,s\in[0,1]\) and \(i = 1,\dots,n\).
The MECfda package provides the function
ME.fcLR_IV
for implementing their bias-correction
estimation method. The arguments are as follows:
data.Y
(Response Variable):
The response variable can be provided as an atomic vector, a one-column
matrix, or a data frame. The recommended format is a one-column data
frame with a column name.
data.W
(Measurement Data for \(W\)):
This argument represents \(W\), the
observed measurement of \(X\) in the
statistical model. It should be provided as a data frame or matrix,
where each row represents a subject and each column corresponds to a
measurement (time) point.
data.M
(Instrumental
Variable):
This argument provides the data for \(M\), the instrumental variable. It should
be provided as a data frame or matrix, with each row representing a
subject and each column representing a measurement (time)
point.
t_interval
(Domain of the Function-Valued
Covariate):
A two-element vector that specifies the domain (interval) of the
function-valued covariate. The default is c(0,1)
,
representing the interval \([0,1]\).
t_points
(Measurement Time
Points):
Specifies the sequence of measurement (time) points. The default is
NULL
.
CI.bootstrap
(Bootstrap Confidence
Interval):
A logical flag indicating whether to return a confidence interval using
the bootstrap method. The default is FALSE
.
The function returns an object of class ME.fcLR_IV
,
which is a list containing the following elements:
beta_tW
: Parameter estimates.CI
: Confidence interval, which is
returned only if CI.bootstrap
is TRUE
.The package MECfda includes some functions to generate simulated data to test the functions in the package.