--- title: "Beta-Select Demonstration: SEM by 'lavaan'" date: "2024-11-08" output: rmarkdown::html_vignette: number_sections: true vignette: > %\VignetteIndexEntry{Beta-Select Demonstration: SEM by 'lavaan'} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: "references.bib" csl: apa.csl --- # Introduction This article demonstrates how to use `lav_betaselect()` from the package [`betaselectr`](https://sfcheung.github.io/betaselectr/) to standardize selected variables in a model fitted by `lavaan` and forming confidence intervals for the parameters. # Data and Model The sample dataset from the package `betaselectr` will be used in this demonstration: ``` r library(betaselectr) head(data_test_medmod) #> dv iv mod med cov1 cov2 #> 1 7.487873 11.42573 16.65805 42.28988 54.14051 15.56069 #> 2 8.474931 16.64790 22.66332 42.08692 39.21125 17.61286 #> 3 11.206539 14.81278 22.80955 32.76869 31.97963 20.77333 #> 4 10.148827 15.79632 22.94451 43.96807 42.72187 15.66971 #> 5 7.421606 14.29621 24.51562 37.10942 42.74174 21.97132 #> 6 6.846435 12.00819 25.22163 35.46051 30.85914 22.35444 ``` This is the path model, fitted by `lavaan::sem()`: ``` r library(lavaan) #> This is lavaan 0.6-19 #> lavaan is FREE software! Please report any bugs. mod <- " med ~ iv + mod + iv:mod + cov1 + cov2 dv ~ med + iv + cov1 + cov2 " fit <- sem(mod, data_test_medmod) ``` The model has a moderator, `mod`, posited to moderate the effect from `iv` to `med`. The product term is `iv:mod`. These are the results: ``` r summary(fit) #> lavaan 0.6-19 ended normally after 2 iterations #> #> Estimator ML #> Optimization method NLMINB #> Number of model parameters 11 #> #> Number of observations 200 #> #> Model Test User Model: #> #> Test statistic 2.303 #> Degrees of freedom 2 #> P-value (Chi-square) 0.316 #> #> Parameter Estimates: #> #> Standard errors Standard #> Information Expected #> Information saturated (h1) model Structured #> #> Regressions: #> Estimate Std.Err z-value P(>|z|) #> med ~ #> iv -6.373 0.985 -6.473 0.000 #> mod -3.899 0.614 -6.346 0.000 #> iv:mod 0.286 0.039 7.340 0.000 #> cov1 -0.093 0.070 -1.327 0.185 #> cov2 0.242 0.133 1.823 0.068 #> dv ~ #> med 0.092 0.011 8.098 0.000 #> iv 0.227 0.038 5.896 0.000 #> cov1 -0.006 0.013 -0.454 0.650 #> cov2 0.030 0.025 1.230 0.219 #> #> Variances: #> Estimate Std.Err z-value P(>|z|) #> .med 60.292 6.029 10.000 0.000 #> .dv 2.087 0.209 10.000 0.000 ``` # Problems With Standardization We can request the standardized solution using `lavaan::standardizedSolution()`: ``` r standardizedSolution(fit, output = "text") #> #> Regressions: #> est.std Std.Err z-value P(>|z|) ci.lower ci.upper #> med ~ #> iv -1.855 0.259 -7.158 0.000 -2.363 -1.347 #> mod -1.956 0.280 -6.988 0.000 -2.504 -1.407 #> iv:mod 3.588 0.428 8.390 0.000 2.750 4.426 #> cov1 -0.077 0.058 -1.332 0.183 -0.189 0.036 #> cov2 0.105 0.057 1.836 0.066 -0.007 0.218 #> dv ~ #> med 0.459 0.052 8.845 0.000 0.357 0.560 #> iv 0.331 0.053 6.279 0.000 0.228 0.434 #> cov1 -0.024 0.054 -0.454 0.650 -0.130 0.081 #> cov2 0.066 0.054 1.233 0.218 -0.039 0.171 #> #> Variances: #> est.std Std.Err z-value P(>|z|) ci.lower ci.upper #> .med 0.656 0.050 13.243 0.000 0.559 0.753 #> .dv 0.569 0.050 11.353 0.000 0.471 0.667 ``` However, for this model, there are several problems: - The product term, `iv:mod`, is also standardized. This is inappropriate. One simple but underused solution is to standardize the variables *before* forming the product term [@friedrich_defense_1982]. - The confidence intervals are formed using the delta-method, which has been found to be inferior to methods such as nonparametric percentile bootstrap confidence interval for the standardized solution [@falk_are_2018]. Although there are situations in which the delta-method confidence and the nonparametric percentile bootstrap confidences can be similar (e.g., sample size is large and the sample estimates are not extreme), it is still safe to at least try both methods and compare the results. - There are cases in which some variables are measured by meaningful units and do not need to be standardized. for example, if `cov1` is age measured by year, then age is more meaningful than "standardized age". - In path analysis, categorical variables are usually represented by dummy variables, each of them having only two possible values (0 or 1). It is not meaningful to standardize the dummy variables. # Beta-Select `lav_betaselect()` The function `lav_betaselect()` can be used to solve this problem by: - standardizing variables before product terms are formed, - standardizing only variables for which standardization can facilitate interpretation, and - forming confidence intervals that take into account selected standardization. We call the coefficients computed by this kind of standardization *beta*s-select ($\beta{s}_{Select}$, $\beta_{Select}$ in singular form), to differentiate them from coefficients computed by standardizing all variables, including product terms. ## Estimates Only Suppose we only need to solve the first problem, with the product term computed after `iv` and `mod` are standardized: ``` r fit_beta <- lav_betaselect(fit) ``` ``` r fit_beta ``` This is the output if printed using the default options: ``` #> #> Selected Standardization: #> #> Standard Error: Nil #> #> Parameter Estimates Settings: #> #> Standard errors: Standard #> Information: Expected #> Information saturated (h1) model: Structured #> #> Regressions: #> BetaSelect #> med ~ #> iv -1.855 #> mod -1.956 #> iv:mod 0.400 #> cov1 -0.077 #> cov2 0.105 #> dv ~ #> med 0.459 #> iv 0.331 #> cov1 -0.024 #> cov2 0.066 #> #> Footnote: #> - Variable(s) standardized: cov1, cov2, dv, iv, med, mod #> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both #> original estimates and betas-select. #> - Product terms (iv:mod) have variables standardized before computing #> them. The product term(s) is/are not standardized. ``` Compared to the solution with the product term standardized, the coefficient of `iv:mod` changed substantially from 3.588 to 0.286. As shown by @cheung_improving_2022, the coefficient of *standardized* product term (`iv:mod`) can be substantially different from the properly standardized product term (the product of standardized `iv` and standardized `mod`). The footnote will also indicate variables that are standardized, and remarked that product terms are formed *after* standardization. ## Estimates and Bootstrap Confidence Interval Suppose we want to address both the first and the second problems, with - the product term computed after `iv` and `mod` standardized, and - bootstrap confidence intervals used, that take into account the sampling variation of the standardizers (the standard deviations). We can call `lav_betaselect()` again, with additional arguments set: ``` r fit_beta <- lav_betaselect(fit, std_se = "bootstrap", bootstrap = 5000, iseed = 2345, parallel = "snow", ncpus = 20) #> Finding product terms in the model ... #> Finished finding product terms. #> #> Compute bootstrapping standardized solution: ``` These are the additional arguments: - `std_se`: The method to compute the standard errors as well as confidence intervals. Set to `"bootstrap"` for nonparametric bootstrapping. - `iseed`: The seed for the random number generator used for bootstrapping. Set this to an integer to make the results reproducible. - `parallel`: The method to be used for parallel processing. It will be passed to `lavaan::bootstrapLavaan()`. Supported values are `"none"`, `"snow"`, and `"multicore"`. - `ncpus`: The number of CPU cores to use if `parallel` processing is not `"none"`. Default is `parallel::detectCores(logical = FALSE) - 1`, or the number of *physical* cores minus one. This is the output if printed with default options: ``` r fit_beta ``` ``` #> #> Selected Standardization: #> #> Standard Error: Nonparametric bootstrap #> Bootstrap samples: 5000 #> Confidence Interval: Percentile #> Level of Confidence: 95.0% #> #> Parameter Estimates Settings: #> #> Standard errors: Standard #> Information: Expected #> Information saturated (h1) model: Structured #> #> Regressions: #> BetaSelect SE Z p-value Sig CI.Lo CI.Hi CI.Sig #> med ~ #> iv -1.855 0.248 -7.490 0.000 *** -2.307 -1.332 Sig. #> mod -1.956 0.281 -6.950 0.000 *** -2.453 -1.348 Sig. #> iv:mod 0.400 0.047 8.565 0.000 *** 0.298 0.481 Sig. #> cov1 -0.077 0.057 -1.353 0.185 -0.186 0.038 n.s. #> cov2 0.105 0.061 1.725 0.094 . -0.019 0.219 n.s. #> dv ~ #> med 0.459 0.052 8.828 0.000 *** 0.348 0.553 Sig. #> iv 0.331 0.051 6.480 0.000 *** 0.229 0.431 Sig. #> cov1 -0.024 0.058 -0.418 0.686 -0.137 0.090 n.s. #> cov2 0.066 0.058 1.139 0.259 -0.050 0.178 n.s. #> #> Footnote: #> - Variable(s) standardized: cov1, cov2, dv, iv, med, mod #> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> - Standard errors, p-values, and confidence intervals are not computed #> for betas-select which are fixed in the standardized solution. #> - P-values for betas-select are asymmetric bootstrap p-value computed #> by the method of Asparouhov and Muthén (2021). #> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both #> original estimates and betas-select. #> - Product terms (iv:mod) have variables standardized before computing #> them. The product term(s) is/are not standardized. ``` In this dataset, with 200 cases, the delta-method confidence intervals are close to the bootstrap confidence intervals, except obviously for the product term because the coefficient of the product term has substantially different values in the two solutions. ## Estimates and Bootstrap Confidence Intervals, With Only Selected Variables Standardized Suppose we want to address also the the third issue, and standardize only some of the variables. This can be done using either `to_standardize` or `not_to_standardize`. - Use `to_standardize` when the number of variables to standardize is much fewer than the number of variables not to standardize. - Use `not_to_standardize` when the number variables to standardize is much more than the the number of variables not to standardize. For example, suppose we only need to standardize `dv` and `iv`, `cov1`, and `cov2`, this is the call to do this, setting `to_standardize` to `c("iv", "dv", "cov1", "cov2")`: ``` r fit_beta_select_1 <- lav_betaselect(fit, std_se = "bootstrap", to_standardize = c("iv", "dv", "cov1", "cov2"), bootstrap = 5000, iseed = 2345, parallel = "snow", ncpus = 20) ``` If we want to standardize all variables except for `dv` and `mod`, we can use this call, and set `not_to_standardize` to `c("mod", "dv")`: ``` r fit_beta_select_2 <- lav_betaselect(fit, std_se = "bootstrap", not_to_standardize = c("mod", "dv"), bootstrap = 5000, iseed = 2345, parallel = "snow", ncpus = 20) ``` The results of these calls are identical, and only those of the second version are printed: ``` r fit_beta_select_2 ``` ``` #> Selected Standardization: #> #> Standard Error: Nonparametric bootstrap #> Bootstrap samples: 5000 #> Confidence Interval: Percentile #> Level of Confidence: 95.0% #> #> Parameter Estimates Settings: #> #> Standard errors: Standard #> Information: Expected #> Information saturated (h1) model: Structured #> #> Regressions: #> BetaSelect SE Z p-value Sig CI.Lo CI.Hi CI.Sig #> med ~ #> iv -1.855 0.248 -7.490 0.000 *** -2.307 -1.332 Sig. #> mod -0.407 0.059 -6.950 0.000 *** -0.510 -0.280 Sig. #> iv:mod 0.083 0.010 8.565 0.000 *** 0.062 0.100 Sig. #> cov1 -0.077 0.057 -1.353 0.185 -0.186 0.038 n.s. #> cov2 0.105 0.061 1.725 0.094 . -0.019 0.219 n.s. #> dv ~ #> med 0.878 0.116 7.567 0.000 *** 0.635 1.092 Sig. #> iv 0.634 0.100 6.337 0.000 *** 0.430 0.826 Sig. #> cov1 -0.047 0.112 -0.418 0.686 -0.265 0.168 n.s. #> cov2 0.126 0.111 1.137 0.259 -0.093 0.345 n.s. #> #> Footnote: #> - Variable(s) standardized: cov1, cov2, iv, med #> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> - Standard errors, p-values, and confidence intervals are not computed #> for betas-select which are fixed in the standardized solution. #> - P-values for betas-select are asymmetric bootstrap p-value computed #> by the method of Asparouhov and Muthén (2021). #> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both #> original estimates and betas-select. #> - Product terms (iv:mod) have variables standardized before computing #> them. The product term(s) is/are not standardized. ``` The footnotes show that, by specifying that `dv` and `mod` are not standardized, all the other four variables are standardized: `iv`, `med`, `cov1`, and `cov2`. Therefore, in this case, it is more convenient to use `not_to_standardize`. When reporting *beta*s-*select*, researchers need to state which variables are standardized and which are not. This can be done in table notes, or in a column of the parameter estimate tables. The output can of `lav_betaselect()` can be printed with `show_Bs.by` set to `TRUE` to demonstrate the second approach: ``` r print(fit_beta_select_2, show_Bs.by = TRUE) ``` ``` #> Regressions: #> BetaSelect SE Z p-value Sig CI.Lo CI.Hi CI.Sig Selected #> med ~ #> iv -1.855 0.248 -7.490 0.000 *** -2.307 -1.332 Sig. iv,med #> mod -0.407 0.059 -6.950 0.000 *** -0.510 -0.280 Sig. med #> iv:mod 0.083 0.010 8.565 0.000 *** 0.062 0.100 Sig. iv,med #> cov1 -0.077 0.057 -1.353 0.185 -0.186 0.038 n.s. cov1,med #> cov2 0.105 0.061 1.725 0.094 . -0.019 0.219 n.s. cov2,med #> dv ~ #> med 0.878 0.116 7.567 0.000 *** 0.635 1.092 Sig. med #> iv 0.634 0.100 6.337 0.000 *** 0.430 0.826 Sig. iv #> cov1 -0.047 0.112 -0.418 0.686 -0.265 0.168 n.s. cov1 #> cov2 0.126 0.111 1.137 0.259 -0.093 0.345 n.s. cov2 #> #> Footnote: #> - Variable(s) standardized: cov1, cov2, iv, med #> - Sig codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> - Standard errors, p-values, and confidence intervals are not computed #> for betas-select which are fixed in the standardized solution. #> - P-values for betas-select are asymmetric bootstrap p-value computed #> by the method of Asparouhov and Muthén (2021). #> - Call 'print()' and set 'standardized_only' to 'FALSE' to print both #> original estimates and betas-select. #> - The column 'Selected' lists variable(s) standardized when computing #> the standardized coefficient of a parameter. ('NA' for user-defined #> parameters because they are computed from other standardized #> parameters.) #> - Product terms (iv:mod) have variables standardized before computing #> them. The product term(s) is/are not standardized. ``` ## Categorical Variables When calling `lav_betaselect()`, variables with only two values in the dataset are assumed to be categorical and will not be standardized by default. This can be overriden by setting `skip_categorical_x` to `FALSE`, though not recommended. # Conclusion In structural equation modeling, there are situations in which standardizing all variables is not appropriate, or when standardization needs to be done before forming product terms. We are not aware of tools that can do appropriate standardization *and* form confidence intervals that takes into account the selective standardization. By promoting the use of *beta*s-*select* using `lav_betaselect()`, we hope to make it easier for researchers to do appropriate Standardization in when reporting SEM results. # References