--- title: "Information Consistency-Based Measures for Spatial Stratified Heterogeneity" author: "Wenbo Lv" date: "2024-12-01" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{sshicm} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} ---   ## 1. Introduction to `sshicm` package ### 1.1 The `sshicm` package can be used to address following issues: - Information consistency-based measures of spatial stratified heterogeneity intensity for continuous and nominal variables. - Strength of spatial pattern associations based on information consistency measures. ### 1.2 Example data in the `sshicm` package #### baltim data "baltim" consists of [Baltimore home sale prices and hedonics][5]. In total, there are 221 instances in "baltim" data. The explanatory variables are whether it is a detached unit (DWELL), whether it has a patio (PATIO), whether it has a fireplace (FIREPL), whether it has air conditioning (AC), and whether the dwelling is in Baltimore County (CITCOU, while the target variable is the sale price of the home (PRICE). #### cinc data "cinc" is derived from [the 2008 Cincinnati Crime + Socio-Demographics dataset][6]. It includes spatial data on 457 objects located on an irregular lattice. The explanatory variables are male population (MALE), female population (FEMALE), median age (MEDIAN_AGE), average family size (AVG_FAMSIZ), and population density (DENSITY), while the target variable is the existence of theft (THEFT_D). ![**Figure 1**. Maps of the baltim and cinc data sets. ([Bai et al. 2023][2])](../man/figures/sshicm/sshicm_example_data.jpg){width=500px} ### 1.3 Functions in the `sshicm` package #### Two functions for vector-type inputs of dependent and independent variables. - `sshic()` for continuous dependent variable - `sshin()` for continuous nominal variable #### Regression-style data frame modeling function A function `sshicm()` that yields all results in a single line, with the `type` parameter set to `IC` (Continuous) or `IN` (Nominal) to specify whether the dependent variable is a continuous or nominal variable. ## 2. The principle of measuring spatial stratified heterogeneity based on information consistency **Note: All explanatory variables must be discretized in advance or inherently be discrete nominal variables.** ### 2.1 When the dependent variable is a continuous variable: $$ I_{C}\left(d,s\right) = \sum_{s_{i} \in S}p\left(s_{i}\right)\frac{ \arctan \left(\textbf{RelE} \left( f_{d_{i}} \mid \mid f \right) \right)}{\pi / 2} $$ where $d_i$ is the random variable corresponding to the target variable in stratum $s_i$ , and $f_{d_i}$ and $f$ are the density functions of $d_i$ and $d$, respectively. Additionally, $\textbf{RelE} \left( f_{d_{i}} \mid \mid f \right)$ is the relative entropy of $f_{d_i}$ and $f$. $$ \textbf{RelE} \left( f_{d_{i}} \mid \mid f \right) = H \left(f_{d_{i}} , f\right) - H \left(f_{d_{i}}\right) = \sum_{i = 1}^{n} f_{d_{i}} \log \frac{1}{f} - \sum_{i = 1}^{n} f_{d_{i}} \log \frac{1}{f_{d_{i}}} = \sum_{i = 1}^{n} f_{d_{i}} \log \frac{f_{d_{i}}}{f} $$ ### 2.2 When the dependent variable is a nominal variable: $$ I_{N}\left(d,s\right) = \frac{I \left(d,s\right)}{I \left(d\right)} = \frac{I \left(d\right) - I \left(d \mid s\right)}{I \left(d\right)} = 1 - \frac{\sum_{s_i \in S}\sum_{x \in V_d} p\left(s_i,x\right) \log p\left(x \mid s_i\right)}{\sum_{x \in V_d} p\left(x\right) \log p\left(x\right)} $$ where $p\left(x\right)$ is the probability of observing $x$ in $U$, $p\left(s_i,x\right)$ is the probability of observing $s_i$ and $x$ in $U$, and $p\left(x \mid s_i\right)$ is the probability of observing $x$ given that the stratum is $s_i$. ## 3. Examples of the `sshicm` package ```r install.packages("sshicm", dep = TRUE) ``` ``` r library(sshicm) ``` ``` r baltim = sf::read_sf(system.file("extdata/baltim.gpkg",package = "sshicm")) sshicm(PRICE ~ .,baltim,type = "IC") ## # A tibble: 5 × 3 ## Variable Ic Pv ## ## 1 DWELL 0.648 0.00801 ## 2 AC 0.223 0.0591 ## 3 PATIO 0.168 0.556 ## 4 FIREPL 0.135 0.667 ## 5 CITCOU 0.0898 0.988 ``` ``` r cinc = sf::read_sf(system.file("extdata/cinc.gpkg",package = "sshicm")) sshicm(THEFT_D ~ .,cinc,type = "IN") ## # A tibble: 5 × 3 ## Variable In Pv ## ## 1 DENSITY 0.776 0.0681 ## 2 MEDIAN_AGE 0.228 0.0230 ## 3 MALE 0.0367 0 ## 4 AVG_FAMSIZ 0.0205 0.00300 ## 5 FEMALE 0.00584 0.0200 ``` ``` r ntds = gdverse::NTDs sshicm(incidence ~ watershed + elevation + soiltype,data = ntds) ## # A tibble: 3 × 3 ## Variable Ic Pv ## ## 1 elevation 0.293 0.0250 ## 2 watershed 0.177 0.0521 ## 3 soiltype 0.117 0.0671 ``` ## Reference Wang, J., Haining, R., Zhang, T., Xu, C., Hu, M., Yin, Q., … Chen, H. (2024). Statistical Modeling of Spatially Stratified Heterogeneous Data. Annals of the American Association of Geographers, 114(3), 499–519. [https://doi.org/10.1080/24694452.2023.2289982][1]. Bai, H., Wang, H., Li, D., & Ge, Y. (2023). Information Consistency-Based Measures for Spatial Stratified Heterogeneity. Annals of the American Association of Geographers, 113(10), 2512–2524. [https://doi.org/10.1080/24694452.2023.2223700][2]. Wang, J., Li, X., Christakos, G., Liao, Y., Zhang, T., Gu, X., & Zheng, X. (2010). Geographical Detectors‐Based Health Risk Assessment and its Application in the Neural Tube Defects Study of the Heshun Region, China. International Journal of Geographical Information Science, 24(1), 107–127. [https://doi.org/10.1080/13658810802443457][3]. Wang, J. F., Zhang, T. L., & Fu, B. J. A measure of spatial stratified heterogeneity. Ecological indicators, 2016. 67, 250-256. [https://doi.org/10.1016/j.ecolind.2016.02.052][4].   [1]: https://doi.org/10.1080/24694452.2023.2289982 [2]: https://doi.org/10.1080/24694452.2023.2223700 [3]: https://doi.org/10.1080/13658810802443457 [4]: https://doi.org/10.1016/j.ecolind.2016.02.052 [5]: https://geodacenter.github.io/data-and-lab/baltim/ [6]: https://geodacenter.github.io/data-and-lab/walnut_hills/