Extensible R framework for subgroup discovery (Atzmueller (2015)), contrast patterns (Chen (2022)), emerging patterns (Dong (1999)) and association rules (Agrawal (1994)). Both crisp (binary) and fuzzy data are supported. It generates conditions in the form of elementary conjunctions, evaluates them on a dataset and checks the induced sub-data for interesting statistical properties. Currently, the package searches for implicative association rules and conditional correlations (Hájek (1978)). A user-defined function may be defined to evaluate on each generated condition to search for custom patterns.
To install the stable version of nuggets
from CRAN, type
the following command within the R session:
install.packages("nuggets")
You can also install the development version of nuggets
from GitHub with:
install.packages("devtools")
::install_github("beerda/nuggets") devtools
We start with loading of the needed packages:
library(tidyverse)
library(nuggets)
We are going to use the CO2
dataset as an example:
head(CO2)
#> Plant Type Treatment conc uptake
#> 1 Qn1 Quebec nonchilled 95 16.0
#> 2 Qn1 Quebec nonchilled 175 30.4
#> 3 Qn1 Quebec nonchilled 250 34.8
#> 4 Qn1 Quebec nonchilled 350 37.2
#> 5 Qn1 Quebec nonchilled 500 35.3
#> 6 Qn1 Quebec nonchilled 675 39.2
First, the numeric columns need to be transformed to factors:
<- mutate(CO2,
d conc = cut(conc, c(-Inf, 175, 350, 675, Inf)),
uptake = cut(uptake, c(-Inf, 17.9, 28.3, 37.12)))
head(d)
#> Plant Type Treatment conc uptake
#> 1 Qn1 Quebec nonchilled (-Inf,175] (-Inf,17.9]
#> 2 Qn1 Quebec nonchilled (-Inf,175] (28.3,37.1]
#> 3 Qn1 Quebec nonchilled (175,350] (28.3,37.1]
#> 4 Qn1 Quebec nonchilled (175,350] <NA>
#> 5 Qn1 Quebec nonchilled (350,675] (28.3,37.1]
#> 6 Qn1 Quebec nonchilled (350,675] <NA>
Then every column can be dichotomized, i.e., dummy logical columns may be created for each factor level:
<- dichotomize(d)
d head(d)
#> # A tibble: 6 × 23
#> `Plant=Qn1` `Plant=Qn2` `Plant=Qn3` `Plant=Qc1` `Plant=Qc3` `Plant=Qc2`
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE FALSE FALSE FALSE
#> 2 TRUE FALSE FALSE FALSE FALSE FALSE
#> 3 TRUE FALSE FALSE FALSE FALSE FALSE
#> 4 TRUE FALSE FALSE FALSE FALSE FALSE
#> 5 TRUE FALSE FALSE FALSE FALSE FALSE
#> 6 TRUE FALSE FALSE FALSE FALSE FALSE
#> # ℹ 17 more variables: `Plant=Mn3` <lgl>, `Plant=Mn2` <lgl>, `Plant=Mn1` <lgl>,
#> # `Plant=Mc2` <lgl>, `Plant=Mc3` <lgl>, `Plant=Mc1` <lgl>,
#> # `Type=Quebec` <lgl>, `Type=Mississippi` <lgl>,
#> # `Treatment=nonchilled` <lgl>, `Treatment=chilled` <lgl>,
#> # `conc=(-Inf,175]` <lgl>, `conc=(175,350]` <lgl>, `conc=(350,675]` <lgl>,
#> # `conc=(675, Inf]` <lgl>, `uptake=(-Inf,17.9]` <lgl>,
#> # `uptake=(17.9,28.3]` <lgl>, `uptake=(28.3,37.1]` <lgl>
Before starting to search for the rules, it is good idea to create
the vector of disjoints. Columns with equal values in the disjoint
vector will not be combined together. This will speed-up the search as
it makes no sense, e.g., to combine Plant=Qn1
and
Plant=Qn2
in a single condition.
<- sub("=.*", "", colnames(d))
disj print(disj)
#> [1] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [7] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [13] "Type" "Type" "Treatment" "Treatment" "conc" "conc"
#> [19] "conc" "conc" "uptake" "uptake" "uptake"
Once the data are prepared, the dig_implications
function may be invoked. It takes the dataset as its first parameter and
a pair of “tidyselect” expressions to select the column names to appear
in the left- and right-hand side of the rule (antecedent and
consequent).
<- dig_implications(d,
result antecedent = !starts_with("Treatment"),
consequent = starts_with("Treatment"),
disjoint = disj,
min_support = 0.02,
min_confidence = 0.8)
<- arrange(result, desc(support))
result print(result)
#> # A tibble: 225 × 7
#> antecedent consequent support confidence coverage lift count
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 {Type=Mississippi,uptake=… {Treatmen… 0.155 0.813 0.190 1.63 16
#> 2 {Type=Mississippi,uptake=… {Treatmen… 0.119 1 0.119 2 10
#> 3 {Plant=Mc3} {Treatmen… 0.0833 1 0.0833 2 7
#> 4 {Plant=Mc1} {Treatmen… 0.0833 1 0.0833 2 7
#> 5 {Plant=Qn1} {Treatmen… 0.0833 1 0.0833 2 7
#> 6 {Plant=Mc2} {Treatmen… 0.0833 1 0.0833 2 7
#> 7 {Plant=Mn1} {Treatmen… 0.0833 1 0.0833 2 7
#> 8 {Plant=Mn2} {Treatmen… 0.0833 1 0.0833 2 7
#> 9 {Plant=Mn3} {Treatmen… 0.0833 1 0.0833 2 7
#> 10 {Plant=Qc2} {Treatmen… 0.0833 1 0.0833 2 7
#> # ℹ 215 more rows
The nuggets
package allows to execute a user-defined
callback function on each generated frequent condition. That way a
custom type of patterns may be searched. The following example
replicates the search for implicative rules with the custom callback
function. For that, a dataset has to be dichotomized and the disjoint
vector created as in the previous example:
head(d)
#> # A tibble: 6 × 23
#> `Plant=Qn1` `Plant=Qn2` `Plant=Qn3` `Plant=Qc1` `Plant=Qc3` `Plant=Qc2`
#> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl>
#> 1 TRUE FALSE FALSE FALSE FALSE FALSE
#> 2 TRUE FALSE FALSE FALSE FALSE FALSE
#> 3 TRUE FALSE FALSE FALSE FALSE FALSE
#> 4 TRUE FALSE FALSE FALSE FALSE FALSE
#> 5 TRUE FALSE FALSE FALSE FALSE FALSE
#> 6 TRUE FALSE FALSE FALSE FALSE FALSE
#> # ℹ 17 more variables: `Plant=Mn3` <lgl>, `Plant=Mn2` <lgl>, `Plant=Mn1` <lgl>,
#> # `Plant=Mc2` <lgl>, `Plant=Mc3` <lgl>, `Plant=Mc1` <lgl>,
#> # `Type=Quebec` <lgl>, `Type=Mississippi` <lgl>,
#> # `Treatment=nonchilled` <lgl>, `Treatment=chilled` <lgl>,
#> # `conc=(-Inf,175]` <lgl>, `conc=(175,350]` <lgl>, `conc=(350,675]` <lgl>,
#> # `conc=(675, Inf]` <lgl>, `uptake=(-Inf,17.9]` <lgl>,
#> # `uptake=(17.9,28.3]` <lgl>, `uptake=(28.3,37.1]` <lgl>
print(disj)
#> [1] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [7] "Plant" "Plant" "Plant" "Plant" "Plant" "Plant"
#> [13] "Type" "Type" "Treatment" "Treatment" "conc" "conc"
#> [19] "conc" "conc" "uptake" "uptake" "uptake"
As we want to search for implicative rules with some minimum support and confidence, we define the variables to hold that thresholds. We also need to define a callback function that will be called for each found frequent condition. Its purpose is to generate the rules with the obtained condition as an antecedent:
<- 0.02
min_support <- 0.8
min_confidence
<- function(condition, support, foci_supports) {
f <- foci_supports / support
conf <- !is.na(conf) & conf >= min_confidence & !is.na(foci_supports) & foci_supports >= min_support
sel <- conf[sel]
conf <- foci_supports[sel]
supp
lapply(seq_along(conf), function(i) {
list(antecedent = format_condition(names(condition)),
consequent = format_condition(names(conf)[[i]]),
support = supp[[i]],
confidence = conf[[i]])
}) }
The callback function f()
defines three arguments:
condition
, support
and
foci_supports
. The names of the arguments are not random.
Based on the argument names of the callback function, the searching
algorithm provides information to the function. Here
condition
is a vector of indices representing the
conjunction of predicates in a condition. By the predicate we mean the
column in the source dataset. The support
argument gets the
relative frequency of the condition in the dataset.
foci_supports
is a vector of supports of special
predicates, which we call “foci” (plural of “focus”), within the rows
satisfying the condition. For implicative rules, foci are potential rule
consequents.
Now we can run the digging for rules:
<- dig(d,
result f = f,
condition = !starts_with("Treatment"),
focus = starts_with("Treatment"),
disjoint = disj,
min_length = 1,
min_support = min_support)
As we return a list of lists in the callback function, we have to flatten the first level of lists in the result and binding it into a data frame:
<- result %>%
result unlist(recursive = FALSE) %>%
map(as_tibble) %>%
do.call(rbind, .) %>%
arrange(desc(support))
print(result)
#> # A tibble: 225 × 4
#> antecedent consequent support confidence
#> <chr> <chr> <dbl> <dbl>
#> 1 {Type=Mississippi,uptake=(-Inf,17.9]} {Treatment=chilled} 0.155 0.813
#> 2 {Type=Mississippi,uptake=(28.3,37.1]} {Treatment=nonchill… 0.119 1
#> 3 {Plant=Mc3} {Treatment=chilled} 0.0833 1
#> 4 {Plant=Mc1} {Treatment=chilled} 0.0833 1
#> 5 {Plant=Qn1} {Treatment=nonchill… 0.0833 1
#> 6 {Plant=Mc2} {Treatment=chilled} 0.0833 1
#> 7 {Plant=Mn1} {Treatment=nonchill… 0.0833 1
#> 8 {Plant=Mn2} {Treatment=nonchill… 0.0833 1
#> 9 {Plant=Mn3} {Treatment=nonchill… 0.0833 1
#> 10 {Plant=Qc2} {Treatment=chilled} 0.0833 1
#> # ℹ 215 more rows