The feature selection procedure is implemented with Python’s scikit-learn library. The package requires an Anaconda installation to work properly. The Python requirements are listed in the requirements.txt file.
You can install the development version of GeneSelectR from GitHub with:
GeneSelectR is based on usage of Python for feature selection via reticulate. When you install the library and launch it for the first time, it is necessary to create the working environment for the package. To do this there is a convenience function:
By running it you will be prompted to install a new conda environment with the default name ‘GeneSelectR_env’ that will be used in your future GeneSelectR sessions. After that you will be asked if you want necessary Python packages to be installed in this working environment.
Due to restrictions the reticulate package has after this initial step is done, it is necessary to start every of your GeneSelectR session with setting up the correct conda working environment:
The data matrix should be a data frame with samples as rows and features as columns. Example dataset can be accessed via:
This dataset is a bulk RNAseq dataset that was obtained from blood samples of the 149 African children that were stratified into ones having Atopic Dermatitis (AD) and healthy controls (HC). Additionally, the whole dataset contains the stratification variable by the children’s location (Urban and Rural). This data snippet contains the Urban samples only. The columns represent the genes and samples are in rows. The column treatment contains diagnosis label.
This data subset was used in the PharML 2022 workshop within ECML-PKDD 2022 conference. The whole paper is available with the doi: doi.org/10.1007/978-3-031-23633-4_18
As noted before every new session in which you use GeneSelectR should start with the following line, that would allow you to set the correct conda environment:
Prior to running the feature selection function, let’s prepare the data to have a suitable format:
X <- UrbanRandomSubset %>% dplyr::select(-treatment) # get the feature matrix
y <- UrbanRandomSubset['treatment'] # store the data point label in a separate vector
Due to how sklearn handles data, it is critical to convert the label column y into numeric representation. For example, you can do it like this:
Also, make sure that y is a vector, and not a data frame with one column:
To run the feature selection procedure with default settings, you should run the GeneSelectR::GeneSelectR() function:
fs_param_grids <- list(
"Lasso" = list(
"feature_selector__estimator__C" = c(0.01, 0.1, 1L, 10L),
"feature_selector__estimator__solver" = c('liblinear','saga')
),
"Univariate" = list(
"feature_selector__param" = seq(50L, 200L, by = 50L)
),
"boruta" = list(
"feature_selector__perc" = seq(80L, 100L, by = 10L),
'feature_selector__n_estimators' = c(50L, 100L, 250L, 500L)
),
"RandomForest" = list(
"feature_selector__estimator__n_estimators" = seq(100L, 500L,by = 50L),
"feature_selector__estimator__max_depth" = c(10L, 20L, 30L),
"feature_selector__estimator__min_samples_split" = c(2L, 5L, 10L),
"feature_selector__estimator__min_samples_leaf" = c(1L, 2L, 4L),
"feature_selector__estimator__bootstrap" = c(TRUE, FALSE)
)
)
Pipeline Selection: If custom pipelines are provided, they’re used. Otherwise, default pipelines are created using the chosen feature selection methods. Pipelines contain two steps: VarianceThreshold that is set to 0.85 to filter out low variance features and StandardScaler. As per sklearn workflow to prevent data leakage every pipeline is fit on train and test/validate data separately.
Repeated Train-Test Splitting: The data undergoes repeated splitting into training and test sets, with the number of repetitions being defined by n_splits. For each split:
4.1 Note on Inbuilt vs Permutation feature importance: GeneSelectR package offers to ways of selecting features, and it is important to understand the difference in these two approaches. In short, these two methods can be summarized in this manner:
A Note on Permutation Importance: Permutation importance provides insights specific to the model it’s computed for. A feature that’s vital for one model might not be for another. Hence, it’s crucial to understand that permutation importance doesn’t indicate the intrinsic predictive value of a feature in isolation. Moreover, it’s vital to have a trustworthy model (with good cross-validation scores) before considering its permutation importance values. Features deemed unimportant in a poorly performing model could be crucial for a well-performing one. It’s always recommended to evaluate a model’s predictive power using held-out data or cross-validation prior to computing permutation importances. For more information please refere to the sklearn official documentation on permutation importance.
Note: Permutation Importance calculation is pretty computationally intensive and may increase the time to finish the analysis run. As a general recommendation it should be used if there large computational resources are avaialble.
GeneSelectR also supports scikit-optimize method BayesSearchCV, that performs Bayesian Optimization of hyperparameters. Depending on settings and data it could potentially speed up computation:
The GeneSelectR workflow is highly customizable depending on your needs. For example, if you wish to add any other feature selection methods from sklearn, you should pass it as a named list like this:
# sklearn is already imported when the library is loaded
# define the feature selection submodule and wanted methods with an estimator
feature_selection <- sklearn$feature_selection
select_from_model <- feature_selection$SelectFromModel
RFE <- feature_selection$RFE
rf <- sklearn$ensemble$RandomForestClassifier
# feature selection methods of your choice
my_methods <- list('RFE' = RFE(estimator = rf(), n_features_to_select = 100L),
'SelectFromModel' = select_from_model(estimator = rf()))
The parameters for the methods should be passed like this with a prefix ‘feature_selector__’ for every parameter:
# params for the feature selection methods of your choice
my_params <- list('RFE' = list('feature_selector__step' = seq(0.1, 0.001, 1, 10)),
'SelectFromModel' = list('feature_selector__estimator__n_estimators' = c(50L, 100L, 250L, 500L),
"feature_selector__estimator__max_depth" = c(10L, 20L, 30L),
"feature_selector__estimator__bootstrap" = c(TRUE, FALSE))
)
Finally, we can pass it as an arguments to the GeneSelectR() function:
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
feature_selection_methods=my_methods,
fs_param_grids = my_params)
Other than feature selection methods, preprocessing steps as well as other estimators for classifications can be passed. For example, if you want to pass other preprocessing steps you can do it like this:
minmax <- sklearn$preprocessing$MinMaxScaler()
varthr <- sklearn$feature_selection$VarianceThreshold()
preprocessing <- list( 'MinMaxScaler' = minmax,
'VarianceThreshold' = varthr)
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
feature_selection_methods=my_methods,
fs_param_grids = my_params,
preprocessing_steps = preprocessing)
Same can be done with the classifiying estimator. For example if you want to use the XGBoost classifier instead of the default random forest:
# import xgboost
# NOTE: it should be installed in the working conda env
xgb <- reticulate::import('xgboost')
xgb.classifier <- xgb$XGBClassifier()
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
classifier = xgb.classifier)
Note: if you supply your own classifier, you have to specify the parameter grid for it with a ‘classifier__’ prefix:
xgb <- reticulate::import('xgboost')
xgb.classifier <- xgb$XGBClassifier()
xgb_param_grid <- list(
"classifier__learning_rate" = c(0.01, 0.05, 0.1),
"classifier__n_estimators" = c(100L, 200L, 300L),
"classifier__max_depth" = c(3L, 5L, 7L),
"classifier__min_child_weight" = c(1L, 3L, 5L),
"classifier__gamma" = c(0, 0.1, 0.2),
"classifier__subsample" = c(0.8, 1.0),
"classifier__colsample_bytree" = c(0.8, 1.0)
)
selection_results <- GeneSelectR::GeneSelectR(X = X,
y = y,
njobs = -1,
classifier = xgb.classifier,
classifier_grid = xgb_param_grid)
The function returns an object of class “PipelineResults”,
containing:
- ‘best_pipeline’: A named list containing parameters of the best
performer pipeline.
‘cv_results’: Results of cross-validation for each pipeline.
‘inbuilt_feature_importance’: Aggregated inbuilt feature importance scores.
‘test_metrics’: If the perform_split parameter was set to TRUE, a dataframe of test metrics for each pipeline is returned.
-‘cv_mean_score’: A dataframe that summarizes the mean cross-validation scores.
Finally, you can visualize the most important features for every feature importance method calculation by calling the plotting function:
The plot will show you the mean feature importance of top n features of your choice by their importance score in descending order.
You can plot the feature selection procedure metrics with the following function:
plot_metrics(selection_results)
# or access it as a dataframe
selection_results$test_metrics
selection_results$cv_mean_score
Additionally, you can inspect whether genes in your feature selection lists have overlapping features. To do that use the following:
This will return a dataframe that demonstrates three types of overlap coefficients for inbuilt feature importance and permutation importance (if calculated): Soerensen-Dice, Overlap and Jaccard. The coefficients can also be visualized as overlap heatmaps. To do so do the following:
Additionally, if you have any custom lists (e.g. differential gene expression list) you can pass it as an argument like this:
custom_list <- list(custom_list = c('char1','char2','char3','char4','char5'),
custom_list2 = c('char1','char2','char3','char4','char5'))
overlap1 <- calculate_overlap_coefficients(selection_results, custom_lists = custom_list)
plot_overlap_heatmaps(overlap1)
For your convenience there is a wrapper function for clusterprofiler(link) GO enrichment implemented, along with a function to get gene annotations. After GeneSelectR has been run, to fetch gene annotations do the following:
# Get the annotations for the genes in the analysis
# The example dataset contains sequencing from humans
ah <- AnnotationHub::AnnotationHub()
human_ens <- AnnotationHub::query(ah, c("Homo sapiens", "EnsDb"))
human_ens <- human_ens[['AH98047']]
annotations_ahb <- ensembldb::genes(human_ens, return.type = "data.frame") %>%
dplyr::select(gene_id,gene_name,entrezid,gene_biotype)
annotations_df <- annotate_gene_lists(pipeline_results = selection_results,
annotations_ahb = annotations_ahb,
format = 'ENSEMBL')
annotations_df
The format argument can be set to three possible options: (1) ‘ENSEMBL’ (eg. ENSGXXXXXXX), (2) ‘SYMBOL’ (eg. TSPAN) or ENTREZ ID. The resulting AnnotatedGeneLists class object contains following fields:
# Formal class 'AnnotatedGeneLists' [package "GeneSelectR"] with 2 slots
# ..@ inbuilt :List of 4
# .. ..$ Lasso :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
# .. .. .. ..@ SYMBOL : chr [1:54] "MCOLN3" "AGL" "NBPF26" "EXO1" ...
# .. .. .. ..@ ENSEMBL : chr [1:54] "ENSG00000055732" "ENSG00000162688" "ENSG00000273136" "ENSG00000174371" ...
# .. .. .. ..@ ENTREZID: chr [1:54] "55283" "178" "101060684" "9156" ...
# .. ..$ Univariate :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
# .. .. .. ..@ SYMBOL : chr [1:74] "CDA" "BMP8B" "MCOLN3" "AGL" ...
# .. .. .. ..@ ENSEMBL : chr [1:74] "ENSG00000158825" "ENSG00000116985" "ENSG00000055732" "ENSG00000162688" ...
# .. .. .. ..@ ENTREZID: chr [1:74] "978" "656" "55283" "178" ...
# .. ..$ RandomForest:Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
# .. .. .. ..@ SYMBOL : chr [1:73] "CDA" "BMP8B" "MCOLN3" "NBPF26" ...
# .. .. .. ..@ ENSEMBL : chr [1:73] "ENSG00000158825" "ENSG00000116985" "ENSG00000055732" "ENSG00000273136" ...
# .. .. .. ..@ ENTREZID: chr [1:73] "978" "656" "55283" "101060684" ...
# .. ..$ boruta :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
# .. .. .. ..@ SYMBOL : chr [1:36] "MCOLN3" "AGL" "NBPF26" "BNIPL" ...
# .. .. .. ..@ ENSEMBL : chr [1:36] "ENSG00000055732" "ENSG00000162688" "ENSG00000273136" "ENSG00000163141" ...
# .. .. .. ..@ ENTREZID: chr [1:36] "55283" "178" "101060684" "149428" ...
# ..@ permutation:List of 4
# .. ..$ Lasso :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
# .. .. .. ..@ SYMBOL : chr [1:4] "HBB" "" "ABCC12" "LTF"
# .. .. .. ..@ ENSEMBL : chr [1:4] "ENSG00000244734" "ENSG00000259529" "ENSG00000140798" "ENSG00000012223"
# .. .. .. ..@ ENTREZID: chr [1:4] "3043" "NA" "94160" "4057"
# .. ..$ Univariate :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
# .. .. .. ..@ SYMBOL : chr [1:19] "KAZN" "RGS16" "IGFN1" "CHI3L1" ...
# .. .. .. ..@ ENSEMBL : chr [1:19] "ENSG00000189337" "ENSG00000143333" "ENSG00000163395" "ENSG00000133048" ...
# .. .. .. ..@ ENTREZID: chr [1:19] "23254" "6004" "91156" "1116" ...
# .. ..$ RandomForest:Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
# .. .. .. ..@ SYMBOL : chr [1:16] "ZBED6" "CYP2E1" "HBB" "OTOGL" ...
# .. .. .. ..@ ENSEMBL : chr [1:16] "ENSG00000257315" "ENSG00000130649" "ENSG00000244734" "ENSG00000165899" ...
# .. .. .. ..@ ENTREZID: chr [1:16] "100381270" "1571" "3043" "283310" ...
# .. ..$ boruta :Formal class 'GeneList' [package "GeneSelectR"] with 3 slots
# .. .. .. ..@ SYMBOL : chr [1:8] "CHI3L1" "HBB" "IDH1" "PTPRN" ...
# .. .. .. ..@ ENSEMBL : chr [1:8] "ENSG00000133048" "ENSG00000244734" "ENSG00000138413" "ENSG00000054356" ...
# .. .. .. ..@ ENTREZID: chr [1:8] "1116" "3043" "3417" "5798" ...
For example, ENTREZID for Random Forest gene list derived from inbuilt feature importance can be accessed like this:
You can also add custom lists (eg. background for enrichment, differential gene expression list) to fetch annotations for them:
custom_list <- list(background = c('gene1', 'gene2', 'gene3', 'gene4'),
DEGs = c('gene5','gene6','gene7'))
annotations_df_with_custom <- annotate_gene_lists(pipeline_results = selection_results,
annotations_ahb = annotations_ahb,
format = 'ensembl_gene_name',
custom_lists = custom_list)
annotations_df_with_custom
There is a wrapper function to run GO enrichment analysis with the clusterprofiler package. To run a GO enrichment analysis with default settings simply run:
This function can take different arguments depending on your need:
annotated_GO <- GO_enrichment_analysis(annotated_df,
list_type = 'permutation', #run GO enrichment on permutation based selected features
keyType = 'ENSEMBL', # run analysis with ENSEMBLIDs
ont = 'BP') # run BP ontology
annotated_GO
The returned object is a list containing clusterprofiler enrichResult for every feature selection list.
You can calculate what is the fraction of target parent GO terms is enriched in your feature selection lists:
annot_child_fractions <- compute_GO_child_term_metrics(GO_data = annotated_GO,
GO_terms = c("GO:0002376", "GO:0044419"),
plot = FALSE)
This will produce a dataframe containing information on number of children terms of specified terms, number of all the terms and fraction of children terms related to the parent one in the total number of identified GO terms (without p-value filtering) and hypergeometric test p-value:
str(annot_child_fractions)
# 'data.frame': 12 obs. of 6 variables:
# $ feature_list : chr "Lasso" "Univariate" "RandomForest" "boruta" ...
# $ all_terms_number : int 5413 5421 5981 4643 1308 1752 5413 5421 5981 4643 ...
# $ offspring_nodes_number: int 432 449 495 379 196 170 112 110 128 100 ...
# $ offspring_terms : chr "GO:0001767;GO:0001768;GO:0001771;GO:0001773;GO:0001774;GO:0001776;GO:0001779;GO:0001780;GO:0001782;GO:0001787;G"| __truncated__ "GO:0001767;GO:0001768;GO:0001771;GO:0001773;GO:0001774;GO:0001776;GO:0001780;GO:0001782;GO:0001787;GO:0001909;G"| __truncated__ "GO:0001767;GO:0001768;GO:0001771;GO:0001773;GO:0001774;GO:0001776;GO:0001779;GO:0001782;GO:0001787;GO:0001865;G"| __truncated__ "GO:0001767;GO:0001768;GO:0001771;GO:0001773;GO:0001774;GO:0001776;GO:0001779;GO:0001780;GO:0001782;GO:0001865;G"| __truncated__ ...
# $ fraction : num 7.98 8.28 8.28 8.16 14.98 ...
# $ GO_term : chr "GO:0002376" "GO:0002376" "GO:0002376" "GO:0002376" ...
If the plot argument is set to TRUE it will produce a following plot:
Final step of the analysis is the clustering and semantic similarity analysis of GO Terms from every list. This is done with thesimplifyEnrichment R package. For the convenience of data input, there is a wrapper for simplifyGOFromMultipleLists() function implemented:
# simplifyEnrichment produces a heatmap
hmap <- run_simplify_enrichment(annotated_GO,
method = 'lovain',
measure = 'Resnik')
This will produce a heatmap with clustered GO terms from the feature selecton lists methods:
For more information on simplifyEnrichment please visit the official package webpage.