Introduction to featureselectr

The featureselectr package is an object oriented package containing functionality designed for feature selection, model building, and/or classifier development.

Two primary functions in `featureselectr`

feature_selection()
- Sets up the feature selection object containing all search information.
Search()
- Performs the actual search.

Search Type Helpers

There are two main search types to choose from. See ?search_type.

search_type_forward_model()   # Forward stepwise
#> ── Forward Search ────────────────────────────────────────────────────
#> • display_name  'Forward Stepwise Model Search'
#> • max_steps     20
#> ──────────────────────────────────────────────────────────────────────

search_type_backward_model()  # Backward stepwise
#> ── Backward Search ───────────────────────────────────────────────────
#> • display_name  'Backward Stepwise Model Search'
#> ──────────────────────────────────────────────────────────────────────

Model Type Helpers

There are three main model types to choose from. See ?model_type.

model_type_lr()  # Logistic regression
#> ── Model: logistic regression ────────────────────────────────────────
#> • response    'Response'
#> ──────────────────────────────────────────────────────────────────────

model_type_lm()  # Linear regression
#> ── Model: linear regression ──────────────────────────────────────────
#> • response    'Response'
#> ──────────────────────────────────────────────────────────────────────

model_type_nb()  # Naive Bayes
#> ── Model: naive Bayes ────────────────────────────────────────────────
#> • response    'Response'
#> ──────────────────────────────────────────────────────────────────────

Cost Helpers

There are five available cost functions, that the used typically does not need to call directly. Simply pass one of the following as a string to the cost = argument to feature_selection(). See ?feature_selection and perhaps ?cost.

AUC: Area under the curve (classification)
CCC: Concordance Correlation Coefficient (regression)
MSE: Mean-squared Error (regression)
R2: R-squared (regression)
sens/spec: Sensitivity + Specificity (sum; classification)

Feature Selection with Naive Bayes

The analysis below is performed with the simulated data set from wranglr::simdata. We fit a Naive Bayes model during the feature selection. The setup below specifies 3 independent runs of 5 fold cross-validation.

Higher folds might generate slightly different results, but a 20-25% hold-out is fairly common. Of course, more runs (repeats) will take longer. There are 5 features that should be significant in a binary classification context. They are identified in the attributes of the object itself. We will restrict the search to the top 10 steps (there are 40 total features; thus approx. 35 false positives).

Setup `feature_select` Object

data <- simdata

# True positive features
attributes(data)$sig_feats$class
#> [1] "seq.2802.68" "seq.9251.29" "seq.1942.70" "seq.5751.80"
#> [5] "seq.9608.12"

# log-transform, center, and scale
cs <- function(x) {
  out <- log10(x)
  out <- out - mean(out)
  out / sd(out)
}

# scramble order of feats random
feats <- withr::with_seed(123, sample(helpr:::get_analytes(data)))
data[, feats] <- apply(data[, feats], 2, cs)

# set model type and column name of response variable
mt <- model_type_nb(response = "class_response")

# set search method function to 'forward' and 'model'
# restrict to the top 10 steps in the search; then stop
sm <- search_type_forward_model(max_steps = 10L)

# setup feature selection object
fs_setup <- feature_selection(
  data,
  candidate_features = feats,
  model_type  = mt,
  search_type = sm,
  runs  = 3L,
  folds = 5L,
  cost  = "AUC",
  random_seed = 1
)

fs_setup
#> ══ Feature Selection Object ══════════════════════════════════════════
#> ── Dataset Info ──────────────────────────────────────────────────────
#> • Rows                      100
#> • Columns                   55
#> • FeatureData               40
#> ── Search Optimization Info ──────────────────────────────────────────
#> • No. Candidates            '40'
#> • Response Field            'class_response'
#> • Cross Validation Runs     '3'
#> • Cross Validation Folds    '5'
#> • Stratified Folds          'FALSE'
#> • Model Type                'fs_nb'
#> • Search Type               'fs_forward_model'
#> • Cost Function             'AUC'
#> • Random Seed               '1'
#> • Display Name              'Forward Stepwise Model Search'
#> • Search Complete           'FALSE'
#> ══════════════════════════════════════════════════════════════════════

Perform the Search

The S3 method Search() performs the actual feature selection, and method dispatch occurs depending on the class of fs_nb.

fs_nb <- Search(fs_setup)

Plot the Selection Paths

There is an S3 plot() method easily visualizes the steps of the selection algorithm, and highlights the peak (AUC) and the models at $1\sigma$ and $2\sigma$ from the peak. The 2 panels show a distribution-free representation of the data (left; Wilcoxon signed-ranks with medians) and a distribution dependent representation (right; standard errors with means and CI95%).

plot(fs_nb)

Logistic Regression

We can use the update() method to modify the existing feature_select object.

fs_update <- update(
  fs_setup,   # the `feature_selection` object being modified
  model_type  = model_type_lr("class_response"), # logistic reg
  search_type = search_type_forward_model(max_steps = 15L), # increase max steps
  stratify    = TRUE   # now stratify
)

fs_update
#> ══ Feature Selection Object ══════════════════════════════════════════
#> ── Dataset Info ──────────────────────────────────────────────────────
#> • Rows                      100
#> • Columns                   55
#> • FeatureData               40
#> ── Search Optimization Info ──────────────────────────────────────────
#> • No. Candidates            '40'
#> • Response Field            'class_response'
#> • Cross Validation Runs     '3'
#> • Cross Validation Folds    '5'
#> • Stratified Folds          'TRUE'
#> • Model Type                'fs_lr'
#> • Search Type               'fs_forward_model'
#> • Cost Function             'AUC'
#> • Random Seed               '1'
#> • Display Name              'Forward Stepwise Model Search'
#> • Search Complete           'FALSE'
#> ══════════════════════════════════════════════════════════════════════

Perform the Search

fs_lr <- Search(fs_update)

Plot the Selection Algorithm

plot(fs_lr)

Return the Plot Features

get_fs_features(fs_lr)
#> ══ Features ══════════════════════════════════════════════════════════
#> • features_max  12
#> • features_1se  8
#> • features_2se  6
#> ── features_max ──────────────────────────────────────────────────────
#> 'seq.1942.70', 'seq.9297.97', 'seq.9608.12', 'seq.4914.10', 'seq.9360.55', 'seq.8142.63', 'seq.1130.49', 'seq.9373.82', 'seq.9251.29', 'seq.2802.68', 'seq.3459.49', 'seq.6356.60'
#> ── features_1se ──────────────────────────────────────────────────────
#> 'seq.1942.70', 'seq.9608.12', 'seq.9360.55', 'seq.8142.63', 'seq.9373.82', 'seq.9251.29', 'seq.2802.68', 'seq.3459.49'
#> ── features_2se ──────────────────────────────────────────────────────
#> 'seq.1942.70', 'seq.9608.12', 'seq.8142.63', 'seq.9251.29', 'seq.2802.68', 'seq.3459.49'

Class Stratification Plot

You can also check the class proportions (imbalances) of the cross-validation folds based on the proportion of binary classes (for classification problems).

This should be most evident when comparing the folds with and without forced stratification. Below is a sample plot of the cross-validation folds without stratification (left) and after an update to the object to include stratification (right):

no_strat <- feature_selection(
  data, candidate_features = feats,
  model_type = mt, search_type = sm,
  runs  = 2L, folds = 3L
)
with_strat <- update(no_strat, stratify = TRUE)
plot_cross(no_strat) + plot_cross(with_strat)

Two primary functions in featureselectr

Search Type Helpers

Model Type Helpers

Cost Helpers

Feature Selection with Naive Bayes

Setup feature_select Object

Perform the Search

Plot the Selection Paths

Logistic Regression

Perform the Search

Plot the Selection Algorithm

Return the Plot Features

Class Stratification Plot

Two primary functions in `featureselectr`

Setup `feature_select` Object