Skip to content

The featureselectr package is an object oriented package containing functionality designed for feature selection, model building, and/or classifier development.


Two primary functions in featureselectr


Search Type Helpers

There are two main search types to choose from. See ?search_type.

search_type_forward_model()   # Forward stepwise
#> ── Forward Search ────────────────────────────────────────────────────
#> • display_name  'Forward Stepwise Model Search'
#> • max_steps     20
#> ──────────────────────────────────────────────────────────────────────

search_type_backward_model()  # Backward stepwise
#> ── Backward Search ───────────────────────────────────────────────────
#> • display_name  'Backward Stepwise Model Search'
#> ──────────────────────────────────────────────────────────────────────

Model Type Helpers

There are three main model types to choose from. See ?model_type.

model_type_lr()  # Logistic regression
#> ── Model: logistic regression ────────────────────────────────────────
#> • response    'Response'
#> ──────────────────────────────────────────────────────────────────────

model_type_lm()  # Linear regression
#> ── Model: linear regression ──────────────────────────────────────────
#> • response    'Response'
#> ──────────────────────────────────────────────────────────────────────

model_type_nb()  # Naive Bayes
#> ── Model: naive Bayes ────────────────────────────────────────────────
#> • response    'Response'
#> ──────────────────────────────────────────────────────────────────────

Cost Helpers

There are five available cost functions, that the used typically does not need to call directly. Simply pass one of the following as a string to the cost = argument to feature_selection(). See ?feature_selection and perhaps ?cost.

  • AUC: Area under the curve (classification)
  • CCC: Concordance Correlation Coefficient (regression)
  • MSE: Mean-squared Error (regression)
  • R2: R-squared (regression)
  • sens/spec: Sensitivity + Specificity (sum; classification)

Feature Selection with Naive Bayes

The analysis below is performed with the simulated data set from wranglr::simdata. We fit a Naive Bayes model during the feature selection. The setup below specifies 3 independent runs of 5 fold cross-validation.

Higher folds might generate slightly different results, but a 20-25% hold-out is fairly common. Of course, more runs (repeats) will take longer. There are 5 features that should be significant in a binary classification context. They are identified in the attributes of the object itself. We will restrict the search to the top 10 steps (there are 40 total features; thus approx. 35 false positives).

Setup feature_select Object

data <- simdata

# True positive features
attributes(data)$sig_feats$class
#> [1] "seq.2802.68" "seq.9251.29" "seq.1942.70" "seq.5751.80"
#> [5] "seq.9608.12"

# log-transform, center, and scale
cs <- function(x) {
  out <- log10(x)
  out <- out - mean(out)
  out / sd(out)
}

# scramble order of feats random
feats <- withr::with_seed(123, sample(helpr:::get_analytes(data)))
data[, feats] <- apply(data[, feats], 2, cs)

# set model type and column name of response variable
mt <- model_type_nb(response = "class_response")

# set search method function to 'forward' and 'model'
# restrict to the top 10 steps in the search; then stop
sm <- search_type_forward_model(max_steps = 10L)

# setup feature selection object
fs_setup <- feature_selection(
  data,
  candidate_features = feats,
  model_type  = mt,
  search_type = sm,
  runs  = 3L,
  folds = 5L,
  cost  = "AUC",
  random_seed = 1
)

fs_setup
#> ══ Feature Selection Object ══════════════════════════════════════════
#> ── Dataset Info ──────────────────────────────────────────────────────
#> • Rows                      100
#> • Columns                   55
#> • FeatureData               40
#> ── Search Optimization Info ──────────────────────────────────────────
#> • No. Candidates            '40'
#> • Response Field            'class_response'
#> • Cross Validation Runs     '3'
#> • Cross Validation Folds    '5'
#> • Stratified Folds          'FALSE'
#> • Model Type                'fs_nb'
#> • Search Type               'fs_forward_model'
#> • Cost Function             'AUC'
#> • Random Seed               '1'
#> • Display Name              'Forward Stepwise Model Search'
#> • Search Complete           'FALSE'
#> ══════════════════════════════════════════════════════════════════════

The S3 method Search() performs the actual feature selection, and method dispatch occurs depending on the class of fs_nb.

fs_nb <- Search(fs_setup)

Plot the Selection Paths

There is an S3 plot() method easily visualizes the steps of the selection algorithm, and highlights the peak (AUC) and the models at 1σ1\sigma and 2σ2\sigma from the peak. The 2 panels show a distribution-free representation of the data (left; Wilcoxon signed-ranks with medians) and a distribution dependent representation (right; standard errors with means and CI95%).

plot(fs_nb)


Logistic Regression

We can use the update() method to modify the existing feature_select object.

fs_update <- update(
  fs_setup,   # the `feature_selection` object being modified
  model_type  = model_type_lr("class_response"), # logistic reg
  search_type = search_type_forward_model(max_steps = 15L), # increase max steps
  stratify    = TRUE   # now stratify
)

fs_update
#> ══ Feature Selection Object ══════════════════════════════════════════
#> ── Dataset Info ──────────────────────────────────────────────────────
#> • Rows                      100
#> • Columns                   55
#> • FeatureData               40
#> ── Search Optimization Info ──────────────────────────────────────────
#> • No. Candidates            '40'
#> • Response Field            'class_response'
#> • Cross Validation Runs     '3'
#> • Cross Validation Folds    '5'
#> • Stratified Folds          'TRUE'
#> • Model Type                'fs_lr'
#> • Search Type               'fs_forward_model'
#> • Cost Function             'AUC'
#> • Random Seed               '1'
#> • Display Name              'Forward Stepwise Model Search'
#> • Search Complete           'FALSE'
#> ══════════════════════════════════════════════════════════════════════

Perform the Search

fs_lr <- Search(fs_update)

Plot the Selection Algorithm

plot(fs_lr)

Return the Plot Features

get_fs_features(fs_lr)
#> ══ Features ══════════════════════════════════════════════════════════
#> • features_max  12
#> • features_1se  8
#> • features_2se  6
#> ── features_max ──────────────────────────────────────────────────────
#> 'seq.1942.70', 'seq.9297.97', 'seq.9608.12', 'seq.4914.10', 'seq.9360.55', 'seq.8142.63', 'seq.1130.49', 'seq.9373.82', 'seq.9251.29', 'seq.2802.68', 'seq.3459.49', 'seq.6356.60'
#> ── features_1se ──────────────────────────────────────────────────────
#> 'seq.1942.70', 'seq.9608.12', 'seq.9360.55', 'seq.8142.63', 'seq.9373.82', 'seq.9251.29', 'seq.2802.68', 'seq.3459.49'
#> ── features_2se ──────────────────────────────────────────────────────
#> 'seq.1942.70', 'seq.9608.12', 'seq.8142.63', 'seq.9251.29', 'seq.2802.68', 'seq.3459.49'

Class Stratification Plot

You can also check the class proportions (imbalances) of the cross-validation folds based on the proportion of binary classes (for classification problems).

This should be most evident when comparing the folds with and without forced stratification. Below is a sample plot of the cross-validation folds without stratification (left) and after an update to the object to include stratification (right):

no_strat <- feature_selection(
  data, candidate_features = feats,
  model_type = mt, search_type = sm,
  runs  = 2L, folds = 3L
)
with_strat <- update(no_strat, stratify = TRUE)
plot_cross(no_strat) + plot_cross(with_strat)