Feature Selection Object Declaration

Declares and generates a feature_selection class object within the Feature Selection framework. This object acts as the holder of data (bootstrapped or cross-validation folds), model type, search type, cost function, and an underlying data structure for use by other functions.

Usage

feature_selection(
  data,
  candidate_features,
  model_type,
  search_type,
  runs = 1L,
  folds = 1L,
  cost = c("AUC", "R2", "CCC", "MSE", "sens", "spec"),
  bootstrap = FALSE,
  stratify = FALSE,
  strat_column = NULL,
  random_seed = 101L
)

# S3 method for class 'feature_select'
print(x, ...)

is_feature_select(x)

# S3 method for class 'feature_select'
update(object, ...)

Arguments

data

A data.frame containing features and clinical data suitable for modeling.

candidate_features

character(n). List of candidate features, i.e. columns names, from the data object.

model_type

An instantiated model_type object, generated via a call to one of the model_type() functions.

search_type

An instantiated search_type object, generated via a call to one of the search_type() functions.

runs

integer(1). How many runs (repeats) to perform.

folds

integer(1). How many fold cross-validation to perform.

cost

character(1). A string to be used in defining the cost function. One of:

AUC: Area Under the Curve
MSE: Mean-Squared Error
CCC: Concordance Correlation Coefficient
R2: R-squared - regression models
sens or spec: Sensitivity + Specificity

bootstrap

logical(1). Should data be bootstrapped rather than set up in cross-validation folds? The result is multiple runs (defined by runs) with 1 Fold each. The full data set will be sampled with replacement to generate a training set of equivalent size. The samples not chosen during sampling make up the test set.

stratify

logical(1). Should cross-validation folds be stratified based upon the column specified in strat_column?

strat_column

character(1). Which column to use for stratification of cross-validation. If NULL (default), column name corresponding to the response parameter from the ?model_type will be used.

random_seed

integer(1). Used to control the random number generator for reproducibility.

x, object

A feature_select class object.

...

Arguments declared for update in argument = value format. Non-declared arguments from the original call are preserved.

Value

A "feature_select" class object; a list of:

data: The original feature data to use.
candidate_features: The list of candidate features.
model_type: A list containing model type variables of the appropriate class for the desired model type.
search_type: A list containing search type variables of the appropriate class for the desired search type.
cost: A string of the type of cost function.
cost_fxn: A list containing cost variables of the appropriate class for the desired object cost function.
runs: The number of runs.
folds: The number of folds.
random_seed: The random seed used
cross_val: A list containing the training and test indices of the various cross validation folds.
search_complete: Logical if the object has completed a search
call: The original matched call.

Functions

is_feature_select(): Check if a valid feature_select class object.

References

Hastie, Tibshirani, and Friedman. Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Ed. Springer. 2009.

Author

Stu Field, Kirk DeLisle

Examples

# Simulated Test Data
data <- wranglr::simdata

# Setup response variable
data$class_response <- factor(data$class_response)

mt <- model_type_lr("class_response")
sm <- search_type_forward_model(15L, display_name = "Forward Algorithm")
ft <- helpr:::get_analytes(data)   # select candidate features
fs <- feature_selection(data, candidate_features = ft,
                        model_type = mt, search_type = sm, cost = "sens",
                        runs = 5L, folds = 5L)
# S3 Print method
fs
#> ══ Feature Selection Object ═══════════════════════════════════════════
#> ── Dataset Info ───────────────────────────────────────────────────────
#> • Rows                      100
#> • Columns                   55
#> • FeatureData               40
#> ── Search Optimization Info ───────────────────────────────────────────
#> • No. Candidates            '40'
#> • Response Field            'class_response'
#> • Cross Validation Runs     '5'
#> • Cross Validation Folds    '5'
#> • Stratified Folds          'FALSE'
#> • Model Type                'fs_lr'
#> • Search Type               'fs_forward_model'
#> • Cost Function             'sens'
#> • Random Seed               '101'
#> • Display Name              'Forward Algorithm'
#> • Search Complete           'FALSE'
#> ═══════════════════════════════════════════════════════════════════════

# Using the S3 Update method to modify existing `feature_select` object:
#   change model type, cost function, and random seed
fs2 <- update(fs, model_type = model_type_nb("class_response"),
              cost = "AUC", random_seed = 99L)
fs2
#> ══ Feature Selection Object ═══════════════════════════════════════════
#> ── Dataset Info ───────────────────────────────────────────────────────
#> • Rows                      100
#> • Columns                   55
#> • FeatureData               40
#> ── Search Optimization Info ───────────────────────────────────────────
#> • No. Candidates            '40'
#> • Response Field            'class_response'
#> • Cross Validation Runs     '5'
#> • Cross Validation Folds    '5'
#> • Stratified Folds          'FALSE'
#> • Model Type                'fs_nb'
#> • Search Type               'fs_forward_model'
#> • Cost Function             'AUC'
#> • Random Seed               '99'
#> • Display Name              'Forward Algorithm'
#> • Search Complete           'FALSE'
#> ═══════════════════════════════════════════════════════════════════════

# change number of runs & folds
#   requires re-calculation of cross-validation parameters
fs3 <- update(fs, runs = 20L, folds = 10L)
fs3
#> ══ Feature Selection Object ═══════════════════════════════════════════
#> ── Dataset Info ───────────────────────────────────────────────────────
#> • Rows                      100
#> • Columns                   55
#> • FeatureData               40
#> ── Search Optimization Info ───────────────────────────────────────────
#> • No. Candidates            '40'
#> • Response Field            'class_response'
#> • Cross Validation Runs     '20'
#> • Cross Validation Folds    '10'
#> • Stratified Folds          'FALSE'
#> • Model Type                'fs_lr'
#> • Search Type               'fs_forward_model'
#> • Cost Function             'sens'
#> • Random Seed               '101'
#> • Display Name              'Forward Algorithm'
#> • Search Complete           'FALSE'
#> ═══════════════════════════════════════════════════════════════════════