Stability Selection

Performs stability selection on a set of predictive features and a response variable. Stability selection is performed using a user-specified kernel. For classification problems the l1-logistic kernel should be used and the response should be a vector or factor of with two class labels. For lasso the response column should be a numeric vector. For "Cox" the response is a two column matrix containing the event time in the first column and the censoring indicator in the second column. The randomized lasso is used if the alpha parameter is set to a value less than 1. In a randomized Lasso the model coefficients are randomly re-weighted when calculating the regularization term. This weighting can be performed in two different ways. If Pw = NA then these random weights are sampled uniformly between alpha and 1. If Pw is supplied, then the random weights are chosen to be alpha with probability Pw and 1 otherwise. The latter choice is used in Theorem 2 in Meinshausen and Buhlmann. Recommended values of alpha and Pw are [0.5, 0.2].

The is_stab_sel() function checks whether an object is class stab_sel. See inherits().

Usage

stability_selection(
  x,
  y = NULL,
  kernel = c("l1-logistic", "lasso", "ridge", "Cox", "pca.sd", "pca.thresh",
    "multinomial"),
  num_iter = 100,
  parallel = FALSE,
  alpha = 0.8,
  Pw = 0.5,
  num_perms = 0,
  standardize = TRUE,
  lambda_min_ratio = 0.1,
  beta_threshold = 0L,
  elastic_alpha = 1,
  lambda_pad = 20,
  impute_outliers = FALSE,
  impute_n_sigma = 3,
  r_seed = sample(1000, 1),
  ...
)

is_stab_sel(x)

# S3 method for class 'stab_sel'
print(x, ...)

# S3 method for class 'stab_sel'
summary(object, ..., thresh)

# S3 method for class 'stab_sel'
plot(
  x,
  thresh = 0.6,
  custom_labels = NULL,
  main = NULL,
  sort_by_AUC = TRUE,
  ln_cols = unlist(col_palette),
  add_perm = FALSE,
  emp_thresh = seq(1, 0.1, by = -0.01),
  ...
)

Arguments

x

A numeric \(n x p\) matrix of predictive features containing n observation rows and p feature columns. Alternatively, a stab_sel class object if passing to one of the S3 generic methods.

y

The response variable. If kernel is "l1-logistic" then a vector of binary class labels. If kernel is "Cox" then it is a two column matrix with the event time in the first column and the censoring indicator (1 = event, 0 = censored) in the second column.

kernel

character(1). A string describing the underlying model used for selection. Options are:

"l1-logistic" (default)
"lasso"
"Cox"
"ridge"
"multinomial"
"pca.sd"
"pca.thresh"

num_iter

integer(1). Defining the number of sub-sampling iterations for the stability selection.

parallel

logical(1). Should parallel processing via multiple cores be implemented? Must be on a Linux platform and have the parallel package installed. Otherwise defaults to 1 core.

alpha

numeric(1). Value defining the weakness parameter for the randomized regularization. This is the minimum random weight applied to each beta coefficient in the regularization.

Pw

numeric(1). Value defining probability of a weak weight, see alpha. If Pw = NA then the coefficient weights are sampled uniformly from alpha to 1.

num_perms

integer(1). The number of permutations to use in calculating the empirical false positive rate.

standardize

logical(1). Whether the data should be centered and scaled.

lambda_min_ratio

The minimum value of lambda/max(lambda) to use during the selection procedure. See glmnet().

beta_threshold

numeric(1). Floating point value defining selection levels for ridge regression. Since ridge regression will not zero out coefficients, selection of coefficient curves by selection probability is not effective. Any variable having a coefficient with absolute value greater than or equal to beta_threshold will be selected.

elastic_alpha

numeric(1). Floating point value between 0 and 1. When 0, the results of glmnet() are equivalent to Ridge regression. When 1, the results are equivalent to Lasso. Any value between 0 and 1 creates a compromise between L1 and L2 penalty.

lambda_pad

The lambda path is padded with high values of lambda in order to produce a more appealing plot. Occasionally, the degree of padding needs to be adjusted in order to produce better resolution at low values of lambda. Typical values for this parameter are 20 (default), 15, 10, or 5.

impute_outliers

logical(1). Should statistical outliers (\(3 * \sigma\)) be imputed to approximate a Gaussian distribution during stability selection? See wranglr::impute_outliers().

impute_n_sigma

numeric(1). Standard deviation outlier threshold for imputing outliers if impute_outliers = TRUE, ignored otherwise.

r_seed

integer(1). Seed for the random number generator, allowing for reproducibility of results.

...

Additional arguments passed to one of the S3 methods for stab_sel class objects, generics include:

plot.stab_sel()
print.stab_sel()
summary.stab_sel()

object

An stab_sel class object.

thresh

A numeric minimum selection probability threshold. This value can also be a vector of values in [0, 1], but ideally greater than 0.50.

custom_labels

character(n). Character vector of additional features to label in the plot, see Details.

main

character(1). Optional plot title (default depends on kernel).

sort_by_AUC

logical(1). If TRUE, entries in the legend will be sorted by their curve AUC values which are in parentheses following the variable name in the legend.

ln_cols

character(n). A vector of colors to be used as line colors in plotting. Recycled as necessary.

add_perm

logical(1). Should empirical false discovery lines from the null permutation be added to the plot (only if permutation was performed)? This can be time consuming depending on the number of permutations.

emp_thresh

numeric(n). A vector describing the empirical threshold values to be used (default = seq(1, 0.1, by = 0.01)).

Value

A stab_sel class object:

stabpath_matrix: A matrix of \(features x lambda_seq\) containing stability selection probabilities. A row in this matrix corresponds to a stability selection path for a single feature.
lambda: the sequence of lambdas used for regularization. They correspond to the columns of stabpath_matrix.
alpha: the weakness parameter provided in the call.
Pw: the weak weight probability provided in the call.
kernel: the kernel used (e.g. l1-logistic).
num_iter: The number of iterations used in computing the stability paths.
standardize: should the data be standardized prior to analysis?
lambda_min_ratio: ?
perm_data: Logical. Is there permuted data to perform empirical FDR?
permpath_list: list containing information to calculated the permutation paths of the empirical false positive rate.
perm_lambda: The lambda used in the permuted lists.
permpath_max: max lambda for the permuted lists (I think).
beta: A matrix of the betas calculated during the selection process.
r_seed: The random seed used.

The is_stab_sel function returns a logical boolean.

The S3 print method returns:

Stability Selection Kernel: The kernel used in the stability selection algorithm.
Weakness: The weakness used (alpha argument).
Weakness Probability: The probability of the weakness being applied (Pw = argument).
Number of Iterations: Number of iterations in the selection (num_iter = argument).
Standardized: Was the data standardized prior to stability selection?
Imputed Outliers: Were statistical outliers imputed to the Gaussian approximation prior to stability selection?
Lambda Max: The maximum lambda (tuning parameter) used.
Lambda Max: The maximum lambda (tuning parameter) used.
Random Seed: The seed passed to the random number generator for subset selection.

A ggplot.

Details

Stability selection can be performed on multiple cores by setting parallel = TRUE. This functionality requires parallel::mclapply() from the parallel package. This is not available for Windows based OS.

Functions

print(stab_sel): S3 print method for class stab_sel.
summary(stab_sel): The S3 summary method for class stab_sel.
plot(stab_sel): The S3 plot method plots the selection paths for the features. This plot closely resembles a lasso coefficient plot with the regularization parameter (lambda) plotted on x-axis and the feature selection probability (rather than the model coefficient) is plotted on the y-axis.

Plots the regularization parameter (lambda) on the x-axis and the selection probability on the y-axis. The regularization parameter is plotted as lambda/max(lambda) so that it is in the range from 1 to 0. The selection probability corresponds to the number of times a particular marker was chosen at a given value of lambda. Each line in the plot is a marker and represents the stability selection path over the range of regularization parameter. All features that have a maximum selection probability greater than thresh (shown as a dotted horizontal line) are colored and labeled and the remaining features are colored gray and unlabeled. Additionally, you can provide a set of custom labels that will be colored and labeled regardless of their max selection probability. Each feature is labeled with a capital letter and the full name of the feature is indicated in the legend along with the AUC for its curve in parentheses.

Note

Additional features can be passed as strings to the summary method via the add_features argument.

References

Meinshausen, N. and Buhlmann, P. (2010), Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72: 417-473. doi: 10.1111/j.1467-9868.2010.00740.x

Author

Michael R. Mehan, Stu Field, and Robert Kirk DeLisle

Examples

# l1-logistic
withr::with_seed(101, {
  n_feat      <- 20
  n_samp      <- 100
  x           <- matrix(rnorm(n_samp * n_feat), n_samp, n_feat)
  colnames(x) <- paste0("feat", "_", head(letters, n_feat))
  y           <- sample(1:2, n_samp, replace = TRUE)
  stab_sel    <- stability_selection(x, y, kernel = "l1-logistic", r_seed = 101)
})
#> ✓ Using kernel: 'l1-logistic' and 1 core (serial)

# Cox
xcox <- feature_matrix(stabilityselectr:::log_rfu(simdata))

# Note: this works because colnames are already "time" and "status".
#   In 'real' datasets, you may need to rename the final matrix as
#   "time" and "status".

ycox <- select(simdata, time, status) |> as.matrix()
stab_sel_cox <- stability_selection(xcox, ycox, kernel = "Cox", r_seed = 3)
#> ✓ Using kernel: 'Cox' and 1 core (serial)
# Test for class `stab_sel`
is_stab_sel(stab_sel)
#> [1] TRUE

# S3 print method
stab_sel
#> ══ Stability Selection (Kernel: l1-logistic) ══════
#> • Weakness (alpha)            0.8
#> • Weakness Probability (Pw)   0.5
#> • Number of Iterations        100
#> • Standardized                'Yes'
#> • Imputed Outliers            'No'
#> • Lambda Max                  0.144
#> • Lambda Min Ratio            0.1
#> • Permuted Data               'No'
#> • Random Seed                 101
#> ═══════════════════════════════════════════════════════════════════════

# S3 summary method
summary(stab_sel, thresh = 0.6)
#> # A tibble: 20 × 4
#>    feature MaxSelectProb   AUC FDRbound
#>    <chr>           <dbl> <dbl>    <dbl>
#>  1 feat_d          0.945 0.422   0.0125
#>  2 feat_t          0.935 0.416   0.025 
#>  3 feat_a          0.915 0.342   0.0375
#>  4 feat_j          0.915 0.430   0.05  
#>  5 feat_s          0.905 0.343   0.0625
#>  6 feat_m          0.9   0.410   0.075 
#>  7 feat_l          0.89  0.326   0.0875
#>  8 feat_f          0.885 0.300   0.1   
#>  9 feat_q          0.88  0.290   0.113 
#> 10 feat_g          0.87  0.269   0.125 
#> 11 feat_e          0.86  0.283   0.138 
#> 12 feat_n          0.86  0.226   0.15  
#> 13 feat_r          0.86  0.352   0.163 
#> 14 feat_c          0.855 0.326   0.175 
#> 15 feat_k          0.85  0.292   0.188 
#> 16 feat_b          0.845 0.275   0.2   
#> 17 feat_o          0.84  0.243   0.213 
#> 18 feat_h          0.835 0.274   0.225 
#> 19 feat_p          0.83  0.236   0.238 
#> 20 feat_i          0.785 0.242   0.25  
summary(stab_sel, thresh = 0.8, add_features = "feat_c")   # force feat_c into table
#> # A tibble: 19 × 4
#>    feature MaxSelectProb   AUC FDRbound
#>    <chr>           <dbl> <dbl>    <dbl>
#>  1 feat_d          0.945 0.422  0.00417
#>  2 feat_t          0.935 0.416  0.00833
#>  3 feat_a          0.915 0.342  0.0125 
#>  4 feat_j          0.915 0.430  0.0167 
#>  5 feat_s          0.905 0.343  0.0208 
#>  6 feat_m          0.9   0.410  0.025  
#>  7 feat_l          0.89  0.326  0.0292 
#>  8 feat_f          0.885 0.300  0.0333 
#>  9 feat_q          0.88  0.290  0.0375 
#> 10 feat_g          0.87  0.269  0.0417 
#> 11 feat_e          0.86  0.283  0.0458 
#> 12 feat_n          0.86  0.226  0.05   
#> 13 feat_r          0.86  0.352  0.0542 
#> 14 feat_c          0.855 0.326  0.0583 
#> 15 feat_k          0.85  0.292  0.0625 
#> 16 feat_b          0.845 0.275  0.0667 
#> 17 feat_o          0.84  0.243  0.0708 
#> 18 feat_h          0.835 0.274  0.075  
#> 19 feat_p          0.83  0.236  0.0792 
# S3 plot method
plot(stab_sel, thresh = 0.8)