Skip to content

Adaptation of the rsample::vfold_cv() function and its utilities. Modified to remove testing and class structures not required for local usage and to accommodate two-level stratification. Up to 2 levels of stratification can be specified through the breaks parameter:

  • No stratification: breaks = NULL

  • One level stratification: breaks is a list of length 1, where the name of the list element specifies the stratification variable and the value of the element specifies the stratification structure.

  • Two level stratification: breaks is a list of length 2, where the name of the first element specifies the first level stratification variable and the value of the first element specifies its stratification structure. Similarly, the name of the second element specifies the second level stratification variable and its value specifies its stratification structure.

analysis() : [Questioning]

assessment() : [Questioning]

Usage

create_kfold(data, k = 10L, repeats = 1L, breaks = NULL, ...)

is.x_split(x)

analysis(object, i = NULL)

assessment(object, i = NULL)

Arguments

data

A data.frame class object. The data to be subset.

k

integer(1). The number of partitions of the data set.

repeats

integer(1)The number of times to repeat the k-fold partitioning.

breaks

A named list or NULL. If NULL, no stratification is performed. If a named list, must be of length 1 or 2, where the name of the i^th element indicates the column header of data containing the ith stratification variable and the value of the i^th element specifies the stratification structure for that variable. See Details for further information.

...

Variables to be passed to stratification step. Currently limited to depth. The number of stratification bins are based on min(5, floor(n / depth)), where n = length(x).

x

An R object to test.

object

A x_split object.

i

An integer or NULL. If an integer, the split for which the analysis or assessment data is to be retrieved.

Value

A x_split object (extension of the list class). Element data contains the original data. Element splits contains a tibble. Each row of the tibble corresponds to an individual split. Column split contains lists with named elements "analysis" and "assessment". These elements contain the indices of data to be used for each category. Columns Fold and Repeat provide fold and repeat indices for each corresponding split.

is.x_split(): Logical. TRUE if x inherits from class x_split.

analysis(): A list ... each element containing an object of class data.frame.

assessment(): A list ... each element containing an object of class data.frame.

Details

For stratification variables that are factor, character, or numeric with 5 or fewer unique values, the stratification structure should be set as NA. For example, if stratifying only on status, a binary variable, breaks = list(status = NA).

If the stratification variable is continuous or has more than 5 unique values, the stratification structure can be specified as either the number of quantile-based stratification bins or as a numeric vector providing the bin boundaries (must fully span the range of the stratification variable). For example, if the stratification variable, x, is a continuous variable in [0,1], breaks = list(x = 4) indicates stratification into 4 bins, the boundaries of which are determined internally using

quantile(x, probs = seq(0.0, 1.0, length.out = 5))

and breaks = list(x = c(0.0, 0.25, 0.75, 1.0)) specifies a 3 bin structure: [0, 0.25], (0.25, 0.75], and (0.75, 1.0]. Note: the lowest boundary is always taken as inclusive.

Examples

# no stratification
sample_no_strat <- create_kfold(simdata, k = 4L, repeats = 2L)

# stratification on 1 discrete variable
sample_one <- create_kfold(simdata, k = 4L, repeats = 2L,
                           breaks = list(status = NA))

# stratification on 2 variables; 1 continuous + 1 discrete
sample_two <- create_kfold(simdata, k = 4L, repeats = 2L,
                           breaks = list(time = 4L, status = NA))

# retrieve analysis data for 2nd split
an_2 <- analysis(sample_no_strat, 2L)

# retrieve analysis data for all splits
an_all <- analysis(sample_no_strat)

# retrieve assessment data for 2nd split
ass_2 <- assessment(sample_no_strat, 2L)

# retrieve assessment data for all splits
ass_all <- assessment(sample_no_strat)