Create k-Fold Partitioning
create_kfold.Rd
Adaptation of the rsample::vfold_cv()
function and its utilities. Modified
to remove testing and class structures not required for local usage and to
accommodate two-level stratification.
Up to 2 levels of stratification can be specified through
the breaks
parameter:
No stratification:
breaks = NULL
One level stratification:
breaks
is a list of length 1, where the name of the list element specifies the stratification variable and the value of the element specifies the stratification structure.Two level stratification:
breaks
is a list of length 2, where the name of the first element specifies the first level stratification variable and the value of the first element specifies its stratification structure. Similarly, the name of the second element specifies the second level stratification variable and its value specifies its stratification structure.
Usage
create_kfold(data, k = 10L, repeats = 1L, breaks = NULL, ...)
is.x_split(x)
analysis(object, i = NULL)
assessment(object, i = NULL)
Arguments
- data
A
data.frame
class object. The data to be subset.- k
integer(1)
. The number of partitions of the data set.- repeats
integer(1)
The number of times to repeat the k-fold partitioning.- breaks
A named list or
NULL
. IfNULL
, no stratification is performed. If a named list, must be of length 1 or 2, where the name of the i^th element indicates the column header ofdata
containing the ith stratification variable and the value of the i^th element specifies the stratification structure for that variable. See Details for further information.- ...
Variables to be passed to stratification step. Currently limited to
depth
. The number of stratification bins are based onmin(5, floor(n / depth))
, wheren = length(x)
.- x
An
R
object to test.- object
A
x_split
object.- i
An integer or
NULL
. If an integer, the split for which the analysis or assessment data is to be retrieved.
Value
A x_split
object (extension of the list class). Element data
contains the original data. Element splits
contains a tibble. Each row
of the tibble corresponds to an individual split. Column split
contains
lists with named elements "analysis" and "assessment". These
elements contain the indices of data
to be used for each category.
Columns Fold
and Repeat
provide fold and repeat indices for
each corresponding split.
is.x_split()
: Logical. TRUE
if x
inherits from
class x_split
.
analysis()
: A list ... each element containing an object of
class data.frame
.
assessment()
: A list ... each element containing an object of
class data.frame
.
Details
For stratification variables that are factor, character, or numeric with
5 or fewer unique values, the stratification structure should be set
as NA
. For example, if stratifying only on status, a binary variable,
breaks = list(status = NA)
.
If the stratification variable is continuous or has more than 5 unique
values, the stratification structure can be specified as either the
number of quantile-based stratification bins or as a numeric vector
providing the bin boundaries (must fully span the range of the
stratification variable). For example, if the stratification variable,
x
, is a continuous variable in [0,1]
, breaks = list(x = 4)
indicates
stratification into 4 bins, the boundaries of which are determined
internally using
and breaks = list(x = c(0.0, 0.25, 0.75, 1.0))
specifies a 3 bin
structure: [0, 0.25]
, (0.25, 0.75]
, and (0.75, 1.0]
.
Note: the lowest boundary is always taken as inclusive.
Examples
# no stratification
sample_no_strat <- create_kfold(simdata, k = 4L, repeats = 2L)
# stratification on 1 discrete variable
sample_one <- create_kfold(simdata, k = 4L, repeats = 2L,
breaks = list(status = NA))
# stratification on 2 variables; 1 continuous + 1 discrete
sample_two <- create_kfold(simdata, k = 4L, repeats = 2L,
breaks = list(time = 4L, status = NA))
# retrieve analysis data for 2nd split
an_2 <- analysis(sample_no_strat, 2L)
# retrieve analysis data for all splits
an_all <- analysis(sample_no_strat)
# retrieve assessment data for 2nd split
ass_2 <- assessment(sample_no_strat, 2L)
# retrieve assessment data for all splits
ass_all <- assessment(sample_no_strat)