Published on December 20, 2022

Difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

StratifiedKFold and StratifiedShuffleSplit are two types of stratified sampling techniques that can be used to split a dataset into train and test sets in scikit-learn.

The main difference between StratifiedKFold and StratifiedShuffleSplit is in the way the data is split. StratifiedKFold performs k-fold cross-validation, where the data is split into k folds, and the model is trained and evaluated k times, each time using a different fold as the test set and the remaining folds as the training set.

On the other hand, StratifiedShuffleSplit randomly shuffles the data and splits it into train and test sets. The number of splits is specified by the user, and the data is shuffled and split into train and test sets multiple times, according to the number of splits specified.

Both StratifiedKFold and StratifiedShuffleSplit aim to maintain the class balance in the train and test sets, meaning that the proportion of samples from each class is approximately the same in the train and test sets as it is in the original dataset. This is particularly important when the classes in the dataset are imbalanced (i.e., there is a disproportionate number of samples from one class compared to the other).

StratifiedKFold and StratifiedShuffleSplit are two types of stratified sampling techniques that can be used to split a dataset into train and test sets in scikit-learn.

Here is an example of how to use StratifiedKFold for k-fold cross-validation:

from sklearn.model_selection import StratifiedKFold
 
# Load data and labels into X and y
X = ...
y = ...
 
# Set the number of folds for cross-validation
n_folds = 5
 
# Create the StratifiedKFold object
skf = StratifiedKFold(n_splits=n_folds, shuffle=True)
 
# Iterate through the folds
for train_index, test_index in skf.split(X, y):
    # Split the data into train and test sets
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Train and evaluate the model on the train and test sets
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.2f}")

In this example, the data is split into 5 folds, and the model is trained and evaluated 5 times, each time using a different fold as the test set and the remaining folds as the training set.

Here is an example of how to use StratifiedShuffleSplit to split the data into train and test sets multiple times:

from sklearn.model_selection import StratifiedShuffleSplit
 
# Load data and labels into X and y
X = ...
y = ...
 
# Set the number of splits
n_splits = 10
 
# Create the StratifiedShuffleSplit object
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.2, random_state=42)
 
# Iterate through the splits
for train_index, test_index in sss.split(X, y):
    # Split the data into train and test sets
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # Train and evaluate the model on the train and test sets
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.2f}")

In this example, the data is shuffled and split into train and test sets 10 times, according to the number of splits specified.

Both StratifiedKFold and StratifiedShuffleSplit aim to maintain the class balance in the train and test sets, meaning that the proportion of samples from each class is approximately the same in the train and test sets as it is in the original dataset. This is particularly important when the classes in the dataset are imbalanced.

In summary, StratifiedKFold is used for k-fold cross-validation, and StratifiedShuffleSplit is used to split the data into train and test sets multiple times, with a random shuffle of the data in each split. Both techniques are used to maintain class balance in the train and test sets.

See all posts