Difference between StratifiedKFold and StratifiedShuffleSplit in sklearn
StratifiedKFold and StratifiedShuffleSplit
are two types of stratified sampling techniques that can be used to split a dataset into train and test sets in scikit-learn.
The main difference between StratifiedKFold and StratifiedShuffleSplit
is in the way the data is split. StratifiedKFold
performs k-fold cross-validation, where the data is split into k folds, and the model is trained and evaluated k times, each time using a different fold as the test set and the remaining folds as the training set.
On the other hand, StratifiedShuffleSplit
randomly shuffles the data and splits it into train and test sets. The number of splits is specified by the user, and the data is shuffled and split into train and test sets multiple times, according to the number of splits specified.
Both StratifiedKFold and StratifiedShuffleSplit
aim to maintain the class balance in the train and test sets, meaning that the proportion of samples from each class is approximately the same in the train and test sets as it is in the original dataset. This is particularly important when the classes in the dataset are imbalanced (i.e., there is a disproportionate number of samples from one class compared to the other).
StratifiedKFold and StratifiedShuffleSplit
are two types of stratified sampling techniques that can be used to split a dataset into train and test sets in scikit-learn.
Here is an example of how to use StratifiedKFold
for k-fold cross-validation:
from sklearn.model_selection import StratifiedKFold
# Load data and labels into X and y
X = ...
y = ...
# Set the number of folds for cross-validation
n_folds = 5
# Create the StratifiedKFold object
skf = StratifiedKFold(n_splits=n_folds, shuffle=True)
# Iterate through the folds
for train_index, test_index in skf.split(X, y):
# Split the data into train and test sets
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and evaluate the model on the train and test sets
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
In this example, the data is split into 5 folds, and the model is trained and evaluated 5 times, each time using a different fold as the test set and the remaining folds as the training set.
Here is an example of how to use StratifiedShuffleSplit
to split the data into train and test sets multiple times:
from sklearn.model_selection import StratifiedShuffleSplit
# Load data and labels into X and y
X = ...
y = ...
# Set the number of splits
n_splits = 10
# Create the StratifiedShuffleSplit object
sss = StratifiedShuffleSplit(n_splits=n_splits, test_size=0.2, random_state=42)
# Iterate through the splits
for train_index, test_index in sss.split(X, y):
# Split the data into train and test sets
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and evaluate the model on the train and test sets
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
In this example, the data is shuffled and split into train and test sets 10 times, according to the number of splits specified.
Both StratifiedKFold and StratifiedShuffleSplit aim to maintain the class balance in the train and test sets, meaning that the proportion of samples from each class is approximately the same in the train and test sets as it is in the original dataset. This is particularly important when the classes in the dataset are imbalanced.
In summary, StratifiedKFold
is used for k-fold cross-validation, and StratifiedShuffleSplit
is used to split the data into train and test sets multiple times, with a random shuffle of the data in each split. Both techniques are used to maintain class balance in the train and test sets.