How to beginner · 3 min read

How to use train_test_split in Scikit-learn

Quick answer
Use train_test_split from sklearn.model_selection to split your dataset arrays or DataFrames into random train and test subsets. It accepts features and labels as inputs and returns split subsets, with parameters to control test size, random state, and shuffling.

PREREQUISITES

  • Python 3.8+
  • pip install scikit-learn>=1.2

Setup

Install Scikit-learn if not already installed using pip. Import train_test_split from sklearn.model_selection.

bash
pip install scikit-learn>=1.2

Step by step

This example demonstrates splitting a dataset of features and labels into training and testing sets with a 75/25 split and a fixed random seed for reproducibility.

python
from sklearn.model_selection import train_test_split
import numpy as np

# Example dataset
X = np.arange(20).reshape((10, 2))  # 10 samples, 2 features each
y = np.arange(10)  # 10 labels

# Split dataset: 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, shuffle=True
)

print("X_train:\n", X_train)
print("X_test:\n", X_test)
print("y_train:", y_train)
print("y_test:", y_test)
output
X_train:
 [[ 8  9]
 [ 0  1]
 [14 15]
 [ 4  5]
 [12 13]
 [ 6  7]]
X_test:
 [[16 17]
 [ 2  3]
 [18 19]
 [10 11]]
y_train: [4 0 7 2 6 3]
y_test: [8 1 9 5]

Common variations

  • Use train_size instead of test_size to specify training set proportion.
  • Set shuffle=False to split without shuffling (useful for time series).
  • Split multiple arrays simultaneously (e.g., features, labels, sample weights).
python
from sklearn.model_selection import train_test_split

# Multiple arrays split example
X = [[1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 1, 0, 1]
sample_weights = [0.1, 0.2, 0.3, 0.4]

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, sample_weights, test_size=0.5, random_state=0
)

print("X_train:", X_train)
print("y_train:", y_train)
print("w_train:", w_train)
output
X_train: [[7, 8], [1, 2]]
y_train: [1, 0]
w_train: [0.4, 0.1]

Troubleshooting

  • If you get a ValueError about input lengths, ensure all input arrays have the same first dimension length.
  • For reproducible splits, always set random_state to a fixed integer.
  • Check that test_size and train_size sum to less than or equal to 1.

Key Takeaways

  • Use train_test_split to easily split datasets into train and test subsets with control over size and randomness.
  • Set random_state for reproducible splits across runs.
  • You can split multiple arrays simultaneously, such as features, labels, and sample weights.
  • Disable shuffling for time series or ordered data by setting shuffle=False.
  • Always verify input array lengths match to avoid errors.
Verified 2026-04
Verify ↗