How to beginner · 3 min read

How to use train_test_split in Scikit-learn

Q: How to use train_test_split in Scikit-learn

Use train_test_split from sklearn.model_selection to split your dataset arrays or DataFrames into random train and test subsets. It accepts features and labels as inputs and returns split subsets, with parameters to control test size, random state, and shuffling.

Quick answer

Use train_test_split from sklearn.model_selection to split your dataset arrays or DataFrames into random train and test subsets. It accepts features and labels as inputs and returns split subsets, with parameters to control test size, random state, and shuffling.

PREREQUISITES

Python 3.8+
pip install scikit-learn>=1.2

Setup

Install Scikit-learn if not already installed using pip. Import train_test_split from sklearn.model_selection.

bash

pip install scikit-learn>=1.2

Step by step

This example demonstrates splitting a dataset of features and labels into training and testing sets with a 75/25 split and a fixed random seed for reproducibility.

python

from sklearn.model_selection import train_test_split
import numpy as np

# Example dataset
X = np.arange(20).reshape((10, 2))  # 10 samples, 2 features each
y = np.arange(10)  # 10 labels

# Split dataset: 75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, shuffle=True
)

print("X_train:\n", X_train)
print("X_test:\n", X_test)
print("y_train:", y_train)
print("y_test:", y_test)

output

X_train:
 [[ 8  9]
 [ 0  1]
 [14 15]
 [ 4  5]
 [12 13]
 [ 6  7]]
X_test:
 [[16 17]
 [ 2  3]
 [18 19]
 [10 11]]
y_train: [4 0 7 2 6 3]
y_test: [8 1 9 5]

Common variations

Use train_size instead of test_size to specify training set proportion.
Set shuffle=False to split without shuffling (useful for time series).
Split multiple arrays simultaneously (e.g., features, labels, sample weights).

python

from sklearn.model_selection import train_test_split

# Multiple arrays split example
X = [[1, 2], [3, 4], [5, 6], [7, 8]]
y = [0, 1, 0, 1]
sample_weights = [0.1, 0.2, 0.3, 0.4]

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(
    X, y, sample_weights, test_size=0.5, random_state=0
)

print("X_train:", X_train)
print("y_train:", y_train)
print("w_train:", w_train)

output

X_train: [[7, 8], [1, 2]]
y_train: [1, 0]
w_train: [0.4, 0.1]

Troubleshooting

If you get a ValueError about input lengths, ensure all input arrays have the same first dimension length.
For reproducible splits, always set random_state to a fixed integer.
Check that test_size and train_size sum to less than or equal to 1.

✅

Key Takeaways

Use train_test_split to easily split datasets into train and test subsets with control over size and randomness.
Set random_state for reproducible splits across runs.
You can split multiple arrays simultaneously, such as features, labels, and sample weights.
Disable shuffling for time series or ordered data by setting shuffle=False.
Always verify input array lengths match to avoid errors.

Verified 2026-04

Verify ↗