How to beginner · 3 min read

How to use OneHotEncoder in Scikit-learn

Q: How to use OneHotEncoder in Scikit-learn

Use OneHotEncoder from sklearn.preprocessing to transform categorical features into one-hot encoded vectors. Fit the encoder on your data, then call transform or fit_transform to get the encoded output as a sparse or dense array.

Quick answer

Use OneHotEncoder from sklearn.preprocessing to transform categorical features into one-hot encoded vectors. Fit the encoder on your data, then call transform or fit_transform to get the encoded output as a sparse or dense array.

PREREQUISITES

Python 3.8+
pip install scikit-learn>=1.0

Setup

Install Scikit-learn if not already installed using pip. Import OneHotEncoder from sklearn.preprocessing.

bash

pip install scikit-learn>=1.0

Step by step

This example shows how to encode a simple categorical feature array using OneHotEncoder. It fits the encoder and transforms the data into one-hot encoded format.

python

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Sample categorical data
X = np.array([['red'], ['green'], ['blue'], ['green'], ['red']])

# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)  # sparse_output=False returns dense array

# Fit and transform the data
X_encoded = encoder.fit_transform(X)

print("Categories:", encoder.categories_)
print("One-hot encoded array:\n", X_encoded)

output

Categories: [array(['blue', 'green', 'red'], dtype='<U5')]
One-hot encoded array:
 [[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

Common variations

Use sparse_output=True to get a sparse matrix output for memory efficiency.
Encode multiple categorical columns by passing a 2D array with multiple features.
Use handle_unknown='ignore' to avoid errors on unseen categories during transform.

python

from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Multiple categorical features
X_multi = np.array([
    ['red', 'S'],
    ['green', 'M'],
    ['blue', 'L'],
    ['green', 'XL'],
    ['red', 'S']
])

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_multi_encoded = encoder.fit_transform(X_multi)

print("Categories:", encoder.categories_)
print("One-hot encoded array shape:", X_multi_encoded.shape)
print(X_multi_encoded)

output

Categories: [array(['blue', 'green', 'red'], dtype='<U5'), array(['L', 'M', 'S', 'XL'], dtype='<U2')]
One-hot encoded array shape: (5, 7)
[[0. 0. 1. 0. 0. 1. 0.]
 [0. 1. 0. 0. 1. 0. 0.]
 [1. 0. 0. 1. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 1. 0.]]

Troubleshooting

If you get a ValueError about unknown categories during transform, set handle_unknown='ignore' when creating the encoder.
Ensure input data is 2D array-like; reshape 1D arrays with reshape(-1, 1).
Use sparse_output=False if you want a dense numpy array output instead of a sparse matrix.

✅

Key Takeaways

Use OneHotEncoder to convert categorical features into binary one-hot vectors.
Set handle_unknown='ignore' to safely transform unseen categories without errors.
Choose sparse_output=True for memory efficiency or sparse_output=False for dense arrays.
Input data must be 2D; reshape 1D arrays before encoding.
You can encode multiple categorical columns simultaneously by passing a 2D array.

Verified 2026-04

Verify ↗