How to use OneHotEncoder in Scikit-learn
Quick answer
Use
OneHotEncoder from sklearn.preprocessing to transform categorical features into one-hot encoded vectors. Fit the encoder on your data, then call transform or fit_transform to get the encoded output as a sparse or dense array.PREREQUISITES
Python 3.8+pip install scikit-learn>=1.0
Setup
Install Scikit-learn if not already installed using pip. Import OneHotEncoder from sklearn.preprocessing.
pip install scikit-learn>=1.0 Step by step
This example shows how to encode a simple categorical feature array using OneHotEncoder. It fits the encoder and transforms the data into one-hot encoded format.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Sample categorical data
X = np.array([['red'], ['green'], ['blue'], ['green'], ['red']])
# Initialize OneHotEncoder
encoder = OneHotEncoder(sparse_output=False) # sparse_output=False returns dense array
# Fit and transform the data
X_encoded = encoder.fit_transform(X)
print("Categories:", encoder.categories_)
print("One-hot encoded array:\n", X_encoded) output
Categories: [array(['blue', 'green', 'red'], dtype='<U5')] One-hot encoded array: [[0. 0. 1.] [0. 1. 0.] [1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]
Common variations
- Use
sparse_output=Trueto get a sparse matrix output for memory efficiency. - Encode multiple categorical columns by passing a 2D array with multiple features.
- Use
handle_unknown='ignore'to avoid errors on unseen categories during transform.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
# Multiple categorical features
X_multi = np.array([
['red', 'S'],
['green', 'M'],
['blue', 'L'],
['green', 'XL'],
['red', 'S']
])
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_multi_encoded = encoder.fit_transform(X_multi)
print("Categories:", encoder.categories_)
print("One-hot encoded array shape:", X_multi_encoded.shape)
print(X_multi_encoded) output
Categories: [array(['blue', 'green', 'red'], dtype='<U5'), array(['L', 'M', 'S', 'XL'], dtype='<U2')] One-hot encoded array shape: (5, 7) [[0. 0. 1. 0. 0. 1. 0.] [0. 1. 0. 0. 1. 0. 0.] [1. 0. 0. 1. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 1.] [0. 0. 1. 0. 0. 1. 0.]]
Troubleshooting
- If you get a
ValueErrorabout unknown categories duringtransform, sethandle_unknown='ignore'when creating the encoder. - Ensure input data is 2D array-like; reshape 1D arrays with
reshape(-1, 1). - Use
sparse_output=Falseif you want a dense numpy array output instead of a sparse matrix.
Key Takeaways
- Use
OneHotEncoderto convert categorical features into binary one-hot vectors. - Set
handle_unknown='ignore'to safely transform unseen categories without errors. - Choose
sparse_output=Truefor memory efficiency orsparse_output=Falsefor dense arrays. - Input data must be 2D; reshape 1D arrays before encoding.
- You can encode multiple categorical columns simultaneously by passing a 2D array.