How to tune XGBoost hyperparameters
Quick answer
To tune
XGBoost hyperparameters, use techniques like grid search or randomized search with scikit-learn's GridSearchCV or RandomizedSearchCV. Focus on key parameters such as max_depth, learning_rate, n_estimators, and subsample to optimize model accuracy and prevent overfitting.PREREQUISITES
Python 3.8+pip install xgboost scikit-learn numpyBasic knowledge of Python and machine learning
Setup
Install xgboost and scikit-learn libraries if not already installed. Import necessary modules and prepare your dataset.
pip install xgboost scikit-learn numpy Step by step
This example demonstrates tuning XGBoost hyperparameters using GridSearchCV on a classification dataset.
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# Load dataset
X, y = load_breast_cancer(return_X_y=True)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define model
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
# Define hyperparameter grid
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [50, 100, 200],
'subsample': [0.8, 1.0]
}
# Setup GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='accuracy', verbose=1)
# Fit grid search
grid_search.fit(X_train, y_train)
# Best parameters
print('Best hyperparameters:', grid_search.best_params_)
# Evaluate best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print('Test accuracy:', accuracy_score(y_test, y_pred)) output
Fitting 3 folds for each of 54 candidates, totalling 162 fits
Best hyperparameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100, 'subsample': 1.0}
Test accuracy: 0.956140350877193 Common variations
You can use RandomizedSearchCV for faster tuning with random sampling of hyperparameters. Alternatively, use libraries like Optuna or Ray Tune for advanced Bayesian optimization. Adjust parameters like colsample_bytree, gamma, and min_child_weight for fine-grained control.
from sklearn.model_selection import RandomizedSearchCV
import scipy.stats as stats
param_dist = {
'max_depth': stats.randint(3, 10),
'learning_rate': stats.uniform(0.01, 0.3),
'n_estimators': stats.randint(50, 300),
'subsample': stats.uniform(0.6, 0.4)
}
random_search = RandomizedSearchCV(
estimator=XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
param_distributions=param_dist,
n_iter=20,
cv=3,
scoring='accuracy',
verbose=1,
random_state=42
)
random_search.fit(X_train, y_train)
print('Best hyperparameters (random search):', random_search.best_params_)
best_random_model = random_search.best_estimator_
y_pred_random = best_random_model.predict(X_test)
print('Test accuracy (random search):', accuracy_score(y_test, y_pred_random)) output
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best hyperparameters (random search): {'learning_rate': 0.123, 'max_depth': 5, 'n_estimators': 150, 'subsample': 0.85}
Test accuracy (random search): 0.9473684210526315 Troubleshooting
- If you encounter overfitting, reduce
max_depthor increasesubsampleandcolsample_bytree. - If training is slow, reduce
n_estimatorsor use early stopping withXGBClassifier. - Ensure
use_label_encoder=Falseandeval_metricare set to avoid warnings in recentxgboostversions.
Key Takeaways
- Use
GridSearchCVorRandomizedSearchCVfromscikit-learnto systematically tuneXGBoosthyperparameters. - Focus on tuning
max_depth,learning_rate,n_estimators, andsubsamplefor best performance. - Consider advanced optimization libraries like
Optunafor more efficient hyperparameter search. - Set
use_label_encoder=Falseand specifyeval_metricto avoid deprecation warnings inxgboost. - Monitor for overfitting and adjust parameters accordingly, using early stopping if needed.