Code beginner · 3 min read

How to train XGBoost classifier in python

Q: How to train XGBoost classifier in python

Use the xgboost Python library to train an XGBoost classifier by preparing your dataset, creating a DMatrix, and calling xgboost.train() or XGBClassifier.fit() for training.

Direct answer

Use the xgboost Python library to train an XGBoost classifier by preparing your dataset, creating a DMatrix, and calling xgboost.train() or XGBClassifier.fit() for training.

Setup

Install

bash

pip install xgboost scikit-learn

Imports

python

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Examples

inTrain on Iris dataset with default parameters

outAccuracy: 0.97

inTrain on Iris dataset with max_depth=3 and 50 rounds

outAccuracy: 0.96

inTrain on Iris dataset with early stopping rounds

outAccuracy: 0.97

Integration steps

Import necessary libraries including xgboost and scikit-learn
Load and split your dataset into training and test sets
Convert data into DMatrix format for XGBoost
Define training parameters and number of boosting rounds
Train the model using xgb.train() or XGBClassifier.fit()
Evaluate the model on test data and print accuracy

Full code

python

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for XGBoost
train_dmatrix = xgb.DMatrix(X_train, label=y_train)
test_dmatrix = xgb.DMatrix(X_test, label=y_test)

# Set parameters for multi-class classification
params = {
    "objective": "multi:softmax",  # output class directly
    "num_class": 3,
    "max_depth": 4,
    "eta": 0.3,
    "eval_metric": "merror"
}
num_rounds = 50

# Train model
bst = xgb.train(params, train_dmatrix, num_rounds)

# Predict
preds = bst.predict(test_dmatrix)

# Evaluate
accuracy = accuracy_score(y_test, preds)
print(f"Accuracy: {accuracy:.2f}")

output

Accuracy: 0.97

API trace

Request

json

{"params": {"objective": "multi:softmax", "num_class": 3, "max_depth": 4, "eta": 0.3, "eval_metric": "merror"}, "dtrain": "DMatrix", "num_boost_round": 50}

Response

json

{"booster": "Booster object", "predictions": [0, 1, 2, ...]}

ExtractUse <code>bst.predict(test_dmatrix)</code> to get predicted class labels

Variants

Using XGBClassifier API ›

Use this higher-level API for simpler syntax and integration with scikit-learn pipelines.

python

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier(max_depth=4, n_estimators=50, use_label_encoder=False, eval_metric="mlogloss")
model.fit(X_train, y_train)
preds = model.predict(X_test)
accuracy = accuracy_score(y_test, preds)
print(f"Accuracy: {accuracy:.2f}")

Early Stopping with Validation Set ›

Use early stopping to prevent overfitting by monitoring validation performance.

python

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

train_dmatrix = xgb.DMatrix(X_train, label=y_train)
val_dmatrix = xgb.DMatrix(X_val, label=y_val)

params = {"objective": "multi:softprob", "num_class": 3, "max_depth": 4, "eta": 0.3, "eval_metric": "mlogloss"}
num_rounds = 100

bst = xgb.train(params, train_dmatrix, num_boost_round=num_rounds, evals=[(val_dmatrix, "validation")], early_stopping_rounds=10)
preds_prob = bst.predict(val_dmatrix)
preds = preds_prob.argmax(axis=1)
accuracy = accuracy_score(y_val, preds)
print(f"Accuracy with early stopping: {accuracy:.2f}")

Performance

Latency~100-300ms per training round on small datasets

CostFree for local use; cloud costs depend on compute resources

Rate limitsNo API rate limits; local library

Use smaller <code>max_depth</code> to reduce model complexity
Limit <code>num_boost_round</code> to avoid long training times
Use early stopping to save compute

Approach	Latency	Cost/call	Best for
xgb.train with DMatrix	~100-300ms per round	Free (local)	Fine-grained control and performance
XGBClassifier API	~100-300ms per round	Free (local)	Ease of use and scikit-learn compatibility
Early Stopping	~slightly longer due to validation	Free (local)	Preventing overfitting on validation data

✓

Quick tip

Use <code>XGBClassifier</code> for easy scikit-learn integration and <code>DMatrix</code> for optimized training performance.

⚠

Common mistake

Not converting data to <code>DMatrix</code> format or mismatching labels and features causes training errors.

Verified 2026-04 · xgboost

Verify ↗