Code beginner · 3 min read

How to train XGBoost classifier in python

Direct answer
Use the xgboost Python library to train an XGBoost classifier by preparing your dataset, creating a DMatrix, and calling xgboost.train() or XGBClassifier.fit() for training.

Setup

Install
bash
pip install xgboost scikit-learn
Imports
python
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Examples

inTrain on Iris dataset with default parameters
outAccuracy: 0.97
inTrain on Iris dataset with max_depth=3 and 50 rounds
outAccuracy: 0.96
inTrain on Iris dataset with early stopping rounds
outAccuracy: 0.97

Integration steps

  1. Import necessary libraries including xgboost and scikit-learn
  2. Load and split your dataset into training and test sets
  3. Convert data into DMatrix format for XGBoost
  4. Define training parameters and number of boosting rounds
  5. Train the model using xgb.train() or XGBClassifier.fit()
  6. Evaluate the model on test data and print accuracy

Full code

python
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create DMatrix for XGBoost
train_dmatrix = xgb.DMatrix(X_train, label=y_train)
test_dmatrix = xgb.DMatrix(X_test, label=y_test)

# Set parameters for multi-class classification
params = {
    "objective": "multi:softmax",  # output class directly
    "num_class": 3,
    "max_depth": 4,
    "eta": 0.3,
    "eval_metric": "merror"
}
num_rounds = 50

# Train model
bst = xgb.train(params, train_dmatrix, num_rounds)

# Predict
preds = bst.predict(test_dmatrix)

# Evaluate
accuracy = accuracy_score(y_test, preds)
print(f"Accuracy: {accuracy:.2f}")
output
Accuracy: 0.97

API trace

Request
json
{"params": {"objective": "multi:softmax", "num_class": 3, "max_depth": 4, "eta": 0.3, "eval_metric": "merror"}, "dtrain": "DMatrix", "num_boost_round": 50}
Response
json
{"booster": "Booster object", "predictions": [0, 1, 2, ...]}
ExtractUse <code>bst.predict(test_dmatrix)</code> to get predicted class labels

Variants

Using XGBClassifier API

Use this higher-level API for simpler syntax and integration with scikit-learn pipelines.

python
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = xgb.XGBClassifier(max_depth=4, n_estimators=50, use_label_encoder=False, eval_metric="mlogloss")
model.fit(X_train, y_train)
preds = model.predict(X_test)
accuracy = accuracy_score(y_test, preds)
print(f"Accuracy: {accuracy:.2f}")
Early Stopping with Validation Set

Use early stopping to prevent overfitting by monitoring validation performance.

python
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

train_dmatrix = xgb.DMatrix(X_train, label=y_train)
val_dmatrix = xgb.DMatrix(X_val, label=y_val)

params = {"objective": "multi:softprob", "num_class": 3, "max_depth": 4, "eta": 0.3, "eval_metric": "mlogloss"}
num_rounds = 100

bst = xgb.train(params, train_dmatrix, num_boost_round=num_rounds, evals=[(val_dmatrix, "validation")], early_stopping_rounds=10)
preds_prob = bst.predict(val_dmatrix)
preds = preds_prob.argmax(axis=1)
accuracy = accuracy_score(y_val, preds)
print(f"Accuracy with early stopping: {accuracy:.2f}")

Performance

Latency~100-300ms per training round on small datasets
CostFree for local use; cloud costs depend on compute resources
Rate limitsNo API rate limits; local library
  • Use smaller <code>max_depth</code> to reduce model complexity
  • Limit <code>num_boost_round</code> to avoid long training times
  • Use early stopping to save compute
ApproachLatencyCost/callBest for
xgb.train with DMatrix~100-300ms per roundFree (local)Fine-grained control and performance
XGBClassifier API~100-300ms per roundFree (local)Ease of use and scikit-learn compatibility
Early Stopping~slightly longer due to validationFree (local)Preventing overfitting on validation data

Quick tip

Use <code>XGBClassifier</code> for easy scikit-learn integration and <code>DMatrix</code> for optimized training performance.

Common mistake

Not converting data to <code>DMatrix</code> format or mismatching labels and features causes training errors.

Verified 2026-04 · xgboost
Verify ↗