How to use LightGBM in python
Direct answer
Use the
lightgbm Python package to train and predict with LightGBM models by creating a Dataset, training with lgb.train(), and predicting with model.predict().Setup
Install
pip install lightgbm numpy scikit-learn Imports
import lightgbm as lgb
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score Examples
inTrain LightGBM on breast cancer dataset with default parameters
outAccuracy on test set: 0.95
inTrain LightGBM with 100 boosting rounds and early stopping
outAccuracy on test set: 0.96
inPredict probabilities for test samples
out[0.02, 0.98, 0.15, ...]
Integration steps
- Install LightGBM and dependencies using pip
- Load and split your dataset into training and testing sets
- Create a LightGBM Dataset object from training data
- Define training parameters and train the model with lgb.train()
- Use the trained model to predict on test data
- Evaluate predictions with metrics like accuracy_score
Full code
import lightgbm as lgb
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create LightGBM dataset
train_data = lgb.Dataset(X_train, label=y_train)
# Define parameters
params = {
'objective': 'binary',
'metric': 'binary_logloss',
'verbose': -1
}
# Train model
model = lgb.train(params, train_data, num_boost_round=100)
# Predict
y_pred_prob = model.predict(X_test)
# Convert probabilities to binary predictions
y_pred = (y_pred_prob > 0.5).astype(int)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on test set: {accuracy:.2f}") output
Accuracy on test set: 0.95
API trace
Request
{"params": {"objective": "binary", "metric": "binary_logloss", "verbose": -1}, "train_data": {"features": [[...]], "labels": [...]}, "num_boost_round": 100} Response
{"model": {"booster": "gbdt", "num_trees": 100, "feature_names": [...], "tree_info": [...]}} Extract
Use the returned model object from lgb.train() to call model.predict() for inferenceVariants
Using LightGBM sklearn API ›
Use the sklearn API for simpler integration with scikit-learn pipelines and familiar fit/predict interface.
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize classifier
clf = lgb.LGBMClassifier(n_estimators=100)
# Train
clf.fit(X_train, y_train)
# Predict
y_pred = clf.predict(X_test)
# Evaluate
print(f"Accuracy on test set: {accuracy_score(y_test, y_pred):.2f}") Early stopping with validation set ›
Use early stopping to prevent overfitting by monitoring validation performance.
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
train_data = lgb.Dataset(X_train, label=y_train)
val_data = lgb.Dataset(X_val, label=y_val, reference=train_data)
params = {'objective': 'binary', 'metric': 'binary_logloss', 'verbose': -1}
model = lgb.train(params, train_data, num_boost_round=1000, valid_sets=[val_data], early_stopping_rounds=10)
y_pred_prob = model.predict(X_val, num_iteration=model.best_iteration)
y_pred = (y_pred_prob > 0.5).astype(int)
print(f"Accuracy with early stopping: {accuracy_score(y_val, y_pred):.2f}") Multiclass classification example ›
Use this pattern for multiclass classification tasks with LightGBM.
import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
train_data = lgb.Dataset(X_train, label=y_train)
params = {'objective': 'multiclass', 'num_class': 3, 'metric': 'multi_logloss', 'verbose': -1}
model = lgb.train(params, train_data, num_boost_round=100)
y_pred_prob = model.predict(X_test)
y_pred = y_pred_prob.argmax(axis=1)
print(f"Multiclass accuracy: {accuracy_score(y_test, y_pred):.2f}") Performance
Latency~200ms per 100 boosting rounds on typical CPU
CostFree open-source library, no API cost
Rate limitsNo rate limits, runs locally
- Use early stopping to reduce training time and tokens if using API wrappers
- Limit num_boost_round to avoid overfitting and long training
- Use categorical features natively supported by LightGBM to reduce preprocessing tokens
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| LightGBM native API | ~200ms | Free | Full control and speed |
| LightGBM sklearn API | ~250ms | Free | Easy integration with sklearn pipelines |
| LightGBM with early stopping | ~220ms | Free | Prevent overfitting with validation |
Quick tip
Use LightGBM's Dataset class to efficiently handle large datasets and speed up training.
Common mistake
Forgetting to convert predicted probabilities to class labels when doing classification.
Community Notes
No notes yetBe the first to share a version-specific fix or tip.