How to Intermediate · 3 min read

How to use GPTQ quantization with vLLM

Q: How to use GPTQ quantization with vLLM

Use vLLM to load GPTQ-quantized models by specifying the quantized model path in the LLM constructor. GPTQ quantization reduces model size and speeds up inference while maintaining accuracy. Run inference via the generate method or the vllm serve CLI with the quantized model.

Quick answer

Use vLLM to load GPTQ-quantized models by specifying the quantized model path in the LLM constructor. GPTQ quantization reduces model size and speeds up inference while maintaining accuracy. Run inference via the generate method or the vllm serve CLI with the quantized model.

PREREQUISITES

Python 3.8+
pip install vllm
A GPTQ-quantized model checkpoint compatible with vLLM

Setup

Install the vLLM package via pip and prepare your GPTQ-quantized model checkpoint. Ensure Python 3.8 or higher is installed.

bash

pip install vllm

Step by step

Load a GPTQ-quantized model in Python using vLLM by specifying the path to the quantized checkpoint. Then generate text with the model.

python

from vllm import LLM, SamplingParams

# Path to your GPTQ-quantized model directory or checkpoint
quantized_model_path = "/path/to/gptq-quantized-model"

# Load the quantized model
llm = LLM(model=quantized_model_path)

# Generate text with sampling parameters
outputs = llm.generate([
    "Explain GPTQ quantization in simple terms."
], SamplingParams(temperature=0.7, max_tokens=100))

print(outputs[0].outputs[0].text)

output

GPTQ quantization is a technique that compresses large language models by reducing the precision of weights, enabling faster and more memory-efficient inference without significant loss in accuracy.

Common variations

You can run GPTQ-quantized models with vLLM using the CLI server for HTTP serving or batch inference. Adjust sampling parameters or use different quantized checkpoints as needed.

python

# Start vLLM server with GPTQ-quantized model
!vllm serve /path/to/gptq-quantized-model --port 8000

# Query the running server via OpenAI-compatible API
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"], base_url="http://localhost:8000/v1")
response = client.chat.completions.create(
    model="gptq-quantized-model",
    messages=[{"role": "user", "content": "What is GPTQ quantization?"}]
)
print(response.choices[0].message.content)

output

GPTQ quantization compresses model weights to accelerate inference while preserving accuracy, enabling efficient deployment of large language models.

Troubleshooting

If you see errors loading the model, verify the checkpoint path points to a valid GPTQ-quantized model compatible with vLLM.
For performance issues, ensure you have the latest vLLM version and sufficient GPU memory.
If inference is slow, try adjusting SamplingParams or use batch generation.

✅

Key Takeaways

Load GPTQ-quantized models in vLLM by specifying the quantized checkpoint path in the LLM constructor.
Use vLLM's generate method or CLI serve command to run efficient inference with GPTQ models.
Ensure compatibility of the quantized checkpoint and keep vLLM updated for best performance.

Verified 2026-04 · gptq-quantized-model, vLLM

Verify ↗