How to Intermediate · 3 min read

How to use speculative decoding in vLLM

Quick answer
Use vllm's generate method with the speculative_sampling parameter set to True and configure speculative_model and speculative_ratio to enable speculative decoding. This technique accelerates generation by first sampling from a smaller model and then verifying with the main model.

PREREQUISITES

  • Python 3.8+
  • pip install vllm
  • Access to a compatible vLLM model checkpoint

Setup

Install the vllm package and prepare your environment to run vLLM models locally. Ensure you have a compatible model checkpoint downloaded or accessible.

bash
pip install vllm

Step by step

This example demonstrates how to enable speculative decoding in vllm using the Python API. It uses a smaller speculative model to generate tokens quickly and a larger main model to verify and finalize the output.

python
from vllm import LLM, SamplingParams

# Initialize the main model (larger, slower but accurate)
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Define sampling parameters with speculative decoding enabled
sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=50,
    speculative_sampling=True,
    speculative_model="meta-llama/Llama-3.1-4B-Instruct",  # smaller model for speculation
    speculative_ratio=4  # number of tokens speculative model generates per main model token
)

# Prompt to generate text from
prompt = "Explain speculative decoding in vLLM."

# Generate output
outputs = llm.generate([prompt], sampling_params)

# Print the generated text
print(outputs[0].outputs[0].text)
output
Explain speculative decoding in vLLM. Speculative decoding is a technique that speeds up text generation by using a smaller, faster model to predict multiple tokens ahead. The main model then verifies these tokens, allowing for faster overall generation without sacrificing quality.

Common variations

You can adjust speculative decoding parameters such as speculative_ratio to control the speed-quality tradeoff. Using a smaller speculative model reduces latency but may increase verification overhead. You can also run vllm asynchronously or with streaming output by adapting the API calls accordingly.

python
from vllm import LLM, SamplingParams
import asyncio

async def async_generate():
    llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
    sampling_params = SamplingParams(
        temperature=0.7,
        max_tokens=50,
        speculative_sampling=True,
        speculative_model="meta-llama/Llama-3.1-4B-Instruct",
        speculative_ratio=3
    )
    outputs = await llm.agenerate(["What is speculative decoding?"], sampling_params)
    print(outputs[0].outputs[0].text)

asyncio.run(async_generate())
output
Speculative decoding is a method that accelerates language model inference by using a smaller model to propose tokens which are then verified by the main model, improving throughput.

Troubleshooting

  • If you see errors about model loading, verify the model names and that checkpoints are downloaded.
  • If generation is slower than expected, try lowering speculative_ratio or using a smaller speculative model.
  • Ensure your environment has sufficient GPU memory for both models when using speculative decoding.

Key Takeaways

  • Enable speculative decoding in vLLM by setting speculative_sampling=True and specifying a smaller speculative_model.
  • Adjust speculative_ratio to balance speed and output quality during generation.
  • Use asynchronous API calls for non-blocking speculative decoding workflows.
  • Ensure both main and speculative models are compatible and properly loaded to avoid runtime errors.
Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.1-4B-Instruct
Verify ↗