How to use structured outputs in vLLM
Quick answer
Use
vllm to generate structured outputs by prompting the model to produce JSON or other structured formats, then parse the output string in Python. The vllm Python API returns raw text, so you handle structure by designing prompts and parsing results accordingly.PREREQUISITES
Python 3.8+pip install vllmBasic knowledge of JSON and Python parsing
Setup
Install the vllm package and ensure you have Python 3.8 or higher. No API key is required for local usage.
pip install vllm Step by step
Use vllm to generate structured JSON output by crafting a prompt that instructs the model to respond in JSON format. Then parse the output string with Python's json module.
from vllm import LLM, SamplingParams
import json
# Initialize the LLM with a local model
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
# Prompt instructing the model to output JSON
prompt = '''
Generate a JSON object with keys 'name', 'age', and 'city' for a fictional person.
Respond ONLY with the JSON object.
'''
# Generate output
outputs = llm.generate([prompt], SamplingParams(temperature=0.0))
# Extract text output
text_output = outputs[0].outputs[0].text.strip()
print("Raw model output:", text_output)
# Parse JSON output
try:
structured_output = json.loads(text_output)
print("Parsed structured output:", structured_output)
except json.JSONDecodeError as e:
print("Failed to parse JSON:", e) output
Raw model output: {"name": "Alice", "age": 30, "city": "Seattle"}
Parsed structured output: {'name': 'Alice', 'age': 30, 'city': 'Seattle'} Common variations
- Use different models by changing the
modelparameter inLLM(). - Adjust
SamplingParamsfor creativity or determinism (e.g.,temperature=0.7for more varied outputs). - For asynchronous usage, use
vllm.engine.async_llm_engine.AsyncLLMEngine(advanced). - When running a
vllmserver, query it via OpenAI-compatible HTTP API withopenaiSDK and parse structured outputs similarly.
Troubleshooting
- If JSON parsing fails, verify the prompt strictly instructs the model to output valid JSON only.
- Trim whitespace and remove any extra text before parsing.
- Use
temperature=0.0to reduce randomness and improve output consistency. - Check model logs or outputs for partial or malformed JSON and refine prompt accordingly.
Key Takeaways
- Use explicit prompts to instruct
vllmto output structured JSON for reliable parsing. - Parse the raw text output with Python's
jsonmodule to convert to structured data. - Set
temperature=0.0inSamplingParamsfor deterministic structured outputs. - You can run
vllmlocally or query a runningvllmserver via OpenAI-compatible API. - Careful prompt design and output cleaning are essential to avoid JSON parsing errors.