How to beginner · 3 min read

How to use structured outputs in vLLM

Q: How to use structured outputs in vLLM

Use vllm to generate structured outputs by prompting the model to produce JSON or other structured formats, then parse the output string in Python. The vllm Python API returns raw text, so you handle structure by designing prompts and parsing results accordingly.

Quick answer

Use vllm to generate structured outputs by prompting the model to produce JSON or other structured formats, then parse the output string in Python. The vllm Python API returns raw text, so you handle structure by designing prompts and parsing results accordingly.

PREREQUISITES

Python 3.8+
pip install vllm
Basic knowledge of JSON and Python parsing

Setup

Install the vllm package and ensure you have Python 3.8 or higher. No API key is required for local usage.

bash

pip install vllm

Step by step

Use vllm to generate structured JSON output by crafting a prompt that instructs the model to respond in JSON format. Then parse the output string with Python's json module.

python

from vllm import LLM, SamplingParams
import json

# Initialize the LLM with a local model
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")

# Prompt instructing the model to output JSON
prompt = '''
Generate a JSON object with keys 'name', 'age', and 'city' for a fictional person.
Respond ONLY with the JSON object.
'''

# Generate output
outputs = llm.generate([prompt], SamplingParams(temperature=0.0))

# Extract text output
text_output = outputs[0].outputs[0].text.strip()

print("Raw model output:", text_output)

# Parse JSON output
try:
    structured_output = json.loads(text_output)
    print("Parsed structured output:", structured_output)
except json.JSONDecodeError as e:
    print("Failed to parse JSON:", e)

output

Raw model output: {"name": "Alice", "age": 30, "city": "Seattle"}
Parsed structured output: {'name': 'Alice', 'age': 30, 'city': 'Seattle'}

Common variations

Use different models by changing the model parameter in LLM().
Adjust SamplingParams for creativity or determinism (e.g., temperature=0.7 for more varied outputs).
For asynchronous usage, use vllm.engine.async_llm_engine.AsyncLLMEngine (advanced).
When running a vllm server, query it via OpenAI-compatible HTTP API with openai SDK and parse structured outputs similarly.

Troubleshooting

If JSON parsing fails, verify the prompt strictly instructs the model to output valid JSON only.
Trim whitespace and remove any extra text before parsing.
Use temperature=0.0 to reduce randomness and improve output consistency.
Check model logs or outputs for partial or malformed JSON and refine prompt accordingly.

✅

Key Takeaways

Use explicit prompts to instruct vllm to output structured JSON for reliable parsing.
Parse the raw text output with Python's json module to convert to structured data.
Set temperature=0.0 in SamplingParams for deterministic structured outputs.
You can run vllm locally or query a running vllm server via OpenAI-compatible API.
Careful prompt design and output cleaning are essential to avoid JSON parsing errors.

Verified 2026-04 · meta-llama/Llama-3.1-8B-Instruct

Verify ↗