High severity intermediate · Fix: 5-10 min

ValueError: Invalid thinking block format

ValueError (thinking tag parsing in QwQ-32B response handler)

What this error means
QwQ-32B returns reasoning output wrapped in XML thinking tags, and the parser fails when tags are malformed, nested incorrectly, or missing closing delimiters.

Stack trace

traceback
Traceback (most recent call last):
  File "inference.py", line 42, in parse_thinking_output
    raise ValueError(f"Invalid thinking block format: opening tag found at {start} but no closing </thinking> tag")
ValueError: Invalid thinking block format: opening tag found at 342 but no closing </thinking> tag

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "app.py", line 78, in generate_response
    result = handler.extract_reasoning(llm_output)
ValueError: Unable to parse QwQ-32B thinking output — malformed XML structure
QUICK FIX
Set `max_tokens=16384` in generation config and remove custom `stopping_criteria` to allow QwQ-32B to complete thinking tag generation without truncation.

Why it happens

QwQ-32B is a reasoning model that outputs its internal thinking process wrapped in <thinking>...</thinking> XML tags before providing the final answer. When the model generates nested thinking blocks, fails to close tags, or the response is truncated mid-generation (due to max_tokens limits), the parsing logic cannot extract the reasoning correctly. This is especially common when max_tokens is too low, the model is interrupted, or custom stopping criteria strip the closing tag.

Detection

Add a pre-processing step that logs raw model output before parsing: `if '<thinking>' in output and '</thinking>' not in output: log_malformed_output(output)`. Monitor for truncated responses in production and adjust max_tokens accordingly. Use regex validation to catch unclosed tags before parser execution.

Causes & fixes

1

Response is truncated before the closing </thinking> tag (max_tokens limit reached mid-generation)

✓ Fix

Increase max_tokens in your generation call to at least 16384 for QwQ-32B: `max_tokens=16384` in HuggingFace generation_config or `num_predict=16384` in Ollama

2

Custom stopping criteria or repetition_penalty settings cause premature generation termination

✓ Fix

Disable custom stopping_criteria for QwQ-32B reasoning phase, or allow the model to complete thinking before applying stopping logic: use `stopping_criteria=None` during reasoning generation

3

Nested or malformed thinking tags in the output (rare in recent versions, common with fine-tuned models)

✓ Fix

Use regex-based fallback parser to extract content between first <thinking> and last </thinking>, handling malformed nesting gracefully with `re.search(r'<thinking>(.*?)</thinking>', output, re.DOTALL)`

4

Parser expects thinking tags but model was called without reasoning capability (model ID mismatch)

✓ Fix

Verify model is `Qwen/QwQ-32B-Preview` or `Qwen/QwQ-32B` from HuggingFace, not a non-reasoning variant like `Qwen/Qwen2.5-32B`

Code: broken vs fixed

Broken - triggers the error
python
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

model_id = "Qwen/QwQ-32B-Preview"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Explain quantum computing step by step."
inputs = tokenizer(prompt, return_tensors="pt")

# BUG: max_tokens too low, cuts off thinking tags
outputs = model.generate(
    **inputs,
    max_tokens=512,  # ← Too low — thinking block gets truncated
    temperature=0.7
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# This fails: ValueError: Invalid thinking block format
if '<thinking>' in response and '</thinking>' not in response:
    raise ValueError(f"Invalid thinking block format: opening tag found but no closing </thinking> tag")
Fixed - works correctly
python
from transformers import AutoModelForCausalLM, AutoTokenizer
import os
import re

model_id = "Qwen/QwQ-32B-Preview"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Explain quantum computing step by step."
inputs = tokenizer(prompt, return_tensors="pt")

# FIX: Set max_tokens high enough for complete thinking block + answer
outputs = model.generate(
    **inputs,
    max_tokens=16384,  # ← Increased to allow full thinking generation
    temperature=0.7,
    stopping_criteria=None  # ← Removed to prevent premature termination
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Safe extraction with regex fallback
try:
    thinking_match = re.search(r'<thinking>(.*?)</thinking>', response, re.DOTALL)
    if thinking_match:
        thinking_block = thinking_match.group(1)
        answer_match = re.search(r'</thinking>\s*(.*)', response, re.DOTALL)
        answer = answer_match.group(1) if answer_match else ""
        print(f"Thinking: {thinking_block[:200]}...")
        print(f"Answer: {answer}")
    else:
        print("No thinking tags found in output.")
        print(f"Full response: {response}")
except ValueError as e:
    print(f"Parse error (non-fatal): {e}")
    print(f"Falling back to raw response: {response}")
Increased max_tokens to 16384 to allow complete reasoning generation, removed stopping_criteria that was truncating output, and added regex-based fallback parsing to handle edge cases gracefully without crashing.

Workaround

If you cannot immediately increase max_tokens globally, wrap the generation in a retry loop with exponentially increasing max_tokens: start at 8192, catch ValueError, retry with 12288, then 16384. Alternatively, extract the thinking block using regex with `re.search(r'<thinking>(.*?(?=</thinking>))', output, re.DOTALL)` to handle incomplete closing tags, and log the truncation event for monitoring.

Prevention

Always allocate max_tokens >= 16384 for reasoning-model inference in production. Use a configuration system that routes QwQ-32B requests to a dedicated parameter set separate from smaller models. Implement response validation before parsing that checks for balanced opening/closing thinking tags. Set up monitoring alerts for truncated reasoning outputs (log length warnings when response approaches max_tokens). For critical use cases, implement a fallback to a non-reasoning variant with manual prompt engineering as a safety net.

Python 3.9+ · transformers >=4.42.0 · tested on 4.45.x
Verified 2026-04 · Qwen/QwQ-32B-Preview, Qwen/QwQ-32B
Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.