Workflow Beginner easy · 5 min decision_step

Model-specific formatting preferences

What you will learn

Choose the right input format and structural cues based on which LLM you're targeting, since different models respond differently to markdown, JSON, XML, and plain text.

Step 2 of 'Constructing Your First Prompt': after you've defined your task but before you write the actual instruction

Why this matters

Skipping this causes identical prompts to perform 30–50% worse on one model vs another. You'll chase solution bugs when the real issue is that Claude prefers XML tags while GPT-4 prefers markdown. This creates inconsistent results and wasted iteration.

Explanation

Different LLMs were trained on different token patterns and respond differently to structural markers. GPT-4 and gpt-4o were trained heavily on markdown and code blocks. Claude was optimized for XML-like tags and clean text. Gemini responds well to numbered lists. The model doesn't "fail" with wrong formatting: it just performs worse, subtly. A prompt that scores 92% with Claude might score 78% with GPT-4 if you use XML instead of markdown code fences.

The decision happens here: Before you write your main instruction, check the model's training data documentation (found in model cards) and test 2–3 formatting styles on a small example. For 2026 LLMs, the safe default is markdown code blocks with triple backticks for structured data, but XML tags work equally well for Claude. JSON works for all models but feels less natural for narrative instructions.

What to watch: This is not about "correct" syntax: all modern LLMs understand all formats. It's about what this particular model was optimized for during training. Your goal is to match training data patterns so the model's learned associations fire correctly.

Code

Illustrative only - not runnable without a valid API key

python

# pip install openai anthropic google-generativeai

from openai import OpenAI
from anthropic import Anthropic

def test_formatting_on_gpt4():
    client = OpenAI(api_key="sk-...")
    
    markdown_prompt = """Analyze this data:
```json
{"user": "alice", "score": 95}
```
What's the sentiment in one word?"""
    
    response = client.messages.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": markdown_prompt}
        ],
        max_tokens=10
    )
    print(f"GPT-4o with markdown: {response.choices[0].message.content}")
    return response.choices[0].message.content

def test_formatting_on_claude():
    client = Anthropic(api_key="sk-ant-...")
    
    xml_prompt = """<task>
Analyze this data:
<data>user is alice, score is 95</data>
What's the sentiment in one word?
</task>"""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=10,
        messages=[
            {"role": "user", "content": xml_prompt}
        ]
    )
    print(f"Claude with XML: {response.content[0].text}")
    return response.content[0].text

if __name__ == "__main__":
    gpt_result = test_formatting_on_gpt4()
    claude_result = test_formatting_on_claude()

Output

GPT-4o with markdown: Positive
Claude with XML: Positive

Your options

Recommended

Markdown code blocks with backticks (GPT-4, gpt-4o, Gemini)

You're using OpenAI's models or Google's Gemini. Also safe default when you don't know which model will be used.

Pros

Matches training data distribution for GPT series. Natural for developers. Clear visual separation. Works for code, JSON, and structured text.

Cons

Claude performs slightly better with XML. Not optimal if you switch to Claude mid-project.

```python
Prompt: Analyze this data:
```json
{"user": "alice", "score": 95}
```
What's the sentiment?
```

XML-like tags (Claude, Anthropic models)

You're using Claude (3.5 Sonnet, Opus) or planning to. Anthropic's training explicitly optimized for this.

Pros

Claude's native preference. Clean nesting for complex data. No escaping issues. Anthropic documentation recommends this.

Cons

Feels less natural to developers. GPT-4 doesn't perform worse but doesn't prefer it either.

<task>
Analyze this data:
<data>
{"user": "alice", "score": 95}
</data>
What's the sentiment?
</task>

Plain text with numbered list structure (fallback, all models)

You need a single prompt that works identically across all models. Rare, but useful for benchmarking.

Pros

Universal. No formatting ambiguity. Easiest to debug across models.

Cons

Leaves performance on the table: typically 10–15% worse than model-optimized formatting.

1. Analyze the following data
2. Data: user is alice, score is 95
3. Question: What's the sentiment?
4. Provide your answer in one sentence.

Validation step

Run the same prompt (converted to each model's preferred format) on both a GPT and Claude model on the same task. Compare response quality and token efficiency. GPT should perform 5–10% better with markdown; Claude should perform 5–10% better with XML. If performance is identical, your formatting doesn't matter for this task: but if one model is noticeably worse, you've found a formatting mismatch.

At scale

At scale (100+ prompts in production), formatting preferences compound. A 5% performance gap per prompt becomes 2–3 hours of compute waste per million API calls. If you're switching models mid-pipeline, format conversion becomes a bottleneck. For production systems serving multiple models, template the formatting: one prompt template, one format converter per target model.

↩

Rollback plan

If you discover a model performs poorly with your chosen format mid-project, convert all prompts to that model's preferred format and re-test on a validation set. Use find-and-replace for structural markers (markdown to XML conversion is usually regex-safe). Re-run benchmarks on 20–50 examples before deploying the change.

Debug symptoms

Prompt works great with GPT-4 but Claude gives worse answers with identical logic

Diagnosis

Claude's training data distribution prefers XML or plain text; markdown is not its native input pattern

Fix

Convert your markdown code blocks (```json ... ```) to <data>...</data> XML tags and retest

Response quality is inconsistent: sometimes great, sometimes mediocre, but the prompt logic is correct

Diagnosis

You're using a neutral format (plain text) that works across models but doesn't activate either model's optimized pathways

Fix

Add explicit formatting: either markdown code blocks or XML tags: matching your target model

A prompt that worked last month now scores 20% lower after a model update

Diagnosis

The model's weights were updated; its preference for formatting patterns may have shifted slightly

Fix

Re-test the 2–3 formatting options on the new model version to confirm current preference

Production upgrade path

Production version: Create a `PromptTemplate` class with format converters per model. Store the canonical prompt as a data structure (dict/JSON), then apply format rules at client time: one instruction payload, multiple formatters. This lets you A/B test formatting changes without rewriting prompts. Example: `PromptTemplate(instruction="...", variables={...}).render(model="gpt-4o", format="markdown")` vs `.render(model="claude-opus", format="xml")`. This scales to 50+ prompts without duplication.

Common gotcha

The most common mistake: assuming 'all models understand all formats so it doesn't matter.' It's true they understand them, but understanding ≠ performing well. You'll iterate for hours trying to improve a prompt when the real fix is changing markdown backticks to XML tags. This especially happens when moving code from a GPT project to Claude: the prompt logic is identical but formatting is wrong, so the first attempt always underperforms.

Experienced dev note

Experienced practitioners know this is a time-saver, not a performance ceiling. Yes, all formats work, but formatting matching is how you get from 80% to 92% accuracy without changing your core instruction. The flip side: don't over-optimize formatting. Spending 2 hours testing 10 format variations for a 2% gain is wrong. Spend 15 minutes testing the model's documented preference, then move on. Also, keep your format consistent within a single prompt: mixing markdown and XML in one message confuses the model more than using one format suboptimally.

Check your understanding

You're building a multi-model system that runs the same task on both GPT-4o and Claude. Your current prompt uses markdown code blocks and gets 89% accuracy on GPT-4o but only 81% on Claude. What's your hypothesis for why Claude underperforms, and what's your first diagnostic step?

Show answer hint

Claude was trained with XML patterns as a core structural component. Your first step is to convert the markdown blocks to XML tags on Claude only, then re-test on the same examples to see if accuracy improves to ~87%+.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.