How to format training data for fine-tuning LLMs
JSONL) file where each line is a JSON object containing prompt and completion fields. The prompt is the input text, and the completion is the desired model output, both formatted as strings.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the OpenAI Python SDK and set your API key as an environment variable to prepare for fine-tuning.
pip install openai>=1.0 Step by step
Prepare your training data as a JSONL file where each line is a JSON object with prompt and completion keys. The prompt is the input text you want the model to learn from, and the completion is the expected output. Both should be strings, and the completion usually ends with a stop token or newline to signal the end.
Example format:
{"prompt": "Translate English to French: Hello, how are you?\n", "completion": " Bonjour, comment ça va?\n"}Use the OpenAI CLI or API to upload and fine-tune your model with this data.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Example training data line
training_example = {
"prompt": "Translate English to French: Hello, how are you?\n",
"completion": " Bonjour, comment ça va?\n"
}
# Save to JSONL file
import json
with open("training_data.jsonl", "w") as f:
f.write(json.dumps(training_example) + "\n")
print("Training data saved in JSONL format.") Training data saved in JSONL format.
Common variations
You can format data for different fine-tuning tasks by adjusting the prompt and completion structure. For chat models, include role tokens like User: and Assistant: in the prompt and completion. Async and streaming fine-tuning workflows depend on the provider's SDK capabilities.
Troubleshooting
- If your fine-tuning job fails, verify your JSONL file is valid JSON and each line contains both
promptandcompletionkeys. - Ensure the
completionends with a newline or stop token to prevent the model from generating unwanted text. - Check for consistent formatting and avoid very long prompts or completions that exceed token limits.
Key Takeaways
- Use JSONL format with one JSON object per line containing 'prompt' and 'completion' keys.
- Both 'prompt' and 'completion' must be strings; 'completion' should end with a newline or stop token.
- Validate your JSONL file before fine-tuning to avoid errors.
- Adjust prompt-completion formatting for different tasks like translation, chat, or summarization.
- Keep prompts and completions within token limits to ensure successful fine-tuning.