How to prepare training data for OpenAI fine-tuning
prompt and a completion field. Ensure the data is clean, representative, and formatted with clear input-output pairs to guide the model during fine-tuning.PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the OpenAI Python SDK and set your API key as an environment variable to authenticate requests.
pip install openai Step by step
Create a JSONL file where each line is a JSON object with prompt and completion keys. The prompt is the input text, and the completion is the desired output the model should learn to generate.
Example format:
{"prompt": "Translate English to French: 'Hello'", "completion": " Bonjour"}Note the space before the completion text, which helps the model distinguish prompt from completion.
import json
# Example training data as a list of dicts
training_data = [
{"prompt": "Translate English to French: 'Hello'", "completion": " Bonjour"},
{"prompt": "Translate English to French: 'Goodbye'", "completion": " Au revoir"}
]
# Write to JSONL file
with open('fine_tune_data.jsonl', 'w', encoding='utf-8') as f:
for entry in training_data:
json.dump(entry, f)
f.write('\n')
print("Training data saved to fine_tune_data.jsonl") Training data saved to fine_tune_data.jsonl
Common variations
You can prepare data for different tasks by adjusting the prompt and completion format, such as question-answering, summarization, or code generation. For example, for Q&A:
{"prompt": "Q: What is AI?\nA:", "completion": " Artificial Intelligence is..."}Use the same JSONL format but tailor prompts to your use case.
Troubleshooting
If your fine-tuning job fails, check that your JSONL file is valid JSON with no trailing commas or syntax errors. Also, ensure each completion starts with a space and ends with a stop token (like a newline or punctuation) to help the model learn boundaries.
Use JSON validators or json.load() in Python to verify file integrity before uploading.
import json
try:
with open('fine_tune_data.jsonl', 'r', encoding='utf-8') as f:
for line in f:
json.loads(line)
print("JSONL file is valid.")
except json.JSONDecodeError as e:
print(f"JSON error: {e}") JSONL file is valid.
Key Takeaways
- Use a JSONL file with prompt-completion pairs for OpenAI fine-tuning data.
- Ensure completions start with a space and end with a stop token for clarity.
- Validate JSONL syntax before uploading to avoid fine-tuning errors.