Concept Beginner · 3 min read

What is ShareGPT dataset format

Quick answer
The ShareGPT dataset format is a JSONL file where each line is a JSON object representing a conversation with alternating user and assistant messages. It captures chat history as a list of message objects, making it ideal for fine-tuning chat-based language models.
ShareGPT dataset format is a JSONL-based conversation data structure that stores user-assistant chat exchanges for fine-tuning language models.

How it works

The ShareGPT dataset format organizes conversations as a sequence of messages, each labeled with a role such as user or assistant, and a content string containing the text. Think of it like a chat transcript where each line alternates between the person asking questions and the AI providing answers. This structure preserves context and dialogue flow, which is crucial for training chat models to understand and generate coherent multi-turn conversations.

Concrete example

Each line in the dataset is a JSON object representing one conversation. Here's a minimal example of a conversation with two turns:

json
[
  {
    "id": "conv1",
    "conversations": [
      {"role": "user", "content": "What is the capital of France?"},
      {"role": "assistant", "content": "The capital of France is Paris."}
    ]
  },
  {
    "id": "conv2",
    "conversations": [
      {"role": "user", "content": "How do I write a for loop in Python?"},
      {"role": "assistant", "content": "You can write a for loop like this:\nfor i in range(5):\n    print(i)"}
    ]
  }
]

When to use it

Use the ShareGPT dataset format when you want to fine-tune or train chat-based large language models on real conversational data that includes both user queries and assistant responses. It is not suitable for single-turn question-answer pairs or non-dialogue data. This format helps models learn multi-turn context, improving their ability to maintain coherent conversations.

Key terms

TermDefinition
JSONLA file format where each line is a separate JSON object.
RoleThe speaker in the conversation, typically 'user' or 'assistant'.
ContentThe text message content of a conversation turn.
ConversationA sequence of messages forming a dialogue exchange.
Fine-tuningTraining a pre-trained model further on specific data to specialize it.

Key Takeaways

  • ShareGPT dataset format stores conversations as JSON objects with user and assistant roles.
  • It preserves multi-turn dialogue context essential for fine-tuning chat models.
  • Use it when training models to handle realistic conversational flows, not isolated Q&A.
Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022
Verify ↗