Preparing training data

Dataset File Format

Your training dataset should be a jsonl file, meaning that each line should contain a json object representing a chat. A chat json objects should contain a messages field, which looks like this:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}

Each message contains a role and a content field. The content field represents - as the name suggests - the content of a message and is a string. A role can be one of the following:

system: The system prompt indicating general model behaviour. The system prompt only exists once in a chat and has to be the first message.
user: The user role represents the person interacting with your LLM.
assistant: The assistant role represents the LLM. Its messages are the ones that the model will be fine-tuned on.

The following should apply to your list of messages:

The first message has to be a system prompt.
The next message has to be a user message, and the following messages should alternate between assistant and user.
The last message has to be an assistant message.

Dataset Size & Context Lengths

We recommend having at least 100 training chats to obtain good training results. We currently support the following maximum context lengths:

7B Models: 2700 tokens
13B Models: 2100 tokens

Quickstart

Finetuning

General

Preparing training data

Dataset File Format

Dataset Size & Context Lengths

Quickstart

Finetuning

General

​Dataset File Format

​Dataset Size & Context Lengths

Dataset File Format

Dataset Size & Context Lengths