Finetuning
Preparing training data
Everything about training data requirements and how to format your dataset files
Dataset File Format
Your training dataset should be a jsonl
file, meaning that each line should contain a json object representing a chat. A chat json objects should contain a messages
field, which looks like this:
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
Each message contains a role
and a content
field. The content field represents - as the name suggests - the content of a message and is a string. A role
can be one of the following:
system
: The system prompt indicating general model behaviour. The system prompt only exists once in a chat and has to be the first message.user
: The user role represents the person interacting with your LLM.assistant
: The assistant role represents the LLM. Its messages are the ones that the model will be fine-tuned on.
The following should apply to your list of messages:
- The first message has to be a
system
prompt. - The next message has to be a
user
message, and the following messages should alternate betweenassistant
anduser
. - The last message has to be an
assistant
message.
Dataset Size & Context Lengths
We recommend having at least 100 training chats to obtain good training results. We currently support the following maximum context lengths:
- 7B Models: 2700 tokens
- 13B Models: 2100 tokens