Dataset File Format

Your training dataset should be a jsonl file, meaning that each line should contain a json object representing a chat. A chat json objects should contain a messages field, which looks like this:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}

Each message contains a role and a content field. The content field represents - as the name suggests - the content of a message and is a string. A role can be one of the following:

  • system: The system prompt indicating general model behaviour. The system prompt only exists once in a chat and has to be the first message.
  • user: The user role represents the person interacting with your LLM.
  • assistant: The assistant role represents the LLM. Its messages are the ones that the model will be fine-tuned on.

The following should apply to your list of messages:

  1. The first message has to be a system prompt.
  2. The next message has to be a user message, and the following messages should alternate between assistant and user.
  3. The last message has to be an assistant message.

Dataset Size & Context Lengths

We recommend having at least 100 training chats to obtain good training results. We currently support the following maximum context lengths:

  • 7B Models: 2700 tokens
  • 13B Models: 2100 tokens