Train

POST

v1/train

Fine-tune a language model by adjusting its weights to make it more likely to produce the results you want. Training modifies the model's internal parameters based on your provided examples, enabling it to learn new behaviors, preferences, and response patterns.

To train the model, you provide a conversation context (messages) along with multiple possible responses, each with a reward value indicating how desirable that response is. The model learns from these examples by updating its weights to favor responses with higher rewards, gradually improving its ability to generate the types of outputs you prefer.

This process allows you to customize the model's behavior for specific tasks, styles, or preferences without needing to build a model from scratch.

Parameters

model:required string

The identifier of the language model to use for training.

Note: This must be a model you yourself has created, the standard models Llama-3.1-8b or DeepSeek-R1-8b are readonly. So you have to first make a copy of these models, and then train them.

messages:required Array of Message

Message Structure

Models are trained to operate on alternating user and assistant conversational turns. When creating a new Message, you specify the prior conversational turns with the messages parameter, and the model then generates the next Message in the conversation.

Message

Each input message must be an object with a role and content. You can specify three types of roles:

system - Sets instructions and context for the entire conversation. This is placed at the beginning of the messages array.
user - Represents input from the end user
assistant - Represents responses from the model

System Message

The system role allows you to provide context, instructions, and guidelines that influence the model's behavior throughout the conversation. System messages are optional but powerful for:

Defining the assistant's role, personality, or expertise
Setting response format requirements (e.g., "always respond in JSON")
Providing domain-specific context or constraints
Establishing behavioral guidelines

System messages are included in the messages array as the first message:

{
  "model"    : "your-model-name",
  "messages" : [
    {
      "role"    : "system",
      "content" : "You are a helpful API documentation assistant. Provide clear, concise explanations with code examples."
    },
    {
      "role"    : "user",
      "content" : "Why is the sky blue?"
    }
  ]
}

Message members

response:required Array of Response

Response Array

Contains a array of different possible response, each Response contain a response(text) and a reward.

The rewards are normalized to ensures that responses with above-average rewards get positive advantages (encouraging the model to favor them), while responses with below-average rewards get negative advantages (discouraging those patterns).

Response members

stream:optional boolean

Streaming allow you to see the results of individual training steps incrementally as it's generated, instread of getting a status report when training is compled.

false - Returns the complete response only after training is finished
true - Streams response chunks as the training processes, allowing for real-time display

method:required TrainingMethod

Training Method

Different training models are implemented. The recomended method is "gce" which is a improved "sft" method that makes it possible to learn multiple responses at the same time.

GCE

Value:gce

This is our most stabile learning method, it is a combination of DeepSeeks GPRO and SFT(Supervised Fine-tuning which allow you to traing multiple responses with rewards in the same way as the GRPO algorithm.

Direct Preference Optimization

Value:dpo

Direct Preference Optimization is a reinforcement learning method that trains language models to align with human preferences without requiring a separate reward model. Unlike traditional RLHF (Reinforcement Learning from Human Feedback) which uses PPO and requires fitting a reward model, DPO optimizes the model directly using a simple binary cross-entropy loss on preference data. This makes it more stable, computationally efficient, and easier to implement while achieving comparable or better performance than PPO-based methods.

Research: https://arxiv.org/abs/2305.18290

GRPO (Group Relative Policy Optimization)

Value:grpo

Group Relative Policy Optimization is a variant of Proximal Policy Optimization that enhances reasoning capabilities while reducing memory consumption. GRPO eliminates the need for a separate value function (critic network) and instead generates multiple outputs for each prompt, computing advantages relative to the group's average reward. This approach is particularly effective for mathematical reasoning and complex problem-solving tasks, as demonstrated in DeepSeek-Math and DeepSeek-R1 models.

Research: https://arxiv.org/abs/2402.03300

SFT (Supervised Fine-Tuning)

Value:sft

Supervised Fine-Tuning is the foundational method for adapting pre-trained language models to specific tasks or domains using labeled data. The model is trained on curated examples of desired behavior using a standard language modeling objective (next token prediction). SFT is typically the first step in the alignment process before applying more advanced techniques like DPO or GRPO. It's computationally efficient and effective for teaching models to follow instructions, adopt specific writing styles, or gain domain expertise.

Research: https://arxiv.org/abs/2310.05492

learning_steps:required number

The number of training iterations or optimization steps to perform during the fine-tuning process. Each step represents one update to the model's parameters. More learning steps generally lead to better model adaptation but increase training time and cost.

Considerations:

Do not use a to high value here, because you leaning one thing at a time, so a to high will result in overfitting, where the model memorizes training examples rather than generalizing. Instead use multiple calls with different training data to learn the model new behaviour.

Alternatively, you can use the targetAccuracy parameter to automatically stop training when the target accuracy is reached.

target_accuracy:optional number

This parameter works together with learningSteps. During training, accuracy is measured as the percentage of words that match the target response. A word is considered to match if the model assigns it a probability above 50%.

When this target accuracy is reached, training stops automatically. Training will also stop when the maximum number of training steps (specified by learningSteps) has been reached, whichever comes first.

learning_rate:optional number

Range: Positive decimal (typically between 1E-4 and 1E-7)

Controls how much the model's parameters are adjusted during each training step. The learning rate determines the size of the steps taken during optimization - larger values make bigger updates, while smaller values make more cautious adjustments.

If not specified, the model will use its default learning rate, which has been pre-configured based on the model architecture and recommended training practices.

Behavior:

High learning rates (1E-4 - 1E-5) - Faster training but risk of instability or overshooting optimal parameters
Low learning rates (1E-5 - 1E-6) - Slower but more stable training, good for preserving pre-trained knowledge

Recommendations:

Use the model's default learning rate unless you have specific performance issues
Lower learning rates are generally safer for fine-tuning pre-trained models to avoid catastrophic forgetting
If training appears unstable or loss values spike, reduce the learning rate
If training progresses too slowly, carefully increase the learning rate

epsilon:optional number

This is used in the GRPO training method. It is the clipping parameter that prevents destructively large policy updates by constraining the probability ratio between new and old policies to [1-epsilon, 1+epsilon]. Default is 0.2.

beta:optional number

A critical hyperparameter that controls how much the model is allowed to deviate from its reference policy during training. The role and recommended values of beta differ significantly between DPO and GRPO methods.

Beta in DPO (Direct Preference Optimization)

In DPO, beta is the primary trade-off parameter that controls the strength of the preference learning signal relative to staying close to the reference model.

How it works:

Beta appears in the DPO loss function as a coefficient that weights the log-probability ratios between preferred and non-preferred responses
Higher beta values make the model more conservative, keeping it closer to the reference policy and making smaller adjustments based on preferences
Lower beta values allow more aggressive optimization toward human preferences, potentially diverging more from the reference model

Recommended ranges for DPO:

beta = 0.1 to 0.3 - Standard range for most DPO fine-tuning tasks
beta = 0.1 - More aggressive preference learning, allows significant deviation from reference
beta = 0.5 - More conservative, maintains stronger alignment with reference model behavior

Beta in GRPO (Group Relative Policy Optimization)

In GRPO, beta serves as a KL divergence penalty coefficient for regularization, but is typically set much lower than in DPO or disabled entirely.

How it works:

When beta = 0.0 (default), no KL penalty is applied and the reference model is not loaded, reducing memory usage
When beta < 0.0, a penalty term prevents the model from diverging too far from the reference policy
This regularization helps preserve general capabilities while learning new tasks

Note: For GRPO, we highly recomend setting this to 0.0, because there is a problem with Kl divergence making the model less stabile when it is applied.

Output

TrainResponse

This is the standard (non-streaming) response format returned by the Train endpoint. It contains performance metrics for each training example and usage statistics for the training operation.

performance:required Array of TrainPerformance

An array of performance metrics, one for each response sample provided in the training data.

TrainPerformance members

usage:required TrainUsage

Usage statistics for the training operation.

Train Usage

Usage statistics for the training operation, including token counts and training configuration details.

TrainUsage members

API Reference