API Reference
Train
Fine-tune a language model by adjusting its weights to make it more likely to produce the results you want. Training modifies the model's internal parameters based on your provided examples, enabling it to learn new behaviors, preferences, and response patterns.
To train the model, you provide a conversation context (messages) along with multiple possible responses, each with a reward value indicating how desirable that response is. The model learns from these examples by updating its weights to favor responses with higher rewards, gradually improving its ability to generate the types of outputs you prefer.
This process allows you to customize the model's behavior for specific tasks, styles, or preferences without needing to build a model from scratch.
Parameters
The identifier of the language model to use for training.
Note: This must be a model you yourself has created, the standard models Llama-3.1-8b or DeepSeek-R1-8b are readonly. So you have to first make a copy of these models, and then train them.
Message Structure
Models are trained to operate on alternating user and assistant conversational turns. When creating a new Message, you specify the prior conversational turns with the messages parameter, and the model then generates the next Message in the conversation.
Message
Each input message must be an object with a role and content. You can specify three types of roles:
- system - Sets instructions and context for the entire conversation. This is placed at the beginning of the messages array.
- user - Represents input from the end user
- assistant - Represents responses from the model
System Message
The system role allows you to provide context, instructions, and guidelines that influence the model's behavior throughout the conversation. System messages are optional but powerful for:
- Defining the assistant's role, personality, or expertise
- Setting response format requirements (e.g., "always respond in JSON")
- Providing domain-specific context or constraints
- Establishing behavioral guidelines
System messages are included in the messages array as the first message:
{
"model" : "your-model-name",
"messages" : [
{
"role" : "system",
"content" : "You are a helpful API documentation assistant. Provide clear, concise explanations with code examples."
},
{
"role" : "user",
"content" : "Why is the sky blue?"
}
]
}
Response Array
Contains a array of different possible response, each Response contain a response(text) and a reward.
The rewards are normalized to ensures that responses with above-average rewards get positive advantages (encouraging the model to favor them), while responses with below-average rewards get negative advantages (discouraging those patterns).
Streaming allow you to see the results of individual training steps incrementally as it's generated, instread of getting a status report when training is compled.
- false - Returns the complete response only after training is finished
- true - Streams response chunks as the training processes, allowing for real-time display
Training Method
Different training models are implemented. The recomended method is "gce" which is a improved "sft" method that makes it possible to learn multiple responses at the same time.GCE
Value:gce
This is our most stabile learning method, it is a combination of DeepSeeks GPRO and SFT(Supervised Fine-tuning which allow you to traing multiple responses with rewards in the same way as the GRPO algorithm.
Direct Preference Optimization
Value:dpo
Direct Preference Optimization is a reinforcement learning method that trains language models to align with human preferences without requiring a separate reward model. Unlike traditional RLHF (Reinforcement Learning from Human Feedback) which uses PPO and requires fitting a reward model, DPO optimizes the model directly using a simple binary cross-entropy loss on preference data. This makes it more stable, computationally efficient, and easier to implement while achieving comparable or better performance than PPO-based methods.
Research: https://arxiv.org/abs/2305.18290
GRPO (Group Relative Policy Optimization)
Value:grpo
Group Relative Policy Optimization is a variant of Proximal Policy Optimization that enhances reasoning capabilities while reducing memory consumption. GRPO eliminates the need for a separate value function (critic network) and instead generates multiple outputs for each prompt, computing advantages relative to the group's average reward. This approach is particularly effective for mathematical reasoning and complex problem-solving tasks, as demonstrated in DeepSeek-Math and DeepSeek-R1 models.
Research: https://arxiv.org/abs/2402.03300
SFT (Supervised Fine-Tuning)
Value:sft
Supervised Fine-Tuning is the foundational method for adapting pre-trained language models to specific tasks or domains using labeled data. The model is trained on curated examples of desired behavior using a standard language modeling objective (next token prediction). SFT is typically the first step in the alignment process before applying more advanced techniques like DPO or GRPO. It's computationally efficient and effective for teaching models to follow instructions, adopt specific writing styles, or gain domain expertise.
Research: https://arxiv.org/abs/2310.05492
The number of training iterations or optimization steps to perform during the fine-tuning process. Each step represents one update to the model's parameters. More learning steps generally lead to better model adaptation but increase training time and cost.
Considerations:
Do not use a to high value here, because you leaning one thing at a time, so a to high will result in overfitting, where the model memorizes training examples rather than generalizing. Instead use multiple calls with different training data to learn the model new behaviour.
Alternatively, you can use the targetAccuracy parameter to automatically stop training when the target accuracy is reached.
This parameter works together with learningSteps. During training, accuracy is measured as the percentage of words that match the target response. A word is considered to match if the model assigns it a probability above 50%.
When this target accuracy is reached, training stops automatically. Training will also stop when the maximum number of training steps (specified by learningSteps) has been reached, whichever comes first.
Controls how much the model's parameters are adjusted during each training step. The learning rate determines the size of the steps taken during optimization - larger values make bigger updates, while smaller values make more cautious adjustments.
If not specified, the model will use its default learning rate, which has been pre-configured based on the model architecture and recommended training practices.
Behavior:
- High learning rates (1E-4 - 1E-5) - Faster training but risk of instability or overshooting optimal parameters
- Low learning rates (1E-5 - 1E-6) - Slower but more stable training, good for preserving pre-trained knowledge
Recommendations:
- Use the model's default learning rate unless you have specific performance issues
- Lower learning rates are generally safer for fine-tuning pre-trained models to avoid catastrophic forgetting
- If training appears unstable or loss values spike, reduce the learning rate
- If training progresses too slowly, carefully increase the learning rate
This is used in the GRPO training method. It is the clipping parameter that prevents destructively large policy updates by constraining the probability ratio between new and old policies to [1-epsilon, 1+epsilon]. Default is 0.2.
A critical hyperparameter that controls how much the model is allowed to deviate from its reference policy during training. The role and recommended values of beta differ significantly between DPO and GRPO methods.
Beta in DPO (Direct Preference Optimization)
In DPO, beta is the primary trade-off parameter that controls the strength of the preference learning signal relative to staying close to the reference model.
How it works:
- Beta appears in the DPO loss function as a coefficient that weights the log-probability ratios between preferred and non-preferred responses
- Higher beta values make the model more conservative, keeping it closer to the reference policy and making smaller adjustments based on preferences
- Lower beta values allow more aggressive optimization toward human preferences, potentially diverging more from the reference model
Recommended ranges for DPO:
- beta = 0.1 to 0.3 - Standard range for most DPO fine-tuning tasks
- beta = 0.1 - More aggressive preference learning, allows significant deviation from reference
- beta = 0.5 - More conservative, maintains stronger alignment with reference model behavior
Beta in GRPO (Group Relative Policy Optimization)
In GRPO, beta serves as a KL divergence penalty coefficient for regularization, but is typically set much lower than in DPO or disabled entirely.
How it works:
- When beta = 0.0 (default), no KL penalty is applied and the reference model is not loaded, reducing memory usage
- When beta < 0.0, a penalty term prevents the model from diverging too far from the reference policy
- This regularization helps preserve general capabilities while learning new tasks
Note: For GRPO, we highly recomend setting this to 0.0, because there is a problem with Kl divergence making the model less stabile when it is applied.
Output
TrainResponse
This is the standard (non-streaming) response format returned by the Train endpoint. It contains performance metrics for each training example and usage statistics for the training operation.
An array of performance metrics, one for each response sample provided in the training data.
Usage statistics for the training operation.
Train Usage
Usage statistics for the training operation, including token counts and training configuration details.
curl http://localhost:45678/v1/train \
-X POST \
-H "Authorization: Bearer $TIGER_API_KEY"\
-H 'Content-Type: application/json' \
-d "{
\"model\": \"MyCustomModel\",
\"messages\": [
{
\"role\": \"system\",
\"content\": \"You are a helpful assistant.\"
},
{
\"role\": \"user\",
\"content\": \"What is the capital of France?\"
}
],
\"responses\": [
{
\"reward\": 1,
\"response\": \"The capital of France is Paris.\"
},
{
\"reward\": -0.5,
\"response\": \"I don't know.\"
}
],
\"method\": \"gce\",
\"learning_steps\": 2,
\"stream\": false,
\"learning_rate\": 0.00001,
\"beta\": 0.04,
\"epsilon\": 0.2
}"{
"performance" : [
{
"id" : 1,
"advantage" : 1,
"accuracy" : 1,
"loss" : 0.0167236
},
{
"id" : 2,
"advantage" : -1,
"accuracy" : 0.333333,
"loss" : 5.15625
}
],
"usage" : {
"input_tokens" : 28,
"output_tokens" : 14,
"num_samples" : 2,
"steps" : 2,
"training_tokens" : 350,
"duration_ms" : 4294
}
}