Generate Response

POST

v1/generate_response

Send a structured list of input messages with text, and the model will generate the next message in the conversation.

Parameters

model:required string

The identifier of the language model to use for generating the response. This should correspond to a valid model name from your available models (e.g., Llama-3.1-8b, DeepSeek-R1-8b, etc.).

It can also be models you have created.

messages:required Array of Message

Message Structure

Models are trained to operate on alternating user and assistant conversational turns. When creating a new Message, you specify the prior conversational turns with the messages parameter, and the model then generates the next Message in the conversation.

Message

Each input message must be an object with a role and content. You can specify three types of roles:

system - Sets instructions and context for the entire conversation. This is placed at the beginning of the messages array.
user - Represents input from the end user
assistant - Represents responses from the model

System Message

The system role allows you to provide context, instructions, and guidelines that influence the model's behavior throughout the conversation. System messages are optional but powerful for:

Defining the assistant's role, personality, or expertise
Setting response format requirements (e.g., "always respond in JSON")
Providing domain-specific context or constraints
Establishing behavioral guidelines

System messages are included in the messages array as the first message:

{
  "model"    : "your-model-name",
  "messages" : [
    {
      "role"    : "system",
      "content" : "You are a helpful API documentation assistant. Provide clear, concise explanations with code examples."
    },
    {
      "role"    : "user",
      "content" : "Why is the sky blue?"
    }
  ]
}

Message members

stream:optional boolean

Controls whether the response is returned as a complete message or streamed incrementally as it's generated.

false - Returns the complete response only after generation is finished
true - Streams response chunks as they're generated, allowing for real-time display

temperature:optional number

Controls the randomness and creativity of the model's responses. Lower values make output more focused and deterministic, while higher values increase randomness and creativity.

0.0 - Most deterministic, repeatable responses
1.0 - Balanced creativity and coherence (recommended for most use cases)
2.0 - Maximum randomness and creativity

Note: For tasks requiring consistency (like data extraction or classification), use lower values (0.0-0.3). For creative tasks (like brainstorming or storytelling), higher values (0.7-1.5) work better.

top_p:optional number

Also known as "nucleus sampling," this parameter controls the diversity of responses by limiting the model to consider only the most probable tokens whose cumulative probability reaches the specified threshold.

0.1 - Very focused, only highly probable tokens
0.5 - Moderately diverse output
1.0 - Considers all tokens based on their probability

Note: It's generally recommended to adjust either temperature OR top_p, but not both simultaneously. When top_p is less than 1.0, the model samples from the smallest set of tokens whose cumulative probability exceeds the threshold.

Output