API Reference
Streaming Guide
Streaming allows you to receive responses from the TigerCity.ai API incrementally as they are generated, rather than waiting for the complete response. This provides a better user experience with real-time feedback and faster perceived response times.
How Streaming Works
When you enable streaming by setting stream: true in your API request, the server uses HTTP chunked transfer encoding to send the response in multiple parts. Each chunk is a JSON object that represents a different part of the generation process.
Request Format
To enable streaming, simply add "stream": true to your request body. For example:
{
"model": "Llama-3.1-8b",
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
],
"stream": true
}Response Format
The streaming response consists of multiple JSON chunks, each sent as a separate line. Each chunk is a JSON object with a type field that indicates what kind of data it contains:
Chunk Types
- Info chunks (
type: "info") - Status messages during model loading, such as "Loading model" or "Loading tokenizer". These provide feedback about the initialization process. - Delta chunks (
type: "delta") - Incremental text tokens as they are generated. Each delta chunk contains a small piece of the response text. You accumulate these to build the complete message. - Stop chunks (
type: "stop") - The final chunk indicating that generation is complete. This includes the stop reason (e.g., "end_turn", "max_tokens") and usage statistics (duration, token counts). - Error chunks (
type: "error") - Error messages when something goes wrong during processing. These indicate that an error occurred and generation cannot continue.
Example Streaming Response
Here's what a typical streaming response looks like:
{"type": "info", "message": "Loading model llama-3.1-8b"}
{"type": "info", "message": "Loading tokenizer"}
{"type": "delta", "delta": "The"}
{"type": "delta", "delta": " sky"}
{"type": "delta", "delta": " appears"}
{"type": "delta", "delta": " blue"}
{"type": "delta", "delta": " because"}
{"type": "stop", "stop_reason": "end_turn", "usage": {
"duration_ms": 1396,
"input_tokens": 33,
"output_tokens": 35
}}Processing Streaming Responses
To process a streaming response, you need to:
- Read the response body as a stream (not as JSON)
- Parse each line as a separate JSON object
- Handle each chunk type appropriately:
- Display info messages to show progress
- Append delta text to build the complete response
- Extract usage statistics from the stop chunk
- Handle error messages and stop processing if an error chunk is received
JavaScript/TypeScript Example
const response = await fetch('http://localhost:45678/v1/generate_response', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${process.env.TIGER_API_KEY}`
},
body: JSON.stringify({
model: 'Llama-3.1-8b',
messages: [{ role: 'user', content: 'Hello!' }],
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop() || ''; // Keep incomplete line in buffer
for (const line of lines) {
if (line.trim()) {
const chunk = JSON.parse(line);
if (chunk.type === 'info') {
console.log('Info:', chunk.message);
} else if (chunk.type === 'delta') {
process.stdout.write(chunk.delta); // Print incrementally
} else if (chunk.type === 'stop') {
console.log('\nGeneration complete:', chunk.usage);
} else if (chunk.type === 'error') {
console.error('Error:', chunk.message);
break; // Stop processing on error
}
}
}
}Python Example
import requests
import json
import os
import sys
response = requests.post(
http://localhost:45678/v1/generate_response',
headers={
'Content-Type': 'application/json',
'Authorization': f'Bearer {os.environ["TIGER_API_KEY"]}'
},
json={
'model': 'Llama-3.1-8b',
'messages': [{'role': 'user', 'content': 'Hello!'}],
'stream': True
},
stream=True
)
# Check if request was successful
if response.status_code != 200:
print(f'HTTP Error: {response.status_code} - {response.text}', file=sys.stderr)
sys.exit(1)
try:
for line in response.iter_lines():
if line:
try:
chunk = json.loads(line)
if chunk.get('type') == 'info':
print(f'Info: {chunk["message"]}')
elif chunk.get('type') == 'delta':
print(chunk['delta'], end='', flush=True)
elif chunk.get('type') == 'stop':
print(f'\nGeneration complete: {chunk["usage"]}')
elif chunk.get('type') == 'error':
print(f'Error: {chunk["message"]}', file=sys.stderr)
break # Stop processing on error
except json.JSONDecodeError:
print(f'Error: Failed to parse JSON chunk: {line}', file=sys.stderr)
continue
except requests.exceptions.RequestException as e:
print(f'Request error: {e}', file=sys.stderr)
sys.exit(1)Benefits of Streaming
- Faster perceived response time - Users see text appearing immediately rather than waiting for the entire response
- Better user experience - Real-time feedback makes the application feel more responsive
- Progress visibility - Info chunks provide insight into what's happening during model loading
- Early cancellation - You can stop processing if needed, though the server will continue generating until completion
When to Use Streaming
Use streaming when:
- Building interactive chat interfaces where users expect to see text appear in real-time
- You want to provide progress feedback during long-running operations
- You need to process responses incrementally rather than all at once
Use non-streaming when:
- You need the complete response before processing it
- You're making automated API calls where real-time display isn't important
- You prefer simpler error handling with a single response object
Supported Endpoints
Streaming is currently supported for:
- Generate Response - Stream text generation as it happens
- Train - Stream training progress and performance metrics
Error Handling
When streaming, errors can occur at any point. Make sure to:
- Check the HTTP status code before starting to read the stream
- Handle JSON parsing errors gracefully (malformed chunks)
- Implement timeout handling for streams that don't complete
- Validate chunk structure before accessing fields
For more details on specific endpoints, see the API Reference.