Streaming Guide

Streaming allows you to receive responses from the TigerCity.ai API incrementally as they are generated, rather than waiting for the complete response. This provides a better user experience with real-time feedback and faster perceived response times.

How Streaming Works

When you enable streaming by setting stream: true in your API request, the server uses HTTP chunked transfer encoding to send the response in multiple parts. Each chunk is a JSON object that represents a different part of the generation process.

Request Format

To enable streaming, simply add "stream": true to your request body. For example:

{
  "model": "Llama-3.1-8b",
  "messages": [
    {
      "role": "user",
      "content": "Why is the sky blue?"
    }
  ],
  "stream": true
}

Response Format

The streaming response consists of multiple JSON chunks, each sent as a separate line. Each chunk is a JSON object with a type field that indicates what kind of data it contains:

Chunk Types

Info chunks (type: "info") - Status messages during model loading, such as "Loading model" or "Loading tokenizer". These provide feedback about the initialization process.
Delta chunks (type: "delta") - Incremental text tokens as they are generated. Each delta chunk contains a small piece of the response text. You accumulate these to build the complete message.
Stop chunks (type: "stop") - The final chunk indicating that generation is complete. This includes the stop reason (e.g., "end_turn", "max_tokens") and usage statistics (duration, token counts).
Error chunks (type: "error") - Error messages when something goes wrong during processing. These indicate that an error occurred and generation cannot continue.

Example Streaming Response

Here's what a typical streaming response looks like:

{"type": "info", "message": "Loading model llama-3.1-8b"}
{"type": "info", "message": "Loading tokenizer"}
{"type": "delta", "delta": "The"}
{"type": "delta", "delta": " sky"}
{"type": "delta", "delta": " appears"}
{"type": "delta", "delta": " blue"}
{"type": "delta", "delta": " because"}
{"type": "stop", "stop_reason": "end_turn", "usage": {
  "duration_ms": 1396,
  "input_tokens": 33,
  "output_tokens": 35
}}

Processing Streaming Responses

To process a streaming response, you need to:

Read the response body as a stream (not as JSON)
Parse each line as a separate JSON object
Handle each chunk type appropriately:
- Display info messages to show progress
- Append delta text to build the complete response
- Extract usage statistics from the stop chunk
- Handle error messages and stop processing if an error chunk is received

JavaScript/TypeScript Example

const response = await fetch('http://localhost:45678/v1/generate_response', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${process.env.TIGER_API_KEY}`
  },
  body: JSON.stringify({
    model: 'Llama-3.1-8b',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: true
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  
  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop() || ''; // Keep incomplete line in buffer
  
  for (const line of lines) {
    if (line.trim()) {
      const chunk = JSON.parse(line);
      
      if (chunk.type === 'info') {
        console.log('Info:', chunk.message);
      } else if (chunk.type === 'delta') {
        process.stdout.write(chunk.delta); // Print incrementally
      } else if (chunk.type === 'stop') {
        console.log('\nGeneration complete:', chunk.usage);
      } else if (chunk.type === 'error') {
        console.error('Error:', chunk.message);
        break; // Stop processing on error
      }
    }
  }
}

Python Example

import requests
import json
import os
import sys

response = requests.post(
    http://localhost:45678/v1/generate_response',
    headers={
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {os.environ["TIGER_API_KEY"]}'
    },
    json={
        'model': 'Llama-3.1-8b',
        'messages': [{'role': 'user', 'content': 'Hello!'}],
        'stream': True
    },
    stream=True
)

# Check if request was successful
if response.status_code != 200:
    print(f'HTTP Error: {response.status_code} - {response.text}', file=sys.stderr)
    sys.exit(1)

try:
    for line in response.iter_lines():
        if line:
            try:
                chunk = json.loads(line)

                if chunk.get('type') == 'info':
                    print(f'Info: {chunk["message"]}')
                elif chunk.get('type') == 'delta':
                    print(chunk['delta'], end='', flush=True)
                elif chunk.get('type') == 'stop':
                    print(f'\nGeneration complete: {chunk["usage"]}')
                elif chunk.get('type') == 'error':
                    print(f'Error: {chunk["message"]}', file=sys.stderr)
                    break  # Stop processing on error
            except json.JSONDecodeError:
                print(f'Error: Failed to parse JSON chunk: {line}', file=sys.stderr)
                continue
except requests.exceptions.RequestException as e:
    print(f'Request error: {e}', file=sys.stderr)
    sys.exit(1)

Benefits of Streaming

Faster perceived response time - Users see text appearing immediately rather than waiting for the entire response
Better user experience - Real-time feedback makes the application feel more responsive
Progress visibility - Info chunks provide insight into what's happening during model loading
Early cancellation - You can stop processing if needed, though the server will continue generating until completion

When to Use Streaming

Use streaming when:

Building interactive chat interfaces where users expect to see text appear in real-time
You want to provide progress feedback during long-running operations
You need to process responses incrementally rather than all at once

Use non-streaming when:

You need the complete response before processing it
You're making automated API calls where real-time display isn't important
You prefer simpler error handling with a single response object

Supported Endpoints

Streaming is currently supported for:

Generate Response - Stream text generation as it happens
Train - Stream training progress and performance metrics

Error Handling

When streaming, errors can occur at any point. Make sure to:

Check the HTTP status code before starting to read the stream
Handle JSON parsing errors gracefully (malformed chunks)
Implement timeout handling for streams that don't complete
Validate chunk structure before accessing fields

For more details on specific endpoints, see the API Reference.