Streaming Guide

Streaming allows you to receive responses from the TigerCity.ai API incrementally as they are generated, rather than waiting for the complete response. This provides a better user experience with real-time feedback and faster perceived response times.

How Streaming Works

When you enable streaming by setting stream: true in your API request, the server uses HTTP chunked transfer encoding to send the response in multiple parts. Each chunk is a JSON object that represents a different part of the generation process.

Request Format

To enable streaming, simply add "stream": true to your request body. For example:

{
  "model": "Llama-3.1-8b",
  "messages": [
    {
      "role": "user",
      "content": "Why is the sky blue?"
    }
  ],
  "stream": true
}

Response Format

The streaming response consists of multiple JSON chunks, each sent as a separate line. Each chunk is a JSON object with a type field that indicates what kind of data it contains:

Chunk Types

  • Info chunks (type: "info") - Status messages during model loading, such as "Loading model" or "Loading tokenizer". These provide feedback about the initialization process.
  • Delta chunks (type: "delta") - Incremental text tokens as they are generated. Each delta chunk contains a small piece of the response text. You accumulate these to build the complete message.
  • Stop chunks (type: "stop") - The final chunk indicating that generation is complete. This includes the stop reason (e.g., "end_turn", "max_tokens") and usage statistics (duration, token counts).
  • Error chunks (type: "error") - Error messages when something goes wrong during processing. These indicate that an error occurred and generation cannot continue.

Example Streaming Response

Here's what a typical streaming response looks like:

{"type": "info", "message": "Loading model llama-3.1-8b"}
{"type": "info", "message": "Loading tokenizer"}
{"type": "delta", "delta": "The"}
{"type": "delta", "delta": " sky"}
{"type": "delta", "delta": " appears"}
{"type": "delta", "delta": " blue"}
{"type": "delta", "delta": " because"}
{"type": "stop", "stop_reason": "end_turn", "usage": {
  "duration_ms": 1396,
  "input_tokens": 33,
  "output_tokens": 35
}}

Processing Streaming Responses

To process a streaming response, you need to:

  1. Read the response body as a stream (not as JSON)
  2. Parse each line as a separate JSON object
  3. Handle each chunk type appropriately:
    • Display info messages to show progress
    • Append delta text to build the complete response
    • Extract usage statistics from the stop chunk
    • Handle error messages and stop processing if an error chunk is received

JavaScript/TypeScript Example

const response = await fetch('http://localhost:45678/v1/generate_response', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': `Bearer ${process.env.TIGER_API_KEY}`
  },
  body: JSON.stringify({
    model: 'Llama-3.1-8b',
    messages: [{ role: 'user', content: 'Hello!' }],
    stream: true
  })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  
  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop() || ''; // Keep incomplete line in buffer
  
  for (const line of lines) {
    if (line.trim()) {
      const chunk = JSON.parse(line);
      
      if (chunk.type === 'info') {
        console.log('Info:', chunk.message);
      } else if (chunk.type === 'delta') {
        process.stdout.write(chunk.delta); // Print incrementally
      } else if (chunk.type === 'stop') {
        console.log('\nGeneration complete:', chunk.usage);
      } else if (chunk.type === 'error') {
        console.error('Error:', chunk.message);
        break; // Stop processing on error
      }
    }
  }
}

Python Example

import requests
import json
import os
import sys

response = requests.post(
    http://localhost:45678/v1/generate_response',
    headers={
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {os.environ["TIGER_API_KEY"]}'
    },
    json={
        'model': 'Llama-3.1-8b',
        'messages': [{'role': 'user', 'content': 'Hello!'}],
        'stream': True
    },
    stream=True
)

# Check if request was successful
if response.status_code != 200:
    print(f'HTTP Error: {response.status_code} - {response.text}', file=sys.stderr)
    sys.exit(1)

try:
    for line in response.iter_lines():
        if line:
            try:
                chunk = json.loads(line)

                if chunk.get('type') == 'info':
                    print(f'Info: {chunk["message"]}')
                elif chunk.get('type') == 'delta':
                    print(chunk['delta'], end='', flush=True)
                elif chunk.get('type') == 'stop':
                    print(f'\nGeneration complete: {chunk["usage"]}')
                elif chunk.get('type') == 'error':
                    print(f'Error: {chunk["message"]}', file=sys.stderr)
                    break  # Stop processing on error
            except json.JSONDecodeError:
                print(f'Error: Failed to parse JSON chunk: {line}', file=sys.stderr)
                continue
except requests.exceptions.RequestException as e:
    print(f'Request error: {e}', file=sys.stderr)
    sys.exit(1)

Benefits of Streaming

  • Faster perceived response time - Users see text appearing immediately rather than waiting for the entire response
  • Better user experience - Real-time feedback makes the application feel more responsive
  • Progress visibility - Info chunks provide insight into what's happening during model loading
  • Early cancellation - You can stop processing if needed, though the server will continue generating until completion

When to Use Streaming

Use streaming when:

  • Building interactive chat interfaces where users expect to see text appear in real-time
  • You want to provide progress feedback during long-running operations
  • You need to process responses incrementally rather than all at once

Use non-streaming when:

  • You need the complete response before processing it
  • You're making automated API calls where real-time display isn't important
  • You prefer simpler error handling with a single response object

Supported Endpoints

Streaming is currently supported for:

  • Generate Response - Stream text generation as it happens
  • Train - Stream training progress and performance metrics

Error Handling

When streaming, errors can occur at any point. Make sure to:

  • Check the HTTP status code before starting to read the stream
  • Handle JSON parsing errors gracefully (malformed chunks)
  • Implement timeout handling for streams that don't complete
  • Validate chunk structure before accessing fields

For more details on specific endpoints, see the API Reference.