Streaming Output
Overview
Section titled “Overview”Vitamem supports streaming output, delivering LLM response tokens as they are generated rather than waiting for the complete response. This enables responsive UIs with progressive text rendering.
How It Works
Section titled “How It Works”The streaming flow follows the same lifecycle as non-streaming chat():
- Thread resolution — Dormant/closed threads are redirected, cooling threads are reactivated
- User message saved — Your message is stored before the LLM call
- Memory retrieval — If
autoRetrieveis enabled, memories are fetched and injected into context - Streaming begins — The LLM generates tokens, yielded one-by-one via an
AsyncGenerator - Completion — After the stream ends, the full reply is saved to storage and thread timestamps are updated
Basic Streaming
Section titled “Basic Streaming”const { stream, thread, memories } = await vm.chatStream({ threadId: "thread-123", message: "What was my last A1C level?",});
// Metadata (thread, memories) is available immediatelyconsole.log("Thread state:", thread.state);console.log("Memories used:", memories?.length ?? 0);
// Consume the streamlet fullReply = "";for await (const chunk of stream) { process.stdout.write(chunk); // Print tokens as they arrive fullReply += chunk;}// After the loop, the reply has been saved to storage automaticallyConvenience Method
Section titled “Convenience Method”// Automatically resolves or creates a thread for the userconst { stream, thread } = await vm.chatWithUserStream({ userId: "user-abc", message: "How is my blood pressure trending?",});
for await (const chunk of stream) { process.stdout.write(chunk);}With Custom System Prompt
Section titled “With Custom System Prompt”const { stream } = await vm.chatStream({ threadId: "thread-123", message: "Summarize my health history", systemPrompt: "You are a concise health assistant. Respond in bullet points.",});
for await (const chunk of stream) { process.stdout.write(chunk);}Server-Sent Events (SSE)
Section titled “Server-Sent Events (SSE)”When building a web API, you can convert the AsyncGenerator to an SSE stream:
// Express / Next.js API route exampleexport async function POST(request: Request) { const { threadId, message } = await request.json(); const vm = getVitamem();
const { stream, thread, memories } = await vm.chatStream({ threadId, message });
const encoder = new TextEncoder(); const readable = new ReadableStream({ async start(controller) { // Send metadata first controller.enqueue(encoder.encode( `data: ${JSON.stringify({ type: "meta", thread, memories })}\n\n` ));
// Stream text chunks for await (const chunk of stream) { controller.enqueue(encoder.encode( `data: ${JSON.stringify({ type: "delta", chunk })}\n\n` )); }
// Signal completion controller.enqueue(encoder.encode( `data: ${JSON.stringify({ type: "done" })}\n\n` )); controller.close(); }, });
return new Response(readable, { headers: { "Content-Type": "text/event-stream", "Cache-Control": "no-cache, no-transform", }, });}Browser Consumer
Section titled “Browser Consumer”const res = await fetch("/api/chat", { method: "POST", headers: { "Content-Type": "application/json", Accept: "text/event-stream", }, body: JSON.stringify({ threadId, message }),});
const reader = res.body.getReader();const decoder = new TextDecoder();let buffer = "";
while (true) { const { done, value } = await reader.read(); if (done) break; buffer += decoder.decode(value, { stream: true });
for (const line of buffer.split("\n")) { if (line.startsWith("data: ")) { const event = JSON.parse(line.slice(6)); if (event.type === "delta") { appendToUI(event.chunk); } } } buffer = buffer.endsWith("\n") ? "" : buffer.split("\n").pop() || "";}Fallback Behavior
Section titled “Fallback Behavior”If the LLM adapter does not implement chatStream, Vitamem automatically falls back to non-streaming: it calls chat(), waits for the full response, and yields it as a single chunk. This means chatStream() is always safe to call regardless of adapter support.
Storage Behavior
Section titled “Storage Behavior”The complete reply is saved to storage after the stream finishes — not incrementally. If the consumer disconnects mid-stream, the partial response is not persisted.
Custom Adapter Support
Section titled “Custom Adapter Support”To add streaming to a custom adapter, implement the optional chatStream method:
const myAdapter: LLMAdapter = { async chat(messages) { /* ... */ },
async *chatStream(messages) { const stream = await myLLMClient.generate({ messages, stream: true }); for await (const chunk of stream) { yield chunk.text; } },
async extractMemories(messages) { /* ... */ }, async embed(text) { /* ... */ },};