Auto-Retrieve
Auto-Retrieve
Section titled “Auto-Retrieve”When autoRetrieve is enabled, Vitamem automatically searches the user’s memory store on every chat() call and injects relevant memories as a system message before sending the conversation to the LLM. This means your application does not need to call retrieve() or build memory-aware system prompts manually.
How It Works
Section titled “How It Works”With autoRetrieve: true, every call to chat() performs these additional steps:
- Embed the user’s message — Vitamem calls
llm.embed(message)to get a vector representation of what the user just said. - Search memories — The embedding is used to search the user’s memory store via
storage.searchMemories(), returning the most semantically similar facts. - Inject as system message — If any memories are found, Vitamem prepends a system message to the conversation with the format:
Relevant context from previous sessions:- Has Type 2 diabetes (confirmed)- Takes metformin 500mg twice daily (confirmed)- Prefers morning check-ins (inferred)- Return memories for transparency — The
chat()response includes amemoriesfield containing theMemoryMatch[]that were injected, so your application can log or display them.
Enabling Auto-Retrieve
Section titled “Enabling Auto-Retrieve”const mem = await createVitamem({ provider: "openai", apiKey: process.env.OPENAI_API_KEY, storage: "ephemeral", autoRetrieve: true, // default: false});Using It
Section titled “Using It”const { reply, thread, memories } = await mem.chat({ threadId: thread.id, message: "How should I adjust my diet?",});
// `memories` contains what was injectedif (memories && memories.length > 0) { console.log("Context injected:"); for (const m of memories) { console.log(` - ${m.content} (${m.source}, score: ${m.score.toFixed(2)})`); }}The memories field is only present when autoRetrieve is enabled. When disabled, it is undefined.
Auto-Retrieve vs Manual Retrieve
Section titled “Auto-Retrieve vs Manual Retrieve”There are two patterns for getting memories into your LLM context. Each has trade-offs.
Auto-Retrieve (autoRetrieve: true)
Section titled “Auto-Retrieve (autoRetrieve: true)”// Just call chat() -- memories are injected automaticallyconst { reply, memories } = await mem.chat({ threadId: thread.id, message: userMessage,});Advantages:
- Zero boilerplate — works out of the box
- Every message gets relevant context, even mid-conversation
- The injected system message format is consistent
Trade-offs:
- One extra embedding API call per
chat()call (the user’s message must be embedded to search) - You cannot customize which memories are injected or how they are formatted
- The search query is always the raw user message, which may not be the best retrieval query for every situation
Manual Retrieve (retrieve() + systemPrompt)
Section titled “Manual Retrieve (retrieve() + systemPrompt)”// 1. Retrieve memories with a custom queryconst memories = await mem.retrieve({ userId, query: "medications health conditions current treatment", limit: 5,});
// 2. Format however you wantconst context = memories .filter((m) => m.score > 0.7) // custom threshold .map((m) => `${m.content}`) .join("\n");
// 3. Inject via systemPromptconst { reply } = await mem.chat({ threadId: thread.id, message: userMessage, systemPrompt: `User context:\n${context}\n\nRespond as a caring health companion.`,});Advantages:
- Full control over the retrieval query (can be different from the user’s message)
- Custom filtering (score thresholds, source filtering, topic-specific queries)
- Custom formatting of the injected context
- Can retrieve once per session instead of once per message
Trade-offs:
- More code to write and maintain
- You must remember to retrieve and inject on every relevant interaction
When to Use Each
Section titled “When to Use Each”| Scenario | Recommendation |
|---|---|
| Prototyping or simple chat apps | autoRetrieve: true |
| Health companion with topic-specific queries | Manual retrieve() |
| High-volume API with cost concerns | Manual retrieve() (retrieve once per session, not per message) |
| Multi-turn conversation where context matters throughout | autoRetrieve: true |
You need to filter by memory source (confirmed only) | Manual retrieve() |
| You want the simplest possible integration | autoRetrieve: true |
Combining Both
Section titled “Combining Both”You can enable autoRetrieve and still pass a systemPrompt. Both will be included. The auto-retrieved memories are injected as the first system message, followed by your custom system prompt:
const { reply } = await mem.chat({ threadId: thread.id, message: "What about my exercise routine?", systemPrompt: "You are a supportive health companion. Be warm and encouraging.",});
// The LLM sees:// 1. System: "Relevant context from previous sessions: ..." (auto-injected)// 2. System: "You are a supportive health companion. ..." (your prompt)// 3. User/assistant message history// 4. User: "What about my exercise routine?"Memory Formatting
Section titled “Memory Formatting”When memories are injected into the LLM context (via autoRetrieve or manual retrieval), Vitamem can format them to maximize usefulness and minimize token waste. Three formatting features are available:
Priority Signaling
Section titled “Priority Signaling”Memories can be prefixed with priority markers so the LLM knows which facts are most important:
[CRITICAL] Allergic to penicillin (confirmed)[IMPORTANT] Takes metformin 500mg twice daily (confirmed)[INFO] Prefers morning check-ins (inferred)Enable with prioritySignaling: true. Pinned memories are marked [CRITICAL], confirmed facts are [IMPORTANT], and inferred facts are [INFO].
Chronological Grouping
Section titled “Chronological Grouping”Memories can be sorted by date and grouped under month/year headers, giving the LLM a timeline view:
--- March 2026 ---- Has Type 2 diabetes (mentioned 2026-03-15)- Takes metformin 500mg twice daily (mentioned 2026-03-15)
--- April 2026 ---- A1C improved to 6.5 (mentioned 2026-04-02)Enable with chronologicalRetrieval: true.
Cache-Friendly Context
Section titled “Cache-Friendly Context”To optimize for LLM prefix caching (supported by OpenAI, Anthropic, and others), Vitamem can split memory context into a stable prefix and a dynamic suffix:
- Stable prefix: User profile + pinned memories (rarely changes between calls)
- Dynamic suffix: Query-retrieved memories (changes per message)
This layout maximizes cache hits on the stable portion, reducing latency and cost for repeated interactions. Enable with cacheableContext: true.
Configuration
Section titled “Configuration”const mem = await createVitamem({ provider: "openai", apiKey: process.env.OPENAI_API_KEY!, storage: "ephemeral", autoRetrieve: true, prioritySignaling: true, chronologicalRetrieval: true, cacheableContext: true,});See Memory Formatting for details on how each option affects the injected context.
Performance Considerations
Section titled “Performance Considerations”Each chat() call with autoRetrieve makes one additional API call to your embedding provider (llm.embed()). For OpenAI’s text-embedding-3-small, this adds roughly 10-50ms of latency and costs $0.00002 per call (as of early 2026). While embedding costs are negligible at this scale, the real efficiency gain is in LLM input tokens — autoRetrieve injects only the most relevant memories rather than dumping the entire memory store into context. This keeps per-turn LLM costs flat regardless of how many memories exist.
If latency or cost is a concern, prefer the manual approach: call retrieve() once when the session starts and pass the result as systemPrompt for all subsequent messages in that session.