Auto-Retrieve

When autoRetrieve is enabled, Vitamem automatically searches the user’s memory store on every chat() call and injects relevant memories as a system message before sending the conversation to the LLM. This means your application does not need to call retrieve() or build memory-aware system prompts manually.

How It Works

With autoRetrieve: true, every call to chat() performs these additional steps:

Embed the user’s message — Vitamem calls llm.embed(message) to get a vector representation of what the user just said.
Search memories — The embedding is used to search the user’s memory store via storage.searchMemories(), returning the most semantically similar facts.
Inject as system message — If any memories are found, Vitamem prepends a system message to the conversation with the format:

Relevant context from previous sessions:
- Has Type 2 diabetes (confirmed)
- Takes metformin 500mg twice daily (confirmed)
- Prefers morning check-ins (inferred)

Return memories for transparency — The chat() response includes a memories field containing the MemoryMatch[] that were injected, so your application can log or display them.

Enabling Auto-Retrieve

const mem = await createVitamem({
  provider: "openai",
  apiKey: process.env.OPENAI_API_KEY,
  storage: "ephemeral",
  autoRetrieve: true, // default: false
});

Using It

const { reply, thread, memories } = await mem.chat({
  threadId: thread.id,
  message: "How should I adjust my diet?",
});

// `memories` contains what was injected
if (memories && memories.length > 0) {
  console.log("Context injected:");
  for (const m of memories) {
    console.log(`  - ${m.content} (${m.source}, score: ${m.score.toFixed(2)})`);
  }
}

The memories field is only present when autoRetrieve is enabled. When disabled, it is undefined.

Auto-Retrieve vs Manual Retrieve

There are two patterns for getting memories into your LLM context. Each has trade-offs.

Auto-Retrieve (`autoRetrieve: true`)

// Just call chat() -- memories are injected automatically
const { reply, memories } = await mem.chat({
  threadId: thread.id,
  message: userMessage,
});

Advantages:

Zero boilerplate — works out of the box
Every message gets relevant context, even mid-conversation
The injected system message format is consistent

Trade-offs:

One extra embedding API call per chat() call (the user’s message must be embedded to search)
You cannot customize which memories are injected or how they are formatted
The search query is always the raw user message, which may not be the best retrieval query for every situation

Manual Retrieve (`retrieve()` + `systemPrompt`)

// 1. Retrieve memories with a custom query
const memories = await mem.retrieve({
  userId,
  query: "medications health conditions current treatment",
  limit: 5,
});

// 2. Format however you want
const context = memories
  .filter((m) => m.score > 0.7) // custom threshold
  .map((m) => `${m.content}`)
  .join("\n");

// 3. Inject via systemPrompt
const { reply } = await mem.chat({
  threadId: thread.id,
  message: userMessage,
  systemPrompt: `User context:\n${context}\n\nRespond as a caring health companion.`,
});

Advantages:

Full control over the retrieval query (can be different from the user’s message)
Custom filtering (score thresholds, source filtering, topic-specific queries)
Custom formatting of the injected context
Can retrieve once per session instead of once per message

Trade-offs:

More code to write and maintain
You must remember to retrieve and inject on every relevant interaction

When to Use Each

Scenario	Recommendation
Prototyping or simple chat apps	`autoRetrieve: true`
Health companion with topic-specific queries	Manual `retrieve()`
High-volume API with cost concerns	Manual `retrieve()` (retrieve once per session, not per message)
Multi-turn conversation where context matters throughout	`autoRetrieve: true`
You need to filter by memory source (`confirmed` only)	Manual `retrieve()`
You want the simplest possible integration	`autoRetrieve: true`

Combining Both

You can enable autoRetrieve and still pass a systemPrompt. Both will be included. The auto-retrieved memories are injected as the first system message, followed by your custom system prompt:

const { reply } = await mem.chat({
  threadId: thread.id,
  message: "What about my exercise routine?",
  systemPrompt: "You are a supportive health companion. Be warm and encouraging.",
});

// The LLM sees:
// 1. System: "Relevant context from previous sessions: ..." (auto-injected)
// 2. System: "You are a supportive health companion. ..." (your prompt)
// 3. User/assistant message history
// 4. User: "What about my exercise routine?"

Memory Formatting

When memories are injected into the LLM context (via autoRetrieve or manual retrieval), Vitamem can format them to maximize usefulness and minimize token waste. Three formatting features are available:

Priority Signaling

Memories can be prefixed with priority markers so the LLM knows which facts are most important:

[CRITICAL] Allergic to penicillin (confirmed)
[IMPORTANT] Takes metformin 500mg twice daily (confirmed)
[INFO] Prefers morning check-ins (inferred)

Enable with prioritySignaling: true. Pinned memories are marked [CRITICAL], confirmed facts are [IMPORTANT], and inferred facts are [INFO].

Chronological Grouping

Memories can be sorted by date and grouped under month/year headers, giving the LLM a timeline view:

--- March 2026 ---
- Has Type 2 diabetes (mentioned 2026-03-15)
- Takes metformin 500mg twice daily (mentioned 2026-03-15)

--- April 2026 ---
- A1C improved to 6.5 (mentioned 2026-04-02)

Enable with chronologicalRetrieval: true.

Cache-Friendly Context

To optimize for LLM prefix caching (supported by OpenAI, Anthropic, and others), Vitamem can split memory context into a stable prefix and a dynamic suffix:

Stable prefix: User profile + pinned memories (rarely changes between calls)
Dynamic suffix: Query-retrieved memories (changes per message)

This layout maximizes cache hits on the stable portion, reducing latency and cost for repeated interactions. Enable with cacheableContext: true.

Configuration

const mem = await createVitamem({
  provider: "openai",
  apiKey: process.env.OPENAI_API_KEY!,
  storage: "ephemeral",
  autoRetrieve: true,
  prioritySignaling: true,
  chronologicalRetrieval: true,
  cacheableContext: true,
});

See Memory Formatting for details on how each option affects the injected context.

Performance Considerations

Each chat() call with autoRetrieve makes one additional API call to your embedding provider (llm.embed()). For OpenAI’s text-embedding-3-small, this adds roughly 10-50ms of latency and costs $0.00002 per call (as of early 2026). While embedding costs are negligible at this scale, the real efficiency gain is in LLM input tokens — autoRetrieve injects only the most relevant memories rather than dumping the entire memory store into context. This keeps per-turn LLM costs flat regardless of how many memories exist.

If latency or cost is a concern, prefer the manual approach: call retrieve() once when the session starts and pass the result as systemPrompt for all subsequent messages in that session.

Auto-Retrieve

Auto-Retrieve

How It Works

Enabling Auto-Retrieve

Using It

Auto-Retrieve vs Manual Retrieve

Auto-Retrieve (autoRetrieve: true)

Manual Retrieve (retrieve() + systemPrompt)

When to Use Each

Combining Both

Memory Formatting

Priority Signaling

Chronological Grouping

Cache-Friendly Context

Configuration

Performance Considerations

Auto-Retrieve (`autoRetrieve: true`)

Manual Retrieve (`retrieve()` + `systemPrompt`)