Skip to content

Auto-Retrieve

When autoRetrieve is enabled, Vitamem automatically searches the user’s memory store on every chat() call and injects relevant memories as a system message before sending the conversation to the LLM. This means your application does not need to call retrieve() or build memory-aware system prompts manually.

With autoRetrieve: true, every call to chat() performs these additional steps:

  1. Embed the user’s message — Vitamem calls llm.embed(message) to get a vector representation of what the user just said.
  2. Search memories — The embedding is used to search the user’s memory store via storage.searchMemories(), returning the most semantically similar facts.
  3. Inject as system message — If any memories are found, Vitamem prepends a system message to the conversation with the format:
Relevant context from previous sessions:
- Has Type 2 diabetes (confirmed)
- Takes metformin 500mg twice daily (confirmed)
- Prefers morning check-ins (inferred)
  1. Return memories for transparency — The chat() response includes a memories field containing the MemoryMatch[] that were injected, so your application can log or display them.
const mem = await createVitamem({
provider: "openai",
apiKey: process.env.OPENAI_API_KEY,
storage: "ephemeral",
autoRetrieve: true, // default: false
});
const { reply, thread, memories } = await mem.chat({
threadId: thread.id,
message: "How should I adjust my diet?",
});
// `memories` contains what was injected
if (memories && memories.length > 0) {
console.log("Context injected:");
for (const m of memories) {
console.log(` - ${m.content} (${m.source}, score: ${m.score.toFixed(2)})`);
}
}

The memories field is only present when autoRetrieve is enabled. When disabled, it is undefined.

There are two patterns for getting memories into your LLM context. Each has trade-offs.

// Just call chat() -- memories are injected automatically
const { reply, memories } = await mem.chat({
threadId: thread.id,
message: userMessage,
});

Advantages:

  • Zero boilerplate — works out of the box
  • Every message gets relevant context, even mid-conversation
  • The injected system message format is consistent

Trade-offs:

  • One extra embedding API call per chat() call (the user’s message must be embedded to search)
  • You cannot customize which memories are injected or how they are formatted
  • The search query is always the raw user message, which may not be the best retrieval query for every situation

Manual Retrieve (retrieve() + systemPrompt)

Section titled “Manual Retrieve (retrieve() + systemPrompt)”
// 1. Retrieve memories with a custom query
const memories = await mem.retrieve({
userId,
query: "medications health conditions current treatment",
limit: 5,
});
// 2. Format however you want
const context = memories
.filter((m) => m.score > 0.7) // custom threshold
.map((m) => `${m.content}`)
.join("\n");
// 3. Inject via systemPrompt
const { reply } = await mem.chat({
threadId: thread.id,
message: userMessage,
systemPrompt: `User context:\n${context}\n\nRespond as a caring health companion.`,
});

Advantages:

  • Full control over the retrieval query (can be different from the user’s message)
  • Custom filtering (score thresholds, source filtering, topic-specific queries)
  • Custom formatting of the injected context
  • Can retrieve once per session instead of once per message

Trade-offs:

  • More code to write and maintain
  • You must remember to retrieve and inject on every relevant interaction
ScenarioRecommendation
Prototyping or simple chat appsautoRetrieve: true
Health companion with topic-specific queriesManual retrieve()
High-volume API with cost concernsManual retrieve() (retrieve once per session, not per message)
Multi-turn conversation where context matters throughoutautoRetrieve: true
You need to filter by memory source (confirmed only)Manual retrieve()
You want the simplest possible integrationautoRetrieve: true

You can enable autoRetrieve and still pass a systemPrompt. Both will be included. The auto-retrieved memories are injected as the first system message, followed by your custom system prompt:

const { reply } = await mem.chat({
threadId: thread.id,
message: "What about my exercise routine?",
systemPrompt: "You are a supportive health companion. Be warm and encouraging.",
});
// The LLM sees:
// 1. System: "Relevant context from previous sessions: ..." (auto-injected)
// 2. System: "You are a supportive health companion. ..." (your prompt)
// 3. User/assistant message history
// 4. User: "What about my exercise routine?"

When memories are injected into the LLM context (via autoRetrieve or manual retrieval), Vitamem can format them to maximize usefulness and minimize token waste. Three formatting features are available:

Memories can be prefixed with priority markers so the LLM knows which facts are most important:

[CRITICAL] Allergic to penicillin (confirmed)
[IMPORTANT] Takes metformin 500mg twice daily (confirmed)
[INFO] Prefers morning check-ins (inferred)

Enable with prioritySignaling: true. Pinned memories are marked [CRITICAL], confirmed facts are [IMPORTANT], and inferred facts are [INFO].

Memories can be sorted by date and grouped under month/year headers, giving the LLM a timeline view:

--- March 2026 ---
- Has Type 2 diabetes (mentioned 2026-03-15)
- Takes metformin 500mg twice daily (mentioned 2026-03-15)
--- April 2026 ---
- A1C improved to 6.5 (mentioned 2026-04-02)

Enable with chronologicalRetrieval: true.

To optimize for LLM prefix caching (supported by OpenAI, Anthropic, and others), Vitamem can split memory context into a stable prefix and a dynamic suffix:

  • Stable prefix: User profile + pinned memories (rarely changes between calls)
  • Dynamic suffix: Query-retrieved memories (changes per message)

This layout maximizes cache hits on the stable portion, reducing latency and cost for repeated interactions. Enable with cacheableContext: true.

const mem = await createVitamem({
provider: "openai",
apiKey: process.env.OPENAI_API_KEY!,
storage: "ephemeral",
autoRetrieve: true,
prioritySignaling: true,
chronologicalRetrieval: true,
cacheableContext: true,
});

See Memory Formatting for details on how each option affects the injected context.

Each chat() call with autoRetrieve makes one additional API call to your embedding provider (llm.embed()). For OpenAI’s text-embedding-3-small, this adds roughly 10-50ms of latency and costs $0.00002 per call (as of early 2026). While embedding costs are negligible at this scale, the real efficiency gain is in LLM input tokens — autoRetrieve injects only the most relevant memories rather than dumping the entire memory store into context. This keeps per-turn LLM costs flat regardless of how many memories exist.

If latency or cost is a concern, prefer the manual approach: call retrieve() once when the session starts and pass the result as systemPrompt for all subsequent messages in that session.