Skip to content

Deduplication

Vitamem uses cosine similarity to prevent storing semantically duplicate memories. When new facts are extracted from a conversation, they are compared against all existing memories for that user before being saved.

For each newly extracted fact:

  1. The fact is embedded into a vector
  2. The vector is compared against all existing user memory vectors using cosine similarity
  3. If similarity ≥ threshold (default: 0.92), the fact is considered a duplicate and skipped
  4. If similarity < threshold for all existing memories, the fact is saved

Additionally, new facts within the same batch are deduplicated against each other — so if the same conversation produces two similar facts, only the first one is kept.

The default threshold of 0.92 is calibrated to:

  • Catch near-identical phrasings (“Prefers TypeScript” vs “Uses TypeScript”)
  • Allow genuinely distinct but related facts to coexist (“Uses React for web apps” vs “Uses React Native for mobile”)

You can adjust this threshold when initializing:

// Lower threshold = more aggressive deduplication (fewer, broader memories)
// Higher threshold = more permissive (more memories, more specificity)
const mem = await createVitamem({
provider: "openai",
apiKey: process.env.OPENAI_API_KEY!,
storage: "ephemeral",
});
// deduplication threshold is set in the embedding pipeline, default 0.92

Suppose the user has an existing memory: “Prefers TypeScript over JavaScript”.

In a new conversation they say: “I always write my projects in TypeScript.”

The extraction might produce: “Uses TypeScript for projects”

Cosine similarity between “Prefers TypeScript over JavaScript” and “Uses TypeScript for projects” vectors ≈ 0.95 → duplicate, skipped.

Later they say: “I just switched from AWS to Vercel for deployment.”

Extraction produces: “Switched from AWS to Vercel”

Cosine similarity against all existing memories < 0.92 → new fact, saved.

Starting with Vitamem’s two-tier threshold system, facts are not simply “duplicate or new” — there is a middle ground for updated information.

Similarity RangeClassificationAction
>= 0.92 (deduplicationThreshold)Exact duplicateNew fact is discarded
>= 0.75 and < 0.92 (supersedeThreshold)Same topic, updated valueExisting memory is updated in-place
< 0.75New distinct factSaved as a new memory

When a new fact falls in the supersede band (between supersedeThreshold and deduplicationThreshold), Vitamem recognizes it as the same topic with an updated value and replaces the existing memory content rather than creating a conflicting duplicate.

Example: A user says “I’m learning React” and that fact is stored. Months later they say “I’ve become proficient in React”. The cosine similarity between these two facts is ~0.88 — above the supersede threshold (0.75) but below the dedup threshold (0.92). Instead of storing both values (which would create conflicting memories), the old memory is updated to “Proficient in React”.

Both thresholds are configurable via VitamemConfig:

const mem = await createVitamem({
provider: "openai",
apiKey: process.env.OPENAI_API_KEY!,
storage: "ephemeral",
deduplicationThreshold: 0.95, // stricter duplicate detection
supersedeThreshold: 0.80, // narrower supersede band
});

Goals referencing the same health metric are deduplicated semantically using a specialized strategy that differs from general memory deduplication.

Unlike standard memory deduplication (which relies on cosine similarity between embeddings), goal deduplication uses metric-keyword matching and word-overlap analysis. This approach is more deterministic for structured health targets:

StrategyUsed forMethod
Cosine similarity (≥ 0.92)General memoriesEmbedding vector comparison
Metric-keyword + word overlapHealth goalsPattern matching on known metric terms

The system recognizes common health metric keywords: A1C, blood pressure, weight, glucose, cholesterol, and BMI.

A user sets a goal: “Lower A1C below 7.0%”. Later, they say: “Maintain A1C below 7.0%”.

Both goals reference the A1C metric keyword and share significant word overlap. The result is a single goal — the newer wording (“Maintain A1C below 7.0%”) replaces the older one, just like memory supersede but driven by keyword matching instead of vector similarity.

Goals that do not contain a recognized metric keyword fall back to exact-match deduplication. This conservative approach avoids incorrectly merging unrelated goals (e.g., “Walk 10,000 steps daily” and “Drink 8 glasses of water daily” are both kept as distinct goals).

Without deduplication, users who interact regularly would accumulate hundreds of redundant memories (“Prefers TypeScript”, “Uses TypeScript”, “Writes in TypeScript”, …). This degrades retrieval quality because the vector store becomes diluted with near-duplicate entries.

Vitamem’s deduplication keeps the memory store lean and retrieval precise — which is especially important for applications where users interact over months or years, whether that’s a health companion, coaching assistant, or support agent.

Deduplication also has a direct cost benefit: by preventing memory bloat, it keeps both the embedding index and LLM context injection small. Without deduplication, a user mentioning the same fact across 10 sessions would create 10 near-identical memories, all competing for retrieval slots and inflating context tokens.