Deduplication
Deduplication
Section titled “Deduplication”Vitamem uses cosine similarity to prevent storing semantically duplicate memories. When new facts are extracted from a conversation, they are compared against all existing memories for that user before being saved.
How It Works
Section titled “How It Works”For each newly extracted fact:
- The fact is embedded into a vector
- The vector is compared against all existing user memory vectors using cosine similarity
- If similarity ≥ threshold (default: 0.92), the fact is considered a duplicate and skipped
- If similarity < threshold for all existing memories, the fact is saved
Additionally, new facts within the same batch are deduplicated against each other — so if the same conversation produces two similar facts, only the first one is kept.
The Similarity Threshold
Section titled “The Similarity Threshold”The default threshold of 0.92 is calibrated to:
- Catch near-identical phrasings (“Prefers TypeScript” vs “Uses TypeScript”)
- Allow genuinely distinct but related facts to coexist (“Uses React for web apps” vs “Uses React Native for mobile”)
You can adjust this threshold when initializing:
// Lower threshold = more aggressive deduplication (fewer, broader memories)// Higher threshold = more permissive (more memories, more specificity)const mem = await createVitamem({ provider: "openai", apiKey: process.env.OPENAI_API_KEY!, storage: "ephemeral",});// deduplication threshold is set in the embedding pipeline, default 0.92Example
Section titled “Example”Suppose the user has an existing memory: “Prefers TypeScript over JavaScript”.
In a new conversation they say: “I always write my projects in TypeScript.”
The extraction might produce: “Uses TypeScript for projects”
Cosine similarity between “Prefers TypeScript over JavaScript” and “Uses TypeScript for projects” vectors ≈ 0.95 → duplicate, skipped.
Later they say: “I just switched from AWS to Vercel for deployment.”
Extraction produces: “Switched from AWS to Vercel”
Cosine similarity against all existing memories < 0.92 → new fact, saved.
Memory Supersede
Section titled “Memory Supersede”Starting with Vitamem’s two-tier threshold system, facts are not simply “duplicate or new” — there is a middle ground for updated information.
Two-Tier Threshold System
Section titled “Two-Tier Threshold System”| Similarity Range | Classification | Action |
|---|---|---|
>= 0.92 (deduplicationThreshold) | Exact duplicate | New fact is discarded |
>= 0.75 and < 0.92 (supersedeThreshold) | Same topic, updated value | Existing memory is updated in-place |
< 0.75 | New distinct fact | Saved as a new memory |
How Supersede Works
Section titled “How Supersede Works”When a new fact falls in the supersede band (between supersedeThreshold and deduplicationThreshold), Vitamem recognizes it as the same topic with an updated value and replaces the existing memory content rather than creating a conflicting duplicate.
Example: A user says “I’m learning React” and that fact is stored. Months later they say “I’ve become proficient in React”. The cosine similarity between these two facts is ~0.88 — above the supersede threshold (0.75) but below the dedup threshold (0.92). Instead of storing both values (which would create conflicting memories), the old memory is updated to “Proficient in React”.
Configuring Thresholds
Section titled “Configuring Thresholds”Both thresholds are configurable via VitamemConfig:
const mem = await createVitamem({ provider: "openai", apiKey: process.env.OPENAI_API_KEY!, storage: "ephemeral", deduplicationThreshold: 0.95, // stricter duplicate detection supersedeThreshold: 0.80, // narrower supersede band});Goal Deduplication
Section titled “Goal Deduplication”Goals referencing the same health metric are deduplicated semantically using a specialized strategy that differs from general memory deduplication.
How Goal Dedup Differs
Section titled “How Goal Dedup Differs”Unlike standard memory deduplication (which relies on cosine similarity between embeddings), goal deduplication uses metric-keyword matching and word-overlap analysis. This approach is more deterministic for structured health targets:
| Strategy | Used for | Method |
|---|---|---|
| Cosine similarity (≥ 0.92) | General memories | Embedding vector comparison |
| Metric-keyword + word overlap | Health goals | Pattern matching on known metric terms |
The system recognizes common health metric keywords: A1C, blood pressure, weight, glucose, cholesterol, and BMI.
Example
Section titled “Example”A user sets a goal: “Lower A1C below 7.0%”. Later, they say: “Maintain A1C below 7.0%”.
Both goals reference the A1C metric keyword and share significant word overlap. The result is a single goal — the newer wording (“Maintain A1C below 7.0%”) replaces the older one, just like memory supersede but driven by keyword matching instead of vector similarity.
Fallback for Unrecognized Metrics
Section titled “Fallback for Unrecognized Metrics”Goals that do not contain a recognized metric keyword fall back to exact-match deduplication. This conservative approach avoids incorrectly merging unrelated goals (e.g., “Walk 10,000 steps daily” and “Drink 8 glasses of water daily” are both kept as distinct goals).
Why This Matters
Section titled “Why This Matters”Without deduplication, users who interact regularly would accumulate hundreds of redundant memories (“Prefers TypeScript”, “Uses TypeScript”, “Writes in TypeScript”, …). This degrades retrieval quality because the vector store becomes diluted with near-duplicate entries.
Vitamem’s deduplication keeps the memory store lean and retrieval precise — which is especially important for applications where users interact over months or years, whether that’s a health companion, coaching assistant, or support agent.
Deduplication also has a direct cost benefit: by preventing memory bloat, it keeps both the embedding index and LLM context injection small. Without deduplication, a user mentioning the same fact across 10 sessions would create 10 near-identical memories, all competing for retrieval slots and inflating context tokens.