Deduplication

Vitamem uses cosine similarity to prevent storing semantically duplicate memories. When new facts are extracted from a conversation, they are compared against all existing memories for that user before being saved.

How It Works

For each newly extracted fact:

The fact is embedded into a vector
The vector is compared against all existing user memory vectors using cosine similarity
If similarity ≥ threshold (default: 0.92), the fact is considered a duplicate and skipped
If similarity < threshold for all existing memories, the fact is saved

Additionally, new facts within the same batch are deduplicated against each other — so if the same conversation produces two similar facts, only the first one is kept.

The Similarity Threshold

The default threshold of 0.92 is calibrated to:

Catch near-identical phrasings (“Prefers TypeScript” vs “Uses TypeScript”)
Allow genuinely distinct but related facts to coexist (“Uses React for web apps” vs “Uses React Native for mobile”)

You can adjust this threshold when initializing:

// Lower threshold = more aggressive deduplication (fewer, broader memories)
// Higher threshold = more permissive (more memories, more specificity)
const mem = await createVitamem({
  provider: "openai",
  apiKey: process.env.OPENAI_API_KEY!,
  storage: "ephemeral",
});
// deduplication threshold is set in the embedding pipeline, default 0.92

Example

Suppose the user has an existing memory: “Prefers TypeScript over JavaScript”.

In a new conversation they say: “I always write my projects in TypeScript.”

The extraction might produce: “Uses TypeScript for projects”

Cosine similarity between “Prefers TypeScript over JavaScript” and “Uses TypeScript for projects” vectors ≈ 0.95 → duplicate, skipped.

Later they say: “I just switched from AWS to Vercel for deployment.”

Extraction produces: “Switched from AWS to Vercel”

Cosine similarity against all existing memories < 0.92 → new fact, saved.

Memory Supersede

Starting with Vitamem’s two-tier threshold system, facts are not simply “duplicate or new” — there is a middle ground for updated information.

Two-Tier Threshold System

Similarity Range	Classification	Action
`>= 0.92` (deduplicationThreshold)	Exact duplicate	New fact is discarded
`>= 0.75` and `< 0.92` (supersedeThreshold)	Same topic, updated value	Existing memory is updated in-place
`< 0.75`	New distinct fact	Saved as a new memory

How Supersede Works

When a new fact falls in the supersede band (between supersedeThreshold and deduplicationThreshold), Vitamem recognizes it as the same topic with an updated value and replaces the existing memory content rather than creating a conflicting duplicate.

Example: A user says “I’m learning React” and that fact is stored. Months later they say “I’ve become proficient in React”. The cosine similarity between these two facts is ~0.88 — above the supersede threshold (0.75) but below the dedup threshold (0.92). Instead of storing both values (which would create conflicting memories), the old memory is updated to “Proficient in React”.

Configuring Thresholds

Both thresholds are configurable via VitamemConfig:

const mem = await createVitamem({
  provider: "openai",
  apiKey: process.env.OPENAI_API_KEY!,
  storage: "ephemeral",
  deduplicationThreshold: 0.95, // stricter duplicate detection
  supersedeThreshold: 0.80,     // narrower supersede band
});

Goal Deduplication

Goals referencing the same health metric are deduplicated semantically using a specialized strategy that differs from general memory deduplication.

How Goal Dedup Differs

Unlike standard memory deduplication (which relies on cosine similarity between embeddings), goal deduplication uses metric-keyword matching and word-overlap analysis. This approach is more deterministic for structured health targets:

Strategy	Used for	Method
Cosine similarity (≥ 0.92)	General memories	Embedding vector comparison
Metric-keyword + word overlap	Health goals	Pattern matching on known metric terms

The system recognizes common health metric keywords: A1C, blood pressure, weight, glucose, cholesterol, and BMI.

Example

A user sets a goal: “Lower A1C below 7.0%”. Later, they say: “Maintain A1C below 7.0%”.

Both goals reference the A1C metric keyword and share significant word overlap. The result is a single goal — the newer wording (“Maintain A1C below 7.0%”) replaces the older one, just like memory supersede but driven by keyword matching instead of vector similarity.

Fallback for Unrecognized Metrics

Goals that do not contain a recognized metric keyword fall back to exact-match deduplication. This conservative approach avoids incorrectly merging unrelated goals (e.g., “Walk 10,000 steps daily” and “Drink 8 glasses of water daily” are both kept as distinct goals).

Why This Matters

Without deduplication, users who interact regularly would accumulate hundreds of redundant memories (“Prefers TypeScript”, “Uses TypeScript”, “Writes in TypeScript”, …). This degrades retrieval quality because the vector store becomes diluted with near-duplicate entries.

Vitamem’s deduplication keeps the memory store lean and retrieval precise — which is especially important for applications where users interact over months or years, whether that’s a health companion, coaching assistant, or support agent.

Deduplication also has a direct cost benefit: by preventing memory bloat, it keeps both the embedding index and LLM context injection small. Without deduplication, a user mentioning the same fact across 10 sessions would create 10 near-identical memories, all competing for retrieval slots and inflating context tokens.