Skip to content

Ollama

Install Vitamem and the OpenAI peer dependency. The Ollama adapter is a thin wrapper over the OpenAI adapter that points at Ollama’s OpenAI-compatible API.

Terminal window
npm install vitamem openai

Then install Ollama itself from ollama.com.

If Ollama is running locally with default settings, zero configuration is needed:

import { createVitamem } from "vitamem";
const mem = await createVitamem({
provider: "ollama",
storage: "ephemeral",
});

No API key required. This uses the default models: llama3.2 for chat and extraction, nomic-embed-text for embeddings, connecting to http://localhost:11434/v1.

Before using Vitamem with Ollama, pull the default models:

Terminal window
ollama pull llama3.2
ollama pull nomic-embed-text

Ollama must be running when your application starts. Verify it is available:

Terminal window
ollama list

For full control, use createOllamaAdapter:

import { createOllamaAdapter, createVitamem } from "vitamem";
const llm = createOllamaAdapter({
chatModel: "llama3.2",
embeddingModel: "nomic-embed-text",
baseUrl: "http://localhost:11434/v1",
});
const mem = await createVitamem({
llm,
storage: "ephemeral",
});

All options are optional. The adapter works with no arguments at all.

OptionTypeDefaultDescription
chatModelstring"llama3.2"Ollama model for chat and memory extraction.
embeddingModelstring"nomic-embed-text"Ollama model for text embeddings.
baseUrlstring"http://localhost:11434/v1"Ollama server URL.
extractionPromptstringBuilt-in promptCustom prompt for memory extraction. Must include a {conversation} placeholder.
ModelSizeNotes
llama3.23BDefault. Fast, good general quality.
llama3.2:1b1BSmallest and fastest, lower extraction accuracy.
llama3.370BHigh quality, requires significant RAM.
mistral7BGood balance of speed and capability.
gemma29BStrong instruction following.
phi33.8BCompact, good at structured output (JSON extraction).
ModelDimensionsNotes
nomic-embed-text768Default. Purpose-built embedding model, good quality.
all-minilm384Smaller vectors, faster, slightly lower accuracy.
mxbai-embed-large1024Higher accuracy, larger vectors.
const llm = createOllamaAdapter({
chatModel: "mistral",
embeddingModel: "mxbai-embed-large",
});

By default, Vitamem runs up to 5 embedding requests in parallel during the dormant transition pipeline. Ollama processes requests sequentially on most hardware, so parallel requests queue up without a speed benefit and can increase memory pressure.

Set embeddingConcurrency: 1 when using Ollama:

const mem = await createVitamem({
provider: "ollama",
storage: "ephemeral",
embeddingConcurrency: 1,
});

Or when using the adapter factory:

import { createOllamaAdapter, createVitamem } from "vitamem";
const mem = await createVitamem({
llm: createOllamaAdapter(),
storage: "ephemeral",
embeddingConcurrency: 1,
});

Ollama runs entirely on your machine. Once models are pulled, no internet connection is required. This makes it ideal for:

  • Development and testing — no API keys, no costs, no rate limits.
  • Privacy-sensitive deployments — health data never leaves the device.
  • Air-gapped environments — works behind firewalls with no external access.
  • CI/CD pipelines — deterministic tests without API dependencies.
// Complete offline setup
import { createOllamaAdapter, createVitamem } from "vitamem";
const mem = await createVitamem({
llm: createOllamaAdapter(),
storage: "ephemeral",
embeddingConcurrency: 1,
});
// Everything runs locally -- no network calls
const thread = await mem.createThread({ userId: "local-user" });
const { reply } = await mem.chat({
threadId: thread.id,
message: "I started taking vitamin D supplements last week.",
});

If Ollama is running on a different machine (e.g., a GPU server on your network), point baseUrl to it:

const llm = createOllamaAdapter({
baseUrl: "http://192.168.1.50:11434/v1",
});

Under the hood, createOllamaAdapter delegates to createOpenAIAdapter with Ollama-specific defaults:

// This:
createOllamaAdapter({ chatModel: "mistral" });
// Is equivalent to:
createOpenAIAdapter({
apiKey: "ollama", // Ollama ignores the API key
chatModel: "mistral",
embeddingModel: "nomic-embed-text",
baseUrl: "http://localhost:11434/v1",
});

This works because Ollama implements the OpenAI-compatible chat completions and embeddings endpoints.

The Ollama adapter inherits streaming support from the OpenAI adapter. Use chatStream() or chatWithUserStream() on the Vitamem instance:

const { stream } = await mem.chatStream({
threadId: thread.id,
message: "How is my sleep tracking looking?",
});
for await (const chunk of stream) {
process.stdout.write(chunk);
}

See Streaming Output for the full guide.

“Connection refused” errors — Make sure Ollama is running. Start it with ollama serve or check that the Ollama desktop app is open.

“Model not found” errors — Pull the model first with ollama pull <model-name>.

Slow embedding pipeline — Set embeddingConcurrency: 1 and consider using a smaller embedding model like all-minilm.

Out of memory — Use smaller models (llama3.2:1b for chat, all-minilm for embeddings) or increase your system swap space.