System architecture

Persistent memory extraction and hierarchical context compression, working together through a hook-driven lifecycle.

Extraction, storage, retrieval

After every assistant response, the pipeline evaluates the conversation, extracts structured facts, and stores them for future sessions.

Conversation exchange

User sends a message and the model responds. This triggers the extraction pipeline asynchronously.

LLM extraction

A purpose-built prompt evaluates the exchange and outputs structured JSON. Up to 10 facts per turn, validated against six strict schemas.

Embedding & deduplication

Each fact is vectorized into 4096 dimensions via Qwen3-Embedding-8B. If a semantically similar memory exists, it is merged rather than duplicated.

Qdrant storage

Validated vectors are inserted into Qdrant with rich metadata, partitioned by hashed user ID for strict isolation.

Semantic retrieval

Before each new user message, cosine-similarity search pulls the most relevant memories. Results are formatted by type and injected into the system prompt in under 200ms.

Three tiers, one budget

When session history reaches 70% of the model's context limit, the compression engine activates and redistributes content across three tiers.

50%

Current

Most recent exchanges at full fidelity. Uncompressed, immediate context for the active task.

30%

Topics

Older exchanges grouped and summarized by the LLM into medium-detail topic blocks.

20%

Bulks

Ancient project history merged and heavily compressed. High-level summaries preserving only critical information.

How compression runs

Token evaluation

Total token count of session history is computed after each assistant response.

Threshold gate

If the total is below 70% of the model's absolute context limit, no action is taken.

Budget audit

The engine calculates which tier is most over its allocated ratio (50/30/20).

Targeted summarization

The most over-budget tier's oldest entries are summarized by the LLM and pushed down the hierarchy.

Stabilization loop

Audit and summarization repeat iteratively until all tiers fit within budget. State is persisted in SQLite.

Technology

Bun + TypeScript

Runtime and language. Type safety across the entire codebase with Bun's native file and process APIs.

Qdrant

Vector database for persistent memory storage. Cosine-similarity search filtered by user ID.

SQLite + Drizzle

Session state, compression state, configuration, and provider registry. ACID-compliant with proper concurrency.

mem0

Embedding generation, deduplication logic, and memory lifecycle management framework.

Qwen3-Embedding-8B

4096-dimensional embeddings via OpenRouter for high-fidelity semantic search.

LLM providers

OpenRouter, Anthropic, OpenAI, and Google for extraction, summarization, and primary chat inference.

Latency targets

All memory and compression operations run outside the critical chat path.

100-300ms Embedding generation
<200ms Vector search
<10ms Context formatting
2-5s Full extraction (async)