Persistent memory extraction and hierarchical context compression, working together through a hook-driven lifecycle.
After every assistant response, the pipeline evaluates the conversation, extracts structured facts, and stores them for future sessions.
User sends a message and the model responds. This triggers the extraction pipeline asynchronously.
A purpose-built prompt evaluates the exchange and outputs structured JSON. Up to 10 facts per turn, validated against six strict schemas.
Each fact is vectorized into 4096 dimensions via Qwen3-Embedding-8B. If a semantically similar memory exists, it is merged rather than duplicated.
Validated vectors are inserted into Qdrant with rich metadata, partitioned by hashed user ID for strict isolation.
Before each new user message, cosine-similarity search pulls the most relevant memories. Results are formatted by type and injected into the system prompt in under 200ms.
When session history reaches 70% of the model's context limit, the compression engine activates and redistributes content across three tiers.
Most recent exchanges at full fidelity. Uncompressed, immediate context for the active task.
Older exchanges grouped and summarized by the LLM into medium-detail topic blocks.
Ancient project history merged and heavily compressed. High-level summaries preserving only critical information.
Total token count of session history is computed after each assistant response.
If the total is below 70% of the model's absolute context limit, no action is taken.
The engine calculates which tier is most over its allocated ratio (50/30/20).
The most over-budget tier's oldest entries are summarized by the LLM and pushed down the hierarchy.
Audit and summarization repeat iteratively until all tiers fit within budget. State is persisted in SQLite.
Runtime and language. Type safety across the entire codebase with Bun's native file and process APIs.
Vector database for persistent memory storage. Cosine-similarity search filtered by user ID.
Session state, compression state, configuration, and provider registry. ACID-compliant with proper concurrency.
Embedding generation, deduplication logic, and memory lifecycle management framework.
4096-dimensional embeddings via OpenRouter for high-fidelity semantic search.
OpenRouter, Anthropic, OpenAI, and Google for extraction, summarization, and primary chat inference.
All memory and compression operations run outside the critical chat path.