Hebbrix
Performance

Latency & SLA

Honest, per-endpoint latency guidance and the production pattern for keeping memory off your user's response path. Retrieval is fast; fact extraction is LLM-bound — run it asynchronously.

The one rule for production

Read on the hot path, write off it. Call search to fetch context inline (sub-second warm), then extract/learn memories asynchronously after you've already responded to the user. Never block a chat turn on /v1/memories/process in synchronous mode.

What is fast, what is LLM-bound

Fast (retrieval path)

Search, raw writes, and reads are dominated by vector + keyword lookup and a rerank pass — warm p50 is sub-second. The raw vector lookup itself is single-digit-millisecond; the full hybrid + rerank path is what you see end-to-end.

LLM-bound (write/reasoning path)

Fact extraction, conflict resolution, and dialectic reasoning call a reasoning model. These are seconds, not milliseconds — and they are exactly the operations you should run in the background.

Per-endpoint guidance

Typical observed latency on warm infrastructure. These are guidance targets for capacity planning, not contractual guarantees — see SLA below. Cold starts and large batches run higher.

EndpointTypical p50ClassMode
POST /v1/search~0.4–1s warmRetrievalInline (hot path)
POST /v1/memories (raw store)sub-secondWriteInline ok
GET /v1/memories/{id}sub-secondReadInline
POST /v1/chat/completions~2–8sLLM + retrievalStream tokens
POST /v1/memories/process (sync)~7–22sLLM extractionUse async →
POST /v1/memories/process (async_dispatch)~1–2s to 202QueuedBackground
POST /v1/profile/dialectic~10–14sLLM reasoningBackground / await
PATCH /v1/memories/{id} (re-extract)~10–15sLLM + re-indexBackground

Note: these are end-to-end numbers — the full hybrid retrieval + rerank (and, for chat, LLM generation) — not an isolated vector-lookup figure. Measure against your own clients using the X-Process-Time response header.

Production pattern: async extraction

Set async_dispatch: true on /v1/memories/process. You get a 202 with a job_id in ~1–2s; poll /v1/memories/jobs/{job_id} for completion. Your user already has their answer.

Python (async extraction)
import os, time, requests

BASE = "https://api.hebbrix.com/v1"
H = {"Authorization": f"Bearer {os.environ['HEBBRIX_API_KEY']}"}

# 1) Inline: fetch context for THIS turn (fast, hot path)
ctx = requests.post(f"{BASE}/search",
    headers=H, json={"query": user_message, "collection_id": "coll_123", "limit": 5}).json()

# 2) Respond to the user with YOUR LLM (using ctx) ... already done here ...

# 3) Off the hot path: learn from the turn ASYNCHRONOUSLY
job = requests.post(f"{BASE}/memories/process", headers=H, json={
    "messages": [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": assistant_reply},
    ],
    "collection_id": "coll_123",
    "async_dispatch": True,        # <-- returns 202 + job_id in ~1-2s
}).json()

# 4) (Optional) poll for completion in a background worker
job_id = job["job_id"]
while True:
    s = requests.get(f"{BASE}/memories/jobs/{job_id}", headers=H).json()
    if s["status"] in ("completed", "failed"):
        break
    time.sleep(1)

SLA & status

What we publish

The numbers above are operating targets. Formal per-plan p50/p95/p99 SLA targets and uptime commitments are part of the enterprise agreement — contact us for the current SLA document.

Observability

Every response carries X-Process-Time (server ms) and X-Request-ID. Measure real latency from your own clients and include the request id when reporting a slow call.

Don't benchmark warm-solo latency as your SLA. Measure p95/p99 under your real concurrency, and keep all LLM-bound writes in the background so a slow extraction never delays a user-facing response.

Assistant

Ask me anything about Hebbrix