Latency & SLA
Honest, per-endpoint latency guidance and the production pattern for keeping memory off your user's response path. Retrieval is fast; fact extraction is LLM-bound — run it asynchronously.
The one rule for production
Read on the hot path, write off it. Call search to fetch context inline (sub-second warm), then extract/learn memories asynchronously after you've already responded to the user. Never block a chat turn on /v1/memories/process in synchronous mode.
What is fast, what is LLM-bound
Fast (retrieval path)
Search, raw writes, and reads are dominated by vector + keyword lookup and a rerank pass — warm p50 is sub-second. The raw vector lookup itself is single-digit-millisecond; the full hybrid + rerank path is what you see end-to-end.
LLM-bound (write/reasoning path)
Fact extraction, conflict resolution, and dialectic reasoning call a reasoning model. These are seconds, not milliseconds — and they are exactly the operations you should run in the background.
Per-endpoint guidance
Typical observed latency on warm infrastructure. These are guidance targets for capacity planning, not contractual guarantees — see SLA below. Cold starts and large batches run higher.
| Endpoint | Typical p50 | Class | Mode |
|---|---|---|---|
POST /v1/search | ~0.4–1s warm | Retrieval | Inline (hot path) |
POST /v1/memories (raw store) | sub-second | Write | Inline ok |
GET /v1/memories/{id} | sub-second | Read | Inline |
POST /v1/chat/completions | ~2–8s | LLM + retrieval | Stream tokens |
POST /v1/memories/process (sync) | ~7–22s | LLM extraction | Use async → |
POST /v1/memories/process (async_dispatch) | ~1–2s to 202 | Queued | Background |
POST /v1/profile/dialectic | ~10–14s | LLM reasoning | Background / await |
PATCH /v1/memories/{id} (re-extract) | ~10–15s | LLM + re-index | Background |
Note: these are end-to-end numbers — the full hybrid retrieval + rerank (and, for chat, LLM generation) — not an isolated vector-lookup figure. Measure against your own clients using the X-Process-Time response header.
Production pattern: async extraction
Set async_dispatch: true on /v1/memories/process. You get a 202 with a job_id in ~1–2s; poll /v1/memories/jobs/{job_id} for completion. Your user already has their answer.
import os, time, requests
BASE = "https://api.hebbrix.com/v1"
H = {"Authorization": f"Bearer {os.environ['HEBBRIX_API_KEY']}"}
# 1) Inline: fetch context for THIS turn (fast, hot path)
ctx = requests.post(f"{BASE}/search",
headers=H, json={"query": user_message, "collection_id": "coll_123", "limit": 5}).json()
# 2) Respond to the user with YOUR LLM (using ctx) ... already done here ...
# 3) Off the hot path: learn from the turn ASYNCHRONOUSLY
job = requests.post(f"{BASE}/memories/process", headers=H, json={
"messages": [
{"role": "user", "content": user_message},
{"role": "assistant", "content": assistant_reply},
],
"collection_id": "coll_123",
"async_dispatch": True, # <-- returns 202 + job_id in ~1-2s
}).json()
# 4) (Optional) poll for completion in a background worker
job_id = job["job_id"]
while True:
s = requests.get(f"{BASE}/memories/jobs/{job_id}", headers=H).json()
if s["status"] in ("completed", "failed"):
break
time.sleep(1)SLA & status
What we publish
The numbers above are operating targets. Formal per-plan p50/p95/p99 SLA targets and uptime commitments are part of the enterprise agreement — contact us for the current SLA document.
Observability
Every response carries X-Process-Time (server ms) and X-Request-ID. Measure real latency from your own clients and include the request id when reporting a slow call.
Don't benchmark warm-solo latency as your SLA. Measure p95/p99 under your real concurrency, and keep all LLM-bound writes in the background so a slow extraction never delays a user-facing response.
