Latency & SLA - Performance Guidance

Honest, per-endpoint latency guidance and the production pattern for keeping memory off your user's response path. Retrieval is fast; fact extraction is LLM-bound, so run it asynchronously.

Read on the hot path, write off it. Call search to fetch context inline (sub-second warm), then extract/learn memories asynchronously after you've already responded to the user. Never block a chat turn on /v1/memories/process in synchronous mode.

01. What is fast, what is LLM-bound

Fast (retrieval path). Search, raw writes, and reads are dominated by vector + keyword lookup and a rerank pass. Warm p50 is sub-second. The raw vector lookup itself is single-digit-millisecond; the full hybrid + rerank path is what you see end-to-end.
LLM-bound (write/reasoning path). Fact extraction, conflict resolution, and dialectic reasoning call a reasoning model. These are seconds, not milliseconds, and they are exactly the operations you should run in the background.

02. Per-endpoint guidance

Typical observed latency on warm infrastructure. These are guidance targets for capacity planning, not contractual guarantees (see SLA below). Cold starts and large batches run higher.

Endpoint	Typical p50	Class	Mode
`POST /v1/search`	~0.4–1s warm	Retrieval	Inline (hot path)
`POST /v1/memories (raw store)`	sub-second	Write	Inline ok
`GET /v1/memories/{id}`	sub-second	Read	Inline
`POST /v1/chat/completions`	~2–8s	LLM + retrieval	Stream tokens
`POST /v1/memories/process (sync)`	~7–22s	LLM extraction	Use async →
`POST /v1/memories/process (async_dispatch)`	~1–2s to 202	Queued	Background
`POST /v1/profile/dialectic`	~10–14s	LLM reasoning	Background / await
`PATCH /v1/memories/{id} (re-extract)`	~10–15s	LLM + re-index	Background

Note: these are end-to-end numbers, the full hybrid retrieval + rerank (and, for chat, LLM generation), not an isolated vector-lookup figure. Measure against your own clients using the X-Process-Time response header.

03. Production pattern: async extraction

Set async_dispatch: true on /v1/memories/process. You get a 202 with a job_id in ~1–2s; poll /v1/memories/jobs/{job_id} for completion. Your user already has their answer.

Python (async extraction)

import os, time, requests

BASE = "https://api.hebbrix.com/v1"
H = {"Authorization": f"Bearer {os.environ['HEBBRIX_API_KEY']}"}

# 1) Inline: fetch context for THIS turn (fast, hot path)
ctx = requests.post(f"{BASE}/search",
    headers=H, json={"query": user_message, "collection_id": "coll_123", "limit": 5}).json()

# 2) Respond to the user with YOUR LLM (using ctx) ... already done here ...

# 3) Off the hot path: learn from the turn ASYNCHRONOUSLY
job = requests.post(f"{BASE}/memories/process", headers=H, json={
    "messages": [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": assistant_reply},
    ],
    "collection_id": "coll_123",
    "async_dispatch": True,        # returns 202 + job_id in ~1-2s
}).json()

# 4) (Optional) poll for completion in a background worker
job_id = job["job_id"]
while True:
    s = requests.get(f"{BASE}/memories/jobs/{job_id}", headers=H).json()
    if s["status"] in ("completed", "failed"):
        break
    time.sleep(1)

04. SLA & status

What we publish. The numbers above are operating targets. Formal per-plan p50/p95/p99 SLA targets and uptime commitments are part of the enterprise agreement. Contact us for the current SLA document.
Observability. Every response carries X-Process-Time (server ms) and X-Request-ID. Measure real latency from your own clients and include the request id when reporting a slow call.

Don't benchmark warm-solo latency as your SLA. Measure p95/p99 under your real concurrency, and keep all LLM-bound writes in the background so a slow extraction never delays a user-facing response.