Hebbrix
Documents

Documents & Media

Upload documents, images, videos, and audio files. Hebbrix automatically extracts content, generates embeddings, and makes everything searchable.

Supported Formats

Documents

PDF, DOCX, TXT, MD, HTML, CSV

Images

PNG, JPG, WebP, GIF, SVG

Videos

MP4, WebM, MOV (transcription)

Audio

MP3, WAV, M4A (transcription)

Processing Pipeline

  1. 1Upload - File uploaded and validated
  2. 2Extract - Text extraction, OCR for images, transcription for audio/video
  3. 3Chunk - Content split into semantic chunks with overlap
  4. 4Embed - Generate vector embeddings for each chunk
  5. 5Index - Store in vector database for fast retrieval

Endpoints

Code Examples

Upload a Document

Python
import os
import requests

BASE = "https://api.hebbrix.com/v1"
H = {"Authorization": f"Bearer {os.environ['HEBBRIX_API_KEY']}"}

# POST /v1/documents/upload is multipart/form-data.
# `collection_id` is a FORM field (not a query param). Omit it to let the
# server auto-assign your default collection — the response field
# `collection_auto_assigned` will be true when that happens.
with open("research_paper.pdf", "rb") as f:
    r = requests.post(
        f"{BASE}/documents/upload",
        headers=H,
        files={"file": ("research_paper.pdf", f, "application/pdf")},
        data={
            "collection_id": "col_xyz",  # optional
            "category": "research",      # optional
            "tags": "ml,2024",           # optional, comma-separated
        },
    )
body = r.json()
doc = body["document"]

print(f"Document ID: {doc['id']}")
# Legacy internal status (fine-grained enum):
#   uploaded / processing / searchable / enriching / enriched / processed / failed / deleted
print(f"status = {doc['status']}")
# PDF-contract lifecycle status (prefer this for new integrations):
#   pending / processing / completed / failed / deleted
print(f"processing_status = {doc['processing_status']}")
print(f"auto_assigned_default = {body['collection_auto_assigned']}")

Poll until processing completes

Python (with polling)
import time

doc_id = doc["id"]
while True:
    r = requests.get(f"{BASE}/documents/{doc_id}", headers=H)
    doc = r.json()["document"]
    if doc["processing_status"] in ("completed", "failed"):
        break
    print(f"Processing ({doc['processing_status']})…")
    time.sleep(2)

if doc["processing_status"] == "failed":
    raise RuntimeError(f"Processing failed: {doc.get('processing_error')}")

print(f"Document ready — {doc['chunk_count']} chunks, {doc['memory_count']} memories")

cURL Example

The endpoint is multipart/form-data. collection_id is a form field (not a query param); omit it to let the server auto-assign the caller's default collection. Add -H "X-Hebbrix-Require-Collection: true" to forbid silent defaulting.

Upload Document
# Upload to a specific collection
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -F "file=@document.pdf" \
  -F "collection_id=col_xyz" \
  -F "category=research" \
  -F "tags=ml,2024"

# Let the server auto-assign to your default collection
# (response will include  "collection_auto_assigned": true )
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -F "file=@document.pdf"

# Strict mode — 422 if collection_id is missing
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -H "X-Hebbrix-Require-Collection: true" \
  -F "file=@document.pdf" \
  -F "collection_id=col_xyz"

Processing Status

Document processing is asynchronous. Check the status field:

pending

Queued for processing

processing

Currently being processed

completed

Ready for search

failed

Check error field

Processing lifecycle & status fields

A document moves through a single lifecycle. The fields below describe the same progression from different angles, which is why they can look like they overlap. Use this section as the canonical reference.

State machine

uploadedextractingprocessingindexingsearchable

At any step the document can transition to failed instead.

Field meanings

FieldMeaning
processing_statusThe authoritative lifecycle state (e.g. "pending", "processing", "completed", "failed").
statusA coarse/legacy alias of the lifecycle (e.g. "processed"); prefer processing_status.
index_statusIndexing sub-state: "indexing" while embeddings/BM25 are being written, "completed" when done.
is_searchableBoolean; true means the document's memories are retrievable via search right now. This can be true before memories_indexed equals memories_total, because search becomes available incrementally.
memories_createdNumber of memories extracted from the document.
memories_indexedNumber of those memories fully embedded/indexed so far.
memories_totalTotal memories expected for the document.

Treat is_searchable: true as the readiness signal for querying; use memories_indexed === memories_total only if you need every chunk fully indexed.

Assistant

Ask me anything about Hebbrix