Documents

Documents & Media

Upload documents, images, videos, and audio files. Hebbrix automatically extracts content, generates embeddings, and makes everything searchable.

Supported Formats

Documents

PDF, DOCX, TXT, MD, HTML, CSV

Images

PNG, JPG, WebP, GIF, SVG

Videos

MP4, WebM, MOV (transcription)

Audio

MP3, WAV, M4A (transcription)

Processing Pipeline

1Upload - File uploaded and validated
2Extract - Text extraction, OCR for images, transcription for audio/video
3Chunk - Content split into semantic chunks with overlap
4Embed - Generate vector embeddings for each chunk
5Index - Store in vector database for fast retrieval

Endpoints

Code Examples

Upload a Document

Python

import os
import requests

BASE = "https://api.hebbrix.com/v1"
H = {"Authorization": f"Bearer {os.environ['HEBBRIX_API_KEY']}"}

# POST /v1/documents/upload is multipart/form-data.
# `collection_id` is a FORM field (not a query param). Omit it to let the
# server auto-assign your default collection — the response field
# `collection_auto_assigned` will be true when that happens.
with open("research_paper.pdf", "rb") as f:
    r = requests.post(
        f"{BASE}/documents/upload",
        headers=H,
        files={"file": ("research_paper.pdf", f, "application/pdf")},
        data={
            "collection_id": "col_xyz",  # optional
            "category": "research",      # optional
            "tags": "ml,2024",           # optional, comma-separated
        },
    )
body = r.json()
doc = body["document"]

print(f"Document ID: {doc['id']}")
# Legacy internal status (fine-grained enum):
#   uploaded / processing / searchable / enriching / enriched / processed / failed / deleted
print(f"status = {doc['status']}")
# PDF-contract lifecycle status (prefer this for new integrations):
#   pending / processing / completed / failed / deleted
print(f"processing_status = {doc['processing_status']}")
print(f"auto_assigned_default = {body['collection_auto_assigned']}")

Poll until processing completes

Python (with polling)

import time

doc_id = doc["id"]
while True:
    r = requests.get(f"{BASE}/documents/{doc_id}", headers=H)
    doc = r.json()["document"]
    if doc["processing_status"] in ("completed", "failed"):
        break
    print(f"Processing ({doc['processing_status']})…")
    time.sleep(2)

if doc["processing_status"] == "failed":
    raise RuntimeError(f"Processing failed: {doc.get('processing_error')}")

print(f"Document ready — {doc['chunk_count']} chunks, {doc['memory_count']} memories")

cURL Example

The endpoint is multipart/form-data. collection_id is a form field (not a query param); omit it to let the server auto-assign the caller's default collection. Add -H "X-Hebbrix-Require-Collection: true" to forbid silent defaulting.

Upload Document

# Upload to a specific collection
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -F "file=@document.pdf" \
  -F "collection_id=col_xyz" \
  -F "category=research" \
  -F "tags=ml,2024"

# Let the server auto-assign to your default collection
# (response will include  "collection_auto_assigned": true )
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -F "file=@document.pdf"

# Strict mode — 422 if collection_id is missing
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -H "X-Hebbrix-Require-Collection: true" \
  -F "file=@document.pdf" \
  -F "collection_id=col_xyz"

Processing Status

Document processing is asynchronous. Check the status field:

pending

Queued for processing

processing

Currently being processed

completed

Ready for search

failed

Check error field

Processing lifecycle & status fields

A document moves through a single lifecycle. The fields below describe the same progression from different angles, which is why they can look like they overlap. Use this section as the canonical reference.

State machine

uploaded→extracting→processing→indexing→searchable

At any step the document can transition to failed instead.

Field meanings

Field	Meaning
`processing_status`	The authoritative lifecycle state (e.g. `"pending"`, `"processing"`, `"completed"`, `"failed"`).
`status`	A coarse/legacy alias of the lifecycle (e.g. `"processed"`); prefer `processing_status`.
`index_status`	Indexing sub-state: `"indexing"` while embeddings/BM25 are being written, `"completed"` when done.
`is_searchable`	Boolean; `true` means the document's memories are retrievable via search right now. This can be `true` before `memories_indexed` equals `memories_total`, because search becomes available incrementally.
`memories_created`	Number of memories extracted from the document.
`memories_indexed`	Number of those memories fully embedded/indexed so far.
`memories_total`	Total memories expected for the document.

Treat is_searchable: true as the readiness signal for querying; use memories_indexed === memories_total only if you need every chunk fully indexed.

Documents & Media

Supported Formats

Documents

Images

Videos

Audio

Processing Pipeline

Endpoints

Code Examples

Upload a Document

Poll until processing completes

cURL Example

Processing Status

Processing lifecycle & status fields

State machine

Field meanings

Assistant