Hebbrix
Documents

Documents & Media

Upload documents, images, videos, and audio files. Hebbrix automatically extracts content, generates embeddings, and makes everything searchable.

Supported Formats

Documents

PDF, DOCX, TXT, MD, HTML, CSV

Images

PNG, JPG, WebP, GIF, SVG

Videos

MP4, WebM, MOV (transcription)

Audio

MP3, WAV, M4A (transcription)

Processing Pipeline

  1. 1Upload - File uploaded and validated
  2. 2Extract - Text extraction, OCR for images, transcription for audio/video
  3. 3Chunk - Content split into semantic chunks with overlap
  4. 4Embed - Generate vector embeddings for each chunk
  5. 5Index - Store in vector database for fast retrieval

Endpoints

Code Examples

Upload a Document

Python
import os
import requests

BASE = "https://api.hebbrix.com/v1"
H = {"Authorization": f"Bearer {os.environ['HEBBRIX_API_KEY']}"}

# POST /v1/documents/upload is multipart/form-data.
# `collection_id` is a FORM field (not a query param). Omit it to let the
# server auto-assign your default collection — the response field
# `collection_auto_assigned` will be true when that happens.
with open("research_paper.pdf", "rb") as f:
    r = requests.post(
        f"{BASE}/documents/upload",
        headers=H,
        files={"file": ("research_paper.pdf", f, "application/pdf")},
        data={
            "collection_id": "col_xyz",  # optional
            "category": "research",      # optional
            "tags": "ml,2024",           # optional, comma-separated
        },
    )
body = r.json()
doc = body["document"]

print(f"Document ID: {doc['id']}")
# Legacy internal status (fine-grained enum):
#   uploaded / processing / searchable / enriching / enriched / processed / failed / deleted
print(f"status = {doc['status']}")
# PDF-contract lifecycle status (prefer this for new integrations):
#   pending / processing / completed / failed / deleted
print(f"processing_status = {doc['processing_status']}")
print(f"auto_assigned_default = {body['collection_auto_assigned']}")

Poll until processing completes

Python (with polling)
import time

doc_id = doc["id"]
while True:
    r = requests.get(f"{BASE}/documents/{doc_id}", headers=H)
    doc = r.json()["document"]
    if doc["processing_status"] in ("completed", "failed"):
        break
    print(f"Processing ({doc['processing_status']})…")
    time.sleep(2)

if doc["processing_status"] == "failed":
    raise RuntimeError(f"Processing failed: {doc.get('processing_error')}")

print(f"Document ready — {doc['chunk_count']} chunks, {doc['memory_count']} memories")

cURL Example

The endpoint is multipart/form-data. collection_id is a form field (not a query param); omit it to let the server auto-assign the caller's default collection. Add -H "X-Hebbrix-Require-Collection: true" to forbid silent defaulting.

Upload Document
# Upload to a specific collection
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -F "file=@document.pdf" \
  -F "collection_id=col_xyz" \
  -F "category=research" \
  -F "tags=ml,2024"

# Let the server auto-assign to your default collection
# (response will include  "collection_auto_assigned": true )
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -F "file=@document.pdf"

# Strict mode — 422 if collection_id is missing
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
  -H "Authorization: Bearer mem_sk_your_api_key" \
  -H "X-Hebbrix-Require-Collection: true" \
  -F "file=@document.pdf" \
  -F "collection_id=col_xyz"

Processing Status

Document processing is asynchronous. Check the status field:

pending

Queued for processing

processing

Currently being processed

completed

Ready for search

failed

Check error field

Assistant

Ask me anything about Hebbrix