Documents
Documents & Media
Upload documents, images, videos, and audio files. Hebbrix automatically extracts content, generates embeddings, and makes everything searchable.
Supported Formats
Documents
PDF, DOCX, TXT, MD, HTML, CSV
Images
PNG, JPG, WebP, GIF, SVG
Videos
MP4, WebM, MOV (transcription)
Audio
MP3, WAV, M4A (transcription)
Processing Pipeline
- 1Upload - File uploaded and validated
- 2Extract - Text extraction, OCR for images, transcription for audio/video
- 3Chunk - Content split into semantic chunks with overlap
- 4Embed - Generate vector embeddings for each chunk
- 5Index - Store in vector database for fast retrieval
Endpoints
Code Examples
Upload a Document
Python
import os
import requests
BASE = "https://api.hebbrix.com/v1"
H = {"Authorization": f"Bearer {os.environ['HEBBRIX_API_KEY']}"}
# POST /v1/documents/upload is multipart/form-data.
# `collection_id` is a FORM field (not a query param). Omit it to let the
# server auto-assign your default collection — the response field
# `collection_auto_assigned` will be true when that happens.
with open("research_paper.pdf", "rb") as f:
r = requests.post(
f"{BASE}/documents/upload",
headers=H,
files={"file": ("research_paper.pdf", f, "application/pdf")},
data={
"collection_id": "col_xyz", # optional
"category": "research", # optional
"tags": "ml,2024", # optional, comma-separated
},
)
body = r.json()
doc = body["document"]
print(f"Document ID: {doc['id']}")
# Legacy internal status (fine-grained enum):
# uploaded / processing / searchable / enriching / enriched / processed / failed / deleted
print(f"status = {doc['status']}")
# PDF-contract lifecycle status (prefer this for new integrations):
# pending / processing / completed / failed / deleted
print(f"processing_status = {doc['processing_status']}")
print(f"auto_assigned_default = {body['collection_auto_assigned']}")Poll until processing completes
Python (with polling)
import time
doc_id = doc["id"]
while True:
r = requests.get(f"{BASE}/documents/{doc_id}", headers=H)
doc = r.json()["document"]
if doc["processing_status"] in ("completed", "failed"):
break
print(f"Processing ({doc['processing_status']})…")
time.sleep(2)
if doc["processing_status"] == "failed":
raise RuntimeError(f"Processing failed: {doc.get('processing_error')}")
print(f"Document ready — {doc['chunk_count']} chunks, {doc['memory_count']} memories")cURL Example
The endpoint is multipart/form-data. collection_id is a form field (not a query param); omit it to let the server auto-assign the caller's default collection. Add -H "X-Hebbrix-Require-Collection: true" to forbid silent defaulting.
Upload Document
# Upload to a specific collection
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
-H "Authorization: Bearer mem_sk_your_api_key" \
-F "file=@document.pdf" \
-F "collection_id=col_xyz" \
-F "category=research" \
-F "tags=ml,2024"
# Let the server auto-assign to your default collection
# (response will include "collection_auto_assigned": true )
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
-H "Authorization: Bearer mem_sk_your_api_key" \
-F "file=@document.pdf"
# Strict mode — 422 if collection_id is missing
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
-H "Authorization: Bearer mem_sk_your_api_key" \
-H "X-Hebbrix-Require-Collection: true" \
-F "file=@document.pdf" \
-F "collection_id=col_xyz"Processing Status
Document processing is asynchronous. Check the status field:
pendingQueued for processing
processingCurrently being processed
completedReady for search
failedCheck error field
