Documents & Media
Upload documents, images, videos, and audio files. Hebbrix automatically extracts content, generates embeddings, and makes everything searchable.
Supported Formats
Documents
PDF, DOCX, TXT, MD, HTML, CSV
Images
PNG, JPG, WebP, GIF, SVG
Videos
MP4, WebM, MOV (transcription)
Audio
MP3, WAV, M4A (transcription)
Processing Pipeline
- 1Upload - File uploaded and validated
- 2Extract - Text extraction, OCR for images, transcription for audio/video
- 3Chunk - Content split into semantic chunks with overlap
- 4Embed - Generate vector embeddings for each chunk
- 5Index - Store in vector database for fast retrieval
Endpoints
Code Examples
Upload a Document
import os
import requests
BASE = "https://api.hebbrix.com/v1"
H = {"Authorization": f"Bearer {os.environ['HEBBRIX_API_KEY']}"}
# POST /v1/documents/upload is multipart/form-data.
# `collection_id` is a FORM field (not a query param). Omit it to let the
# server auto-assign your default collection — the response field
# `collection_auto_assigned` will be true when that happens.
with open("research_paper.pdf", "rb") as f:
r = requests.post(
f"{BASE}/documents/upload",
headers=H,
files={"file": ("research_paper.pdf", f, "application/pdf")},
data={
"collection_id": "col_xyz", # optional
"category": "research", # optional
"tags": "ml,2024", # optional, comma-separated
},
)
body = r.json()
doc = body["document"]
print(f"Document ID: {doc['id']}")
# Legacy internal status (fine-grained enum):
# uploaded / processing / searchable / enriching / enriched / processed / failed / deleted
print(f"status = {doc['status']}")
# PDF-contract lifecycle status (prefer this for new integrations):
# pending / processing / completed / failed / deleted
print(f"processing_status = {doc['processing_status']}")
print(f"auto_assigned_default = {body['collection_auto_assigned']}")Poll until processing completes
import time
doc_id = doc["id"]
while True:
r = requests.get(f"{BASE}/documents/{doc_id}", headers=H)
doc = r.json()["document"]
if doc["processing_status"] in ("completed", "failed"):
break
print(f"Processing ({doc['processing_status']})…")
time.sleep(2)
if doc["processing_status"] == "failed":
raise RuntimeError(f"Processing failed: {doc.get('processing_error')}")
print(f"Document ready — {doc['chunk_count']} chunks, {doc['memory_count']} memories")cURL Example
The endpoint is multipart/form-data. collection_id is a form field (not a query param); omit it to let the server auto-assign the caller's default collection. Add -H "X-Hebbrix-Require-Collection: true" to forbid silent defaulting.
# Upload to a specific collection
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
-H "Authorization: Bearer mem_sk_your_api_key" \
-F "file=@document.pdf" \
-F "collection_id=col_xyz" \
-F "category=research" \
-F "tags=ml,2024"
# Let the server auto-assign to your default collection
# (response will include "collection_auto_assigned": true )
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
-H "Authorization: Bearer mem_sk_your_api_key" \
-F "file=@document.pdf"
# Strict mode — 422 if collection_id is missing
curl -X POST "https://api.hebbrix.com/v1/documents/upload" \
-H "Authorization: Bearer mem_sk_your_api_key" \
-H "X-Hebbrix-Require-Collection: true" \
-F "file=@document.pdf" \
-F "collection_id=col_xyz"Processing Status
Document processing is asynchronous. Check the status field:
pendingQueued for processing
processingCurrently being processed
completedReady for search
failedCheck error field
Processing lifecycle & status fields
A document moves through a single lifecycle. The fields below describe the same progression from different angles, which is why they can look like they overlap. Use this section as the canonical reference.
State machine
At any step the document can transition to failed instead.
Field meanings
| Field | Meaning |
|---|---|
processing_status | The authoritative lifecycle state (e.g. "pending", "processing", "completed", "failed"). |
status | A coarse/legacy alias of the lifecycle (e.g. "processed"); prefer processing_status. |
index_status | Indexing sub-state: "indexing" while embeddings/BM25 are being written, "completed" when done. |
is_searchable | Boolean; true means the document's memories are retrievable via search right now. This can be true before memories_indexed equals memories_total, because search becomes available incrementally. |
memories_created | Number of memories extracted from the document. |
memories_indexed | Number of those memories fully embedded/indexed so far. |
memories_total | Total memories expected for the document. |
Treat is_searchable: true as the readiness signal for querying; use memories_indexed === memories_total only if you need every chunk fully indexed.
