Document Storage

Hyphen provides built-in document storage so workflows can operate on uploaded files rather than requiring data to be embedded in API payloads. Upload a CSV, PDF, or JSON file once, get back a document_id, and reference it anywhere in your workflow using the doc: prefix.

This separates data ingestion from workflow execution — no base64-encoded payloads, no external URL fetching at runtime, no SSRF risk.


Uploading Documents

Upload via multipart form data:

bash
curl -X POST https://apisvr.tryhyphen.com/documents \
  -H "X-Org-Id: acme-corp" \
  -F "file=@transactions_q4.csv" \
  -F 'name=Q4 Transactions' \
  -F 'tags=["finance", "quarterly"]' \
  -F 'metadata={"department": "accounting"}' \
  -F 'ttl_days=90'

Response:

json
{
  "document": {
    "id": "doc_a1b2c3d4e5f6",
    "name": "Q4 Transactions",
    "content_type": "text/csv",
    "size_bytes": 245890,
    "checksum_sha256": "e3b0c44298fc...",
    "current_version": 1,
    "status": "ready",
    "tags": ["finance", "quarterly"],
    "created_at": "2026-02-01T00:00:00Z"
  }
}

Supported Content Types

Content Type Extensions Max Size
CSV .csv 50 MB
JSON .json 50 MB
Plain text .txt 50 MB
PDF .pdf 50 MB
Excel .xlsx, .xls 50 MB
Images .png, .jpg, .jpeg 50 MB

Deduplication

If you upload a file with the same SHA-256 checksum as an existing document in your org, Hyphen returns the existing document_id instead of creating a duplicate. This is transparent — you get back a 201 with the existing document, and the deduplicated: true flag indicates what happened.

Upload from URL

For files already hosted elsewhere:

bash
curl -X POST https://apisvr.tryhyphen.com/documents/from-url \
  -H "X-Org-Id: acme-corp" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://storage.example.com/reports/q4.csv",
    "name": "Q4 Report",
    "tags": ["finance"]
  }'

Hyphen fetches the file, validates it, and stores it internally. The external URL is not accessed at workflow execution time.


Referencing Documents in Workflows

Use the doc: prefix anywhere you would normally pass inline data:

json
{
  "type": "matcher",
  "properties": {
    "left": "doc:doc_a1b2c3d4e5f6",
    "right": "doc:doc_x7y8z9w0v1u2",
    "matchOn": ["invoice_id"],
    "tolerance": 0.02,
    "outputMatched": "matched",
    "outputUnmatchedLeft": "unmatched_invoices"
  }
}

At execution time, the engine resolves doc: references by streaming the file from managed storage and parsing it based on content type.

Resolution by Content Type

Content Type Resolves To Example Use
text/csv Array<Object> — header row becomes keys Matcher left/right, Loop items_path
application/json Object or Array as-is Any property expecting structured data
text/plain String LLM template input, text analysis
application/pdf Binary handle: { doc_id, content_type, storage_key, size_bytes } Agent tool input for document processing
image/* Binary handle Agent tool input for image analysis
Excel (.xlsx) Array<Object> — first sheet, header row becomes keys Same as CSV

CSV parsing is strict. UTF-8 only, comma-delimited, first row as header, consistent column counts. Malformed files fail at upload time with diagnostic errors (e.g., "Row 47 has 6 columns, expected 5 based on header row"). Fix the file and re-upload.

### Using `doc:` with Loop
json
{
  "type": "loop",
  "properties": {
    "mode": "foreach",
    "items_path": "doc:doc_customer_list",
    "item_variable_name": "customer",
    "actions_to_execute": [
      { "type": "send_welcome_email", "properties": { "email": "@customer.email" } }
    ]
  }
}

The CSV resolves to an array, each row becomes @customer inside the loop.


Versioning

Upload a new version of an existing document without changing the document_id:

bash
curl -X POST https://apisvr.tryhyphen.com/documents/doc_a1b2c3d4e5f6/versions \
  -H "X-Org-Id: acme-corp" \
  -F "file=@transactions_q4_updated.csv" \
  -F 'change_note=Updated with December corrections'

The current_version increments automatically. Workflows referencing doc:doc_a1b2c3d4e5f6 always get the latest version.

Pinning to a Version

For audit reproducibility, pin a specific version:

json
{
  "left": "doc:doc_a1b2c3d4e5f6@2"
}

The @N suffix locks to version N. This is important for regulated workflows where you need to prove exactly which dataset was used in a given run.

Reading Version History

Use the document audit endpoint for version history:

bash
curl https://apisvr.tryhyphen.com/documents/doc_a1b2c3d4e5f6/audit \
  -H "X-Org-Id: acme-corp"

The audit stream includes version_created entries with version metadata.


Webhook Triggers

Register webhooks to trigger your own automation service when documents are uploaded:

json
{
  "event": "document.uploaded",
  "url": "https://automation.example.com/hyphen/hooks/document-uploaded",
  "filter": {
    "tags": ["invoices"],
    "content_type": "text/csv"
  },
  "auto_payload": {
    "workflow_id": "wfl-123e4567-e89b-12d3-a456-426614174000"
  }
}

Your automation service can then call POST /workflows/:id/execute with:

json
{
  "input": {
    "invoices": "doc:<document_id>"
  }
}

This enables the upload-to-execution pattern: data providers drop files, workflows run automatically. See Webhooks for full configuration.


Audit Trail

Every document action is tracked:

Event When
uploaded New document created
version_created New version uploaded
downloaded File content retrieved
metadata_updated Metadata changed
deleted Document soft-deleted
bash
curl https://apisvr.tryhyphen.com/documents/doc_a1b2c3d4e5f6/audit?limit=50 \
  -H "X-Org-Id: acme-corp"

The document.referenced event links the document to a specific run_id, creating a bidirectional audit trail: from the document, you can see which workflows used it; from the workflow run, you can see which documents it consumed.


Storage Limits

Limit Default
Storage per organization 10 GB
Documents per organization 10,000
Maximum file size 50 MB
Maximum versions per document 100

Check current usage:

bash
curl https://apisvr.tryhyphen.com/documents/storage-usage \
  -H "X-Org-Id: acme-corp"

Common Patterns

Reconciliation with uploaded datasets. Upload invoices and payments as CSVs → reference both in a matcher step → agent investigates exceptions → results logged to custom table.

Agent document processing. Upload a PDF → agent receives a binary handle → agent uses an LLM action to extract text → agent reasons about the content → routes to appropriate workflow.

Scheduled reconciliation with versioned data. Upload new data monthly → version the same document → scheduled workflow always runs against the latest version. Pin previous versions for historical comparisons.

Webhook-driven ingestion. External system drops a CSV → document.uploaded webhook fires → agent classifies the document → triggers the correct processing workflow. This is the Agent as Trigger pattern with document storage.

Document storage is infrastructure, not a primitive. It enhances existing primitives — matcher can consume doc: references for its datasets, loop can iterate over uploaded CSVs, agents can process uploaded PDFs. Documents plug into the same @path context resolution system used everywhere else.

→ Next: [Conditional Logic](/platform/conditional-logic)