Document Storage

Hyphen provides built-in document storage so workflows can operate on uploaded files rather than requiring data to be embedded in API payloads. Upload a CSV, PDF, or JSON file once, get back a document_id, and reference it anywhere in your workflow using the doc: prefix.

This separates data ingestion from workflow execution — no base64-encoded payloads, no external URL fetching at runtime, no SSRF risk.

Uploading Documents

Upload via multipart form data:

bash

curl -X POST https://your-hyphen.example.com/documents \
  -H "X-Org-Id: acme-corp" \
  -F "file=@transactions_q4.csv" \
  -F 'name=Q4 Transactions' \
  -F 'tags=["finance", "quarterly"]' \
  -F 'metadata={"department": "accounting"}' \
  -F 'ttl_days=90'

Response:

json

{
  "document": {
    "id": "doc_a1b2c3d4e5f6",
    "name": "Q4 Transactions",
    "content_type": "text/csv",
    "size_bytes": 245890,
    "checksum_sha256": "e3b0c44298fc...",
    "current_version": 1,
    "status": "ready",
    "tags": ["finance", "quarterly"],
    "created_at": "2026-02-01T00:00:00Z"
  }
}

Supported Content Types

Content Type	Extensions	Max Size
CSV	`.csv`	50 MB
JSON	`.json`	50 MB
Plain text	`.txt`	50 MB
PDF	`.pdf`	50 MB
Excel	`.xlsx`, `.xls`	50 MB
Images	`.png`, `.jpg`, `.jpeg`	50 MB

Deduplication

If you upload a file with the same SHA-256 checksum as an existing document in your org, Hyphen returns the existing document_id instead of creating a duplicate. This is transparent — you get back a 201 with the existing document, and the deduplicated: true flag indicates what happened.

Upload from URL

For files already hosted elsewhere:

bash

curl -X POST https://your-hyphen.example.com/documents/from-url \
  -H "X-Org-Id: acme-corp" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://storage.example.com/reports/q4.csv",
    "name": "Q4 Report",
    "tags": ["finance"]
  }'

Hyphen fetches the file, validates it, and stores it internally. The external URL is not accessed at workflow execution time.

Referencing Documents in Workflows

Use the doc: prefix anywhere you would normally pass inline data:

json

{
  "type": "matcher",
  "properties": {
    "left": "doc:doc_a1b2c3d4e5f6",
    "right": "doc:doc_x7y8z9w0v1u2",
    "matchOn": ["invoice_id"],
    "tolerance": 50,
    "outputMatched": "matched",
    "outputUnmatchedLeft": "unmatched_invoices"
  }
}

At execution time, the engine resolves doc: references by streaming the file from managed storage and parsing it based on content type.

Resolution by Content Type

Content Type	Resolves To	Example Use
`text/csv`	`Array<Object>` — header row becomes keys	Matcher `left`/`right`, Loop `items_path`
`application/json`	`Object` or `Array` as-is	Any property expecting structured data
`text/plain`	`String`	LLM template input, text analysis
`application/pdf`	Binary handle: `{ doc_id, content_type, storage_key, size_bytes }`	Agent tool input for document processing
`image/*`	Binary handle	Agent tool input for image analysis
Excel (`.xlsx`)	`Array<Object>` — first sheet, header row becomes keys	Same as CSV

CSV parsing is strict. UTF-8 only, comma-delimited, first row as header, consistent column counts. Malformed files fail at upload time with diagnostic errors (e.g., "Row 47 has 6 columns, expected 5 based on header row"). Fix the file and re-upload.

### Using `doc:` with Loop

json

{
  "type": "loop",
  "properties": {
    "mode": "foreach",
    "items_path": "doc:doc_customer_list",
    "item_variable_name": "customer",
    "actions_to_execute": [
      { "type": "send_welcome_email", "properties": { "email": "@customer.email" } }
    ]
  }
}

The CSV resolves to an array, each row becomes @customer inside the loop.

Versioning

Upload a new version of an existing document without changing the document_id:

bash

curl -X POST https://your-hyphen.example.com/documents/doc_a1b2c3d4e5f6/versions \
  -H "X-Org-Id: acme-corp" \
  -F "file=@transactions_q4_updated.csv" \
  -F 'change_note=Updated with December corrections'

The current_version increments automatically. Workflows referencing doc:doc_a1b2c3d4e5f6 always get the latest version.

Pinning to a Version

For audit reproducibility, pin a specific version:

json

{
  "left": "doc:doc_a1b2c3d4e5f6@2"
}

The @N suffix locks to version N. This is important for regulated workflows where you need to prove exactly which dataset was used in a given run.

Reading Version History

Use the document audit endpoint for version history:

bash

curl https://your-hyphen.example.com/documents/doc_a1b2c3d4e5f6/audit \
  -H "X-Org-Id: acme-corp"

The audit stream includes version_created entries with version metadata.

Webhook Triggers

json

{
  "event": "document.uploaded",
  "url": "https://automation.example.com/hyphen/hooks/document-uploaded",
  "filter": {
    "tags": ["invoices"],
    "content_type": "text/csv"
  },
  "auto_payload": {
    "workflow_id": "wfl-123e4567-e89b-12d3-a456-426614174000"
  }
}

Your automation service can then call POST /workflows/:id/execute with:

json

{
  "input": {
    "invoices": "doc:<document_id>"
  }
}

This enables the upload-to-execution pattern: data providers drop files, workflows run automatically. See Webhooks for full configuration.

Audit Trail

Every document action is tracked:

Event	When
`uploaded`	New document created
`version_created`	New version uploaded
`downloaded`	File content retrieved
`metadata_updated`	Metadata changed
`deleted`	Document soft-deleted

bash

curl https://your-hyphen.example.com/documents/doc_a1b2c3d4e5f6/audit?limit=50 \
  -H "X-Org-Id: acme-corp"

The document.referenced event links the document to a specific run_id, creating a bidirectional audit trail: from the document, you can see which workflows used it; from the workflow run, you can see which documents it consumed.

Storage Limits

Limit	Default
Storage per organization	10 GB
Documents per organization	10,000
Maximum file size	50 MB
Maximum versions per document	100

Check current usage:

bash

curl https://your-hyphen.example.com/documents/storage-usage \
  -H "X-Org-Id: acme-corp"

Common Patterns

Reconciliation with uploaded datasets. Upload invoices and payments as CSVs → reference both in a matcher step → agent investigates exceptions → results logged to custom table.

Agent document processing. Upload a PDF → agent receives a binary handle → agent uses an LLM action to extract text → agent reasons about the content → routes to appropriate workflow.

Scheduled reconciliation with versioned data. Upload new data monthly → version the same document → scheduled workflow always runs against the latest version. Pin previous versions for historical comparisons.

Webhook-driven ingestion. External system drops a CSV → document.uploaded webhook fires → agent classifies the document → triggers the correct processing workflow. This is the Agent as Trigger pattern with document storage.

Document storage is infrastructure, not a primitive. It enhances existing primitives — matcher can consume doc: references for its datasets, loop can iterate over uploaded CSVs, agents can process uploaded PDFs. Documents plug into the same @path context resolution system used everywhere else.

→ Next: [Conditional Logic](/platform/conditional-logic)