cot-anc / docs /deploy-huggingface.md
BART-ender's picture
Switch default model to HRM-Text-1B
2620860 verified

Hugging Face Deployment

Primary target: Hugging Face Docker Space on upgraded GPU hardware.

What Gets Deployed

  • FastAPI backend
  • static web frontend
  • Hugging Face OAuth routes
  • ephemeral SQLite-backed session queue

Required Space Settings

  • SDK: Docker
  • Port: 7860
  • OAuth: enabled via README metadata
  • Hardware: upgraded GPU recommended

Recommended Runtime Variables

Core:

  • MODEL_NAME=sapientinc/HRM-Text-1B
  • DEVICE_PREFERENCE=auto
  • DTYPE_PREFERENCE=auto
  • ATTN_IMPLEMENTATION=eager
  • LOW_CPU_MEM_USAGE=true
  • TRUST_REMOTE_CODE=true
  • PRELOAD_MODEL=true

Traffic limits:

  • MAX_TRACE_TOKENS=256
  • MAX_SENTENCES=16
  • JOB_WORKERS=1
  • MAX_QUEUED_JOBS=8
  • MAX_ACTIVE_JOBS_PER_USER=2
  • REQUIRE_AUTH=true

Deploy Flow

  1. Create new Hugging Face Space with Docker SDK.
  2. Push repo contents.
  3. Set runtime variables in Space settings.
  4. Upgrade hardware.
  5. Wait for build.
  6. Verify:
    • GET /healthz
    • sign-in works
    • one short analysis completes
    • JSON / CSV export works

Operational Notes

  • Local disk is ephemeral. Session history disappears on restart.
  • OAuth helper is mocked locally but real inside Space.
  • Keep public defaults conservative. Long traces can OOM small GPUs.
  • If queue pressure grows, lower token caps before increasing worker count.

Common Failure Modes

  • attn_implementation not eager:
    • attribution disabled for model
  • unsupported model layout:
    • generation may work, attribution fails early with clear error
  • OOM:
    • reduce MAX_TRACE_TOKENS, MAX_SENTENCES, or choose larger GPU
  • cold start slow:
    • keep PRELOAD_MODEL=true