dragon-3b-inference / README.md
jeanbaptdzd's picture
Refactor: Clean modular architecture with app/ structure
212188a
metadata
title: Dragon-3B Inference API
emoji: πŸ‰
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit

πŸ‰ Dragon-3B Inference API

FastAPI REST server for Dragon-3B-Base-alpha (Gated DeltaNet architecture) optimized for HuggingFace Spaces with T4 GPU.

πŸš€ Features

  • Clean Architecture: Modular codebase with app/ structure
  • T4 GPU Optimized: Configured for HF Spaces T4 GPU (25-35 tok/s base, 80-100 tok/s with flash-attn)
  • FastAPI REST API: Interactive docs at /docs
  • Health Monitoring: /health and /info endpoints
  • Automatic Optimizations: Detects and uses flash-linear-attention if available

πŸ”§ Hardware Requirements

Required: T4 GPU (or better)

To configure GPU in your Space:

  1. Go to Space Settings
  2. Select "T4 small" or better
  3. Save and rebuild

πŸ“‘ API Endpoints

GET /

Basic API information

GET /health

Health check with model status

GET /info

Detailed model and system information

POST /generate

Generate text from a prompt

{
  "prompt": "The future of AI is",
  "max_new_tokens": 150,
  "temperature": 0.7,
  "top_p": 0.9
}

πŸ”‘ Configuration

Set in Space Settings β†’ Repository secrets:

  • HF_TOKEN - Your HuggingFace token (required to download the model)

πŸ“Š Performance

Hardware Speed (tokens/sec) Notes
T4 GPU 25-35 Base (without flash-attn)
T4 GPU 80-100 With flash-linear-attention
L4 GPU 35-45 Base
L4 GPU 100-120 With flash-linear-attention

πŸ› οΈ Development

This Space uses:

  • PyTorch >= 2.0
  • transformers >= 4.57
  • FastAPI + Uvicorn
  • Optional: flash-linear-attention (3-4x speedup)

πŸ“ Notes

  • First request is slow (model loading ~30-60s)
  • Subsequent requests are fast
  • Space sleeps after 48h inactivity (free tier)
  • Includes flash-linear-attention compilation (slower build, faster inference)

πŸ”— Links