metadata
title: Dragon-3B Inference API
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
π Dragon-3B Inference API
FastAPI REST server for Dragon-3B-Base-alpha (Gated DeltaNet architecture) optimized for HuggingFace Spaces with T4 GPU.
π Features
- Clean Architecture: Modular codebase with
app/structure - T4 GPU Optimized: Configured for HF Spaces T4 GPU (25-35 tok/s base, 80-100 tok/s with flash-attn)
- FastAPI REST API: Interactive docs at
/docs - Health Monitoring:
/healthand/infoendpoints - Automatic Optimizations: Detects and uses flash-linear-attention if available
π§ Hardware Requirements
Required: T4 GPU (or better)
To configure GPU in your Space:
- Go to Space Settings
- Select "T4 small" or better
- Save and rebuild
π‘ API Endpoints
GET /
Basic API information
GET /health
Health check with model status
GET /info
Detailed model and system information
POST /generate
Generate text from a prompt
{
"prompt": "The future of AI is",
"max_new_tokens": 150,
"temperature": 0.7,
"top_p": 0.9
}
π Configuration
Set in Space Settings β Repository secrets:
HF_TOKEN- Your HuggingFace token (required to download the model)
π Performance
| Hardware | Speed (tokens/sec) | Notes |
|---|---|---|
| T4 GPU | 25-35 | Base (without flash-attn) |
| T4 GPU | 80-100 | With flash-linear-attention |
| L4 GPU | 35-45 | Base |
| L4 GPU | 100-120 | With flash-linear-attention |
π οΈ Development
This Space uses:
- PyTorch >= 2.0
- transformers >= 4.57
- FastAPI + Uvicorn
- Optional: flash-linear-attention (3-4x speedup)
π Notes
- First request is slow (model loading ~30-60s)
- Subsequent requests are fast
- Space sleeps after 48h inactivity (free tier)
- Includes flash-linear-attention compilation (slower build, faster inference)