Spaces:

jeanbaptdzd
/

dragon-3b-inference

Paused

App Files Files Community

dragon-3b-inference / README.md

jeanbaptdzd

Refactor: Clean modular architecture with app/ structure

212188a 6 months ago

preview code

raw

history blame contribute delete

2.16 kB

metadata

title: Dragon-3B Inference API
emoji: 🐉
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit

🐉 Dragon-3B Inference API

FastAPI REST server for Dragon-3B-Base-alpha (Gated DeltaNet architecture) optimized for HuggingFace Spaces with T4 GPU.

🚀 Features

Clean Architecture: Modular codebase with app/ structure
T4 GPU Optimized: Configured for HF Spaces T4 GPU (25-35 tok/s base, 80-100 tok/s with flash-attn)
FastAPI REST API: Interactive docs at /docs
Health Monitoring: /health and /info endpoints
Automatic Optimizations: Detects and uses flash-linear-attention if available

🔧 Hardware Requirements

Required: T4 GPU (or better)

To configure GPU in your Space:

Go to Space Settings
Select "T4 small" or better
Save and rebuild

📡 API Endpoints

`GET /`

Basic API information

`GET /health`

Health check with model status

`GET /info`

Detailed model and system information

`POST /generate`

Generate text from a prompt

{
  "prompt": "The future of AI is",
  "max_new_tokens": 150,
  "temperature": 0.7,
  "top_p": 0.9
}

🔑 Configuration

Set in Space Settings → Repository secrets:

HF_TOKEN - Your HuggingFace token (required to download the model)

📊 Performance

Hardware	Speed (tokens/sec)	Notes
T4 GPU	25-35	Base (without flash-attn)
T4 GPU	80-100	With flash-linear-attention
L4 GPU	35-45	Base
L4 GPU	100-120	With flash-linear-attention

🛠️ Development

This Space uses:

PyTorch >= 2.0
transformers >= 4.57
FastAPI + Uvicorn
Optional: flash-linear-attention (3-4x speedup)

📝 Notes

First request is slow (model loading ~30-60s)
Subsequent requests are fast
Space sleeps after 48h inactivity (free tier)
Includes flash-linear-attention compilation (slower build, faster inference)