Tech Context
Technology Stack
Core Framework
- FastAPI: Modern, high-performance web framework
- Uvicorn: ASGI server for running FastAPI
- Python 3.8+: Required for type hints and async features
AI/ML Libraries
- Transformers: Hugging Face library for model loading
- PyTorch: Backend for transformers
- Accelerate: Model optimization and distribution
- HuggingFace Hub: Model downloading and authentication
Utilities
- Pydantic: Data validation and settings management
- python-dotenv: Environment variable management
- python-multipart: Form data handling
Dependencies (requirements.txt)
fastapi
uvicorn[standard]
transformers
huggingface_hub
torch
accelerate
python-multipart
python-dotenv
Configuration
Environment Variables
# .env file
DEFAULT_MODEL_NAME="unsloth/functiongemma-270m-it"
HUGGINGFACE_TOKEN="hf_xxx" # Optional, for gated models
Model Cache
- Location:
./my_model_cache - Structure: Hugging Face cache format
- Management: Automatic via transformers library
API Endpoints
1. GET /
Purpose: Health check and welcome message Response:
{"message": "Welcome to HF-Model-Runner API! Visit /docs for API documentation."}
2. POST /download
Purpose: Download and initialize a model Request:
{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
Response:
{
"status": "success",
"message": "模型 TinyLlama/TinyLlama-1.1B-Chat-v1.0 下载成功",
"loaded": true,
"current_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}
3. POST /v1/chat/completions
Purpose: OpenAI-compatible chat completion Request:
{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 500,
"temperature": 1.0
}
Response:
{
"id": "chatcmpl-1234567890",
"object": "chat.completion",
"created": 1234567890,
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 8,
"total_tokens": 18
}
}
Module Structure
app.py (Main Application)
# Global state
model_name = None
pipe = None
tokenizer = None
# Startup event
@app.on_event("startup")
async def startup_event():
load_dotenv()
default_model = os.getenv("DEFAULT_MODEL_NAME", "fallback")
# Initialize pipeline
# Routes
GET /, POST /download, POST /v1/chat/completions
utils/model.py (Model Management)
class DownloadRequest(BaseModel):
model: str
def check_model(model_name) -> tuple
def download_model(model_name) -> tuple
def initialize_pipeline(model_name) -> tuple
utils/chat_request.py (Request Validation)
class ChatRequest(BaseModel):
model: Optional[str]
messages: List[Dict[str, Any]]
max_tokens: Optional[int]
temperature: Optional[float]
# ... other fields
utils/chat_response.py (Response Generation)
class ChatResponse(BaseModel): ...
class ChatChoice(BaseModel): ...
class ChatUsage(BaseModel): ...
def convert_json_format(input_data) -> dict
def create_chat_response(request, pipe, tokenizer) -> ChatResponse
Deployment
Hugging Face Spaces
- SDK: Docker
- Port: 7860 (standard for HF Spaces)
- Requirements: All dependencies in requirements.txt
- Environment: .env file for configuration
Local Development
# Install dependencies
pip install -r requirements.txt
# Run server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# Access
http://localhost:7860
http://localhost:7860/docs
Error Handling
Common Errors
- Model Not Found: HTTP 404 from check_model()
- Download Failed: HTTP 500 with error message
- Initialization Failed: HTTP 500 detail
- Pipeline Error: Exception in create_chat_response()
Logging
- Startup: Model initialization status
- Download: Progress and success/failure
- Chat: Token counts and errors
Performance Considerations
Memory
- Single model loaded at a time
- Tokenizer cached
- Pipeline reused across requests
Latency
- Startup: One-time initialization cost
- Chat: Inference time (depends on model size)
- Download: Network + disk I/O
Scalability
- Single model per instance
- Stateless API routes
- Async handlers for concurrency