airsmodel / memory-bank /techContext.md
tanbushi's picture
update
f036bb3
# Tech Context
## Technology Stack
### Core Framework
- **FastAPI**: Modern, high-performance web framework
- **Uvicorn**: ASGI server for running FastAPI
- **Python 3.8+**: Required for type hints and async features
### AI/ML Libraries
- **Transformers**: Hugging Face library for model loading
- **PyTorch**: Backend for transformers
- **Accelerate**: Model optimization and distribution
- **HuggingFace Hub**: Model downloading and authentication
### Utilities
- **Pydantic**: Data validation and settings management
- **python-dotenv**: Environment variable management
- **python-multipart**: Form data handling
## Dependencies (requirements.txt)
```
fastapi
uvicorn[standard]
transformers
huggingface_hub
torch
accelerate
python-multipart
python-dotenv
```
## Configuration
### Environment Variables
```bash
# .env file
DEFAULT_MODEL_NAME="unsloth/functiongemma-270m-it"
HUGGINGFACE_TOKEN="hf_xxx" # Optional, for gated models
```
### Model Cache
- **Location**: `./my_model_cache`
- **Structure**: Hugging Face cache format
- **Management**: Automatic via transformers library
## API Endpoints
### 1. GET /
**Purpose**: Health check and welcome message
**Response**:
```json
{"message": "Welcome to HF-Model-Runner API! Visit /docs for API documentation."}
```
### 2. POST /download
**Purpose**: Download and initialize a model
**Request**:
```json
{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
```
**Response**:
```json
{
"status": "success",
"message": "模型 TinyLlama/TinyLlama-1.1B-Chat-v1.0 下载成功",
"loaded": true,
"current_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}
```
### 3. POST /v1/chat/completions
**Purpose**: OpenAI-compatible chat completion
**Request**:
```json
{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 500,
"temperature": 1.0
}
```
**Response**:
```json
{
"id": "chatcmpl-1234567890",
"object": "chat.completion",
"created": 1234567890,
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you?"
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 8,
"total_tokens": 18
}
}
```
## Module Structure
### app.py (Main Application)
```python
# Global state
model_name = None
pipe = None
tokenizer = None
# Startup event
@app.on_event("startup")
async def startup_event():
load_dotenv()
default_model = os.getenv("DEFAULT_MODEL_NAME", "fallback")
# Initialize pipeline
# Routes
GET /, POST /download, POST /v1/chat/completions
```
### utils/model.py (Model Management)
```python
class DownloadRequest(BaseModel):
model: str
def check_model(model_name) -> tuple
def download_model(model_name) -> tuple
def initialize_pipeline(model_name) -> tuple
```
### utils/chat_request.py (Request Validation)
```python
class ChatRequest(BaseModel):
model: Optional[str]
messages: List[Dict[str, Any]]
max_tokens: Optional[int]
temperature: Optional[float]
# ... other fields
```
### utils/chat_response.py (Response Generation)
```python
class ChatResponse(BaseModel): ...
class ChatChoice(BaseModel): ...
class ChatUsage(BaseModel): ...
def convert_json_format(input_data) -> dict
def create_chat_response(request, pipe, tokenizer) -> ChatResponse
```
## Deployment
### Hugging Face Spaces
- **SDK**: Docker
- **Port**: 7860 (standard for HF Spaces)
- **Requirements**: All dependencies in requirements.txt
- **Environment**: .env file for configuration
### Local Development
```bash
# Install dependencies
pip install -r requirements.txt
# Run server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
# Access
http://localhost:7860
http://localhost:7860/docs
```
## Error Handling
### Common Errors
1. **Model Not Found**: HTTP 404 from check_model()
2. **Download Failed**: HTTP 500 with error message
3. **Initialization Failed**: HTTP 500 detail
4. **Pipeline Error**: Exception in create_chat_response()
### Logging
- Startup: Model initialization status
- Download: Progress and success/failure
- Chat: Token counts and errors
## Performance Considerations
### Memory
- Single model loaded at a time
- Tokenizer cached
- Pipeline reused across requests
### Latency
- Startup: One-time initialization cost
- Chat: Inference time (depends on model size)
- Download: Network + disk I/O
### Scalability
- Single model per instance
- Stateless API routes
- Async handlers for concurrency