# Tech Context

## Technology Stack

### Core Framework
- **FastAPI**: Modern, high-performance web framework
- **Uvicorn**: ASGI server for running FastAPI
- **Python 3.8+**: Required for type hints and async features

### AI/ML Libraries
- **Transformers**: Hugging Face library for model loading
- **PyTorch**: Backend for transformers
- **Accelerate**: Model optimization and distribution
- **HuggingFace Hub**: Model downloading and authentication

### Utilities
- **Pydantic**: Data validation and settings management
- **python-dotenv**: Environment variable management
- **python-multipart**: Form data handling

## Dependencies (requirements.txt)
```
fastapi
uvicorn[standard]
transformers
huggingface_hub
torch
accelerate
python-multipart
python-dotenv
```

## Configuration

### Environment Variables
```bash
# .env file
DEFAULT_MODEL_NAME="unsloth/functiongemma-270m-it"
HUGGINGFACE_TOKEN="hf_xxx"  # Optional, for gated models
```

### Model Cache
- **Location**: `./my_model_cache`
- **Structure**: Hugging Face cache format
- **Management**: Automatic via transformers library

## API Endpoints

### 1. GET /
**Purpose**: Health check and welcome message
**Response**: 
```json
{"message": "Welcome to HF-Model-Runner API! Visit /docs for API documentation."}
```

### 2. POST /download
**Purpose**: Download and initialize a model
**Request**:
```json
{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}
```
**Response**:
```json
{
  "status": "success",
  "message": "模型 TinyLlama/TinyLlama-1.1B-Chat-v1.0 下载成功",
  "loaded": true,
  "current_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}
```

### 3. POST /v1/chat/completions
**Purpose**: OpenAI-compatible chat completion
**Request**:
```json
{
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 500,
  "temperature": 1.0
}
```
**Response**:
```json
{
  "id": "chatcmpl-1234567890",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 8,
    "total_tokens": 18
  }
}
```

## Module Structure

### app.py (Main Application)
```python
# Global state
model_name = None
pipe = None
tokenizer = None

# Startup event
@app.on_event("startup")
async def startup_event():
    load_dotenv()
    default_model = os.getenv("DEFAULT_MODEL_NAME", "fallback")
    # Initialize pipeline

# Routes
GET /, POST /download, POST /v1/chat/completions
```

### utils/model.py (Model Management)
```python
class DownloadRequest(BaseModel):
    model: str

def check_model(model_name) -> tuple
def download_model(model_name) -> tuple
def initialize_pipeline(model_name) -> tuple
```

### utils/chat_request.py (Request Validation)
```python
class ChatRequest(BaseModel):
    model: Optional[str]
    messages: List[Dict[str, Any]]
    max_tokens: Optional[int]
    temperature: Optional[float]
    # ... other fields
```

### utils/chat_response.py (Response Generation)
```python
class ChatResponse(BaseModel): ...
class ChatChoice(BaseModel): ...
class ChatUsage(BaseModel): ...

def convert_json_format(input_data) -> dict
def create_chat_response(request, pipe, tokenizer) -> ChatResponse
```

## Deployment

### Hugging Face Spaces
- **SDK**: Docker
- **Port**: 7860 (standard for HF Spaces)
- **Requirements**: All dependencies in requirements.txt
- **Environment**: .env file for configuration

### Local Development
```bash
# Install dependencies
pip install -r requirements.txt

# Run server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Access
http://localhost:7860
http://localhost:7860/docs
```

## Error Handling

### Common Errors
1. **Model Not Found**: HTTP 404 from check_model()
2. **Download Failed**: HTTP 500 with error message
3. **Initialization Failed**: HTTP 500 detail
4. **Pipeline Error**: Exception in create_chat_response()

### Logging
- Startup: Model initialization status
- Download: Progress and success/failure
- Chat: Token counts and errors

## Performance Considerations

### Memory
- Single model loaded at a time
- Tokenizer cached
- Pipeline reused across requests

### Latency
- Startup: One-time initialization cost
- Chat: Inference time (depends on model size)
- Download: Network + disk I/O

### Scalability
- Single model per instance
- Stateless API routes
- Async handlers for concurrency