| # Tech Context | |
| ## Technology Stack | |
| ### Core Framework | |
| - **FastAPI**: Modern, high-performance web framework | |
| - **Uvicorn**: ASGI server for running FastAPI | |
| - **Python 3.8+**: Required for type hints and async features | |
| ### AI/ML Libraries | |
| - **Transformers**: Hugging Face library for model loading | |
| - **PyTorch**: Backend for transformers | |
| - **Accelerate**: Model optimization and distribution | |
| - **HuggingFace Hub**: Model downloading and authentication | |
| ### Utilities | |
| - **Pydantic**: Data validation and settings management | |
| - **python-dotenv**: Environment variable management | |
| - **python-multipart**: Form data handling | |
| ## Dependencies (requirements.txt) | |
| ``` | |
| fastapi | |
| uvicorn[standard] | |
| transformers | |
| huggingface_hub | |
| torch | |
| accelerate | |
| python-multipart | |
| python-dotenv | |
| ``` | |
| ## Configuration | |
| ### Environment Variables | |
| ```bash | |
| # .env file | |
| DEFAULT_MODEL_NAME="unsloth/functiongemma-270m-it" | |
| HUGGINGFACE_TOKEN="hf_xxx" # Optional, for gated models | |
| ``` | |
| ### Model Cache | |
| - **Location**: `./my_model_cache` | |
| - **Structure**: Hugging Face cache format | |
| - **Management**: Automatic via transformers library | |
| ## API Endpoints | |
| ### 1. GET / | |
| **Purpose**: Health check and welcome message | |
| **Response**: | |
| ```json | |
| {"message": "Welcome to HF-Model-Runner API! Visit /docs for API documentation."} | |
| ``` | |
| ### 2. POST /download | |
| **Purpose**: Download and initialize a model | |
| **Request**: | |
| ```json | |
| {"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"} | |
| ``` | |
| **Response**: | |
| ```json | |
| { | |
| "status": "success", | |
| "message": "模型 TinyLlama/TinyLlama-1.1B-Chat-v1.0 下载成功", | |
| "loaded": true, | |
| "current_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0" | |
| } | |
| ``` | |
| ### 3. POST /v1/chat/completions | |
| **Purpose**: OpenAI-compatible chat completion | |
| **Request**: | |
| ```json | |
| { | |
| "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", | |
| "messages": [{"role": "user", "content": "Hello"}], | |
| "max_tokens": 500, | |
| "temperature": 1.0 | |
| } | |
| ``` | |
| **Response**: | |
| ```json | |
| { | |
| "id": "chatcmpl-1234567890", | |
| "object": "chat.completion", | |
| "created": 1234567890, | |
| "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", | |
| "choices": [{ | |
| "index": 0, | |
| "message": { | |
| "role": "assistant", | |
| "content": "Hello! How can I help you?" | |
| }, | |
| "finish_reason": "stop" | |
| }], | |
| "usage": { | |
| "prompt_tokens": 10, | |
| "completion_tokens": 8, | |
| "total_tokens": 18 | |
| } | |
| } | |
| ``` | |
| ## Module Structure | |
| ### app.py (Main Application) | |
| ```python | |
| # Global state | |
| model_name = None | |
| pipe = None | |
| tokenizer = None | |
| # Startup event | |
| @app.on_event("startup") | |
| async def startup_event(): | |
| load_dotenv() | |
| default_model = os.getenv("DEFAULT_MODEL_NAME", "fallback") | |
| # Initialize pipeline | |
| # Routes | |
| GET /, POST /download, POST /v1/chat/completions | |
| ``` | |
| ### utils/model.py (Model Management) | |
| ```python | |
| class DownloadRequest(BaseModel): | |
| model: str | |
| def check_model(model_name) -> tuple | |
| def download_model(model_name) -> tuple | |
| def initialize_pipeline(model_name) -> tuple | |
| ``` | |
| ### utils/chat_request.py (Request Validation) | |
| ```python | |
| class ChatRequest(BaseModel): | |
| model: Optional[str] | |
| messages: List[Dict[str, Any]] | |
| max_tokens: Optional[int] | |
| temperature: Optional[float] | |
| # ... other fields | |
| ``` | |
| ### utils/chat_response.py (Response Generation) | |
| ```python | |
| class ChatResponse(BaseModel): ... | |
| class ChatChoice(BaseModel): ... | |
| class ChatUsage(BaseModel): ... | |
| def convert_json_format(input_data) -> dict | |
| def create_chat_response(request, pipe, tokenizer) -> ChatResponse | |
| ``` | |
| ## Deployment | |
| ### Hugging Face Spaces | |
| - **SDK**: Docker | |
| - **Port**: 7860 (standard for HF Spaces) | |
| - **Requirements**: All dependencies in requirements.txt | |
| - **Environment**: .env file for configuration | |
| ### Local Development | |
| ```bash | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run server | |
| uvicorn app:app --host 0.0.0.0 --port 7860 --reload | |
| # Access | |
| http://localhost:7860 | |
| http://localhost:7860/docs | |
| ``` | |
| ## Error Handling | |
| ### Common Errors | |
| 1. **Model Not Found**: HTTP 404 from check_model() | |
| 2. **Download Failed**: HTTP 500 with error message | |
| 3. **Initialization Failed**: HTTP 500 detail | |
| 4. **Pipeline Error**: Exception in create_chat_response() | |
| ### Logging | |
| - Startup: Model initialization status | |
| - Download: Progress and success/failure | |
| - Chat: Token counts and errors | |
| ## Performance Considerations | |
| ### Memory | |
| - Single model loaded at a time | |
| - Tokenizer cached | |
| - Pipeline reused across requests | |
| ### Latency | |
| - Startup: One-time initialization cost | |
| - Chat: Inference time (depends on model size) | |
| - Download: Network + disk I/O | |
| ### Scalability | |
| - Single model per instance | |
| - Stateless API routes | |
| - Async handlers for concurrency | |