# Tech Context ## Technology Stack ### Core Framework - **FastAPI**: Modern, high-performance web framework - **Uvicorn**: ASGI server for running FastAPI - **Python 3.8+**: Required for type hints and async features ### AI/ML Libraries - **Transformers**: Hugging Face library for model loading - **PyTorch**: Backend for transformers - **Accelerate**: Model optimization and distribution - **HuggingFace Hub**: Model downloading and authentication ### Utilities - **Pydantic**: Data validation and settings management - **python-dotenv**: Environment variable management - **python-multipart**: Form data handling ## Dependencies (requirements.txt) ``` fastapi uvicorn[standard] transformers huggingface_hub torch accelerate python-multipart python-dotenv ``` ## Configuration ### Environment Variables ```bash # .env file DEFAULT_MODEL_NAME="unsloth/functiongemma-270m-it" HUGGINGFACE_TOKEN="hf_xxx" # Optional, for gated models ``` ### Model Cache - **Location**: `./my_model_cache` - **Structure**: Hugging Face cache format - **Management**: Automatic via transformers library ## API Endpoints ### 1. GET / **Purpose**: Health check and welcome message **Response**: ```json {"message": "Welcome to HF-Model-Runner API! Visit /docs for API documentation."} ``` ### 2. POST /download **Purpose**: Download and initialize a model **Request**: ```json {"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"} ``` **Response**: ```json { "status": "success", "message": "模型 TinyLlama/TinyLlama-1.1B-Chat-v1.0 下载成功", "loaded": true, "current_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0" } ``` ### 3. POST /v1/chat/completions **Purpose**: OpenAI-compatible chat completion **Request**: ```json { "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 500, "temperature": 1.0 } ``` **Response**: ```json { "id": "chatcmpl-1234567890", "object": "chat.completion", "created": 1234567890, "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0", "choices": [{ "index": 0, "message": { "role": "assistant", "content": "Hello! How can I help you?" }, "finish_reason": "stop" }], "usage": { "prompt_tokens": 10, "completion_tokens": 8, "total_tokens": 18 } } ``` ## Module Structure ### app.py (Main Application) ```python # Global state model_name = None pipe = None tokenizer = None # Startup event @app.on_event("startup") async def startup_event(): load_dotenv() default_model = os.getenv("DEFAULT_MODEL_NAME", "fallback") # Initialize pipeline # Routes GET /, POST /download, POST /v1/chat/completions ``` ### utils/model.py (Model Management) ```python class DownloadRequest(BaseModel): model: str def check_model(model_name) -> tuple def download_model(model_name) -> tuple def initialize_pipeline(model_name) -> tuple ``` ### utils/chat_request.py (Request Validation) ```python class ChatRequest(BaseModel): model: Optional[str] messages: List[Dict[str, Any]] max_tokens: Optional[int] temperature: Optional[float] # ... other fields ``` ### utils/chat_response.py (Response Generation) ```python class ChatResponse(BaseModel): ... class ChatChoice(BaseModel): ... class ChatUsage(BaseModel): ... def convert_json_format(input_data) -> dict def create_chat_response(request, pipe, tokenizer) -> ChatResponse ``` ## Deployment ### Hugging Face Spaces - **SDK**: Docker - **Port**: 7860 (standard for HF Spaces) - **Requirements**: All dependencies in requirements.txt - **Environment**: .env file for configuration ### Local Development ```bash # Install dependencies pip install -r requirements.txt # Run server uvicorn app:app --host 0.0.0.0 --port 7860 --reload # Access http://localhost:7860 http://localhost:7860/docs ``` ## Error Handling ### Common Errors 1. **Model Not Found**: HTTP 404 from check_model() 2. **Download Failed**: HTTP 500 with error message 3. **Initialization Failed**: HTTP 500 detail 4. **Pipeline Error**: Exception in create_chat_response() ### Logging - Startup: Model initialization status - Download: Progress and success/failure - Chat: Token counts and errors ## Performance Considerations ### Memory - Single model loaded at a time - Tokenizer cached - Pipeline reused across requests ### Latency - Startup: One-time initialization cost - Chat: Inference time (depends on model size) - Download: Network + disk I/O ### Scalability - Single model per instance - Stateless API routes - Async handlers for concurrency