Spaces:

airsltd
/

airsmodel

Sleeping

App Files Files Community

airsmodel / memory-bank /techContext.md

tanbushi

update

f036bb3 about 2 months ago

preview code

raw

history blame contribute delete

4.55 kB

Tech Context

Technology Stack

Core Framework

FastAPI: Modern, high-performance web framework
Uvicorn: ASGI server for running FastAPI
Python 3.8+: Required for type hints and async features

AI/ML Libraries

Transformers: Hugging Face library for model loading
PyTorch: Backend for transformers
Accelerate: Model optimization and distribution
HuggingFace Hub: Model downloading and authentication

Utilities

Pydantic: Data validation and settings management
python-dotenv: Environment variable management
python-multipart: Form data handling

Dependencies (requirements.txt)

fastapi
uvicorn[standard]
transformers
huggingface_hub
torch
accelerate
python-multipart
python-dotenv

Configuration

Environment Variables

# .env file
DEFAULT_MODEL_NAME="unsloth/functiongemma-270m-it"
HUGGINGFACE_TOKEN="hf_xxx"  # Optional, for gated models

Model Cache

Location: ./my_model_cache
Structure: Hugging Face cache format
Management: Automatic via transformers library

API Endpoints

1. GET /

Purpose: Health check and welcome message Response:

{"message": "Welcome to HF-Model-Runner API! Visit /docs for API documentation."}

2. POST /download

Purpose: Download and initialize a model Request:

{"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"}

Response:

{
  "status": "success",
  "message": "模型 TinyLlama/TinyLlama-1.1B-Chat-v1.0 下载成功",
  "loaded": true,
  "current_model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
}

3. POST /v1/chat/completions

Purpose: OpenAI-compatible chat completion Request:

{
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 500,
  "temperature": 1.0
}

Response:

{
  "id": "chatcmpl-1234567890",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you?"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 8,
    "total_tokens": 18
  }
}

Module Structure

app.py (Main Application)

# Global state
model_name = None
pipe = None
tokenizer = None

# Startup event
@app.on_event("startup")
async def startup_event():
    load_dotenv()
    default_model = os.getenv("DEFAULT_MODEL_NAME", "fallback")
    # Initialize pipeline

# Routes
GET /, POST /download, POST /v1/chat/completions

utils/model.py (Model Management)

class DownloadRequest(BaseModel):
    model: str

def check_model(model_name) -> tuple
def download_model(model_name) -> tuple
def initialize_pipeline(model_name) -> tuple

utils/chat_request.py (Request Validation)

class ChatRequest(BaseModel):
    model: Optional[str]
    messages: List[Dict[str, Any]]
    max_tokens: Optional[int]
    temperature: Optional[float]
    # ... other fields

utils/chat_response.py (Response Generation)

class ChatResponse(BaseModel): ...
class ChatChoice(BaseModel): ...
class ChatUsage(BaseModel): ...

def convert_json_format(input_data) -> dict
def create_chat_response(request, pipe, tokenizer) -> ChatResponse

Deployment

Hugging Face Spaces

SDK: Docker
Port: 7860 (standard for HF Spaces)
Requirements: All dependencies in requirements.txt
Environment: .env file for configuration

Local Development

# Install dependencies
pip install -r requirements.txt

# Run server
uvicorn app:app --host 0.0.0.0 --port 7860 --reload

# Access
http://localhost:7860
http://localhost:7860/docs

Error Handling

Common Errors

Model Not Found: HTTP 404 from check_model()
Download Failed: HTTP 500 with error message
Initialization Failed: HTTP 500 detail
Pipeline Error: Exception in create_chat_response()

Logging

Startup: Model initialization status
Download: Progress and success/failure
Chat: Token counts and errors

Performance Considerations

Memory

Single model loaded at a time
Tokenizer cached
Pipeline reused across requests

Latency

Startup: One-time initialization cost
Chat: Inference time (depends on model size)
Download: Network + disk I/O

Scalability

Single model per instance
Stateless API routes
Async handlers for concurrency