| Model | Size | RAM ~ | Coding | Reasoning | License | GGUF |
|---|---|---|---|---|---|---|
| ★ Qwen2.5-Coder-7B-Instruct Q4_K_M | 7B | ~5.5 GB | ✔ Excellent | ✔ Strong | Apache 2.0 | ✔ |
| DeepSeek-Coder-V2-Lite-Instruct Q4 | 16B MoE | ~8 GB | ✔ Excellent | ✔ Excellent | DeepSeek | ✔ |
| Phi-3.5-mini-instruct Q4 | 3.8B | ~2.5 GB | ✔ Good | ✔ Good | MIT | ✔ |
| CodeLlama-7B-Instruct Q4 | 7B | ~5 GB | ✔ Good | ✘ Weaker | Llama 2 | ✔ |
# ── Stage 1: builder (compile llama-cpp-python) FROM python:3.11-slim AS builder RUN apt-get install -y build-essential cmake wget ... ENV CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" RUN pip install -r requirements.txt --target /build/deps # ── Stage 2: runtime (slim image) FROM python:3.11-slim RUN useradd -m -u 1000 user # HF Spaces yêu cầu non-root USER user # Download GGUF model tại BUILD time (~4.4 GB) RUN python -c "from huggingface_hub import hf_hub_download; \ hf_hub_download(repo_id='Qwen/Qwen2.5-Coder-7B-Instruct-GGUF', \ filename='qwen2.5-coder-7b-instruct-q4_k_m.gguf', \ local_dir='/app/models')" ENV MODEL_PATH=/app/models/qwen2.5-coder-7b-instruct-q4_k_m.gguf # BEARER_TOKEN được inject từ HF Secret — không hard-code ở đây! EXPOSE 7860 CMD ["python", "-m", "uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
from fastapi import FastAPI, HTTPException, Depends from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials import os BEARER_TOKEN = os.environ.get("BEARER_TOKEN", "") # ← từ HF Secret # Auth middleware def verify_token(creds: HTTPAuthorizationCredentials): if creds.credentials != BEARER_TOKEN: raise HTTPException(401, "Invalid Bearer Token") # OpenAI-compatible endpoint @app.post("/v1/chat/completions", dependencies=[Depends(verify_token)]) async def chat_completions(request: ChatCompletionRequest): result = llm.create_chat_completion( messages=[{"role": m.role, "content": m.content} for m in request.messages], max_tokens=request.max_tokens, temperature=request.temperature, ) return ChatCompletionResponse(...) # wrapped in OpenAI schema
llama-cpp-python==0.3.4 # GGUF inference engine fastapi==0.115.6 # API framework uvicorn[standard]==0.32.1 # ASGI server pydantic==2.10.3 # Schema validation huggingface-hub==0.27.0 # Model download httpx==0.28.1 # HTTP client
import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://<username>-<space>.hf.space/v1", apiKey: process.env.BEARER_TOKEN, }); const response = await client.chat.completions.create({ model: "qwen2.5-coder-7b-instruct", messages: [ { role: "system", content: "You are a Minecraft bot brain..." }, { role: "user", content: "Bot at x=120. Nearest: oak_log. Chop it." }, ], max_tokens: 512, temperature: 0.2, });
from openai import OpenAI import os client = OpenAI( base_url="https://<username>-<space>.hf.space/v1", api_key=os.environ.get("BEARER_TOKEN"), ) response = client.chat.completions.create( model="qwen2.5-coder-7b-instruct", messages=[ {"role": "system", "content": "You are a Minecraft bot brain..."}, {"role": "user", "content": "Bot at x=120. Nearest: oak_log. Chop it."}, ], max_tokens=512, temperature=0.2, )
Vào Space của bạn → tab Settings → cuộn xuống mục "Repository secrets".
Click "New secret" → điền BEARER_TOKEN vào trường Name → điền token bí mật của bạn vào Value.
HF sẽ inject giá trị này như biến môi trường vào container lúc runtime. Container sẽ tự rebuild. Token không bao giờ xuất hiện trong log hay image layer.
Client phải gửi header: Authorization: Bearer <your-token>. Đặt token vào biến môi trường phía client (BEARER_TOKEN) để tránh hard-code.