Text Generation
Transformers
English
qwen2
code-generation
python
fine-tuning
Qwen
tools
agent-framework
multi-agent
conversational
Eval Results (legacy)
Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
| # Deployment Troubleshooting Guide | |
| ## Quick Diagnostic | |
| Run the health check first: | |
| ```bash | |
| curl http://localhost:8000/health | |
| ``` | |
| Or use Python: | |
| ```bash | |
| python3 -c "import urllib.request; print(urllib.request.urlopen('http://localhost:8000/health').read())" | |
| ``` | |
| Check logs: | |
| ```bash | |
| docker-compose logs -f vllm | |
| # or | |
| tail -f logs/vllm.log | |
| ``` | |
| --- | |
| ## Common Issues and Solutions | |
| ### 1. Docker/Compose Issues | |
| #### Problem: `docker: command not found` | |
| **Error:** Docker is not installed or not in PATH. | |
| **Solution:** | |
| ```bash | |
| # Install Docker (Ubuntu/Debian) | |
| curl -fsSL https://get.docker.com -o get-docker.sh | |
| sudo sh get-docker.sh | |
| sudo usermod -aG docker $USER | |
| # Log out and back in | |
| # Install Docker Compose | |
| sudo apt-get install docker-compose-plugin | |
| # or download binary: https://github.com/docker/compose/releases | |
| ``` | |
| #### Problem: `Cannot connect to the Docker daemon` | |
| **Error:** Permission denied or socket not found. | |
| **Solution:** | |
| ```bash | |
| # Start Docker service | |
| sudo systemctl start docker | |
| sudo systemctl enable docker | |
| # Verify permissions | |
| docker info | |
| ``` | |
| #### Problem: `nvidia: driver not installed` or GPU not detected | |
| **Error:** Docker doesn't see NVIDIA GPU. | |
| **Solution:** | |
| ```bash | |
| # Install NVIDIA Container Toolkit | |
| distribution=$(. /etc/os-release;echo $ID$VERSION_ID) | |
| curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - | |
| curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list | |
| sudo apt-get update && sudo apt-get install -y nvidia-docker2 | |
| sudo systemctl restart docker | |
| # Verify | |
| docker run --rm --gpus all nvidia/cuda:11.8-base nvidia-smi | |
| ``` | |
| --- | |
| ### 2. vLLM Service Issues | |
| #### Problem: `GPU Out of Memory (OOM)` | |
| **Error in logs:** `CUDA out of memory` or `CUDA error: out of memory` | |
| **Solution:** | |
| 1. **Reduce model memory usage** via environment variables: | |
| ```bash | |
| export GPU_MEMORY_UTILIZATION=0.7 # Lower from 0.9 | |
| export MAX_MODEL_LEN=8192 # Reduce from 131072 | |
| export BLOCK_SIZE=16 # Smaller blocks | |
| ``` | |
| 2. **Use quantized model** (recommended): | |
| - Convert model to AWQ or GGUF format | |
| - Set `QUANTIZATION=awq` in environment | |
| 3. **Use smaller model**: Switch from Llama-3.1-8B to 7B or smaller | |
| 4. **Reduce batch size**: | |
| ```bash | |
| export MAX_BATCH_SIZE=4 | |
| ``` | |
| 5. **Ensure no other processes** are using GPU: | |
| ```bash | |
| nvidia-smi # Check for other processes | |
| ``` | |
| #### Problem: `Model not found` | |
| **Error:** Model fails to load, `FileNotFoundError`, or stays in loading state. | |
| **Solution:** | |
| 1. **Check model path**: | |
| ```bash | |
| # For local model: | |
| ls -la models/ | |
| # Should contain config.json, pytorch_model.bin, etc. | |
| # For HuggingFace model: | |
| # Set MODEL_NAME to HF name, e.g., meta-llama/Llama-3.1-8B-Instruct | |
| ``` | |
| 2. **Download model manually** if automatic download fails: | |
| ```bash | |
| # Install huggingface-cli | |
| pip install huggingface-hub | |
| # Download (requires authentication for gated models) | |
| huggingface-cli login # if needed | |
| huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir models | |
| ``` | |
| 3. **Check disk space**: | |
| ```bash | |
| df -h | |
| # Need ~16GB for 8B model (32GB for original, ~8GB for quantized) | |
| ``` | |
| 4. **Use pre-downloaded model**: | |
| - Upload model to the `models/` directory before starting | |
| - Mount external volume with model | |
| #### Problem: `Health check timeout` or `503 Service Unavailable` | |
| **Cause:** Model still loading, or failed to start. | |
| **Diagnosis:** | |
| ```bash | |
| docker-compose logs vllm | |
| # Look for "Model loaded successfully" or error messages | |
| ``` | |
| **Solution:** | |
| - Wait longer (first load can take 5-15 minutes) | |
| - Check logs for specific errors (OOM, missing files) | |
| - Increase healthcheck start_period: | |
| ```yaml | |
| healthcheck: | |
| start_period: 300s # Increase from 120s | |
| ``` | |
| #### Problem: `CORS or network errors` when calling API | |
| **Symptoms:** Connection refused, network timeout. | |
| **Solution:** | |
| ```bash | |
| # Check if container is running | |
| docker-compose ps | |
| # Check port mapping | |
| docker-compose port vllm 8000 | |
| # Test from inside container | |
| docker-compose exec vllm curl http://localhost:8000/health | |
| # Check firewall | |
| sudo ufw status | |
| sudo ufw allow 8000 | |
| ``` | |
| #### Problem: `Redis connection failed` | |
| **Error:** `Could not connect to Redis` | |
| **Solution:** | |
| - Redis is optional (caching). vLLM will continue without it. | |
| - If you want Redis: | |
| ```bash | |
| docker-compose ps redis # Check if running | |
| docker-compose logs redis | |
| ``` | |
| --- | |
| ### 3. Docker Compose Issues | |
| #### Problem: `Port already in use` | |
| **Error:** ` Bind for 0.0.0.0:8000 failed: port is already allocated` | |
| **Solution:** | |
| ```bash | |
| # Find process using port | |
| lsof -i :8000 | |
| # or | |
| netstat -tulpn | grep :8000 | |
| # Kill process or change port in docker-compose.yml: | |
| # ports: | |
| # - "8001:8000" # Map host 8001 to container 8000 | |
| ``` | |
| #### Problem: `Volume mount permission denied` | |
| **Error:** Cannot mount `./models:/models` | |
| **Solution:** | |
| ```bash | |
| # Create directories with proper permissions | |
| mkdir -p models logs | |
| sudo chown -R $(id -u):$(id -g) models logs | |
| # Or run Docker with volume flags to ignore permissions | |
| ``` | |
| #### Problem: `docker-compose: command not found` | |
| **Solution:** | |
| ```bash | |
| # Docker Compose v2 (included with Docker) | |
| sudo apt-get install docker-compose-plugin | |
| # Or Docker Compose v1 (standalone) | |
| sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose | |
| sudo chmod +x /usr/local/bin/docker-compose | |
| ``` | |
| --- | |
| ### 4. Cloud Deployment Issues | |
| #### RunPod Specific | |
| **Problem: `runpodctl: command not found`** | |
| ```bash | |
| # Install | |
| curl -L https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-linux-amd64 -o runpodctl | |
| sudo install runpodctl /usr/local/bin/ | |
| runpodctl config # Set API key | |
| ``` | |
| **Problem: `Template not found` or `pod creation failed`** | |
| - Ensure you have sufficient quota/balance | |
| - Check GPU availability in your region | |
| - Verify template name (case-sensitive) | |
| **Problem: `SCP/SSH connection failed`** | |
| - Pod may still be starting; wait 2-3 minutes | |
| - Check pod status: `runpodctl get pod <id>` | |
| - Verify pod is in `RUNNING` state | |
| **Problem: `Insufficient disk space` on pod** | |
| - Increase disk size in script (`DISK_SIZE=100` or higher) | |
| - Upload model separately to `/workspace/models` before starting | |
| #### Vast.ai Specific | |
| **Problem: `vastai: command not found`** | |
| ```bash | |
| pip install vastai | |
| # or download from: https://vast.ai/docs/cli | |
| ``` | |
| **Problem: `No suitable instance found`** | |
| - Relax search criteria (lower `VAST_GPU_RAM`) | |
| - Increase `VAST_SEARCH_LIMIT` | |
| - Check marketplace manually: `vastai search offers "cuda>=11.8"` | |
| **Problem: `SSH connection refused`** | |
| - Instance may still be provisioning | |
| - Check `vastai show instance <id>` | |
| - Ensure port forwarding is set up correctly | |
| **Problem: `Instance died or unresponsive`** | |
| - Check if balance depleted | |
| - Instance may have been evicted (low priority) | |
| - Use `--priority` flag or choose higher-cost instances | |
| --- | |
| ## Performance Tuning | |
| ### Reduce Latency | |
| ```bash | |
| export MAX_BATCH_SIZE=4 # Smaller batches for lower latency | |
| export MAX_MODEL_LEN=4096 # Shorter context window | |
| export GPU_MEMORY_UTILIZATION=0.8 | |
| ``` | |
| ### Increase Throughput | |
| ```bash | |
| export MAX_BATCH_SIZE=32 # Larger batches | |
| export MAX_MODEL_LEN=16384 # Longer context capability | |
| export GPU_MEMORY_UTILIZATION=0.95 | |
| ``` | |
| ### Multi-GPU Setup | |
| ```bash | |
| # Automatically detected. Ensure tensor parallel size matches GPU count: | |
| # export TENSOR_PARALLEL_SIZE=2 # For 2 GPUs (usually auto-detected) | |
| ``` | |
| --- | |
| ## Monitoring | |
| ### Health Endpoint | |
| ```bash | |
| curl http://localhost:8000/health | jq | |
| # Returns: {"status":"healthy","model":{...},"timestamp":...} | |
| ``` | |
| ### Readiness Endpoint (K8s liveness) | |
| ```bash | |
| curl http://localhost:8000/ready | |
| # Returns: {"status":"ready"} | |
| ``` | |
| ### Prometheus Metrics | |
| ```bash | |
| curl http://localhost:9090/metrics | |
| # Look for: vllm_requests_total, vllm_request_latency_seconds | |
| ``` | |
| ### Container Logs | |
| ```bash | |
| # All logs | |
| docker-compose logs -f vllm | |
| # Last 100 lines | |
| docker-compose logs --tail=100 vllm | |
| # Search for errors | |
| docker-compose logs vllm | grep -i error | |
| ``` | |
| --- | |
| ## Model Compatibility | |
| ### Supported Formats | |
| - **HuggingFace (default)**: `MODEL_FORMAT=hf` | |
| - **Local directory**: Mount model folder to `/models` | |
| - **AWQ quantized**: Set `QUANTIZATION=awq` and use AWQ model | |
| ### Gated Models (Llama 3.1, etc.) | |
| 1. Request access on HuggingFace | |
| 2. Get your token: https://huggingface.co/settings/tokens | |
| 3. Authenticate: | |
| ```bash | |
| huggingface-cli login | |
| # Paste token | |
| ``` | |
| ### Unsupported Models | |
| If vLLM doesn't support your model architecture: | |
| - Use `trust_remote_code=True` (already set) | |
| - Convert model to supported format | |
| - Check vLLM supported models: https://docs.vllm.ai/ | |
| --- | |
| ## Debug Mode | |
| Enable verbose logging: | |
| ```bash | |
| export LOG_LEVEL=DEBUG | |
| # restart services | |
| docker-compose down && docker-compose up -d | |
| ``` | |
| --- | |
| ## Getting Help | |
| 1. Check this guide for common symptoms | |
| 2. Review logs: `docker-compose logs vllm` | |
| 3. Search issues: https://github.com/vllm-project/vllm/issues | |
| 4. Community: https://discord.gg/vllm | |
| --- | |
| ## Quick Reference Commands | |
| ```bash | |
| # Start deployment | |
| cd stack-2.9-deploy | |
| ./local_deploy.sh | |
| # Stop deployment | |
| docker-compose down | |
| # View logs | |
| docker-compose logs -f vllm | |
| # Restart single service | |
| docker-compose restart vllm | |
| # Check service status | |
| docker-compose ps | |
| # Access container shell | |
| docker-compose exec vllm bash | |
| # Clean everything (WARNING: deletes data!) | |
| docker-compose down -v | |
| rm -rf models logs | |
| # Rebuild image (after Dockerfile changes) | |
| docker-compose build --no-cache vllm | |
| docker-compose up -d | |
| ``` | |