Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned / stack /deploy /TROUBLESHOOTING.md

walidsobhie-code

refactor: Squeeze folders further - cleaner structure

65888d5 2 months ago

preview code

raw

history blame

9.57 kB

	# Deployment Troubleshooting Guide

	## Quick Diagnostic

	Run the health check first:
	```bash
	curl http://localhost:8000/health
	```

	Or use Python:
	```bash
	python3 -c "import urllib.request; print(urllib.request.urlopen('http://localhost:8000/health').read())"
	```

	Check logs:
	```bash
	docker-compose logs -f vllm
	# or
	tail -f logs/vllm.log
	```

	---

	## Common Issues and Solutions

	### 1. Docker/Compose Issues

	#### Problem: `docker: command not found`
	Error: Docker is not installed or not in PATH.

	Solution:
	```bash
	# Install Docker (Ubuntu/Debian)
	curl -fsSL https://get.docker.com -o get-docker.sh
	sudo sh get-docker.sh
	sudo usermod -aG docker $USER
	# Log out and back in

	# Install Docker Compose
	sudo apt-get install docker-compose-plugin
	# or download binary: https://github.com/docker/compose/releases
	```

	#### Problem: `Cannot connect to the Docker daemon`
	Error: Permission denied or socket not found.

	Solution:
	```bash
	# Start Docker service
	sudo systemctl start docker
	sudo systemctl enable docker

	# Verify permissions
	docker info
	```

	#### Problem: `nvidia: driver not installed` or GPU not detected
	Error: Docker doesn't see NVIDIA GPU.

	Solution:
	```bash
	# Install NVIDIA Container Toolkit
	distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
	curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey \| sudo apt-key add -
	curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list \| sudo tee /etc/apt/sources.list.d/nvidia-docker.list
	sudo apt-get update && sudo apt-get install -y nvidia-docker2
	sudo systemctl restart docker

	# Verify
	docker run --rm --gpus all nvidia/cuda:11.8-base nvidia-smi
	```

	---

	### 2. vLLM Service Issues

	#### Problem: `GPU Out of Memory (OOM)`
	Error in logs: `CUDA out of memory` or `CUDA error: out of memory`

	Solution:

	1. Reduce model memory usage via environment variables:
	```bash
	export GPU_MEMORY_UTILIZATION=0.7 # Lower from 0.9
	export MAX_MODEL_LEN=8192 # Reduce from 131072
	export BLOCK_SIZE=16 # Smaller blocks
	```

	2. Use quantized model (recommended):
	- Convert model to AWQ or GGUF format
	- Set `QUANTIZATION=awq` in environment

	3. Use smaller model: Switch from Llama-3.1-8B to 7B or smaller

	4. Reduce batch size:
	```bash
	export MAX_BATCH_SIZE=4
	```

	5. Ensure no other processes are using GPU:
	```bash
	nvidia-smi # Check for other processes
	```

	#### Problem: `Model not found`
	Error: Model fails to load, `FileNotFoundError`, or stays in loading state.

	Solution:

	1. Check model path:
	```bash
	# For local model:
	ls -la models/
	# Should contain config.json, pytorch_model.bin, etc.

	# For HuggingFace model:
	# Set MODEL_NAME to HF name, e.g., meta-llama/Llama-3.1-8B-Instruct
	```

	2. Download model manually if automatic download fails:
	```bash
	# Install huggingface-cli
	pip install huggingface-hub

	# Download (requires authentication for gated models)
	huggingface-cli login # if needed
	huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir models
	```

	3. Check disk space:
	```bash
	df -h
	# Need ~16GB for 8B model (32GB for original, ~8GB for quantized)
	```

	4. Use pre-downloaded model:
	- Upload model to the `models/` directory before starting
	- Mount external volume with model

	#### Problem: `Health check timeout` or `503 Service Unavailable`
	Cause: Model still loading, or failed to start.

	Diagnosis:
	```bash
	docker-compose logs vllm
	# Look for "Model loaded successfully" or error messages
	```

	Solution:
	- Wait longer (first load can take 5-15 minutes)
	- Check logs for specific errors (OOM, missing files)
	- Increase healthcheck start_period:
	```yaml
	healthcheck:
	start_period: 300s # Increase from 120s
	```

	#### Problem: `CORS or network errors` when calling API
	Symptoms: Connection refused, network timeout.

	Solution:
	```bash
	# Check if container is running
	docker-compose ps

	# Check port mapping
	docker-compose port vllm 8000

	# Test from inside container
	docker-compose exec vllm curl http://localhost:8000/health

	# Check firewall
	sudo ufw status
	sudo ufw allow 8000
	```

	#### Problem: `Redis connection failed`
	Error: `Could not connect to Redis`

	Solution:
	- Redis is optional (caching). vLLM will continue without it.
	- If you want Redis:
	```bash
	docker-compose ps redis # Check if running
	docker-compose logs redis
	```

	---

	### 3. Docker Compose Issues

	#### Problem: `Port already in use`
	Error: ` Bind for 0.0.0.0:8000 failed: port is already allocated`

	Solution:
	```bash
	# Find process using port
	lsof -i :8000
	# or
	netstat -tulpn \| grep :8000

	# Kill process or change port in docker-compose.yml:
	# ports:
	# - "8001:8000" # Map host 8001 to container 8000
	```

	#### Problem: `Volume mount permission denied`
	Error: Cannot mount `./models:/models`

	Solution:
	```bash
	# Create directories with proper permissions
	mkdir -p models logs
	sudo chown -R $(id -u):$(id -g) models logs
	# Or run Docker with volume flags to ignore permissions
	```

	#### Problem: `docker-compose: command not found`
	Solution:
	```bash
	# Docker Compose v2 (included with Docker)
	sudo apt-get install docker-compose-plugin

	# Or Docker Compose v1 (standalone)
	sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
	sudo chmod +x /usr/local/bin/docker-compose
	```

	---

	### 4. Cloud Deployment Issues

	#### RunPod Specific

	Problem: `runpodctl: command not found`
	```bash
	# Install
	curl -L https://github.com/runpod/runpodctl/releases/latest/download/runpodctl-linux-amd64 -o runpodctl
	sudo install runpodctl /usr/local/bin/
	runpodctl config # Set API key
	```

	Problem: `Template not found` or `pod creation failed`
	- Ensure you have sufficient quota/balance
	- Check GPU availability in your region
	- Verify template name (case-sensitive)

	Problem: `SCP/SSH connection failed`
	- Pod may still be starting; wait 2-3 minutes
	- Check pod status: `runpodctl get pod <id>`
	- Verify pod is in `RUNNING` state

	Problem: `Insufficient disk space` on pod
	- Increase disk size in script (`DISK_SIZE=100` or higher)
	- Upload model separately to `/workspace/models` before starting

	#### Vast.ai Specific

	Problem: `vastai: command not found`
	```bash
	pip install vastai
	# or download from: https://vast.ai/docs/cli
	```

	Problem: `No suitable instance found`
	- Relax search criteria (lower `VAST_GPU_RAM`)
	- Increase `VAST_SEARCH_LIMIT`
	- Check marketplace manually: `vastai search offers "cuda>=11.8"`

	Problem: `SSH connection refused`
	- Instance may still be provisioning
	- Check `vastai show instance <id>`
	- Ensure port forwarding is set up correctly

	Problem: `Instance died or unresponsive`
	- Check if balance depleted
	- Instance may have been evicted (low priority)
	- Use `--priority` flag or choose higher-cost instances

	---

	## Performance Tuning

	### Reduce Latency
	```bash
	export MAX_BATCH_SIZE=4 # Smaller batches for lower latency
	export MAX_MODEL_LEN=4096 # Shorter context window
	export GPU_MEMORY_UTILIZATION=0.8
	```

	### Increase Throughput
	```bash
	export MAX_BATCH_SIZE=32 # Larger batches
	export MAX_MODEL_LEN=16384 # Longer context capability
	export GPU_MEMORY_UTILIZATION=0.95
	```

	### Multi-GPU Setup
	```bash
	# Automatically detected. Ensure tensor parallel size matches GPU count:
	# export TENSOR_PARALLEL_SIZE=2 # For 2 GPUs (usually auto-detected)
	```

	---

	## Monitoring

	### Health Endpoint
	```bash
	curl http://localhost:8000/health \| jq
	# Returns: {"status":"healthy","model":{...},"timestamp":...}
	```

	### Readiness Endpoint (K8s liveness)
	```bash
	curl http://localhost:8000/ready
	# Returns: {"status":"ready"}
	```

	### Prometheus Metrics
	```bash
	curl http://localhost:9090/metrics
	# Look for: vllm_requests_total, vllm_request_latency_seconds
	```

	### Container Logs
	```bash
	# All logs
	docker-compose logs -f vllm

	# Last 100 lines
	docker-compose logs --tail=100 vllm

	# Search for errors
	docker-compose logs vllm \| grep -i error
	```

	---

	## Model Compatibility

	### Supported Formats
	- HuggingFace (default): `MODEL_FORMAT=hf`
	- Local directory: Mount model folder to `/models`
	- AWQ quantized: Set `QUANTIZATION=awq` and use AWQ model

	### Gated Models (Llama 3.1, etc.)
	1. Request access on HuggingFace
	2. Get your token: https://huggingface.co/settings/tokens
	3. Authenticate:
	```bash
	huggingface-cli login
	# Paste token
	```

	### Unsupported Models
	If vLLM doesn't support your model architecture:
	- Use `trust_remote_code=True` (already set)
	- Convert model to supported format
	- Check vLLM supported models: https://docs.vllm.ai/

	---

	## Debug Mode

	Enable verbose logging:
	```bash
	export LOG_LEVEL=DEBUG
	# restart services
	docker-compose down && docker-compose up -d
	```

	---

	## Getting Help

	1. Check this guide for common symptoms
	2. Review logs: `docker-compose logs vllm`
	3. Search issues: https://github.com/vllm-project/vllm/issues
	4. Community: https://discord.gg/vllm

	---

	## Quick Reference Commands

	```bash
	# Start deployment
	cd stack-2.9-deploy
	./local_deploy.sh

	# Stop deployment
	docker-compose down

	# View logs
	docker-compose logs -f vllm

	# Restart single service
	docker-compose restart vllm

	# Check service status
	docker-compose ps

	# Access container shell
	docker-compose exec vllm bash

	# Clean everything (WARNING: deletes data!)
	docker-compose down -v
	rm -rf models logs

	# Rebuild image (after Dockerfile changes)
	docker-compose build --no-cache vllm
	docker-compose up -d
	```