Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
Stack 2.9 Inference API Documentation
REST API for code generation using the Stack 2.9 fine-tuned Qwen model.
Quick Start
1. Install Dependencies
pip install -r requirements_api.txt
pip install -r requirements.txt # Core dependencies (transformers, torch, etc.)
2. Set Model Path
# Option A: Environment variable
export MODEL_PATH=/path/to/your/merged/model
# Option B: Direct parameter
MODEL_PATH=/path/to/model uvicorn inference_api:app --port 8000
3. Start the Server
# Basic usage
uvicorn inference_api:app --host 0.0.0.0 --port 8000
# With auto-reload (development)
uvicorn inference_api:app --reload --port 8000
# Using Python directly
python inference_api.py
4. Verify It's Running
curl http://localhost:8000/health
Expected response:
{
"status": "healthy",
"model_loaded": true,
"model_path": "base_model_qwen7b",
"device": "cuda",
"cuda_available": true
}
API Endpoints
GET /health
Health check endpoint to verify API and model status.
Response:
{
"status": "healthy",
"model_loaded": true,
"model_path": "/path/to/model",
"device": "cuda",
"cuda_available": true
}
GET /model-info
Get information about the currently loaded model.
Response:
{
"model_path": "/path/to/model",
"device": "cuda:0",
"dtype": "torch.float16"
}
POST /generate
Generate code completion for a prompt.
Request Body:
{
"prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
"max_tokens": 128,
"temperature": 0.2,
"top_p": 0.95,
"do_sample": true,
"repetition_penalty": 1.1,
"num_return_sequences": 1
}
Parameters:
| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
prompt |
string | required | - | Input prompt to complete |
max_tokens |
int | 512 | 1-4096 | Maximum tokens to generate |
temperature |
float | 0.2 | 0.0-2.0 | Sampling temperature (higher = more creative) |
top_p |
float | 0.95 | 0.0-1.0 | Nucleus sampling threshold |
do_sample |
bool | true | - | Whether to use sampling vs greedy |
repetition_penalty |
float | 1.1 | 1.0-2.0 | Penalize repeated tokens |
num_return_sequences |
int | 1 | 1-10 | Number of sequences to generate |
Response:
{
"generated_text": " seen = {}\n for i, num in enumerate(nums):\n complement = target - num\n if complement in seen:\n return [seen[complement], i]\n seen[num] = i\n return []",
"prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
"model": "base_model_qwen7b",
"num_tokens": 45,
"finish_reason": "stop"
}
Example with curl:
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{
"prompt": "def fibonacci(n):\n \"\"\"Return first n Fibonacci numbers.\"\"\"\n",
"max_tokens": 100,
"temperature": 0.2
}'
POST /chat
Conversational interface for multi-turn interactions.
Request Body:
{
"messages": [
{"role": "user", "content": "Write a function to reverse a string in Python"},
{"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"},
{"role": "user", "content": "Make it recursive instead"}
],
"max_tokens": 128,
"temperature": 0.2,
"top_p": 0.95
}
Message Roles:
user- User's messageassistant- Model's previous response (for conversation history)
Response:
{
"message": {
"role": "assistant",
"content": "def reverse_string(s):\n if len(s) <= 1:\n return s\n return s[-1] + reverse_string(s[:-1])"
},
"model": "base_model_qwen7b",
"num_tokens": 67,
"finish_reason": "stop"
}
Example with curl:
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Write a binary search function"}
],
"max_tokens": 150
}'
POST /generate/raw
Same as /generate but returns raw output without extracting code from markdown blocks.
Example with curl:
curl -X POST http://localhost:8000/generate/raw \
-H "Content-Type: application/json" \
-d '{
"prompt": "def quick_sort(arr):",
"max_tokens": 200
}'
POST /extract-code
Extract code from a text response that may contain markdown code blocks.
Request Body:
{
"prompt": "```python\ndef hello():\n print(\"world\")\n```"
}
Response:
{
"code": "def hello():\n print(\"world\")"
}
Environment Variables
| Variable | Default | Description |
|---|---|---|
MODEL_PATH |
base_model_qwen7b |
Path to model directory |
DEVICE |
cuda (if available) |
Device to use: cuda or cpu |
PORT |
8000 |
Server port |
HOST |
0.0.0.0 |
Server host |
RELOAD |
false |
Enable auto-reload for development |
DEFAULT_MAX_TOKENS |
512 |
Default max tokens |
DEFAULT_TEMPERATURE |
0.2 |
Default temperature |
DEFAULT_TOP_P |
0.95 |
Default top_p |
Usage Examples
Python Client
import requests
API_URL = "http://localhost:8000"
# Health check
health = requests.get(f"{API_URL}/health").json()
print(f"Model loaded: {health['model_loaded']}")
# Code completion
response = requests.post(
f"{API_URL}/generate",
json={
"prompt": "def merge_sort(arr):\n \"\"\"Return sorted array.\"\"\"\n",
"max_tokens": 200,
"temperature": 0.3,
}
).json()
print(response["generated_text"])
JavaScript/Node.js Client
const API_URL = "http://localhost:8000";
// Code completion
async function generate(prompt) {
const response = await fetch(`${API_URL}/generate`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
prompt,
max_tokens: 128,
temperature: 0.2,
}),
});
return response.json();
}
const result = await generate("def binary_search(arr, target):");
console.log(result.generated_text);
Using with OpenAI SDK (with base_url replacement)
from openai import OpenAI
client = OpenAI(
api_key="not-needed",
base_url="http://localhost:8000"
)
# Note: This works for basic completions but may need adapter code
# for full OpenAI compatibility
response = client.completions.create(
model="stack-2.9",
prompt="def factorial(n):",
max_tokens=100,
)
Performance Tips
- GPU Recommended: For fastest inference, run on GPU with CUDA
- Batch Processing: For multiple prompts, process sequentially (model is loaded once)
- Memory: Ensure adequate GPU memory; reduce
max_tokensif needed - Temperature: Use lower temperature (0.1-0.3) for deterministic code, higher for creative tasks
Error Handling
503 Service Unavailable: Model not loaded or loading failed
{"detail": "Model not loaded. Check /health for status."}
500 Internal Server Error: Generation failed
{"detail": "Generation failed: <error message>"}
400 Bad Request: Invalid input
{"detail": "Last message must be from user"}
Architecture Notes
- Single Model Instance: Model is loaded once at startup and reused
- Synchronous Generation: Uses
torch.no_grad()for inference - CORS Enabled: Accepts requests from any origin (configure for production)
- No Authentication: Add middleware (e.g., API key) for production deployments