Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned / docs /API.md

walidsobhie-code

feat: add inference API, quickstart guide, roadmap, and combined tool data

b03a8a0 about 2 months ago

preview code

raw

history blame

7.76 kB

Stack 2.9 Inference API Documentation

REST API for code generation using the Stack 2.9 fine-tuned Qwen model.

Quick Start

1. Install Dependencies

pip install -r requirements_api.txt
pip install -r requirements.txt  # Core dependencies (transformers, torch, etc.)

2. Set Model Path

# Option A: Environment variable
export MODEL_PATH=/path/to/your/merged/model

# Option B: Direct parameter
MODEL_PATH=/path/to/model uvicorn inference_api:app --port 8000

3. Start the Server

# Basic usage
uvicorn inference_api:app --host 0.0.0.0 --port 8000

# With auto-reload (development)
uvicorn inference_api:app --reload --port 8000

# Using Python directly
python inference_api.py

4. Verify It's Running

curl http://localhost:8000/health

Expected response:

{
  "status": "healthy",
  "model_loaded": true,
  "model_path": "base_model_qwen7b",
  "device": "cuda",
  "cuda_available": true
}

API Endpoints

`GET /health`

Health check endpoint to verify API and model status.

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "model_path": "/path/to/model",
  "device": "cuda",
  "cuda_available": true
}

`GET /model-info`

Get information about the currently loaded model.

Response:

{
  "model_path": "/path/to/model",
  "device": "cuda:0",
  "dtype": "torch.float16"
}

`POST /generate`

Generate code completion for a prompt.

Request Body:

{
  "prompt": "def two_sum(nums, target):\n    \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
  "max_tokens": 128,
  "temperature": 0.2,
  "top_p": 0.95,
  "do_sample": true,
  "repetition_penalty": 1.1,
  "num_return_sequences": 1
}

Parameters:

Parameter	Type	Default	Range	Description
`prompt`	string	required	-	Input prompt to complete
`max_tokens`	int	512	1-4096	Maximum tokens to generate
`temperature`	float	0.2	0.0-2.0	Sampling temperature (higher = more creative)
`top_p`	float	0.95	0.0-1.0	Nucleus sampling threshold
`do_sample`	bool	true	-	Whether to use sampling vs greedy
`repetition_penalty`	float	1.1	1.0-2.0	Penalize repeated tokens
`num_return_sequences`	int	1	1-10	Number of sequences to generate

Response:

{
  "generated_text": "    seen = {}\n    for i, num in enumerate(nums):\n        complement = target - num\n        if complement in seen:\n            return [seen[complement], i]\n        seen[num] = i\n    return []",
  "prompt": "def two_sum(nums, target):\n    \"\"\"Return indices of two numbers that add up to target.\"\"\"\n",
  "model": "base_model_qwen7b",
  "num_tokens": 45,
  "finish_reason": "stop"
}

Example with curl:

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "def fibonacci(n):\n    \"\"\"Return first n Fibonacci numbers.\"\"\"\n",
    "max_tokens": 100,
    "temperature": 0.2
  }'

`POST /chat`

Conversational interface for multi-turn interactions.

Request Body:

{
  "messages": [
    {"role": "user", "content": "Write a function to reverse a string in Python"},
    {"role": "assistant", "content": "def reverse_string(s):\n    return s[::-1]"},
    {"role": "user", "content": "Make it recursive instead"}
  ],
  "max_tokens": 128,
  "temperature": 0.2,
  "top_p": 0.95
}

Message Roles:

user - User's message
assistant - Model's previous response (for conversation history)

Response:

{
  "message": {
    "role": "assistant",
    "content": "def reverse_string(s):\n    if len(s) <= 1:\n        return s\n    return s[-1] + reverse_string(s[:-1])"
  },
  "model": "base_model_qwen7b",
  "num_tokens": 67,
  "finish_reason": "stop"
}

Example with curl:

curl -X POST http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a binary search function"}
    ],
    "max_tokens": 150
  }'

`POST /generate/raw`

Same as /generate but returns raw output without extracting code from markdown blocks.

Example with curl:

curl -X POST http://localhost:8000/generate/raw \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "def quick_sort(arr):",
    "max_tokens": 200
  }'

`POST /extract-code`

Extract code from a text response that may contain markdown code blocks.

Request Body:

{
  "prompt": "```python\ndef hello():\n    print(\"world\")\n```"
}

Response:

{
  "code": "def hello():\n    print(\"world\")"
}

Environment Variables

Variable	Default	Description
`MODEL_PATH`	`base_model_qwen7b`	Path to model directory
`DEVICE`	`cuda` (if available)	Device to use: `cuda` or `cpu`
`PORT`	`8000`	Server port
`HOST`	`0.0.0.0`	Server host
`RELOAD`	`false`	Enable auto-reload for development
`DEFAULT_MAX_TOKENS`	`512`	Default max tokens
`DEFAULT_TEMPERATURE`	`0.2`	Default temperature
`DEFAULT_TOP_P`	`0.95`	Default top_p

Usage Examples

Python Client

import requests

API_URL = "http://localhost:8000"

# Health check
health = requests.get(f"{API_URL}/health").json()
print(f"Model loaded: {health['model_loaded']}")

# Code completion
response = requests.post(
    f"{API_URL}/generate",
    json={
        "prompt": "def merge_sort(arr):\n    \"\"\"Return sorted array.\"\"\"\n",
        "max_tokens": 200,
        "temperature": 0.3,
    }
).json()

print(response["generated_text"])

JavaScript/Node.js Client

const API_URL = "http://localhost:8000";

// Code completion
async function generate(prompt) {
  const response = await fetch(`${API_URL}/generate`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      prompt,
      max_tokens: 128,
      temperature: 0.2,
    }),
  });
  return response.json();
}

const result = await generate("def binary_search(arr, target):");
console.log(result.generated_text);

Using with OpenAI SDK (with base_url replacement)

from openai import OpenAI

client = OpenAI(
    api_key="not-needed",
    base_url="http://localhost:8000"
)

# Note: This works for basic completions but may need adapter code
# for full OpenAI compatibility
response = client.completions.create(
    model="stack-2.9",
    prompt="def factorial(n):",
    max_tokens=100,
)

Performance Tips

GPU Recommended: For fastest inference, run on GPU with CUDA
Batch Processing: For multiple prompts, process sequentially (model is loaded once)
Memory: Ensure adequate GPU memory; reduce max_tokens if needed
Temperature: Use lower temperature (0.1-0.3) for deterministic code, higher for creative tasks

Error Handling

503 Service Unavailable: Model not loaded or loading failed

{"detail": "Model not loaded. Check /health for status."}

500 Internal Server Error: Generation failed

{"detail": "Generation failed: <error message>"}

400 Bad Request: Invalid input

{"detail": "Last message must be from user"}

Architecture Notes

Single Model Instance: Model is loaded once at startup and reused
Synchronous Generation: Uses torch.no_grad() for inference
CORS Enabled: Accepts requests from any origin (configure for production)
No Authentication: Add middleware (e.g., API key) for production deployments