Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned / docs /MODEL_REGISTRY.md

walidsobhie-code

reorganize: consolidate root level to 20 folders

b8e3e42 about 1 month ago

preview code

raw

history blame contribute delete

2.13 kB

Stack 2.9 Model Registry

Version tracking for all Stack 2.9 model variants.

Model Versions

Version	Status	Date	Base Model	Parameters	Dataset	Performance	Use Case
`stack-2.9-1.5B`	🟡 In Training	2026-04-06	Llama 3.2-1B	1.5B	Stack 2.9 dedup	TBD	Research, fine-tuning base
`stack-2.9-7B`	🔴 Planned	TBD	Llama 3.1-8B	7B	Stack 2.9 dedup	TBD	General-purpose inference
`stack-2.9-7B-QLoRA`	🔴 Planned	TBD	Llama 3.1-8B	7B (quantized)	Stack 2.9 dedup	TBD	Edge deployment, low-memory

Version Details

stack-2.9-1.5B (Current)

Status: In Training
Architecture: Transformer (pretrained)
Base Model: Llama 3.2-1B
Parameters: 1.5B
Training Data: Stack 2.9 deduplicated
Context Length: 128k tokens
Vocabulary Size: ~128K
Precision: BF16
Training Hardware: 8x H100 (TBD确认)
Expected Completion: TBD
Notes: First iteration of Stack 2.9, used as baseline for larger variants

stack-2.9-7B (Planned)

Status: Planned
Architecture: Transformer (pretrained)
Base Model: Llama 3.1-8B
Parameters: 7B
Training Data: Stack 2.9 deduplicated
Context Length: 128k tokens
Vocabulary Size: ~128K
Precision: BF16
Training Hardware: TBD
Expected Start: TBD
Notes: Scale-up from 1.5B, targeting general-purpose use

stack-2.9-7B-QLoRA (Planned)

Status: Planned
Architecture: Transformer + QLoRA
Base Model: Llama 3.1-8B
Parameters: 7B (4-bit quantized)
Training Data: Stack 2.9 deduplicated
Context Length: 128k tokens
Vocabulary Size: ~128K
Quantization: 4-bit NF4
LoRA Rank: TBD
LoRA Alpha: TBD
LoRA Dropout: TBD
Target Modules: TBD
Notes: Quantized for consumer GPU deployment (e.g., 24GB VRAM)

Changelog

Date	Version	Change
2026-04-06	stack-2.9-1.5B	Initial entry — training started