Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned

File size: 10,524 Bytes

# Using Stack 2.9 with Together AI

This guide explains how to use Stack 2.9 with Together AI as the model provider.

## Overview

Together AI provides powerful cloud-hosted models with high performance and competitive pricing. Stack 2.9 supports Together AI through its OpenAI-compatible API, allowing you to use models like:

- `togethercomputer/meta-llama-3-70b-instruct`
- `togethercomputer/CodeLlama-34b-instruct`
- `togethercomputer/Qwen2.5-Coder-32B-Instruct` (recommended for Stack 2.9)
- And many others from Together's model library

## Prerequisites

1. **Together AI Account**: Sign up at [together.ai](https://together.ai)
2. **API Key**: Obtain your API key from the Together dashboard
3. **OpenAI Python Package**: Install `openai>=1.0.0` (required for Together client)

```bash
pip install openai
```

## Environment Variables

Configure your environment with the following variables:

```bash
# Required: Together AI API key
export TOGETHER_API_KEY="your-together-api-key-here"

# Optional: Model selection (default: togethercomputer/Qwen2.5-Coder-32B-Instruct)
export TOGETHER_MODEL="togethercomputer/Qwen2.5-Coder-32B-Instruct"

# Optional: Provider configuration (for auto-detection)
export MODEL_PROVIDER="together"
```

### Setting up in Shell

Add these lines to your `~/.zshrc`, `~/.bashrc`, or shell profile:

```bash
# Together AI configuration
export TOGETHER_API_KEY="tog-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export TOGETHER_MODEL="togethercomputer/Qwen2.5-Coder-32B-Instruct"
```

Then reload your shell:

```bash
source ~/.zshrc  # or ~/.bashrc
```

### Using .env file (recommended for development)

Create a `.env` file in your project root:

```env
TOGETHER_API_KEY=tog-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TOGETHER_MODEL=togethercomputer/Qwen2.5-Coder-32B-Instruct
MODEL_PROVIDER=together
```

Then load it with `python-dotenv`:

```bash
pip install python-dotenv
```

And in your Python script:

```python
from dotenv import load_dotenv
load_dotenv()  # loads .env file
```

## Usage Examples

### Command Line

Use the built-in CLI with Together provider:

```bash
# Using default model (Meta-Llama-3-70B)
python stack.py --provider together "Write a Python function to reverse a string"

# Using a specific model (override env var)
TOGETHER_MODEL=togethercomputer/Qwen2.5-Coder-32B-Instruct python stack.py --provider together "def factorial(n):"
```

### Python API

```python
from stack_2_9_eval.model_client import create_model_client

# Create Together client (reads TOGETHER_API_KEY from env)
client = create_model_client(provider="together")

# Or specify explicitly
client = create_model_client(
    provider="together",
    model="togethercomputer/Qwen2.5-Coder-32B-Instruct",
    api_key="your-api-key"
)

# Generate code
result = client.generate(
    prompt="Write a Python function to sort a list using quicksort",
    temperature=0.2,
    max_tokens=1024
)

print(result.text)
```

### Chat Mode

```python
from stack_2_9_eval.model_client import create_model_client, ChatMessage

client = create_model_client(provider="together")

messages = [
    ChatMessage(role="system", content="You are an expert Python programmer."),
    ChatMessage(role="user", content="How do I read a JSON file in Python?"),
]

result = client.chat(messages, temperature=0.2, max_tokens=512)
print(result.text)
```

### Using with Tool Calls

```python
tools = [
    {
        "type": "function",
        "function": {
            "name": "FileReadTool",
            "description": "Read file contents",
            "parameters": {
                "type": "object",
                "properties": {
                    "path": {"type": "string", "description": "File path"}
                },
                "required": ["path"]
            }
        }
    }
]

messages = [
    ChatMessage(role="user", content="Read the file 'config.yaml' and tell me what's in it")
]

result = client.chat(messages, temperature=0.2, tools=tools)
print(result.text)
# Check result.raw_response for tool_calls if model requested a tool
```

## Recommended Models

For Stack 2.9 use cases (coding + tool use), these Together models are recommended:

### Primary Recommendation

**`togethercomputer/qwen2.5-coder-32b-instruct`**
- Matches Stack 2.9's base model (Qwen2.5-Coder-32B)
- Excellent code generation
- Strong tool-calling capabilities
- Cost-effective: ~$0.22 / 1M tokens (input)
- Use this for production Stack 2.9 deployments

### Alternatives

**`togethercomputer/meta-llama-3-70b-instruct`**
- Larger model (70B) with strong reasoning
- Slightly higher cost but excellent quality
- Good for complex problem-solving

**`togethercomputer/codellama-34b-instruct`**
- Code-specialized Llama 34B
- Good performance, lower cost than 70B models

**`togethercomputer/qwen2.5-72b-instruct`**
- 72B variant of Qwen2.5 (if you need maximum quality)
- Higher cost and latency

### Model Selection Tips

- **Match training distribution**: Use Qwen models for Stack 2.9 pattern compatibility
- **Budget**: 34B models offer best price/performance for coding tasks
- **Latency**: Smaller models (7B-13B) are faster but less capable
- **Throughput**: Consider batching for large-scale usage

## Cost Estimation

Together AI pricing (as of 2025, check their site for current rates):

| Model | Input ($/1M tokens) | Output ($/1M tokens) |
|-------|---------------------|----------------------|
| Qwen2.5-Coder-32B | ~0.22 | ~0.22 |
| Meta-Llama-3-70B | ~0.70 | ~0.70 |
| CodeLlama-34B | ~0.22 | ~0.22 |
| Qwen2.5-72B | ~0.70 | ~0.70 |

### Example Cost Calculation

If your typical usage:
- 100 queries/day
- Average 2,000 tokens per query (input + output)
- Using Qwen2.5-Coder-32B

Daily cost: `(100 * 2000 / 1,000,000) * $0.22 ≈ $0.044`
Monthly cost: ~$1.32

**Very affordable for development and light production use.**

## Performance Considerations

- **Latency**: Expect 100-500ms per request depending on model size and complexity
- **Rate Limits**: Together provides generous rate limits (check your plan)
- **Throughput**: Use concurrent requests for batch processing (respect rate limits)
- **Streaming**: Together supports streaming; use `stream=True` in client for long generations

## Error Handling

Implement robust error handling for production:

```python
from stack_2_9_eval.model_client import create_model_client
import time

def generate_with_retry(client, prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            result = client.generate(prompt, temperature=0.2, max_tokens=1024)
            return result
        except Exception as e:
            if attempt == max_retries - 1:
                raise
            wait = 2 ** attempt  # exponential backoff
            print(f"Error: {e}. Retrying in {wait}s...")
            time.sleep(wait)

client = create_model_client(provider="together", api_key=os.getenv("TOGETHER_API_KEY"))
result = generate_with_retry(client, "Write a function to calculate prime numbers")
```

## Comparison with Other Providers

| Feature | Together AI | Ollama (local) | OpenAI | Anthropic |
|---------|-------------|----------------|--------|-----------|
| Cost (32B class) | Low (~$0.22/M) | Free (your hardware) | High (~$3/M) | High (~$3/M) |
| Qwen2.5-Coder | ✅ Supported | ✅ Via pull | ❌ No | ❌ No |
| Privacy | Cloud (check TOS) | Full local | Cloud | Cloud |
| Latency | Medium | Fast (local) | Medium | Medium |
| Setup Complexity | Low (API key) | Medium (install) | Low | Low |
| Rate Limits | Generous | Unlimited | Pay-as-you-go | Pay-as-you-go |
| Tool Calling | ✅ Yes | ✅ Yes | ✅ Yes | ✅ Yes |

**Best for Stack 2.9**: Together AI when you need cloud access and Qwen models without running locally.

## Troubleshooting

### API Key Errors

```
ValueError: Together AI API key required.
```

**Solution**: Set `TOGETHER_API_KEY` environment variable or pass `api_key` param.

### Model Not Found

```
openai.BadRequestError: The model '...' does not exist
```

**Solution**: Check model name spelling. Browse available models at [Together Models](https://together.ai/models). Use full model ID like `togethercomputer/qwen2.5-coder-32b-instruct`.

### Rate Limit Exceeded

**Solution**: Add retry logic with exponential backoff. Consider upgrading your Together plan.

### Import Errors

```
ImportError: openai package required
```

**Solution**: `pip install openai` (version 1.0+)

## Advanced Configuration

### Custom Base URL

If you need to use a custom endpoint (e.g., for regional deployments):

```python
client = create_model_client(
    provider="together",
    model="togethercomputer/qwen2.5-coder-32b-instruct",
    base_url="https://your-custom-endpoint.together.ai/v1"
)
```

### Timeouts and Retries

```python
client = TogetherClient(
    model="togethercomputer/qwen2.5-coder-32b-instruct",
    api_key=os.getenv("TOGETHER_API_KEY"),
    timeout=300  # 5 minute timeout
)
```

### Streaming Responses

For long generations, use streaming (requires modifying client or using OpenAI library directly):

```python
from openai import OpenAI

client = OpenAI(api_key=os.getenv("TOGETHER_API_KEY"), base_url="https://api.together.xyz/v1")

stream = client.chat.completions.create(
    model="togethercomputer/qwen2.5-coder-32b-instruct",
    messages=[{"role": "user", "content": "Write a detailed explanation of binary search"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

## Integration with Stack 2.9 CLI

To make Together AI the default provider:

```bash
# Set environment variables permanently
echo 'export MODEL_PROVIDER="together"' >> ~/.zshrc
echo 'export TOGETHER_MODEL="togethercomputer/qwen2.5-coder-32b-instruct"' >> ~/.zshrc
source ~/.zshrc
```

Now `stack.py` will automatically use Together AI without `--provider` flag.

## Security Best Practices

1. **Never commit API keys** to version control. Use `.env` files or environment variables.
2. **Rotate keys** periodically from Together dashboard.
3. **Use minimal permissions** - Together API keys have full access; protect them.
4. **Enable billing alerts** to avoid unexpected charges.
5. **Review Together's TOS** for data usage and privacy policies.

## Support

- **Together Documentation**: https://docs.together.io/
- **Stack 2.9 Issues**: https://github.com/my-ai-stack/stack-2.9/issues
- **Model Cards**: See `MODEL_CARD.md` for Stack 2.9 details

---

**Last Updated**: 2025-04-02  
**Compatible Stack 2.9 Version**: 2.9.0+