Text Generation
Transformers
English
qwen2
code-generation
python
fine-tuning
Qwen
tools
agent-framework
multi-agent
conversational
Eval Results (legacy)
Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
| # Stack 2.9 Inference API Documentation | |
| REST API for code generation using the Stack 2.9 fine-tuned Qwen model. | |
| ## Quick Start | |
| ### 1. Install Dependencies | |
| ```bash | |
| pip install -r requirements_api.txt | |
| pip install -r requirements.txt # Core dependencies (transformers, torch, etc.) | |
| ``` | |
| ### 2. Set Model Path | |
| ```bash | |
| # Option A: Environment variable | |
| export MODEL_PATH=/path/to/your/merged/model | |
| # Option B: Direct parameter | |
| MODEL_PATH=/path/to/model uvicorn inference_api:app --port 8000 | |
| ``` | |
| ### 3. Start the Server | |
| ```bash | |
| # Basic usage | |
| uvicorn inference_api:app --host 0.0.0.0 --port 8000 | |
| # With auto-reload (development) | |
| uvicorn inference_api:app --reload --port 8000 | |
| # Using Python directly | |
| python inference_api.py | |
| ``` | |
| ### 4. Verify It's Running | |
| ```bash | |
| curl http://localhost:8000/health | |
| ``` | |
| Expected response: | |
| ```json | |
| { | |
| "status": "healthy", | |
| "model_loaded": true, | |
| "model_path": "base_model_qwen7b", | |
| "device": "cuda", | |
| "cuda_available": true | |
| } | |
| ``` | |
| --- | |
| ## API Endpoints | |
| ### `GET /health` | |
| Health check endpoint to verify API and model status. | |
| **Response:** | |
| ```json | |
| { | |
| "status": "healthy", | |
| "model_loaded": true, | |
| "model_path": "/path/to/model", | |
| "device": "cuda", | |
| "cuda_available": true | |
| } | |
| ``` | |
| --- | |
| ### `GET /model-info` | |
| Get information about the currently loaded model. | |
| **Response:** | |
| ```json | |
| { | |
| "model_path": "/path/to/model", | |
| "device": "cuda:0", | |
| "dtype": "torch.float16" | |
| } | |
| ``` | |
| --- | |
| ### `POST /generate` | |
| Generate code completion for a prompt. | |
| **Request Body:** | |
| ```json | |
| { | |
| "prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n", | |
| "max_tokens": 128, | |
| "temperature": 0.2, | |
| "top_p": 0.95, | |
| "do_sample": true, | |
| "repetition_penalty": 1.1, | |
| "num_return_sequences": 1 | |
| } | |
| ``` | |
| **Parameters:** | |
| | Parameter | Type | Default | Range | Description | | |
| |-----------|------|---------|-------|-------------| | |
| | `prompt` | string | required | - | Input prompt to complete | | |
| | `max_tokens` | int | 512 | 1-4096 | Maximum tokens to generate | | |
| | `temperature` | float | 0.2 | 0.0-2.0 | Sampling temperature (higher = more creative) | | |
| | `top_p` | float | 0.95 | 0.0-1.0 | Nucleus sampling threshold | | |
| | `do_sample` | bool | true | - | Whether to use sampling vs greedy | | |
| | `repetition_penalty` | float | 1.1 | 1.0-2.0 | Penalize repeated tokens | | |
| | `num_return_sequences` | int | 1 | 1-10 | Number of sequences to generate | | |
| **Response:** | |
| ```json | |
| { | |
| "generated_text": " seen = {}\n for i, num in enumerate(nums):\n complement = target - num\n if complement in seen:\n return [seen[complement], i]\n seen[num] = i\n return []", | |
| "prompt": "def two_sum(nums, target):\n \"\"\"Return indices of two numbers that add up to target.\"\"\"\n", | |
| "model": "base_model_qwen7b", | |
| "num_tokens": 45, | |
| "finish_reason": "stop" | |
| } | |
| ``` | |
| **Example with curl:** | |
| ```bash | |
| curl -X POST http://localhost:8000/generate \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "prompt": "def fibonacci(n):\n \"\"\"Return first n Fibonacci numbers.\"\"\"\n", | |
| "max_tokens": 100, | |
| "temperature": 0.2 | |
| }' | |
| ``` | |
| --- | |
| ### `POST /chat` | |
| Conversational interface for multi-turn interactions. | |
| **Request Body:** | |
| ```json | |
| { | |
| "messages": [ | |
| {"role": "user", "content": "Write a function to reverse a string in Python"}, | |
| {"role": "assistant", "content": "def reverse_string(s):\n return s[::-1]"}, | |
| {"role": "user", "content": "Make it recursive instead"} | |
| ], | |
| "max_tokens": 128, | |
| "temperature": 0.2, | |
| "top_p": 0.95 | |
| } | |
| ``` | |
| **Message Roles:** | |
| - `user` - User's message | |
| - `assistant` - Model's previous response (for conversation history) | |
| **Response:** | |
| ```json | |
| { | |
| "message": { | |
| "role": "assistant", | |
| "content": "def reverse_string(s):\n if len(s) <= 1:\n return s\n return s[-1] + reverse_string(s[:-1])" | |
| }, | |
| "model": "base_model_qwen7b", | |
| "num_tokens": 67, | |
| "finish_reason": "stop" | |
| } | |
| ``` | |
| **Example with curl:** | |
| ```bash | |
| curl -X POST http://localhost:8000/chat \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "messages": [ | |
| {"role": "user", "content": "Write a binary search function"} | |
| ], | |
| "max_tokens": 150 | |
| }' | |
| ``` | |
| --- | |
| ### `POST /generate/raw` | |
| Same as `/generate` but returns raw output without extracting code from markdown blocks. | |
| **Example with curl:** | |
| ```bash | |
| curl -X POST http://localhost:8000/generate/raw \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "prompt": "def quick_sort(arr):", | |
| "max_tokens": 200 | |
| }' | |
| ``` | |
| --- | |
| ### `POST /extract-code` | |
| Extract code from a text response that may contain markdown code blocks. | |
| **Request Body:** | |
| ```json | |
| { | |
| "prompt": "```python\ndef hello():\n print(\"world\")\n```" | |
| } | |
| ``` | |
| **Response:** | |
| ```json | |
| { | |
| "code": "def hello():\n print(\"world\")" | |
| } | |
| ``` | |
| --- | |
| ## Environment Variables | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `MODEL_PATH` | `base_model_qwen7b` | Path to model directory | | |
| | `DEVICE` | `cuda` (if available) | Device to use: `cuda` or `cpu` | | |
| | `PORT` | `8000` | Server port | | |
| | `HOST` | `0.0.0.0` | Server host | | |
| | `RELOAD` | `false` | Enable auto-reload for development | | |
| | `DEFAULT_MAX_TOKENS` | `512` | Default max tokens | | |
| | `DEFAULT_TEMPERATURE` | `0.2` | Default temperature | | |
| | `DEFAULT_TOP_P` | `0.95` | Default top_p | | |
| --- | |
| ## Usage Examples | |
| ### Python Client | |
| ```python | |
| import requests | |
| API_URL = "http://localhost:8000" | |
| # Health check | |
| health = requests.get(f"{API_URL}/health").json() | |
| print(f"Model loaded: {health['model_loaded']}") | |
| # Code completion | |
| response = requests.post( | |
| f"{API_URL}/generate", | |
| json={ | |
| "prompt": "def merge_sort(arr):\n \"\"\"Return sorted array.\"\"\"\n", | |
| "max_tokens": 200, | |
| "temperature": 0.3, | |
| } | |
| ).json() | |
| print(response["generated_text"]) | |
| ``` | |
| ### JavaScript/Node.js Client | |
| ```javascript | |
| const API_URL = "http://localhost:8000"; | |
| // Code completion | |
| async function generate(prompt) { | |
| const response = await fetch(`${API_URL}/generate`, { | |
| method: "POST", | |
| headers: { "Content-Type": "application/json" }, | |
| body: JSON.stringify({ | |
| prompt, | |
| max_tokens: 128, | |
| temperature: 0.2, | |
| }), | |
| }); | |
| return response.json(); | |
| } | |
| const result = await generate("def binary_search(arr, target):"); | |
| console.log(result.generated_text); | |
| ``` | |
| ### Using with OpenAI SDK (with base_url replacement) | |
| ```python | |
| from openai import OpenAI | |
| client = OpenAI( | |
| api_key="not-needed", | |
| base_url="http://localhost:8000" | |
| ) | |
| # Note: This works for basic completions but may need adapter code | |
| # for full OpenAI compatibility | |
| response = client.completions.create( | |
| model="stack-2.9", | |
| prompt="def factorial(n):", | |
| max_tokens=100, | |
| ) | |
| ``` | |
| --- | |
| ## Performance Tips | |
| 1. **GPU Recommended**: For fastest inference, run on GPU with CUDA | |
| 2. **Batch Processing**: For multiple prompts, process sequentially (model is loaded once) | |
| 3. **Memory**: Ensure adequate GPU memory; reduce `max_tokens` if needed | |
| 4. **Temperature**: Use lower temperature (0.1-0.3) for deterministic code, higher for creative tasks | |
| --- | |
| ## Error Handling | |
| **503 Service Unavailable**: Model not loaded or loading failed | |
| ```json | |
| {"detail": "Model not loaded. Check /health for status."} | |
| ``` | |
| **500 Internal Server Error**: Generation failed | |
| ```json | |
| {"detail": "Generation failed: <error message>"} | |
| ``` | |
| **400 Bad Request**: Invalid input | |
| ```json | |
| {"detail": "Last message must be from user"} | |
| ``` | |
| --- | |
| ## Architecture Notes | |
| - **Single Model Instance**: Model is loaded once at startup and reused | |
| - **Synchronous Generation**: Uses `torch.no_grad()` for inference | |
| - **CORS Enabled**: Accepts requests from any origin (configure for production) | |
| - **No Authentication**: Add middleware (e.g., API key) for production deployments | |