Text Generation
Transformers
English
qwen2
code-generation
python
fine-tuning
Qwen
tools
agent-framework
multi-agent
conversational
Eval Results (legacy)
Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
File size: 6,908 Bytes
b03a8a0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | # Tool Calling Training Data Analysis
**Generated:** 2026-04-06
**Files Analyzed:**
- `training-data/tool_examples.jsonl` (original)
- `training-data_v2/tool_examples.jsonl` (regenerated)
---
## Executive Summary
The original tool calling training data had **significant quality issues** that limited its usefulness for training a production AI coding assistant. The data was synthetically generated with systematic errors.
**Key Findings on Original Data:**
- ❌ 10.5% of tool calls use incorrect parameters (mismatched search queries, wrong files)
- ❌ Heavy prompt duplication (7.5x average)
- ❌ No multi-step tool chains (only 1 tool per example)
- ❌ All examples use identical tool definitions
**Action Taken:** Generated 500 new examples using the project's generator script.
**Recommendation:** The original data needs substantial improvements before use in training.
---
## 1. Statistics Overview
### Original Data (tool_examples.jsonl)
| Metric | Value |
|--------|-------|
| Total Examples | 1,000 |
| Unique Prompts | 133 |
| Average Duplication | 7.52x |
| Unique Tool Sequences | 5 |
| Examples with Issues | ~107 (10.7%) |
### New Data (tool_examples_v2.jsonl)
| Metric | Value |
|--------|-------|
| Total Examples | 500 |
| File Size | 1.9 MB |
| Tools per Example | 5 (static definition) |
### Tool Call Distribution (Original)
| Tool | Call Count |
|------|------------|
| Bash | 200 |
| FileRead | 200 |
| FileWrite | 200 |
| WebSearch | 200 |
| Grep | 200 |
All examples have exactly **one tool call** - no multi-step chains exist.
---
## 2. Prompt Diversity Analysis (Original Data)
### Prompt Categories
| Category | Count | Percentage |
|----------|-------|------------|
| Python | 207 | 20.7% |
| React | 149 | 14.9% |
| File Read | 134 | 13.4% |
| File Write | 119 | 11.9% |
| Other | 114 | 11.4% |
| Run Command | 80 | 8.0% |
| Docker/K8s | 67 | 6.7% |
| Search | 50 | 5.0% |
| Git | 40 | 4.0% |
| Testing | 31 | 3.1% |
| Package Management | 9 | 0.9% |
### Most Duplicated Prompts
| Prompt | Occurrences |
|--------|-------------|
| "Run the tests with pytest" | 40 |
| "Run npm install to install dependencies" | 40 |
| "Write a simple React component to src/components/Button.jsx" | 67 |
---
## 3. Tool Usage Breakdown
### Tool Definitions
All 1,000 original examples use **identical tool definitions** with 5 tools:
- `Bash` - Execute bash commands
- `FileRead` - Read file contents
- `FileWrite` - Create/overwrite files
- `WebSearch` - Search the web
- `Grep` - Search for patterns in files
### Tool Call Issues Found (Original Data)
#### Wrong Search Patterns (105 instances / 10.5%)
The `WebSearch` tool frequently uses queries that don't match the user's question:
| User Question | Actual Search Query |
|--------------|---------------------|
| "How do I use async/await in Python?" | "AWS Lambda cold start optimization" |
| "How do I use React hooks properly?" | "SQL join types explained" |
| "What's the difference between Docker and Kubernetes?" | "Git rebase vs merge" |
| "How do I use React hooks properly?" | "TypeScript generics tutorial" |
| "What's the difference between Docker and Kubernetes?" | "TypeScript generics tutorial" |
#### Wrong File Paths (2 instances)
The `FileWrite` tool sometimes writes to incorrect file types:
| User Request | Written Path |
|-------------|--------------|
| "Create a src/components/Header.jsx file" | Written to `config.json` |
| "Create a src/middleware.py file with settings" | Written to `config.yaml` |
#### Pattern/File Type Mismatches (Grep)
The `Grep` tool sometimes searches with mismatched patterns:
| Pattern | File Pattern | Issue |
|---------|-------------|-------|
| `class ` | `*.ts` | Python pattern in TypeScript files |
| `SELECT ` | `*.js` | SQL pattern in JavaScript files |
| `TODO` | `*.md` | Searching TODO in markdown files |
---
## 4. Data Quality Issues
### Critical Issues
1. **No Multi-Step Tool Chains**
- All 1,000 examples use exactly one tool call
- Real coding tasks typically require 2-5+ tool calls
- Example: "Read file → Find pattern → Search docs → Write fix"
2. **Search Query Mismatches**
- 10.5% of WebSearch calls have irrelevant queries
- Indicates the generator script has logic errors
3. **Heavy Prompt Duplication**
- 133 unique prompts duplicated to 1,000 examples
- "Write a simple React component" appears 67 times
- This creates overfitting to specific prompts
4. **Identical Tool Definitions**
- All examples use the same 5 tools with identical descriptions
- No variation in tool schemas or parameter structures
### Moderate Issues
5. **File Path Hallucination**
- Tool calls reference files that don't exist in actual codebase
- Example: asking for `tests/test_main.py` but reading `src/app.js`
6. **Response Fabrication**
- Assistant responses sometimes claim to show content that wasn't actually read
- Example: "Here's the README.md" when README.md wasn't the file requested
---
## 5. Recommendations for Improvement
### Immediate Actions (Completed)
1. ✅ **Regenerated Data**
```
Generated 500 new examples in training-data_v2/tool_examples.jsonl
```
### Script Fixes Needed
The generator script (`scripts/generate_tool_data.py`) needs:
1. Fix `TOOL_CALL_PAIRS` mapping - queries don't match questions
2. Fix `FILE_PATTERNS` - wrong file types for requested content
3. Add multi-step chain generation
4. Add prompt variation templates
5. Add validation to check query/content relevance
### Future Improvements
1. **Add Multi-Step Examples**
- Real tasks require reading files, searching, editing
- Generate chains of 2-4 tool calls per example
2. **Increase Prompt Diversity**
- Target 500+ unique prompts instead of duplicating
- Use template variations and paraphrasing
3. **Vary Tool Definitions**
- Different tools per example
- Add tool variations (e.g., different Bash commands)
---
## 6. Conclusion
The original `tool_examples.jsonl` data is **NOT suitable for production training** without significant improvements:
- ~10% of examples have incorrect tool parameters
- Heavy duplication leads to overfitting
- No multi-step chains fail to represent real coding workflows
- Synthetic generation errors are systematic
**Action Completed:** Generated 500 new examples via the project's generator script.
**Remaining Work:** Fix the underlying generator script to eliminate the systematic errors before full-scale regeneration.
---
## Appendix: Quick Stats
### Original Data
```
Total examples: 1,000
Unique prompts: 133
Tool call issues: 107 (10.7%)
Multi-tool chains: 0 (0%)
Identical tool defs: 100%
Average duplication: 7.52x
```
### New Data (Generated)
```
Total examples: 500
File size: 1.9 MB
Location: training-data_v2/tool_examples.jsonl
``` |