File size: 6,908 Bytes
b03a8a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
# Tool Calling Training Data Analysis

**Generated:** 2026-04-06  
**Files Analyzed:** 
- `training-data/tool_examples.jsonl` (original)
- `training-data_v2/tool_examples.jsonl` (regenerated)

---

## Executive Summary

The original tool calling training data had **significant quality issues** that limited its usefulness for training a production AI coding assistant. The data was synthetically generated with systematic errors.

**Key Findings on Original Data:**
- ❌ 10.5% of tool calls use incorrect parameters (mismatched search queries, wrong files)
- ❌ Heavy prompt duplication (7.5x average)
- ❌ No multi-step tool chains (only 1 tool per example)
- ❌ All examples use identical tool definitions

**Action Taken:** Generated 500 new examples using the project's generator script.

**Recommendation:** The original data needs substantial improvements before use in training.

---

## 1. Statistics Overview

### Original Data (tool_examples.jsonl)

| Metric | Value |
|--------|-------|
| Total Examples | 1,000 |
| Unique Prompts | 133 |
| Average Duplication | 7.52x |
| Unique Tool Sequences | 5 |
| Examples with Issues | ~107 (10.7%) |

### New Data (tool_examples_v2.jsonl)

| Metric | Value |
|--------|-------|
| Total Examples | 500 |
| File Size | 1.9 MB |
| Tools per Example | 5 (static definition) |

### Tool Call Distribution (Original)

| Tool | Call Count |
|------|------------|
| Bash | 200 |
| FileRead | 200 |
| FileWrite | 200 |
| WebSearch | 200 |
| Grep | 200 |

All examples have exactly **one tool call** - no multi-step chains exist.

---

## 2. Prompt Diversity Analysis (Original Data)

### Prompt Categories

| Category | Count | Percentage |
|----------|-------|------------|
| Python | 207 | 20.7% |
| React | 149 | 14.9% |
| File Read | 134 | 13.4% |
| File Write | 119 | 11.9% |
| Other | 114 | 11.4% |
| Run Command | 80 | 8.0% |
| Docker/K8s | 67 | 6.7% |
| Search | 50 | 5.0% |
| Git | 40 | 4.0% |
| Testing | 31 | 3.1% |
| Package Management | 9 | 0.9% |

### Most Duplicated Prompts

| Prompt | Occurrences |
|--------|-------------|
| "Run the tests with pytest" | 40 |
| "Run npm install to install dependencies" | 40 |
| "Write a simple React component to src/components/Button.jsx" | 67 |

---

## 3. Tool Usage Breakdown

### Tool Definitions

All 1,000 original examples use **identical tool definitions** with 5 tools:
- `Bash` - Execute bash commands
- `FileRead` - Read file contents
- `FileWrite` - Create/overwrite files
- `WebSearch` - Search the web
- `Grep` - Search for patterns in files

### Tool Call Issues Found (Original Data)

#### Wrong Search Patterns (105 instances / 10.5%)

The `WebSearch` tool frequently uses queries that don't match the user's question:

| User Question | Actual Search Query |
|--------------|---------------------|
| "How do I use async/await in Python?" | "AWS Lambda cold start optimization" |
| "How do I use React hooks properly?" | "SQL join types explained" |
| "What's the difference between Docker and Kubernetes?" | "Git rebase vs merge" |
| "How do I use React hooks properly?" | "TypeScript generics tutorial" |
| "What's the difference between Docker and Kubernetes?" | "TypeScript generics tutorial" |

#### Wrong File Paths (2 instances)

The `FileWrite` tool sometimes writes to incorrect file types:

| User Request | Written Path |
|-------------|--------------|
| "Create a src/components/Header.jsx file" | Written to `config.json` |
| "Create a src/middleware.py file with settings" | Written to `config.yaml` |

#### Pattern/File Type Mismatches (Grep)

The `Grep` tool sometimes searches with mismatched patterns:

| Pattern | File Pattern | Issue |
|---------|-------------|-------|
| `class ` | `*.ts` | Python pattern in TypeScript files |
| `SELECT ` | `*.js` | SQL pattern in JavaScript files |
| `TODO` | `*.md` | Searching TODO in markdown files |

---

## 4. Data Quality Issues

### Critical Issues

1. **No Multi-Step Tool Chains**
   - All 1,000 examples use exactly one tool call
   - Real coding tasks typically require 2-5+ tool calls
   - Example: "Read file → Find pattern → Search docs → Write fix"

2. **Search Query Mismatches**
   - 10.5% of WebSearch calls have irrelevant queries
   - Indicates the generator script has logic errors

3. **Heavy Prompt Duplication**
   - 133 unique prompts duplicated to 1,000 examples
   - "Write a simple React component" appears 67 times
   - This creates overfitting to specific prompts

4. **Identical Tool Definitions**
   - All examples use the same 5 tools with identical descriptions
   - No variation in tool schemas or parameter structures

### Moderate Issues

5. **File Path Hallucination**
   - Tool calls reference files that don't exist in actual codebase
   - Example: asking for `tests/test_main.py` but reading `src/app.js`

6. **Response Fabrication**
   - Assistant responses sometimes claim to show content that wasn't actually read
   - Example: "Here's the README.md" when README.md wasn't the file requested

---

## 5. Recommendations for Improvement

### Immediate Actions (Completed)

1.**Regenerated Data**
   ```
   Generated 500 new examples in training-data_v2/tool_examples.jsonl
   ```

### Script Fixes Needed

The generator script (`scripts/generate_tool_data.py`) needs:

1. Fix `TOOL_CALL_PAIRS` mapping - queries don't match questions
2. Fix `FILE_PATTERNS` - wrong file types for requested content
3. Add multi-step chain generation
4. Add prompt variation templates
5. Add validation to check query/content relevance

### Future Improvements

1. **Add Multi-Step Examples**
   - Real tasks require reading files, searching, editing
   - Generate chains of 2-4 tool calls per example

2. **Increase Prompt Diversity**
   - Target 500+ unique prompts instead of duplicating
   - Use template variations and paraphrasing

3. **Vary Tool Definitions**
   - Different tools per example
   - Add tool variations (e.g., different Bash commands)

---

## 6. Conclusion

The original `tool_examples.jsonl` data is **NOT suitable for production training** without significant improvements:

- ~10% of examples have incorrect tool parameters
- Heavy duplication leads to overfitting
- No multi-step chains fail to represent real coding workflows
- Synthetic generation errors are systematic

**Action Completed:** Generated 500 new examples via the project's generator script.

**Remaining Work:** Fix the underlying generator script to eliminate the systematic errors before full-scale regeneration.

---

## Appendix: Quick Stats

### Original Data
```
Total examples:        1,000
Unique prompts:        133
Tool call issues:      107 (10.7%)
Multi-tool chains:     0 (0%)
Identical tool defs:   100%
Average duplication:   7.52x
```

### New Data (Generated)
```
Total examples:        500
File size:             1.9 MB
Location:              training-data_v2/tool_examples.jsonl
```