File size: 7,261 Bytes
2064035
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# Training Infrastructure Improvements

## Status: Audit Complete β€” Issues Found & Documented

---

## πŸ”΄ CRITICAL: Data Format Mismatch (Training Won't Run)

### The Problem
All training scripts expect simple text/chat formats, but the actual training data uses a **messages-array format with tool calls**:

```python
# What scripts expect (WRONG):
{"text": "...", "instruction": "...", "output": "..."}

# What the data actually contains (CORRECT):
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": null, "tool_calls": [...]}, {"role": "tool", ...}], "tools": [...]}
```

### Affected Scripts
| Script | Issue |
|--------|-------|
| `train_simple_nobnb.py` | `tokenize_function` looks for `instruction`/`output` fields β€” these don't exist |
| `train_local.py` | References `./data/final/train.jsonl` β€” wrong path and wrong format |
| `train_extended_context.py` | Same `text` field assumption β€” won't tokenize properly |
| `t4-qlora.yaml` | `text_field: "text"` and `dataset_path: "./data/final/train_combined.jsonl"` β€” wrong |
| `extended-context-128k.yaml` | `dataset_path: "./training-data/final/train.jsonl"` β€” file doesn't exist |

### Fix Required
A proper data loader that converts the `messages` format to training tokens, handling:
- System message prepending
- Tool-call turns (skip or flatten)
- User/assistant turns for language modeling
- Padding and truncation at `max_length`

---

## πŸ”΄ train_local.py Issues

1. **Broken import path** β€” `sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'stack/training'))` points to a directory that doesn't exist
2. **Wrong data path** β€” `./data/final/train.jsonl` should be `./training-data/tool_examples_combined.jsonl`
3. **Wrong config path** β€” `stack/training/train_config_local.yaml` doesn't exist
4. **MPS check bug** β€” `torch.backends.mps.is_built()` would raise `AttributeError` on non-Apple hardware
5. **No 4-bit quantization** β€” loads full model in FP32, will OOM on Mac MPS

---

## 🟑 t4-qlora.yaml Issues

1. **Wrong data path**: `./data/final/train_combined.jsonl` doesn't exist
2. **Wrong format field**: `text_field: "text"` won't work with messages format
3. **Includes `neat_ft: false`** β€” this is not a valid HF TrainingArguments field
4. **No `push_to_hub_model_id`** despite `push_to_hub: true` being templated

---

## 🟑 extended-context-128k.yaml Issues

1. **Wrong data path**: `./training-data/final/train.jsonl` doesn't exist
2. **File references `Qwen/Qwen2.5-Coder-1.5B`** but it's not clear if this model already has extended RoPE config
3. **No verification** that the base model actually has `rope_scaling` in its config.json

---

## 🟑 evaluate_model.py Issues

1. **Wrong HumanEval format** β€” expects `test_cases` in problem dict, but HumanEval typically uses `canonical_solution` + `test` strings that need to be executed
2. **Code execution sandbox is limited** β€” only allows specific builtins; many standard library functions missing
3. **No handling** of `assert` statements in test code
4. **`calculate_pass_at_k`** has a bug: `correct_in_k = sum(correct_flags[:min(k, len(correct_flags))])` is wrong for pass@k β€” should be number of correct out of k samples drawn, not just first k

---

## 🟒 What's Working Well

- **`train_simple_nobnb.py`** β€” Good mixed precision logic, proper bf16/fp16 detection, paged AdamW optimizer, gradient checkpointing with `use_reentrant=False`
- **Training configs** β€” Comprehensive hardware-specific settings, well-documented
- **Recipes** β€” Good documentation of GPU requirements and expected runtimes
- **LoRA config** β€” Properly targets all relevant modules for Qwen

---

## βœ… Recommended Fixes (Priority Order)

### 1. Fix Data Loaders (Highest Priority)
Add a proper `load_chat_data()` function to `train_simple_nobnb.py`:

```python
def load_chat_data(data_path: str, tokenizer, max_length: int = 2048, train_split: float = 0.9):
    """Load messages-format dataset and convert to training tokens."""
    raw_dataset = load_dataset("json", data_files=data_path, split="train")
    
    def tokenize_messages(example):
        messages = example["messages"]
        # Flatten to: system + user + assistant turns
        text = ""
        for msg in messages:
            role = msg["role"]
            content = msg.get("content", "") or ""
            if role == "system":
                text += f"<|system|>\n{content}\n"
            elif role == "user":
                text += f"<|user|>\n{content}\n"
            elif role == "assistant":
                # Skip tool calls in content for now, just use text response
                text += f"<|assistant|>\n{content}\n"
            elif role == "tool":
                text += f"<|tool|>\n{content}\n"
        text += "<|assistant|>"
        
        result = tokenizer(text, truncation=True, max_length=max_length, padding="max_length")
        result["labels"] = result["input_ids"].copy()
        return result
    
    tokenized = raw_dataset.map(tokenize_messages, remove_columns=raw_dataset.column_names)
    # ... train/test split
    return train_dataset, eval_dataset
```

### 2. Fix All Data Paths
| Config File | Current (Wrong) | Correct |
|-------------|-----------------|---------|
| `t4-qlora.yaml` | `./data/final/train_combined.jsonl` | `./training-data/tool_examples_combined.jsonl` |
| `extended-context-128k.yaml` | `./training-data/final/train.jsonl` | `./training-data/tool_examples_combined.jsonl` |
| `train_local.py` | `./data/final/train.jsonl` | `./training-data/tool_examples_combined.jsonl` |

### 3. Fix t4-qlora.yaml
- Remove `neat_ft: false` (not a valid field)
- Add `output_dir` override or create `training-configs/t4-qlora-data-fix.yaml`

### 4. Fix evaluate_model.py
- Add proper HumanEval problem loading (use `openai/humaneval` dataset from HuggingFace)
- Fix pass@k calculation
- Expand safe builtins for code execution

### 5. Fix train_local.py
- Remove broken `stack/training` import path
- Add proper 4-bit quantization support for MPS (or detect CUDA availability)
- Fix data and config paths

---

## πŸ“ Actual Training Data Location

```
/Users/walidsobhi/stack-2.9/training/training-data/
β”œβ”€β”€ tool_examples.jsonl           (1000 lines)
β”œβ”€β”€ tool_examples_combined.jsonl  (1500 lines)
└── tool_examples.json            (same data, json format)
```

Format: `{"messages": [...], "tools": [...]}` β€” messages-array with tool calls.

---

## πŸš€ Quick Test Command

To verify training would work after fixes:

```bash
cd /Users/walidsobhi/stack-2.9/training
python -c "
from datasets import load_dataset
ds = load_dataset('json', data_files='training-data/tool_examples_combined.jsonl', split='train')
print(f'Total examples: {len(ds)}')
print(f'Keys: {ds.column_names}')
print(f'Example: {ds[0]}')
"
```

Expected output: `['messages', 'tools']` β€” not `['text']` or `['instruction', 'output']`.

---

## Next Steps

1. Write a proper `load_chat_data()` function in a shared `data_utils.py`
2. Update `train_simple_nobnb.py` to use it
3. Update all YAML configs with correct data paths
4. Test with 1 epoch on small sample
5. Then scale to full training on Kaggle/A100