# Training Infrastructure Improvements ## Status: Audit Complete — Issues Found & Documented --- ## 🔴 CRITICAL: Data Format Mismatch (Training Won't Run) ### The Problem All training scripts expect simple text/chat formats, but the actual training data uses a **messages-array format with tool calls**: ```python # What scripts expect (WRONG): {"text": "...", "instruction": "...", "output": "..."} # What the data actually contains (CORRECT): {"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": null, "tool_calls": [...]}, {"role": "tool", ...}], "tools": [...]} ``` ### Affected Scripts | Script | Issue | |--------|-------| | `train_simple_nobnb.py` | `tokenize_function` looks for `instruction`/`output` fields — these don't exist | | `train_local.py` | References `./data/final/train.jsonl` — wrong path and wrong format | | `train_extended_context.py` | Same `text` field assumption — won't tokenize properly | | `t4-qlora.yaml` | `text_field: "text"` and `dataset_path: "./data/final/train_combined.jsonl"` — wrong | | `extended-context-128k.yaml` | `dataset_path: "./training-data/final/train.jsonl"` — file doesn't exist | ### Fix Required A proper data loader that converts the `messages` format to training tokens, handling: - System message prepending - Tool-call turns (skip or flatten) - User/assistant turns for language modeling - Padding and truncation at `max_length` --- ## 🔴 train_local.py Issues 1. **Broken import path** — `sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'stack/training'))` points to a directory that doesn't exist 2. **Wrong data path** — `./data/final/train.jsonl` should be `./training-data/tool_examples_combined.jsonl` 3. **Wrong config path** — `stack/training/train_config_local.yaml` doesn't exist 4. **MPS check bug** — `torch.backends.mps.is_built()` would raise `AttributeError` on non-Apple hardware 5. **No 4-bit quantization** — loads full model in FP32, will OOM on Mac MPS --- ## 🟡 t4-qlora.yaml Issues 1. **Wrong data path**: `./data/final/train_combined.jsonl` doesn't exist 2. **Wrong format field**: `text_field: "text"` won't work with messages format 3. **Includes `neat_ft: false`** — this is not a valid HF TrainingArguments field 4. **No `push_to_hub_model_id`** despite `push_to_hub: true` being templated --- ## 🟡 extended-context-128k.yaml Issues 1. **Wrong data path**: `./training-data/final/train.jsonl` doesn't exist 2. **File references `Qwen/Qwen2.5-Coder-1.5B`** but it's not clear if this model already has extended RoPE config 3. **No verification** that the base model actually has `rope_scaling` in its config.json --- ## 🟡 evaluate_model.py Issues 1. **Wrong HumanEval format** — expects `test_cases` in problem dict, but HumanEval typically uses `canonical_solution` + `test` strings that need to be executed 2. **Code execution sandbox is limited** — only allows specific builtins; many standard library functions missing 3. **No handling** of `assert` statements in test code 4. **`calculate_pass_at_k`** has a bug: `correct_in_k = sum(correct_flags[:min(k, len(correct_flags))])` is wrong for pass@k — should be number of correct out of k samples drawn, not just first k --- ## 🟢 What's Working Well - **`train_simple_nobnb.py`** — Good mixed precision logic, proper bf16/fp16 detection, paged AdamW optimizer, gradient checkpointing with `use_reentrant=False` - **Training configs** — Comprehensive hardware-specific settings, well-documented - **Recipes** — Good documentation of GPU requirements and expected runtimes - **LoRA config** — Properly targets all relevant modules for Qwen --- ## ✅ Recommended Fixes (Priority Order) ### 1. Fix Data Loaders (Highest Priority) Add a proper `load_chat_data()` function to `train_simple_nobnb.py`: ```python def load_chat_data(data_path: str, tokenizer, max_length: int = 2048, train_split: float = 0.9): """Load messages-format dataset and convert to training tokens.""" raw_dataset = load_dataset("json", data_files=data_path, split="train") def tokenize_messages(example): messages = example["messages"] # Flatten to: system + user + assistant turns text = "" for msg in messages: role = msg["role"] content = msg.get("content", "") or "" if role == "system": text += f"<|system|>\n{content}\n" elif role == "user": text += f"<|user|>\n{content}\n" elif role == "assistant": # Skip tool calls in content for now, just use text response text += f"<|assistant|>\n{content}\n" elif role == "tool": text += f"<|tool|>\n{content}\n" text += "<|assistant|>" result = tokenizer(text, truncation=True, max_length=max_length, padding="max_length") result["labels"] = result["input_ids"].copy() return result tokenized = raw_dataset.map(tokenize_messages, remove_columns=raw_dataset.column_names) # ... train/test split return train_dataset, eval_dataset ``` ### 2. Fix All Data Paths | Config File | Current (Wrong) | Correct | |-------------|-----------------|---------| | `t4-qlora.yaml` | `./data/final/train_combined.jsonl` | `./training-data/tool_examples_combined.jsonl` | | `extended-context-128k.yaml` | `./training-data/final/train.jsonl` | `./training-data/tool_examples_combined.jsonl` | | `train_local.py` | `./data/final/train.jsonl` | `./training-data/tool_examples_combined.jsonl` | ### 3. Fix t4-qlora.yaml - Remove `neat_ft: false` (not a valid field) - Add `output_dir` override or create `training-configs/t4-qlora-data-fix.yaml` ### 4. Fix evaluate_model.py - Add proper HumanEval problem loading (use `openai/humaneval` dataset from HuggingFace) - Fix pass@k calculation - Expand safe builtins for code execution ### 5. Fix train_local.py - Remove broken `stack/training` import path - Add proper 4-bit quantization support for MPS (or detect CUDA availability) - Fix data and config paths --- ## 📁 Actual Training Data Location ``` /Users/walidsobhi/stack-2.9/training/training-data/ ├── tool_examples.jsonl (1000 lines) ├── tool_examples_combined.jsonl (1500 lines) └── tool_examples.json (same data, json format) ``` Format: `{"messages": [...], "tools": [...]}` — messages-array with tool calls. --- ## 🚀 Quick Test Command To verify training would work after fixes: ```bash cd /Users/walidsobhi/stack-2.9/training python -c " from datasets import load_dataset ds = load_dataset('json', data_files='training-data/tool_examples_combined.jsonl', split='train') print(f'Total examples: {len(ds)}') print(f'Keys: {ds.column_names}') print(f'Example: {ds[0]}') " ``` Expected output: `['messages', 'tools']` — not `['text']` or `['instruction', 'output']`. --- ## Next Steps 1. Write a proper `load_chat_data()` function in a shared `data_utils.py` 2. Update `train_simple_nobnb.py` to use it 3. Update all YAML configs with correct data paths 4. Test with 1 epoch on small sample 5. Then scale to full training on Kaggle/A100