Stack-2-9-finetuned / training /IMPROVEMENTS.md
walidsobhie-code
fix: optimize model card badges and clean YAML frontmatter
2064035

Training Infrastructure Improvements

Status: Audit Complete β€” Issues Found & Documented


πŸ”΄ CRITICAL: Data Format Mismatch (Training Won't Run)

The Problem

All training scripts expect simple text/chat formats, but the actual training data uses a messages-array format with tool calls:

# What scripts expect (WRONG):
{"text": "...", "instruction": "...", "output": "..."}

# What the data actually contains (CORRECT):
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": null, "tool_calls": [...]}, {"role": "tool", ...}], "tools": [...]}

Affected Scripts

Script Issue
train_simple_nobnb.py tokenize_function looks for instruction/output fields β€” these don't exist
train_local.py References ./data/final/train.jsonl β€” wrong path and wrong format
train_extended_context.py Same text field assumption β€” won't tokenize properly
t4-qlora.yaml text_field: "text" and dataset_path: "./data/final/train_combined.jsonl" β€” wrong
extended-context-128k.yaml dataset_path: "./training-data/final/train.jsonl" β€” file doesn't exist

Fix Required

A proper data loader that converts the messages format to training tokens, handling:

  • System message prepending
  • Tool-call turns (skip or flatten)
  • User/assistant turns for language modeling
  • Padding and truncation at max_length

πŸ”΄ train_local.py Issues

  1. Broken import path β€” sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'stack/training')) points to a directory that doesn't exist
  2. Wrong data path β€” ./data/final/train.jsonl should be ./training-data/tool_examples_combined.jsonl
  3. Wrong config path β€” stack/training/train_config_local.yaml doesn't exist
  4. MPS check bug β€” torch.backends.mps.is_built() would raise AttributeError on non-Apple hardware
  5. No 4-bit quantization β€” loads full model in FP32, will OOM on Mac MPS

🟑 t4-qlora.yaml Issues

  1. Wrong data path: ./data/final/train_combined.jsonl doesn't exist
  2. Wrong format field: text_field: "text" won't work with messages format
  3. Includes neat_ft: false β€” this is not a valid HF TrainingArguments field
  4. No push_to_hub_model_id despite push_to_hub: true being templated

🟑 extended-context-128k.yaml Issues

  1. Wrong data path: ./training-data/final/train.jsonl doesn't exist
  2. File references Qwen/Qwen2.5-Coder-1.5B but it's not clear if this model already has extended RoPE config
  3. No verification that the base model actually has rope_scaling in its config.json

🟑 evaluate_model.py Issues

  1. Wrong HumanEval format β€” expects test_cases in problem dict, but HumanEval typically uses canonical_solution + test strings that need to be executed
  2. Code execution sandbox is limited β€” only allows specific builtins; many standard library functions missing
  3. No handling of assert statements in test code
  4. calculate_pass_at_k has a bug: correct_in_k = sum(correct_flags[:min(k, len(correct_flags))]) is wrong for pass@k β€” should be number of correct out of k samples drawn, not just first k

🟒 What's Working Well

  • train_simple_nobnb.py β€” Good mixed precision logic, proper bf16/fp16 detection, paged AdamW optimizer, gradient checkpointing with use_reentrant=False
  • Training configs β€” Comprehensive hardware-specific settings, well-documented
  • Recipes β€” Good documentation of GPU requirements and expected runtimes
  • LoRA config β€” Properly targets all relevant modules for Qwen

βœ… Recommended Fixes (Priority Order)

1. Fix Data Loaders (Highest Priority)

Add a proper load_chat_data() function to train_simple_nobnb.py:

def load_chat_data(data_path: str, tokenizer, max_length: int = 2048, train_split: float = 0.9):
    """Load messages-format dataset and convert to training tokens."""
    raw_dataset = load_dataset("json", data_files=data_path, split="train")
    
    def tokenize_messages(example):
        messages = example["messages"]
        # Flatten to: system + user + assistant turns
        text = ""
        for msg in messages:
            role = msg["role"]
            content = msg.get("content", "") or ""
            if role == "system":
                text += f"<|system|>\n{content}\n"
            elif role == "user":
                text += f"<|user|>\n{content}\n"
            elif role == "assistant":
                # Skip tool calls in content for now, just use text response
                text += f"<|assistant|>\n{content}\n"
            elif role == "tool":
                text += f"<|tool|>\n{content}\n"
        text += "<|assistant|>"
        
        result = tokenizer(text, truncation=True, max_length=max_length, padding="max_length")
        result["labels"] = result["input_ids"].copy()
        return result
    
    tokenized = raw_dataset.map(tokenize_messages, remove_columns=raw_dataset.column_names)
    # ... train/test split
    return train_dataset, eval_dataset

2. Fix All Data Paths

Config File Current (Wrong) Correct
t4-qlora.yaml ./data/final/train_combined.jsonl ./training-data/tool_examples_combined.jsonl
extended-context-128k.yaml ./training-data/final/train.jsonl ./training-data/tool_examples_combined.jsonl
train_local.py ./data/final/train.jsonl ./training-data/tool_examples_combined.jsonl

3. Fix t4-qlora.yaml

  • Remove neat_ft: false (not a valid field)
  • Add output_dir override or create training-configs/t4-qlora-data-fix.yaml

4. Fix evaluate_model.py

  • Add proper HumanEval problem loading (use openai/humaneval dataset from HuggingFace)
  • Fix pass@k calculation
  • Expand safe builtins for code execution

5. Fix train_local.py

  • Remove broken stack/training import path
  • Add proper 4-bit quantization support for MPS (or detect CUDA availability)
  • Fix data and config paths

πŸ“ Actual Training Data Location

/Users/walidsobhi/stack-2.9/training/training-data/
β”œβ”€β”€ tool_examples.jsonl           (1000 lines)
β”œβ”€β”€ tool_examples_combined.jsonl  (1500 lines)
└── tool_examples.json            (same data, json format)

Format: {"messages": [...], "tools": [...]} β€” messages-array with tool calls.


πŸš€ Quick Test Command

To verify training would work after fixes:

cd /Users/walidsobhi/stack-2.9/training
python -c "
from datasets import load_dataset
ds = load_dataset('json', data_files='training-data/tool_examples_combined.jsonl', split='train')
print(f'Total examples: {len(ds)}')
print(f'Keys: {ds.column_names}')
print(f'Example: {ds[0]}')
"

Expected output: ['messages', 'tools'] β€” not ['text'] or ['instruction', 'output'].


Next Steps

  1. Write a proper load_chat_data() function in a shared data_utils.py
  2. Update train_simple_nobnb.py to use it
  3. Update all YAML configs with correct data paths
  4. Test with 1 epoch on small sample
  5. Then scale to full training on Kaggle/A100