| # Qwen2.5-1.5B Math LoRA Collection | |
| This directory aggregates all LoRA checkpoints produced by the `train_lora` pipeline. Every subfolder corresponds to one math dataset and contains 10 independent 100-shot LoRA runs (group `00`β`09`) trained on **Qwen2.5β1.5B-Instruct** with identical hyperparameters. The adapters here are the source of truth for downstream evaluation (`../θ―δΌ°δ½η³»`) and for the `parameter_generator` project, which learns to map prompts to LoRA weights. | |
| If you are new to the project, this document explains **where the data comes from, how the LoRAs are produced, and how you can reuse them for inference, evaluation, or further training**. | |
| ## Provenance | |
| - **Base model:** `Qwen2.5-1.5B-Instruct` | |
| - **Datasets:** sampled from `../../prepare/data/math/*.json`. Each JSON is a list of `{prompt, response, system?}` records. `dataset_sampler.py` draws 10 disjoint groups of 100 samples (unless the dataset has <1β―000 examples, in which case sampling with replacement keeps the group size fixed) using a deterministic seed derived from the dataset name. | |
| - **Training recipe (from `config/default.yaml`):** | |
| - sequence length 4β―096; LoRA `r=64`, `alpha=128`, `dropout=0.05`, target modules = `{q,k,v,o,gate,up,down}_proj` | |
| - 12 epochs / max 1β―800 steps, learning rate `1e-4`, batch size per device `2`, gradient accumulation `16`, BF16 training, gradient checkpointing on, weight decay `0.01`, warmup ratio `0.03`, checkpoints saved every 300 steps (keeping at most 6) plus a final adapter export | |
| - Tokenizers are cloned from the base model (pad token defaults to EOS if missing) | |
| - **Monitoring & reproducibility:** | |
| - Trainer logs (loss, LR, throughput) are in `../logs/<dataset>/group_xx/`. | |
| - Slurm stdout/err for each shard live in `../logs/slurm/`. | |
| - `metadata.json` captures the git commit (if `GIT_COMMIT` was set), timestamps, seeds, and the effective batch size so any experiment can be repeated exactly. | |
| ### End-to-end data flow | |
| 1. **Raw JSON data** comes from `../../prepare/data/math`. Each file is a list of dict objects with keys: | |
| ```json | |
| { | |
| "prompt": "...question...", | |
| "response": "...reference answer...", | |
| "system": "optional system message" | |
| } | |
| ``` | |
| 2. `python -m train_lora.dataset_sampler --config config/default.yaml` reads every dataset, filters out `GSM8K_test.json`, and deterministically samples 10Γ100 items per dataset. The samples plus metadata (indices, seeds, timestamps) are written to `../prompt_groups/<dataset>/group_xx.json`. | |
| 3. `python -m train_lora.run_tasks --run` (or the Slurm array) iterates dataset/group pairs, loads the corresponding prompt group, and performs LoRA fine-tuning with Hugging Face `Trainer`. | |
| 4. After training finishes, the following artifacts land in `outputs/<dataset>/group_xx/`: | |
| - a ready-to-use LoRA adapter (`adapter/`) | |
| - intermediate checkpoints for analysis/resume | |
| - tokenizers and metadata | |
| 5. The evaluation stacks (`../θ―δΌ°δ½η³»`, `../parameter_generator/θ―δΌ°`) and the LoRA parameter generator both consume these directories directly. | |
| ## Directory layout | |
| ``` | |
| outputs/ | |
| βββ Competition_Math/ | |
| βββ GSM8K_train/ | |
| βββ MATH/ | |
| βββ Math-IIO-68K-Mini/ | |
| βββ Math-Plus/ | |
| βββ Math_QA/ | |
| βββ Mu-Math/ | |
| βββ ToT-Math-V1/ | |
| ``` | |
| Each dataset directory contains `group_00` β¦ `group_09`. Inside every group: | |
| | Item | Description | | |
| | --- | --- | | |
| | `adapter/` | Final LoRA export (`adapter_model.safetensors`, `adapter_config.json`, tokenizer + chat template snapshots, and HF `training_args.bin`). This is the folder you will load for inference. | | |
| | `checkpoints/checkpoint-xxxx/` | Intermediate Trainer checkpoints saved every 300 steps (300β1β―800). They include optimizer, scheduler, RNG state, and tokenizer copies for resuming or studying training dynamics. | | |
| | `tokenizer/` | Standalone tokenizer snapshot identical to the one used during training; useful if you need a self-contained deployment without referencing the base model directory. | | |
| | `prompt_group.json` | The exact 100-shot dataset used for this training run (a copy of `prompt_groups/<dataset>/group_xx.json`). Contains metadata such as sampled indices, original source file, and timestamp. | | |
| | `metadata.json` | Provenance record with training loss, Trainer metrics, LoRA config, effective batch size/world size, timestamps, git commit (if exported), and file paths. | | |
| | `metadata.json -> trainer_state` | Full training log history (per-step metrics). Disable via `metadata.save_training_state: false` if you want lighter metadata. | | |
| > **Tip:** Use `metadata.json` to find the latest checkpoint, to confirm which base model/tokenizer were used, or to drive automated uploads/evaluations. | |
| ## Dataset overview | |
| | Dataset dir | Source file (relative to `prepare/data/math`) | Notes | | |
| | --- | --- | --- | | |
| | `Competition_Math` | `Competition_Math.json` | 100-shot groups drawn from Competition Math practice problems. | | |
| | `GSM8K_train` | `GSM8K_train.json` | Standard GSM8K train split, excluding the public test set (`GSM8K_test.json` was filtered out). | | |
| | `MATH` | `MATH.json` | High-school & olympiad math benchmark. | | |
| | `Math-IIO-68K-Mini` | `Math-IIO-68K-Mini.json` | Mini version of Math IIO dataset. | | |
| | `Math-Plus` | `Math-Plus.json` | Composed of challenging math word problems. | | |
| | `Math_QA` | `Math_QA.json` | Multi-choice MathQA dataset formatted to open-ended QA. | | |
| | `Mu-Math` | `Mu-Math.json` | MuSR style math reasoning set. | | |
| | `ToT-Math-V1` | `ToT-Math-V1.json` | Tree-of-Thought flavored math prompts. | | |
| All datasets follow the same JSON schema, so swapping between them only changes topical coverage. | |
| ## How to navigate a single group | |
| ``` | |
| Math_QA/ | |
| βββ group_00/ | |
| βββ adapter/ | |
| β βββ adapter_config.json | |
| β βββ adapter_model.safetensors | |
| β βββ tokenizer/β¦ (extra copies of merges, vocab, chat_template.jinja) | |
| β βββ training_args.bin | |
| βββ checkpoints/ | |
| β βββ checkpoint-300/ | |
| β βββ checkpoint-600/ | |
| β βββ β¦ | |
| βββ tokenizer/ # same as base tokenizer but pinned to this run | |
| βββ prompt_group.json # 100-shot data | |
| βββ metadata.json | |
| ``` | |
| When inspecting or sharing a run, the **minimum** file set is `adapter/` + `prompt_group.json` + `metadata.json`. Everything else speeds up resuming or auditing. | |
| ## Using the adapters | |
| ### 0. Environment prerequisites | |
| - Python β₯ 3.10, `transformers >= 4.37`, `peft >= 0.8`, `accelerate`, `safetensors`, `torch` (GPU build). | |
| - The base model directory must be accessible; otherwise download `Qwen2.5-1.5B-Instruct` from Hugging Face and update `base_model` path. | |
| - Optional: set `HF_HOME`, `TRANSFORMERS_CACHE` to avoid repeated downloads. | |
| ### 0.5. Reproduce the training pipeline (optional) | |
| If someone wants to regenerate any adapter from scratch: | |
| ```bash | |
| cd train_lora | |
| python -m train_lora.dataset_sampler --overwrite # regenerates prompt groups | |
| python -m train_lora.train_single --dataset Math_QA --group 0 | |
| # or run the full queue | |
| python -m train_lora.run_tasks --run | |
| ``` | |
| These commands will rebuild `prompt_groups/` and `outputs/` with exactly the same seeds and configuration documented above. Slurm users should submit `sbatch run_lora_multinode.sh`. | |
| ### 1. Load adapter with PEFT | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| from peft import PeftModel | |
| import torch | |
| base_model = "Qwen2.5-1.5B-Instruct" | |
| adapter_dir = "outputs/Math_QA/group_00/adapter" | |
| tokenizer = AutoTokenizer.from_pretrained(adapter_dir, trust_remote_code=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| base_model, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| trust_remote_code=True, | |
| ) | |
| model = PeftModel.from_pretrained(model, adapter_dir) | |
| prompt = "Solve 3x + 7 = 22." | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| out = model.generate(**inputs, max_new_tokens=256) | |
| print(tokenizer.decode(out[0], skip_special_tokens=True)) | |
| ``` | |
| Notes: | |
| - Loading the tokenizer from `adapter/` ensures identical chat template and additional tokens (if any). You can also point to the base tokenizer path if you prefer. | |
| - For batch inference, wrap the model with `model.merge_and_unload()` if you need a single combined set of weights (at the cost of losing LoRA toggling). | |
| - If you want maximal throughput on a single GPU, also call `model.half()` or `model.to(torch.bfloat16)` depending on your hardware; the adapters were trained with BF16 so keeping BF16 is the safest choice. | |
| ### 2. Resume or continue training | |
| ```bash | |
| python -m train_lora.train_single \ | |
| --dataset Math_QA \ | |
| --group 0 \ | |
| --group-file outputs/Math_QA/group_00/prompt_group.json | |
| ``` | |
| Set `--group-file` to reuse the same 100 samples, and initialize `Trainer` with `checkpoints/checkpoint-XXXX` via `TrainingArguments.resume_from_checkpoint`. This reproduces a group or lets you extend training steps. | |
| To resume manually: | |
| ```python | |
| trainer.train(resume_from_checkpoint="outputs/Math_QA/group_00/checkpoints/checkpoint-1500") | |
| ``` | |
| ### 3. Evaluate with Math-Verify | |
| The evaluation stack in `../θ―δΌ°δ½η³»` and `../parameter_generator/θ―δΌ°` expects this directory layout. Example: | |
| ```bash | |
| cd θ―δΌ°δ½η³» | |
| python scripts/run_all_evals.py \ | |
| --config configs/eval_config.yaml \ | |
| --datasets Math_QA \ | |
| --groups 0 1 | |
| ``` | |
| ### 4. Packaging for distribution | |
| - Upload only `adapter/` and `metadata.json` when sharing publicly (e.g., Hugging Face) to avoid huge checkpoint directories. | |
| - Keep `prompt_group.json` if you want consumers to understand the training data or to regenerate LoRA weights with the same samples. | |
| - When exporting, include a README snippet that references this document so downstream users know the provenance. | |
| - Suggested Hugging Face layout: | |
| ``` | |
| Math_QA/ | |
| group_00/ | |
| adapter/ | |
| prompt_group.json | |
| metadata.json | |
| README.md (copy sections describing provenance + usage) | |
| ``` | |
| ## File reference (`metadata.json`) | |
| Key fields you may want to automate against: | |
| | Field | Meaning | | |
| | --- | --- | | |
| | `dataset_name`, `group_index` | Identify the run. | | |
| | `prompt_group_file` | Absolute path back to the sampled dataset. | | |
| | `checkpoint_root` | Where all intermediate checkpoints live. | | |
| | `train_loss`, `metrics` | Final loss and Trainer metrics dict. | | |
| | `trainer_state` | Full log history (can be large; disable via `metadata.save_training_state`). | | |
| | `training_args` | Exact HF `TrainingArguments` snapshot. | | |
| | `lora_config` | Copy of the LoRA hyperparameters used. | | |
| | `effective_batch_size` | `world_size Γ per_device_batch_size Γ grad_accum` β useful for scaling comparisons. | | |
| | `git_commit` | Populated if the `GIT_COMMIT` env var was set before training. | | |
| | `metrics.train_runtime`, `metrics.train_samples_per_second` | Throughput stats. | | |
| | `generated_at` | UTC timestamp when the metadata was written. | | |
| ## Best practices | |
| - Always match BF16 or FP16 settings between base model loading and adapter training; these adapters were trained in BF16. | |
| - If you edit files inside this directory, keep structure intactβother scripts rely on relative paths (`adapter`, `tokenizer`, `metadata.json`). | |
| - Before deploying a new LoRA, verify it with the evaluation suite and consider merging multiple groups (e.g., ensemble or checkpoint averaging) only after confirming stability. | |
| - Use `prompt_group.json` and `metadata.json` as documentation when presenting results; they already include seeds, sample indices, and environment details. | |
| - If you build new LoRAs with different configs (e.g., higher rank, more steps), add a sibling directory (e.g., `outputs_v2/`) or annotate the README so collaborators know which adapters correspond to which experiment. | |
| Happy finetuning! If you extend this collection (new datasets, extra groups, or different hyperparameters), add another section here describing the changes so downstream consumers stay informed. | |