my_train_lora_math / README.md

Update README.md

5fa3a5b verified about 2 months ago

12 kB

	# Qwen2.5-1.5B Math LoRA Collection

	This directory aggregates all LoRA checkpoints produced by the `train_lora` pipeline. Every subfolder corresponds to one math dataset and contains 10 independent 100-shot LoRA runs (group `00`–`09`) trained on Qwen2.5‑1.5B-Instruct with identical hyperparameters. The adapters here are the source of truth for downstream evaluation (`../评估体系`) and for the `parameter_generator` project, which learns to map prompts to LoRA weights.

	If you are new to the project, this document explains where the data comes from, how the LoRAs are produced, and how you can reuse them for inference, evaluation, or further training.

	## Provenance

	- Base model: `Qwen2.5-1.5B-Instruct`
	- Datasets: sampled from `../../prepare/data/math/*.json`. Each JSON is a list of `{prompt, response, system?}` records. `dataset_sampler.py` draws 10 disjoint groups of 100 samples (unless the dataset has <1 000 examples, in which case sampling with replacement keeps the group size fixed) using a deterministic seed derived from the dataset name.
	- Training recipe (from `config/default.yaml`):
	- sequence length 4 096; LoRA `r=64`, `alpha=128`, `dropout=0.05`, target modules = `{q,k,v,o,gate,up,down}_proj`
	- 12 epochs / max 1 800 steps, learning rate `1e-4`, batch size per device `2`, gradient accumulation `16`, BF16 training, gradient checkpointing on, weight decay `0.01`, warmup ratio `0.03`, checkpoints saved every 300 steps (keeping at most 6) plus a final adapter export
	- Tokenizers are cloned from the base model (pad token defaults to EOS if missing)
	- Monitoring & reproducibility:
	- Trainer logs (loss, LR, throughput) are in `../logs/<dataset>/group_xx/`.
	- Slurm stdout/err for each shard live in `../logs/slurm/`.
	- `metadata.json` captures the git commit (if `GIT_COMMIT` was set), timestamps, seeds, and the effective batch size so any experiment can be repeated exactly.

	### End-to-end data flow

	1. Raw JSON data comes from `../../prepare/data/math`. Each file is a list of dict objects with keys:
	```json
	{
	"prompt": "...question...",
	"response": "...reference answer...",
	"system": "optional system message"
	}
	```
	2. `python -m train_lora.dataset_sampler --config config/default.yaml` reads every dataset, filters out `GSM8K_test.json`, and deterministically samples 10×100 items per dataset. The samples plus metadata (indices, seeds, timestamps) are written to `../prompt_groups/<dataset>/group_xx.json`.
	3. `python -m train_lora.run_tasks --run` (or the Slurm array) iterates dataset/group pairs, loads the corresponding prompt group, and performs LoRA fine-tuning with Hugging Face `Trainer`.
	4. After training finishes, the following artifacts land in `outputs/<dataset>/group_xx/`:
	- a ready-to-use LoRA adapter (`adapter/`)
	- intermediate checkpoints for analysis/resume
	- tokenizers and metadata
	5. The evaluation stacks (`../评估体系`, `../parameter_generator/评估`) and the LoRA parameter generator both consume these directories directly.

	## Directory layout

	```
	outputs/
	├── Competition_Math/
	├── GSM8K_train/
	├── MATH/
	├── Math-IIO-68K-Mini/
	├── Math-Plus/
	├── Math_QA/
	├── Mu-Math/
	└── ToT-Math-V1/
	```

	Each dataset directory contains `group_00` … `group_09`. Inside every group:

	\| Item \| Description \|
	\| --- \| --- \|
	\| `adapter/` \| Final LoRA export (`adapter_model.safetensors`, `adapter_config.json`, tokenizer + chat template snapshots, and HF `training_args.bin`). This is the folder you will load for inference. \|
	\| `checkpoints/checkpoint-xxxx/` \| Intermediate Trainer checkpoints saved every 300 steps (300–1 800). They include optimizer, scheduler, RNG state, and tokenizer copies for resuming or studying training dynamics. \|
	\| `tokenizer/` \| Standalone tokenizer snapshot identical to the one used during training; useful if you need a self-contained deployment without referencing the base model directory. \|
	\| `prompt_group.json` \| The exact 100-shot dataset used for this training run (a copy of `prompt_groups/<dataset>/group_xx.json`). Contains metadata such as sampled indices, original source file, and timestamp. \|
	\| `metadata.json` \| Provenance record with training loss, Trainer metrics, LoRA config, effective batch size/world size, timestamps, git commit (if exported), and file paths. \|
	\| `metadata.json -> trainer_state` \| Full training log history (per-step metrics). Disable via `metadata.save_training_state: false` if you want lighter metadata. \|

	> Tip: Use `metadata.json` to find the latest checkpoint, to confirm which base model/tokenizer were used, or to drive automated uploads/evaluations.

	## Dataset overview

	\| Dataset dir \| Source file (relative to `prepare/data/math`) \| Notes \|
	\| --- \| --- \| --- \|
	\| `Competition_Math` \| `Competition_Math.json` \| 100-shot groups drawn from Competition Math practice problems. \|
	\| `GSM8K_train` \| `GSM8K_train.json` \| Standard GSM8K train split, excluding the public test set (`GSM8K_test.json` was filtered out). \|
	\| `MATH` \| `MATH.json` \| High-school & olympiad math benchmark. \|
	\| `Math-IIO-68K-Mini` \| `Math-IIO-68K-Mini.json` \| Mini version of Math IIO dataset. \|
	\| `Math-Plus` \| `Math-Plus.json` \| Composed of challenging math word problems. \|
	\| `Math_QA` \| `Math_QA.json` \| Multi-choice MathQA dataset formatted to open-ended QA. \|
	\| `Mu-Math` \| `Mu-Math.json` \| MuSR style math reasoning set. \|
	\| `ToT-Math-V1` \| `ToT-Math-V1.json` \| Tree-of-Thought flavored math prompts. \|

	All datasets follow the same JSON schema, so swapping between them only changes topical coverage.

	## How to navigate a single group

	```
	Math_QA/
	└── group_00/
	├── adapter/
	│ ├── adapter_config.json
	│ ├── adapter_model.safetensors
	│ ├── tokenizer/… (extra copies of merges, vocab, chat_template.jinja)
	│ └── training_args.bin
	├── checkpoints/
	│ ├── checkpoint-300/
	│ ├── checkpoint-600/
	│ └── …
	├── tokenizer/ # same as base tokenizer but pinned to this run
	├── prompt_group.json # 100-shot data
	└── metadata.json
	```

	When inspecting or sharing a run, the minimum file set is `adapter/` + `prompt_group.json` + `metadata.json`. Everything else speeds up resuming or auditing.

	## Using the adapters

	### 0. Environment prerequisites

	- Python ≥ 3.10, `transformers >= 4.37`, `peft >= 0.8`, `accelerate`, `safetensors`, `torch` (GPU build).
	- The base model directory must be accessible; otherwise download `Qwen2.5-1.5B-Instruct` from Hugging Face and update `base_model` path.
	- Optional: set `HF_HOME`, `TRANSFORMERS_CACHE` to avoid repeated downloads.

	### 0.5. Reproduce the training pipeline (optional)

	If someone wants to regenerate any adapter from scratch:

	```bash
	cd train_lora
	python -m train_lora.dataset_sampler --overwrite # regenerates prompt groups
	python -m train_lora.train_single --dataset Math_QA --group 0
	# or run the full queue
	python -m train_lora.run_tasks --run
	```

	These commands will rebuild `prompt_groups/` and `outputs/` with exactly the same seeds and configuration documented above. Slurm users should submit `sbatch run_lora_multinode.sh`.

	### 1. Load adapter with PEFT

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	base_model = "Qwen2.5-1.5B-Instruct"
	adapter_dir = "outputs/Math_QA/group_00/adapter"

	tokenizer = AutoTokenizer.from_pretrained(adapter_dir, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True,
	)
	model = PeftModel.from_pretrained(model, adapter_dir)

	prompt = "Solve 3x + 7 = 22."
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	out = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(out[0], skip_special_tokens=True))
	```

	Notes:
	- Loading the tokenizer from `adapter/` ensures identical chat template and additional tokens (if any). You can also point to the base tokenizer path if you prefer.
	- For batch inference, wrap the model with `model.merge_and_unload()` if you need a single combined set of weights (at the cost of losing LoRA toggling).
	- If you want maximal throughput on a single GPU, also call `model.half()` or `model.to(torch.bfloat16)` depending on your hardware; the adapters were trained with BF16 so keeping BF16 is the safest choice.

	### 2. Resume or continue training

	```bash
	python -m train_lora.train_single \
	--dataset Math_QA \
	--group 0 \
	--group-file outputs/Math_QA/group_00/prompt_group.json
	```

	Set `--group-file` to reuse the same 100 samples, and initialize `Trainer` with `checkpoints/checkpoint-XXXX` via `TrainingArguments.resume_from_checkpoint`. This reproduces a group or lets you extend training steps.

	To resume manually:

	```python
	trainer.train(resume_from_checkpoint="outputs/Math_QA/group_00/checkpoints/checkpoint-1500")
	```

	### 3. Evaluate with Math-Verify

	The evaluation stack in `../评估体系` and `../parameter_generator/评估` expects this directory layout. Example:

	```bash
	cd 评估体系
	python scripts/run_all_evals.py \
	--config configs/eval_config.yaml \
	--datasets Math_QA \
	--groups 0 1
	```

	### 4. Packaging for distribution

	- Upload only `adapter/` and `metadata.json` when sharing publicly (e.g., Hugging Face) to avoid huge checkpoint directories.
	- Keep `prompt_group.json` if you want consumers to understand the training data or to regenerate LoRA weights with the same samples.
	- When exporting, include a README snippet that references this document so downstream users know the provenance.
	- Suggested Hugging Face layout:
	```
	Math_QA/
	group_00/
	adapter/
	prompt_group.json
	metadata.json
	README.md (copy sections describing provenance + usage)
	```

	## File reference (`metadata.json`)

	Key fields you may want to automate against:

	\| Field \| Meaning \|
	\| --- \| --- \|
	\| `dataset_name`, `group_index` \| Identify the run. \|
	\| `prompt_group_file` \| Absolute path back to the sampled dataset. \|
	\| `checkpoint_root` \| Where all intermediate checkpoints live. \|
	\| `train_loss`, `metrics` \| Final loss and Trainer metrics dict. \|
	\| `trainer_state` \| Full log history (can be large; disable via `metadata.save_training_state`). \|
	\| `training_args` \| Exact HF `TrainingArguments` snapshot. \|
	\| `lora_config` \| Copy of the LoRA hyperparameters used. \|
	\| `effective_batch_size` \| `world_size × per_device_batch_size × grad_accum` — useful for scaling comparisons. \|
	\| `git_commit` \| Populated if the `GIT_COMMIT` env var was set before training. \|
	\| `metrics.train_runtime`, `metrics.train_samples_per_second` \| Throughput stats. \|
	\| `generated_at` \| UTC timestamp when the metadata was written. \|

	## Best practices

	- Always match BF16 or FP16 settings between base model loading and adapter training; these adapters were trained in BF16.
	- If you edit files inside this directory, keep structure intact—other scripts rely on relative paths (`adapter`, `tokenizer`, `metadata.json`).
	- Before deploying a new LoRA, verify it with the evaluation suite and consider merging multiple groups (e.g., ensemble or checkpoint averaging) only after confirming stability.
	- Use `prompt_group.json` and `metadata.json` as documentation when presenting results; they already include seeds, sample indices, and environment details.
	- If you build new LoRAs with different configs (e.g., higher rank, more steps), add a sibling directory (e.g., `outputs_v2/`) or annotate the README so collaborators know which adapters correspond to which experiment.

	Happy finetuning! If you extend this collection (new datasets, extra groups, or different hyperparameters), add another section here describing the changes so downstream consumers stay informed.