Buckets:

dpe1
/

jules-tinyreasoner

Files

xet

dpe1/jules-tinyreasoner / README.md

dpe1

16 days ago

preview code

download

raw

7.83 kB

	# TinyReasoner

	A reasoning model under 1 million parameters capable of tool calling.

	## Architecture
	- 2-layer LSTM
	- Hidden size: 256
	- Embedding size: 128
	- Character-level tokenizer with special tokens for capabilities.

	## Progress

	### Session 1: Pretraining
	- Implemented the model, tokenizer, and pretraining script.
	- Pretrained on NLTK Gutenberg corpus (approx. 1M characters).
	- Saved checkpoint: `models/pretrained.pt`.

	### Session 2: Supervised Fine-Tuning (SFT)
	- Implemented `src/capabilities.py` with dictionary (NLTK WordNet) and math (SymPy) tools.
	- Implemented `src/sampler.py` with capability call detection and result injection.
	- Created `src/generate_sft_data.py` to generate 2000 synthetic reasoning traces.
	- Created `src/sft_train.py` for instruction tuning using the SOAP optimizer.
	- Fine-tuned the model on reasoning traces.
	- Saved checkpoint: `models/sft_model.pt`.
	- Verified that the model starts to use `[DEFINE]` and `[SYMPY]` tokens correctly.

	### Session 3: Reinforcement Learning (GRPO)
	- Implemented GRPO training in `src/grpo_train.py`.
	- Created multi-faceted reward functions in `src/rewards.py`.
	- Expanded prompt generation in `src/prompts.py`.
	- Updated `src/sampler.py` to support multi-rollout log-prob and mask tracking.
	- Verified RL training loop and checkpointing.
	- Saved initial RL checkpoint: `models/rl_model.pt`.

	### Session 4: Extended RL and Task Complexity
	- Enhanced `src/prompts.py` with multi-step math problems and word length comparisons.
	- Refined `src/rewards.py` with length penalties and rewards for utilizing tool results.
	- Improved `src/grpo_train.py` for continuous training and increased iterations to 500.
	- Successfully completed 500 iterations of GRPO training.
	- Verified that the model maintains reasoning traces and tool use even with increased task complexity.
	- Synced all artifacts to Hugging Face bucket.

	### Session 5: Addressing Mode Collapse and Real Data Integration
	- Identified "mode collapse" where the model defaulted to `[DEFINE]elephant` for most prompts.
	- Updated `src/generate_sft_data.py` to use a much larger vocabulary from NLTK.
	- Created `src/generate_real_sft_data.py` to incorporate real tool-calling traces (from Hermes dataset) with injected capability calls.
	- Retrained SFT model on combined synthetic and real data (3000 examples) for 10 epochs.
	- Strengthened RL rewards in `src/rewards.py`:
	- Increased grounding reward to 0.25 per entity.
	- Added a -0.5 penalty for tool calls that don't match prompt entities (hallucinations).
	- Increased reward for utilizing tool results to 0.3.
	- Continued GRPO training with the new reward structure.

	### Session 6: Aggressive RL and Grounding Incentives
	- Drastically increased grounding rewards (from 0.5 to 5.0) and hallucination penalties (from -1.0 to -10.0) in `src/rewards.py`.
	- Introduced a specific -20.0 penalty for the `elephant` hallucination mode to force exploration.
	- Reduced KL penalty (`beta` from 0.01 to 0.0001) in `grpo_train.py` to allow the model to move away from the collapsed SFT state.
	- Increased GRPO `group_size` to 16 for better advantage estimation.
	- Implemented alternating exploration strategies in `src/sampler.py` (noise vs. temperature).
	- Observed the model starting to explore other words like `cat`, `banana`, and `jacket` in RL logs.
	- Added detailed logging of sample completions and unique completion counts in the training loop.

	### Session 7: Curriculum Learning and Dense Rewards
	- Implemented a three-level curriculum strategy in `src/prompts.py`.
	- Updated `src/rewards.py` for dense grounding rewards.
	- Enhanced `src/grpo_train.py` with curriculum progression and persistence.

	### Session 8: Targeted Grounding and Curriculum Reset
	- Identified persistent mode collapse (e.g., hallucinating `elephant`).
	- Created `src/generate_grounding_data.py` to generate 200 high-quality grounding examples.
	- Modified `src/sft_train.py` to support custom datasets and fine-tuning from existing RL checkpoints.
	- Performed a "nudge" SFT phase on the grounding dataset to re-orient the model towards prompt entities.
	- Reset GRPO curriculum to Level 0 to focus on simple grounding tasks.
	- Ran 300 iterations of GRPO.
	- Observed that while the model moved away from `elephant`, it partially collapsed into a new mode (`jacket`).
	- Verified that tool use syntax and reasoning traces remain intact.

	### Session 9: Tightened Grounding Rewards and Continued RL
	- Identified new mode collapse into `jacket` and `guitar`.
	- Significantly tightened grounding rewards in `src/rewards.py`:
	- Increased reward for grounded entities to +10.0.
	- Added heavy penalties (-20.0) for individual hallucinated entities in tool payloads.
	- Added specific penalties (-20.0) for `jacket` mode collapse.
	- Continued GRPO training for 300 iterations at Level 0.
	- Monitored training logs to ensure the model explores beyond collapsed modes.
	- Synced updated `rl_model.pt` and training logs to Hugging Face bucket.
	- Verified that core architecture and tool-calling mechanics remain stable via integration tests.

	### Session 10: Grounding Refresh and Evaluation Tools
	- Introduced `src/compare_models.py` to quantitatively track grounding rate and average reward.
	- Enhanced `src/generate_grounding_data.py` with diverse reasoning templates and division operations.
	- Generated an expanded dataset of 5000 grounding-focused examples.
	- Performed a targeted SFT "refresh" on this data (`models/sft_grounding_v2.pt`) to stabilize tool payloads.
	- Resumed GRPO training from the refreshed base, achieving a grounding rate of 0.55 (up from 0.3).
	- Verified that math grounding is strong, while dictionary grounding remains a focus for future sessions.
	- Confirmed that the model remains within the 1M parameter limit (~951k parameters).

	### Session 11: Quantitative Comparison and Continued RL
	- Refactored `src/compare_models.py` to support command-line arguments for easier checkpoint evaluation.
	- Evaluated all existing checkpoints (`sft_model.pt`, `rl_model.pt`, `rl_model_grounding.pt`) to establish the strongest baseline.
	- Continued GRPO RL training starting from the best `rl_model.pt` for 200 iterations at Level 0.
	- Achieved a stable grounding rate of 0.42 on Level 0 tasks.
	- Verified that math tool calls are frequently grounded, although dictionary lookups still exhibit word hallucinations in the payload.
	- Confirmed stability of the training loop and maintained model integrity via `src/integration_test.py`.

	### Session 13: Grounding Bonus and RL Stabilization
	- Refined `src/rewards.py` with a Target Grounding Bonus (+10.0) for matching exact prompt entities in capability calls.
	- Expanded the mode-collapse penalty list to include more hallucinated words ('moss', 'bat', 'sout', 'guitar', 'cat', 'banana', 'tomss', 'seet').
	- Ran 300 iterations of GRPO starting from `models/sft_grounding_v3.pt`.
	- Achieved a stable Level 0 grounding rate of 0.42, with math grounding being significantly stronger than dictionary grounding.
	- Verified that the math reward often reaches +30.0 due to the new grounding bonus.
	- Maintained model size under 1M parameters (~951k).

	## Next Steps
	- Monitor the transition between curriculum levels to ensure the model maintains performance.
	- Continue RL training with these aggressive rewards until grounding (using prompt words/numbers) becomes the dominant strategy.
	- If the model still struggles with grounding, consider generating "near-miss" SFT data where only the grounding is changed.
	- Monitor training diversity to ensure the model doesn't just collapse into a different fixed word (like `cat`).
	- Fine-tune the reward balance as the model starts showing desired behaviors.

	## Usage
	To test the model:
	```bash
	PYTHONPATH=. python3 src/sampler.py models/sft_model.pt
	```

Xet Storage Details

Size:: 7.83 kB
Xet hash:: 9a6e50078c9f97601285becec74105fe8534d98e20b5b05f5c11fa164241acdb

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.