Buckets:
| # TinyReasoner | |
| A reasoning model under 1 million parameters capable of tool calling. | |
| ## Architecture | |
| - 2-layer LSTM | |
| - Hidden size: 256 | |
| - Embedding size: 128 | |
| - Character-level tokenizer with special tokens for capabilities. | |
| ## Progress | |
| ### Session 1: Pretraining | |
| - Implemented the model, tokenizer, and pretraining script. | |
| - Pretrained on NLTK Gutenberg corpus (approx. 1M characters). | |
| - Saved checkpoint: `models/pretrained.pt`. | |
| ### Session 2: Supervised Fine-Tuning (SFT) | |
| - Implemented `src/capabilities.py` with dictionary (NLTK WordNet) and math (SymPy) tools. | |
| - Implemented `src/sampler.py` with capability call detection and result injection. | |
| - Created `src/generate_sft_data.py` to generate 2000 synthetic reasoning traces. | |
| - Created `src/sft_train.py` for instruction tuning using the SOAP optimizer. | |
| - Fine-tuned the model on reasoning traces. | |
| - Saved checkpoint: `models/sft_model.pt`. | |
| - Verified that the model starts to use `[DEFINE]` and `[SYMPY]` tokens correctly. | |
| ### Session 3: Reinforcement Learning (GRPO) | |
| - Implemented GRPO training in `src/grpo_train.py`. | |
| - Created multi-faceted reward functions in `src/rewards.py`. | |
| - Expanded prompt generation in `src/prompts.py`. | |
| - Updated `src/sampler.py` to support multi-rollout log-prob and mask tracking. | |
| - Verified RL training loop and checkpointing. | |
| - Saved initial RL checkpoint: `models/rl_model.pt`. | |
| ### Session 4: Extended RL and Task Complexity | |
| - Enhanced `src/prompts.py` with multi-step math problems and word length comparisons. | |
| - Refined `src/rewards.py` with length penalties and rewards for utilizing tool results. | |
| - Improved `src/grpo_train.py` for continuous training and increased iterations to 500. | |
| - Successfully completed 500 iterations of GRPO training. | |
| - Verified that the model maintains reasoning traces and tool use even with increased task complexity. | |
| - Synced all artifacts to Hugging Face bucket. | |
| ### Session 5: Addressing Mode Collapse and Real Data Integration | |
| - Identified "mode collapse" where the model defaulted to `[DEFINE]elephant` for most prompts. | |
| - Updated `src/generate_sft_data.py` to use a much larger vocabulary from NLTK. | |
| - Created `src/generate_real_sft_data.py` to incorporate real tool-calling traces (from Hermes dataset) with injected capability calls. | |
| - Retrained SFT model on combined synthetic and real data (3000 examples) for 10 epochs. | |
| - Strengthened RL rewards in `src/rewards.py`: | |
| - Increased grounding reward to 0.25 per entity. | |
| - Added a -0.5 penalty for tool calls that don't match prompt entities (hallucinations). | |
| - Increased reward for utilizing tool results to 0.3. | |
| - Continued GRPO training with the new reward structure. | |
| ### Session 6: Aggressive RL and Grounding Incentives | |
| - Drastically increased grounding rewards (from 0.5 to 5.0) and hallucination penalties (from -1.0 to -10.0) in `src/rewards.py`. | |
| - Introduced a specific -20.0 penalty for the `elephant` hallucination mode to force exploration. | |
| - Reduced KL penalty (`beta` from 0.01 to 0.0001) in `grpo_train.py` to allow the model to move away from the collapsed SFT state. | |
| - Increased GRPO `group_size` to 16 for better advantage estimation. | |
| - Implemented alternating exploration strategies in `src/sampler.py` (noise vs. temperature). | |
| - Observed the model starting to explore other words like `cat`, `banana`, and `jacket` in RL logs. | |
| - Added detailed logging of sample completions and unique completion counts in the training loop. | |
| ### Session 7: Curriculum Learning and Dense Rewards | |
| - Implemented a three-level curriculum strategy in `src/prompts.py`. | |
| - Updated `src/rewards.py` for dense grounding rewards. | |
| - Enhanced `src/grpo_train.py` with curriculum progression and persistence. | |
| ### Session 8: Targeted Grounding and Curriculum Reset | |
| - Identified persistent mode collapse (e.g., hallucinating `elephant`). | |
| - Created `src/generate_grounding_data.py` to generate 200 high-quality grounding examples. | |
| - Modified `src/sft_train.py` to support custom datasets and fine-tuning from existing RL checkpoints. | |
| - Performed a "nudge" SFT phase on the grounding dataset to re-orient the model towards prompt entities. | |
| - Reset GRPO curriculum to Level 0 to focus on simple grounding tasks. | |
| - Ran 300 iterations of GRPO. | |
| - Observed that while the model moved away from `elephant`, it partially collapsed into a new mode (`jacket`). | |
| - Verified that tool use syntax and reasoning traces remain intact. | |
| ### Session 9: Tightened Grounding Rewards and Continued RL | |
| - Identified new mode collapse into `jacket` and `guitar`. | |
| - Significantly tightened grounding rewards in `src/rewards.py`: | |
| - Increased reward for grounded entities to +10.0. | |
| - Added heavy penalties (-20.0) for individual hallucinated entities in tool payloads. | |
| - Added specific penalties (-20.0) for `jacket` mode collapse. | |
| - Continued GRPO training for 300 iterations at Level 0. | |
| - Monitored training logs to ensure the model explores beyond collapsed modes. | |
| - Synced updated `rl_model.pt` and training logs to Hugging Face bucket. | |
| - Verified that core architecture and tool-calling mechanics remain stable via integration tests. | |
| ### Session 10: Grounding Refresh and Evaluation Tools | |
| - Introduced `src/compare_models.py` to quantitatively track grounding rate and average reward. | |
| - Enhanced `src/generate_grounding_data.py` with diverse reasoning templates and division operations. | |
| - Generated an expanded dataset of 5000 grounding-focused examples. | |
| - Performed a targeted SFT "refresh" on this data (`models/sft_grounding_v2.pt`) to stabilize tool payloads. | |
| - Resumed GRPO training from the refreshed base, achieving a grounding rate of 0.55 (up from 0.3). | |
| - Verified that math grounding is strong, while dictionary grounding remains a focus for future sessions. | |
| - Confirmed that the model remains within the 1M parameter limit (~951k parameters). | |
| ### Session 11: Quantitative Comparison and Continued RL | |
| - Refactored `src/compare_models.py` to support command-line arguments for easier checkpoint evaluation. | |
| - Evaluated all existing checkpoints (`sft_model.pt`, `rl_model.pt`, `rl_model_grounding.pt`) to establish the strongest baseline. | |
| - Continued GRPO RL training starting from the best `rl_model.pt` for 200 iterations at Level 0. | |
| - Achieved a stable grounding rate of 0.42 on Level 0 tasks. | |
| - Verified that math tool calls are frequently grounded, although dictionary lookups still exhibit word hallucinations in the payload. | |
| - Confirmed stability of the training loop and maintained model integrity via `src/integration_test.py`. | |
| ### Session 13: Grounding Bonus and RL Stabilization | |
| - Refined `src/rewards.py` with a Target Grounding Bonus (+10.0) for matching exact prompt entities in capability calls. | |
| - Expanded the mode-collapse penalty list to include more hallucinated words ('moss', 'bat', 'sout', 'guitar', 'cat', 'banana', 'tomss', 'seet'). | |
| - Ran 300 iterations of GRPO starting from `models/sft_grounding_v3.pt`. | |
| - Achieved a stable Level 0 grounding rate of 0.42, with math grounding being significantly stronger than dictionary grounding. | |
| - Verified that the math reward often reaches +30.0 due to the new grounding bonus. | |
| - Maintained model size under 1M parameters (~951k). | |
| ## Next Steps | |
| - Monitor the transition between curriculum levels to ensure the model maintains performance. | |
| - Continue RL training with these aggressive rewards until grounding (using prompt words/numbers) becomes the dominant strategy. | |
| - If the model still struggles with grounding, consider generating "near-miss" SFT data where only the grounding is changed. | |
| - Monitor training diversity to ensure the model doesn't just collapse into a different fixed word (like `cat`). | |
| - Fine-tune the reward balance as the model starts showing desired behaviors. | |
| ## Usage | |
| To test the model: | |
| ```bash | |
| PYTHONPATH=. python3 src/sampler.py models/sft_model.pt | |
| ``` | |
Xet Storage Details
- Size:
- 7.83 kB
- Xet hash:
- 9a6e50078c9f97601285becec74105fe8534d98e20b5b05f5c11fa164241acdb
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.