Buckets:
TinyReasoner
A reasoning model under 1 million parameters capable of tool calling.
Architecture
- 2-layer LSTM
- Hidden size: 256
- Embedding size: 128
- Character-level tokenizer with special tokens for capabilities.
Progress
Session 1: Pretraining
- Implemented the model, tokenizer, and pretraining script.
- Pretrained on NLTK Gutenberg corpus (approx. 1M characters).
- Saved checkpoint:
models/pretrained.pt.
Session 2: Supervised Fine-Tuning (SFT)
- Implemented
src/capabilities.pywith dictionary (NLTK WordNet) and math (SymPy) tools. - Implemented
src/sampler.pywith capability call detection and result injection. - Created
src/generate_sft_data.pyto generate 2000 synthetic reasoning traces. - Created
src/sft_train.pyfor instruction tuning using the SOAP optimizer. - Fine-tuned the model on reasoning traces.
- Saved checkpoint:
models/sft_model.pt. - Verified that the model starts to use
[DEFINE]and[SYMPY]tokens correctly.
Session 3: Reinforcement Learning (GRPO)
- Implemented GRPO training in
src/grpo_train.py. - Created multi-faceted reward functions in
src/rewards.py. - Expanded prompt generation in
src/prompts.py. - Updated
src/sampler.pyto support multi-rollout log-prob and mask tracking. - Verified RL training loop and checkpointing.
- Saved initial RL checkpoint:
models/rl_model.pt.
Session 4: Extended RL and Task Complexity
- Enhanced
src/prompts.pywith multi-step math problems and word length comparisons. - Refined
src/rewards.pywith length penalties and rewards for utilizing tool results. - Improved
src/grpo_train.pyfor continuous training and increased iterations to 500. - Successfully completed 500 iterations of GRPO training.
- Verified that the model maintains reasoning traces and tool use even with increased task complexity.
- Synced all artifacts to Hugging Face bucket.
Session 5: Addressing Mode Collapse and Real Data Integration
- Identified "mode collapse" where the model defaulted to
[DEFINE]elephantfor most prompts. - Updated
src/generate_sft_data.pyto use a much larger vocabulary from NLTK. - Created
src/generate_real_sft_data.pyto incorporate real tool-calling traces (from Hermes dataset) with injected capability calls. - Retrained SFT model on combined synthetic and real data (3000 examples) for 10 epochs.
- Strengthened RL rewards in
src/rewards.py:- Increased grounding reward to 0.25 per entity.
- Added a -0.5 penalty for tool calls that don't match prompt entities (hallucinations).
- Increased reward for utilizing tool results to 0.3.
- Continued GRPO training with the new reward structure.
Session 6: Aggressive RL and Grounding Incentives
- Drastically increased grounding rewards (from 0.5 to 5.0) and hallucination penalties (from -1.0 to -10.0) in
src/rewards.py. - Introduced a specific -20.0 penalty for the
elephanthallucination mode to force exploration. - Reduced KL penalty (
betafrom 0.01 to 0.0001) ingrpo_train.pyto allow the model to move away from the collapsed SFT state. - Increased GRPO
group_sizeto 16 for better advantage estimation. - Implemented alternating exploration strategies in
src/sampler.py(noise vs. temperature). - Observed the model starting to explore other words like
cat,banana, andjacketin RL logs. - Added detailed logging of sample completions and unique completion counts in the training loop.
Session 7: Curriculum Learning and Dense Rewards
- Implemented a three-level curriculum strategy in
src/prompts.py. - Updated
src/rewards.pyfor dense grounding rewards. - Enhanced
src/grpo_train.pywith curriculum progression and persistence.
Session 8: Targeted Grounding and Curriculum Reset
- Identified persistent mode collapse (e.g., hallucinating
elephant). - Created
src/generate_grounding_data.pyto generate 200 high-quality grounding examples. - Modified
src/sft_train.pyto support custom datasets and fine-tuning from existing RL checkpoints. - Performed a "nudge" SFT phase on the grounding dataset to re-orient the model towards prompt entities.
- Reset GRPO curriculum to Level 0 to focus on simple grounding tasks.
- Ran 300 iterations of GRPO.
- Observed that while the model moved away from
elephant, it partially collapsed into a new mode (jacket). - Verified that tool use syntax and reasoning traces remain intact.
Session 9: Tightened Grounding Rewards and Continued RL
- Identified new mode collapse into
jacketandguitar. - Significantly tightened grounding rewards in
src/rewards.py:- Increased reward for grounded entities to +10.0.
- Added heavy penalties (-20.0) for individual hallucinated entities in tool payloads.
- Added specific penalties (-20.0) for
jacketmode collapse.
- Continued GRPO training for 300 iterations at Level 0.
- Monitored training logs to ensure the model explores beyond collapsed modes.
- Synced updated
rl_model.ptand training logs to Hugging Face bucket. - Verified that core architecture and tool-calling mechanics remain stable via integration tests.
Session 10: Grounding Refresh and Evaluation Tools
- Introduced
src/compare_models.pyto quantitatively track grounding rate and average reward. - Enhanced
src/generate_grounding_data.pywith diverse reasoning templates and division operations. - Generated an expanded dataset of 5000 grounding-focused examples.
- Performed a targeted SFT "refresh" on this data (
models/sft_grounding_v2.pt) to stabilize tool payloads. - Resumed GRPO training from the refreshed base, achieving a grounding rate of 0.55 (up from 0.3).
- Verified that math grounding is strong, while dictionary grounding remains a focus for future sessions.
- Confirmed that the model remains within the 1M parameter limit (~951k parameters).
Session 11: Quantitative Comparison and Continued RL
- Refactored
src/compare_models.pyto support command-line arguments for easier checkpoint evaluation. - Evaluated all existing checkpoints (
sft_model.pt,rl_model.pt,rl_model_grounding.pt) to establish the strongest baseline. - Continued GRPO RL training starting from the best
rl_model.ptfor 200 iterations at Level 0. - Achieved a stable grounding rate of 0.42 on Level 0 tasks.
- Verified that math tool calls are frequently grounded, although dictionary lookups still exhibit word hallucinations in the payload.
- Confirmed stability of the training loop and maintained model integrity via
src/integration_test.py.
Session 13: Grounding Bonus and RL Stabilization
- Refined
src/rewards.pywith a Target Grounding Bonus (+10.0) for matching exact prompt entities in capability calls. - Expanded the mode-collapse penalty list to include more hallucinated words ('moss', 'bat', 'sout', 'guitar', 'cat', 'banana', 'tomss', 'seet').
- Ran 300 iterations of GRPO starting from
models/sft_grounding_v3.pt. - Achieved a stable Level 0 grounding rate of 0.42, with math grounding being significantly stronger than dictionary grounding.
- Verified that the math reward often reaches +30.0 due to the new grounding bonus.
- Maintained model size under 1M parameters (~951k).
Next Steps
- Monitor the transition between curriculum levels to ensure the model maintains performance.
- Continue RL training with these aggressive rewards until grounding (using prompt words/numbers) becomes the dominant strategy.
- If the model still struggles with grounding, consider generating "near-miss" SFT data where only the grounding is changed.
- Monitor training diversity to ensure the model doesn't just collapse into a different fixed word (like
cat). - Fine-tune the reward balance as the model starts showing desired behaviors.
Usage
To test the model:
PYTHONPATH=. python3 src/sampler.py models/sft_model.pt
Xet Storage Details
- Size:
- 7.83 kB
- Xet hash:
- 9a6e50078c9f97601285becec74105fe8534d98e20b5b05f5c11fa164241acdb
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.