Buckets:

dpe1
/

jules-tinyreasoner

Implemented src/capabilities.py with dictionary (NLTK WordNet) and math (SymPy) tools.
Implemented src/sampler.py with capability call detection and result injection.
Created src/generate_sft_data.py to generate 2000 synthetic reasoning traces.
Created src/sft_train.py for instruction tuning using the SOAP optimizer.
Fine-tuned the model on reasoning traces.
Saved checkpoint: models/sft_model.pt.
Verified that the model starts to use [DEFINE] and [SYMPY] tokens correctly.

Enhanced src/prompts.py with multi-step math problems and word length comparisons.
Refined src/rewards.py with length penalties and rewards for utilizing tool results.
Improved src/grpo_train.py for continuous training and increased iterations to 500.
Successfully completed 500 iterations of GRPO training.
Verified that the model maintains reasoning traces and tool use even with increased task complexity.
Synced all artifacts to Hugging Face bucket.

Identified "mode collapse" where the model defaulted to [DEFINE]elephant for most prompts.
Updated src/generate_sft_data.py to use a much larger vocabulary from NLTK.
Created src/generate_real_sft_data.py to incorporate real tool-calling traces (from Hermes dataset) with injected capability calls.
Retrained SFT model on combined synthetic and real data (3000 examples) for 10 epochs.
Strengthened RL rewards in src/rewards.py:
- Increased grounding reward to 0.25 per entity.
- Added a -0.5 penalty for tool calls that don't match prompt entities (hallucinations).
- Increased reward for utilizing tool results to 0.3.
Continued GRPO training with the new reward structure.

Drastically increased grounding rewards (from 0.5 to 5.0) and hallucination penalties (from -1.0 to -10.0) in src/rewards.py.
Introduced a specific -20.0 penalty for the elephant hallucination mode to force exploration.
Reduced KL penalty (beta from 0.01 to 0.0001) in grpo_train.py to allow the model to move away from the collapsed SFT state.
Increased GRPO group_size to 16 for better advantage estimation.
Implemented alternating exploration strategies in src/sampler.py (noise vs. temperature).
Observed the model starting to explore other words like cat, banana, and jacket in RL logs.
Added detailed logging of sample completions and unique completion counts in the training loop.

Identified persistent mode collapse (e.g., hallucinating elephant).
Created src/generate_grounding_data.py to generate 200 high-quality grounding examples.
Modified src/sft_train.py to support custom datasets and fine-tuning from existing RL checkpoints.
Performed a "nudge" SFT phase on the grounding dataset to re-orient the model towards prompt entities.
Reset GRPO curriculum to Level 0 to focus on simple grounding tasks.
Ran 300 iterations of GRPO.
Observed that while the model moved away from elephant, it partially collapsed into a new mode (jacket).
Verified that tool use syntax and reasoning traces remain intact.

Identified new mode collapse into jacket and guitar.
Significantly tightened grounding rewards in src/rewards.py:
- Increased reward for grounded entities to +10.0.
- Added heavy penalties (-20.0) for individual hallucinated entities in tool payloads.
- Added specific penalties (-20.0) for jacket mode collapse.
Continued GRPO training for 300 iterations at Level 0.
Monitored training logs to ensure the model explores beyond collapsed modes.
Synced updated rl_model.pt and training logs to Hugging Face bucket.
Verified that core architecture and tool-calling mechanics remain stable via integration tests.

Introduced src/compare_models.py to quantitatively track grounding rate and average reward.
Enhanced src/generate_grounding_data.py with diverse reasoning templates and division operations.
Generated an expanded dataset of 5000 grounding-focused examples.
Performed a targeted SFT "refresh" on this data (models/sft_grounding_v2.pt) to stabilize tool payloads.
Resumed GRPO training from the refreshed base, achieving a grounding rate of 0.55 (up from 0.3).
Verified that math grounding is strong, while dictionary grounding remains a focus for future sessions.
Confirmed that the model remains within the 1M parameter limit (~951k parameters).

Refactored src/compare_models.py to support command-line arguments for easier checkpoint evaluation.
Evaluated all existing checkpoints (sft_model.pt, rl_model.pt, rl_model_grounding.pt) to establish the strongest baseline.
Continued GRPO RL training starting from the best rl_model.pt for 200 iterations at Level 0.
Achieved a stable grounding rate of 0.42 on Level 0 tasks.
Verified that math tool calls are frequently grounded, although dictionary lookups still exhibit word hallucinations in the payload.
Confirmed stability of the training loop and maintained model integrity via src/integration_test.py.

Refined src/rewards.py with a Target Grounding Bonus (+10.0) for matching exact prompt entities in capability calls.
Expanded the mode-collapse penalty list to include more hallucinated words ('moss', 'bat', 'sout', 'guitar', 'cat', 'banana', 'tomss', 'seet').
Ran 300 iterations of GRPO starting from models/sft_grounding_v3.pt.
Achieved a stable Level 0 grounding rate of 0.42, with math grounding being significantly stronger than dictionary grounding.
Verified that the math reward often reaches +30.0 due to the new grounding bonus.
Maintained model size under 1M parameters (~951k).

Next Steps

Monitor the transition between curriculum levels to ensure the model maintains performance.
Continue RL training with these aggressive rewards until grounding (using prompt words/numbers) becomes the dominant strategy.
If the model still struggles with grounding, consider generating "near-miss" SFT data where only the grounding is changed.
Monitor training diversity to ensure the model doesn't just collapse into a different fixed word (like cat).
Fine-tune the reward balance as the model starts showing desired behaviors.

To test the model:

PYTHONPATH=. python3 src/sampler.py models/sft_model.pt

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.