| | --- |
| | license: llama3.2 |
| | datasets: |
| | - open-thoughts/OpenThoughts-114k |
| | - FreedomIntelligence/medical-o1-verifiable-problem |
| | - open-r1/OpenR1-Math-220k |
| | base_model: |
| | - meta-llama/Llama-3.2-3B-Instruct |
| | --- |
| | |
| | # mkurman/Llama-3.2-MedIT-3B-R1 |
| |
|
| | **Important Notice:** |
| | This model is provided strictly for research purposes and is not intended for production use. It should not be considered a validated source of medical or professional advice. Use only in controlled experimental settings. |
| |
|
| | --- |
| |
|
| | ## Model Overview |
| |
|
| | mkurman/Llama-3.2-MedIT-3B-R1 is a fine-tuned variant of meta-llama/Llama-3.2-3B-Instruct, adapted specifically for exploring natural language understanding and reasoning. This model leverages a multi-stage training approach, combining Blurred Thoughts Supervised Fine-Tuning (BT-SFT) and Group Relative Policy Optimization (GRPO) with an LLM evaluator to enhance its performance on specialized tasks. |
| |
|
| | --- |
| |
|
| | ## Training Procedure |
| |
|
| | The model was developed through the following sequential steps: |
| |
|
| | 1. **Initial Blurred Thoughts Supervised Fine-Tuning (BT-SFT):** |
| | - **Base Model:** meta-llama/Llama-3.2-3B-Instruct |
| | - **Parameters:** 2000 steps, batch size 2, accumulation iterations 16, learning rate 1e-6 |
| | - **Dataset:** open-thoughts/OpenThoughts-114k |
| | - **Details:** For further information on BT-SFT, see the [detailed post](https://huggingface.co/posts/mkurman/496852395740108) and the [GitHub repository](https://github.com/mkurman/blurred-thoughts-SFT). |
| |
|
| | 2. **Group Relative Policy Optimization (GRPO) Stage 1:** |
| | - **Dataset:** FreedomIntelligence/medical-o1-verifiable-problem |
| | - **Training:** 200 steps |
| | - **LLM Evaluator** mkurman/Qwen2.5-14B-DeepSeek-R1-1M |
| | - **Details:** For further information on GRPO with LLM evaluators, see the [GitHub repository](https://github.com/mkurman/grpo-llm-evaluator). |
| |
|
| | 3. **Group Relative Policy Optimization (GRPO) Stage 2:** |
| | - **Dataset:** open-r1/OpenR1-Math-220k |
| | - **Training:** 200 steps |
| | - **LLM Evaluator** deepseek/deepseek-r1-distill-qwen-14b (OpenRouterAI) |
| |
|
| | --- |
| |
|
| | ## Datasets Utilized |
| |
|
| | - **open-thoughts/OpenThoughts-114k:** |
| | A dataset consisting of open-ended thoughts that supports diverse conversational contexts during the initial supervised fine-tuning. |
| |
|
| | - **FreedomIntelligence/medical-o1-verifiable-problem:** |
| | A dataset curated for enhancing the model's capabilities in addressing verifiable medical problems. |
| |
|
| | - **open-r1/OpenR1-Math-220k:** |
| | A dataset designed to improve the model's reasoning and problem-solving skills in mathematical contexts. |
| |
|
| | --- |
| |
|
| | ## Intended Use |
| |
|
| | - **Research and Experimental Applications:** |
| | This model is optimized for academic research and exploratory projects. It is ideal for investigating advanced fine-tuning methods and evaluating performance on task-oriented conversational scenarios. |
| |
|
| | - **Controlled Environments:** |
| | Users should deploy this model only within controlled experimental frameworks where rigorous evaluation and proper safety guardrails are in place. |
| |
|
| | --- |
| |
|
| | ## Limitations and Ethical Considerations |
| |
|
| | - **Not for Clinical or Production Use:** |
| | The model’s outputs have not been validated for clinical accuracy or professional decision-making. It must not be used as a primary source for medical, legal, or safety-critical information. |
| |
|
| | - **Safety and Guardrails:** |
| | All users must implement appropriate safety measures and validation protocols. The model may produce biased or inaccurate results and should be used with caution. |
| |
|
| | - **Experimental Nature:** |
| | Given its research-oriented design, the model’s performance can vary widely based on input and context. It is essential to perform thorough testing and validation before drawing any conclusions from its outputs. |
| |
|
| | --- |
| |
|
| | ## License |
| |
|
| | This model is released under the Llama 3.2 license. Users must adhere to the terms specified in the license when utilizing this model. |
| |
|
| | --- |
| |
|
| | ## Final Notice |
| |
|
| | All outputs from **mkurman/Llama-3.2-MedIT-3B-R1** are intended solely for research purposes. This model is not a comprehensive knowledge source and should not be used as a substitute for professional advice or decision-making. Ensure that all necessary guardrails and safety protocols are in place when conducting any experiments with this model. |