v-1.0.0

49f50bc verified about 1 year ago

4.27 kB

	---
	license: llama3.2
	datasets:
	- open-thoughts/OpenThoughts-114k
	- FreedomIntelligence/medical-o1-verifiable-problem
	- open-r1/OpenR1-Math-220k
	base_model:
	- meta-llama/Llama-3.2-3B-Instruct
	---

	# mkurman/Llama-3.2-MedIT-3B-R1

	Important Notice:
	This model is provided strictly for research purposes and is not intended for production use. It should not be considered a validated source of medical or professional advice. Use only in controlled experimental settings.

	---

	## Model Overview

	mkurman/Llama-3.2-MedIT-3B-R1 is a fine-tuned variant of meta-llama/Llama-3.2-3B-Instruct, adapted specifically for exploring natural language understanding and reasoning. This model leverages a multi-stage training approach, combining Blurred Thoughts Supervised Fine-Tuning (BT-SFT) and Group Relative Policy Optimization (GRPO) with an LLM evaluator to enhance its performance on specialized tasks.

	---

	## Training Procedure

	The model was developed through the following sequential steps:

	1. Initial Blurred Thoughts Supervised Fine-Tuning (BT-SFT):
	- Base Model: meta-llama/Llama-3.2-3B-Instruct
	- Parameters: 2000 steps, batch size 2, accumulation iterations 16, learning rate 1e-6
	- Dataset: open-thoughts/OpenThoughts-114k
	- Details: For further information on BT-SFT, see the [detailed post](https://huggingface.co/posts/mkurman/496852395740108) and the [GitHub repository](https://github.com/mkurman/blurred-thoughts-SFT).

	2. Group Relative Policy Optimization (GRPO) Stage 1:
	- Dataset: FreedomIntelligence/medical-o1-verifiable-problem
	- Training: 200 steps
	- LLM Evaluator mkurman/Qwen2.5-14B-DeepSeek-R1-1M
	- Details: For further information on GRPO with LLM evaluators, see the [GitHub repository](https://github.com/mkurman/grpo-llm-evaluator).

	3. Group Relative Policy Optimization (GRPO) Stage 2:
	- Dataset: open-r1/OpenR1-Math-220k
	- Training: 200 steps
	- LLM Evaluator deepseek/deepseek-r1-distill-qwen-14b (OpenRouterAI)

	---

	## Datasets Utilized

	- open-thoughts/OpenThoughts-114k:
	A dataset consisting of open-ended thoughts that supports diverse conversational contexts during the initial supervised fine-tuning.

	- FreedomIntelligence/medical-o1-verifiable-problem:
	A dataset curated for enhancing the model's capabilities in addressing verifiable medical problems.

	- open-r1/OpenR1-Math-220k:
	A dataset designed to improve the model's reasoning and problem-solving skills in mathematical contexts.

	---

	## Intended Use

	- Research and Experimental Applications:
	This model is optimized for academic research and exploratory projects. It is ideal for investigating advanced fine-tuning methods and evaluating performance on task-oriented conversational scenarios.

	- Controlled Environments:
	Users should deploy this model only within controlled experimental frameworks where rigorous evaluation and proper safety guardrails are in place.

	---

	## Limitations and Ethical Considerations

	- Not for Clinical or Production Use:
	The model’s outputs have not been validated for clinical accuracy or professional decision-making. It must not be used as a primary source for medical, legal, or safety-critical information.

	- Safety and Guardrails:
	All users must implement appropriate safety measures and validation protocols. The model may produce biased or inaccurate results and should be used with caution.

	- Experimental Nature:
	Given its research-oriented design, the model’s performance can vary widely based on input and context. It is essential to perform thorough testing and validation before drawing any conclusions from its outputs.

	---

	## License

	This model is released under the Llama 3.2 license. Users must adhere to the terms specified in the license when utilizing this model.

	---

	## Final Notice

	All outputs from mkurman/Llama-3.2-MedIT-3B-R1 are intended solely for research purposes. This model is not a comprehensive knowledge source and should not be used as a substitute for professional advice or decision-making. Ensure that all necessary guardrails and safety protocols are in place when conducting any experiments with this model.