|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-3B-Instruct |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
--- |
|
|
# AbleCredit Reasoner R0 Qwen 2.5 3B Instruct |
|
|
|
|
|
## Introduction |
|
|
|
|
|
This model is trained by Deepseek R1 style (GRPO) reinforcement learning on Qwen 2.5 3B Instruct as a base model. |
|
|
Primarily intended for research in application of small LLMs trained using GRPO/RL in the domain of finance, credit underwriting etc. |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Fine Tuned by:** AbleCredit (LightBees Technologies Private Limited, Bengaluru, India) |
|
|
- **License:** We've retained the original Qwen research license. Note that license does not allow commercial use. |
|
|
- **Finetuned from model:** Qwen/Qwen2.5-3B-Instruct |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use with standard Huggingface based setup |
|
|
|
|
|
```python |
|
|
model_name = "AbleCredit/AbleCredit-R0-Qwen-2.5-3B-Instruct" # or local path to model |
|
|
system_prompt = { |
|
|
"role": "system", |
|
|
"content": ( |
|
|
"You are a helpful assistant. User asks a question the assistant answers it.\n" |
|
|
"The assistant first thinks about reasoning process in mind and then provides the user with the answer." |
|
|
), |
|
|
} |
|
|
|
|
|
suffix_prompt = { |
|
|
"role": "assistant", |
|
|
"content": "Let me solve this step by step.\n<think>", |
|
|
} |
|
|
|
|
|
prompt_msgs = [ |
|
|
system_prompt, |
|
|
{"role": "user", "content": "What is 15 times 3 ?"}, |
|
|
suffix_prompt, |
|
|
] |
|
|
|
|
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
device_map="auto", |
|
|
torch_dtype=torch.bfloat16, |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
|
|
prompt = tokenizer.apply_chat_template( |
|
|
prompt_msgs, |
|
|
tokenize=False, |
|
|
continue_final_message=True, |
|
|
add_generation_prompt=False, |
|
|
) |
|
|
|
|
|
# Tokenize the prompt and move it to the appropriate device. |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0") |
|
|
|
|
|
print("\nGenerating response...\n") |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=1024, |
|
|
temperature=0.5, |
|
|
min_p=0.01, |
|
|
) |
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print("\nResponse:\n", response) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
Trained using open source logical reasoning datasets and a proprietary finance dataset created by AbleCredit.com. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
Trained using deepseek style reinforcement learning using GRPO with rule based rewards. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
- Model achieves ~67% score on GSM8K benchmark in a **zero shot** setting (check benchmarking script for more details). |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
[contact Harshad Saykhedkar via LinkedIn](https://www.linkedin.com/in/harshadss/) |