Update README.md
Browse files
README.md
CHANGED
|
@@ -4,6 +4,23 @@ library_name: peft
|
|
| 4 |
license: apache-2.0
|
| 5 |
---
|
| 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
## Usage
|
| 8 |
```python
|
| 9 |
from peft import PeftModel
|
|
|
|
| 4 |
license: apache-2.0
|
| 5 |
---
|
| 6 |
|
| 7 |
+
# MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
|
| 8 |
+
|
| 9 |
+
🔗 Paper link: [Arxiv preprint](https://arxiv.org/abs/2507.02851)
|
| 10 |
+
|
| 11 |
+
🔗 Link to the trained models: [Hugging Face collection](https://huggingface.co/collections/purbeshmitra/motif-paper-models-686a2f36407bb88f750eef75)
|
| 12 |
+
|
| 13 |
+
The [INFTYTHINK architecture](https://arxiv.org/abs/2503.06692v1), shown below, allows multi-round thinking for extended LLM reasoning beyond its context size.
|
| 14 |
+
<p align="center">
|
| 15 |
+
<img src="assets/multiround.png" alt="Alt Text" width="750">
|
| 16 |
+
</p>
|
| 17 |
+
|
| 18 |
+
In this work, we propose a GRPO based training method for such a system that allows to calculate the accuracy reward by rolling out trajectories and applying the reward at the first round of inference outcomes. This is depicted as following:
|
| 19 |
+
<p align="center">
|
| 20 |
+
<img src="assets/multiround_grpo.png" alt="Alt Text" width="750">
|
| 21 |
+
</p>
|
| 22 |
+
|
| 23 |
+
|
| 24 |
## Usage
|
| 25 |
```python
|
| 26 |
from peft import PeftModel
|