Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -2,6 +2,8 @@
|
|
| 2 |
language: en
|
| 3 |
license: apache-2.0
|
| 4 |
base_model: Nanbeige/Nanbeige4.1-3B
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- tool-use
|
| 7 |
- gmail
|
|
@@ -16,19 +18,21 @@ pipeline_tag: text-generation
|
|
| 16 |
Fine-tuned version of [Nanbeige/Nanbeige4.1-3B](https://huggingface.co/Nanbeige/Nanbeige4.1-3B)
|
| 17 |
for Gmail tool-calling tasks using a two-stage training pipeline.
|
| 18 |
|
|
|
|
|
|
|
| 19 |
## Training Pipeline
|
| 20 |
|
| 21 |
### Stage 1 — Supervised Fine-Tuning (SFT)
|
| 22 |
-
- **Dataset:**
|
| 23 |
- **Format:** ChatML with tool_calls (OpenAI function-calling schema)
|
| 24 |
- **Method:** LoRA r=16, α=32, 7 target modules
|
| 25 |
- **Result:** loss 0.8464 → 0.1888 · PPL 2.33 → 1.21
|
| 26 |
|
| 27 |
### Stage 2 — Direct Preference Optimization (DPO)
|
| 28 |
-
- **Dataset:** 3223 preference pairs
|
| 29 |
-
- `wrong_tool` — incorrect tool selected
|
| 30 |
-
- `missing_args` — required arguments omitted
|
| 31 |
-
- `bad_answer` — poor final response
|
| 32 |
- **Method:** DPO β=0.1, sigmoid loss, LoRA r=16, `ref_model=None` (PEFT implicit ref)
|
| 33 |
- **Result:** val_loss=0.000765 · reward accuracy=100% · normalized margin=+0.52
|
| 34 |
|
|
|
|
| 2 |
language: en
|
| 3 |
license: apache-2.0
|
| 4 |
base_model: Nanbeige/Nanbeige4.1-3B
|
| 5 |
+
datasets:
|
| 6 |
+
- TurkishCodeMan/Nanbeige4.1-3B-Gmail-Tool-Use-Datasets
|
| 7 |
tags:
|
| 8 |
- tool-use
|
| 9 |
- gmail
|
|
|
|
| 18 |
Fine-tuned version of [Nanbeige/Nanbeige4.1-3B](https://huggingface.co/Nanbeige/Nanbeige4.1-3B)
|
| 19 |
for Gmail tool-calling tasks using a two-stage training pipeline.
|
| 20 |
|
| 21 |
+
**Training datasets:** [TurkishCodeMan/Nanbeige4.1-3B-Gmail-Tool-Use-Datasets](https://huggingface.co/datasets/TurkishCodeMan/Nanbeige4.1-3B-Gmail-Tool-Use-Datasets)
|
| 22 |
+
|
| 23 |
## Training Pipeline
|
| 24 |
|
| 25 |
### Stage 1 — Supervised Fine-Tuning (SFT)
|
| 26 |
+
- **Dataset:** 740 multi-turn Gmail agent traces (`sft/traces_chatml_clean.jsonl`)
|
| 27 |
- **Format:** ChatML with tool_calls (OpenAI function-calling schema)
|
| 28 |
- **Method:** LoRA r=16, α=32, 7 target modules
|
| 29 |
- **Result:** loss 0.8464 → 0.1888 · PPL 2.33 → 1.21
|
| 30 |
|
| 31 |
### Stage 2 — Direct Preference Optimization (DPO)
|
| 32 |
+
- **Dataset:** 3223 preference pairs (`dpo/dpo_dataset.jsonl`) — 3 rejection strategies:
|
| 33 |
+
- `wrong_tool` — incorrect tool selected (~34%)
|
| 34 |
+
- `missing_args` — required arguments omitted (~32%)
|
| 35 |
+
- `bad_answer` — poor final response (~34%)
|
| 36 |
- **Method:** DPO β=0.1, sigmoid loss, LoRA r=16, `ref_model=None` (PEFT implicit ref)
|
| 37 |
- **Result:** val_loss=0.000765 · reward accuracy=100% · normalized margin=+0.52
|
| 38 |
|