---
library_name: transformers
base_model:
- Qwen/Qwen3-8B
datasets:
- allenai/Dolci-Think-SFT-7B
- unicorn-team/dolci-vi-5k
- unicorn-team/unicorn-12task
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

[<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
<details><summary>See axolotl config</summary>

axolotl version: `0.13.0.dev0`
```yaml
base_model: /mnt/qwen3-8b

load_in_8bit: false
load_in_4bit: false
strict: false

plugins:
  - axolotl.integrations.liger.LigerPlugin

liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

chat_template: qwen3
datasets:
  # Dolci
  - path: /home/aithucchien/Unicorn/data/dolci_vi_5k_single.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true

  # 12 Task
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task2_sua_loi_sai_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task3_goi_mo_y_tuong_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task10_hoc_tap_tuong_tac_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task1_tra_loi_cau_hoi_extended_2.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task8_tao_tai_lieu_giang_day_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task9_tao_noi_dung_ca_nhan_hoa_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task5_ho_tro_tam_ly_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task7_cham_diem_tu_dong_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task6_tao_bo_cau_hoi_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task1_tra_loi_cau_hoi_extended.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task4_hoc_tap_ca_nhan_hoa_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task1_tra_loi_cau_hoi_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true
  - path: /home/aithucchien/luannd/Unicorn/data/synthetic/output_12_task_split/task11_tu_van_an_toan_answer.jsonl
    type: chat_template
    chat_template: qwen3
    split_thinking: true


output_dir: ./outputs/qwen3_8b_dolci_vi_5k_single_12task/

sequence_len: 32768
sample_packing: true
flex_attention: true


flex_attn_compile_kwargs:
  dynamic: false
  mode: max-autotune-no-cudagraphs

wandb_project: aithucchien
wandb_entity:
wandb_watch:
wandb_name: qwen3_8b_dolci_vi_5k_single_12task
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 2e-5

bf16: true
tf32: true

resume_from_checkpoint:
logging_steps: 1

saves_per_epoch: 1

warmup_ratio: 0.1
weight_decay: 0.0
fsdp:
  - full_shard
  - auto_wrap

fsdp_config:
  fsdp_version: 2
  fsdp_offload_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: Qwen3DecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_reshard_after_forward: true
  fsdp_activation_checkpointing: true

special_tokens:

```

</details><br>

# Unicorn-R3

## 1. Dữ liệu huấn luyện

- Synthetic 12 task từ Gemini: Bởi vì chất lượng thinking của API Gemini không tốt nên team sử dụng chiến lược sinh câu hỏi response - sau đó sinh giả lập thinking như dưới đây:

  1. Tạo Synthetic Syllabus (bao gồm cấp bậc (Tiểu học/THCS/THPT/) - Lớp - môn học - Chủ đề trong từng môn)
  2. Sử dụng Gemini 2.5 pro để sinh trước câu hỏi dựa trên 12 task của BTC dựa trên chủ đề, môn học của synthetic syllabus.
  3. Sau đó sinh Response dựa bằng (3-pro / 2.5-pro)
  4. Bước cuối team sử dụng Gemini 2.5 Pro/Flash để tạo giả lập chuỗi tư duy (synthetic CoT) bằng cách: tạo câu hỏi -> sinh đáp án -> dựng lại biểu diễn reasoning/thinking theo phong cách mô hình Qwen. Điều này giúp mô hình tối ưu khả năng giải quyết bài toán học đường đúng logic và sát thực tế với model base hơn. Final dữ liệu ~4k8 sample.

- Public data: Từ 2M dữ liệu của bộ dữ liệu Dolci-SFT, team lọc theo ngôn ngữ và task thì giữ lại ~5k sample chất lượng cao cho tiếng Việt

- Một số phương pháp team thử nhưng chưa thành công / chưa kịp exp:
  + Crawl được hơn 10k dữ liệu multiple choices + answer từ tracnghiem.net -> sinh synthetic thinking -> Training -> kết quả thấp đi trên VMLU nên quyết định bỏ.
  + Dựa trên 2M SFT, team có phân loại, filter trên random 100k để phân loại thành 12 task chính , nhưng chưa kịp mixing để training

## 2. Training & Evaluation

| Model | VMLU |
| :--- | :--- |
| [Qwen3-4B-Thinking-2507](https://huggingface.co/Qwen/Qwen3-4B-Thinking-2507) | 70.00 |
| [Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) | 69.00 |
| [Qwen3-VL-8B-Thinking](https://huggingface.co/Qwen/Qwen3-VL-8B-Thinking) | 74.10 |
| [**Unicorn-4B-R3**](https://huggingface.co/unicorn-team/Unicorn-4B-R3) | 70.59 |
| [**Unicorn-R3**](https://huggingface.co/unicorn-team/Unicorn-R3) | 71.74 |
| [**Unicorn-VL-R3**](https://huggingface.co/unicorn-team/Unicorn-VL-R3) | **74.87** |

- Team xây dựng 2 bộ benchmark chính để đánh giá trong quá trình training:

  + Tự động: VMLU - Đánh giá trắc nghiệm
  + LLM as Jugde: Build 120 câu benchmark cho 12 task của BTC, sau đó dùng Gemini-3-Pro chấm điểm chất lượng response sinh ra, follow theo bài báo Arena và đưa ra 2 cách chấm:

    + So sánh (Thắng/Thua) 2 response của mô hình khác nhau (mục đích để chọn base model tốt nhất)
    + Rating response (điểm từ 1-10) trước, sau FT với ground_truth response (Mục đích để kiểm tra chất lượng trước và sau finetuning)

- Exp chọn model:
  
  + Lựa chọn: Dựa trên các benchmark của team + report performance từ các mô hình team thử nghiệm tất cả và nhận thấy 3 model tốt nhất cho tiếng Việt, từ thấp đến cao đó là: Qwen3-4B-Thingking-2507, Qwen3-8B, Qwen-8B-VL-Thinking
  + Training các thử nghiệm trên model Qwen3-4B sau đó final apply cho Qwen3-8B-VL và Qwen3-8B

- Training

  + LoRA/FFT: Mặc dù lượng dữ liệu nhỏ nhưng kết quả loss + eval cho thấy FFT vẫn cho hiệu suất tốt hơn
  + Packing dữ liệu, Ligerkernel, Flex attention giảm mem và tăng tốc độ training để exp được nhiều
  + Optimize lr: 2e-5 - 2e-6
  + Total exp ~40-50exp

## 3. Nộp bài

- Mô hình tốt nhất của team training được là Qwen3-VL-8B với 74.87 điểm trên VMLU, nhưng 2 task Instruction Following và Function calling thì chất lượng không bằng Qwen3-8B (chỉ có 71.74 trên VMLU).

  =>  Sau khi tính AVG điểm thì Qwen3-8B đạt 74.03 và Qwen3-8B-VL đạt 73.43 nên team quyết định chọn Qwen3-8B làm final model

- Trong quá trình inference test để hiểu hơn về mô hình, team nhận thấy Qwen3 hay mắc các lỗi về thêm các token tiếng Trung vào trong response dù đã prompt kĩ lưỡng

  => Thực hiện model pruning weight để khiến mô hình không sinh các token tiếng Trung

- Kết quả trước và sau training của Qwen3-8B:

  * VMLU: 69.0 -> 71.74
  * LLM Judge 12 task: 52 -> 72 (Gemini-2.5-Flash: 84 / Gemini-2.5-Pro: 90)