rosettia-chanka-4b-base

The 4B Chanka-specialized base model used as the starting point for the Rosettia compact-mixed SFT chain that produced Thermostatic/rosettia-chanka-4b-alpha160 (chrF++ 56.94 on held-out clean Chanka).

This is not the strongest deployable artifact — it scores chrF++ 43.49 on the 158-row held-out, well below the final champion. It is published so other researchers can train alternative LoRA adapters on the same Chanka-specialized base without having to redo the multi-hour broad+full-FT chain.

Held-out result

Metric (158-row clean Chanka held-out) This base Final champion Δ
chrF++ 43.49 56.94 +13.45
BLEU 16.14 30.76 +14.62
token F1 28.94 46.43 +17.49
TER (↓) 82.49 62.21 −20.28

How this base was produced

Stage Recipe
0 Raw unsloth/Qwen3.5-4B
1 Broad Quechua LoRA SFT (~768 steps on ~169k AmericasNLP + SomosNLP quy-spa pairs, LoRA r=64/α=128, LR 5e-6)
2 Merge the broad LoRA into the full base
3 Full-parameter FT on the clean Chanka subset (48 steps, LR=2e-6, paged_adamw_8bit) — produces this checkpoint

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("Thermostatic/rosettia-chanka-4b-base")
model = AutoModelForCausalLM.from_pretrained(
    "Thermostatic/rosettia-chanka-4b-base",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
# Use the same prompt format as the champion model card.

Or attach the compact-mixed LoRA chain on top with PEFT for the champion behavior (see the champion model card for the full recipe).

Intended use

  • Starting point for further Chanka SFT / LoRA experiments
  • Reproducing the Rosettia compact-mixed chain (v11→v12→v13)
  • Domain-adapting to other Chanka subdomains (judicial training data is the source here)

Limitations

  • Trained on 1,055 reviewed Chanka pairs from a single judicial manual — out-of-domain coverage is unknown
  • Chanka variety only (quy_Latn); not appropriate for Cuzco-Collao (quz), Bolivian (quh), or other Quechua varieties without further adaptation
  • Tokenizer-level: inherits Qwen3.5-4B vocabulary (no Quechua-specific tokens added)

Citation / attribution

Built for #HACKATHONSomosNLP 2026 by the Thermostatic team. Data: Thermostatic/rosettia-chanka-data. The judicial-manual source PDF is in the public domain.

Downloads last month
25
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Thermostatic/rosettia-chanka-4b-base

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(116)
this model

Dataset used to train Thermostatic/rosettia-chanka-4b-base