YouTube entry: Rosettia video

rosettia-chanka-4b-base

The 4B Chanka-specialized base model used as the starting point for the Rosettia compact-mixed SFT chain that produced Thermostatic/rosettia-chanka-4b-alpha160 (chrF++ 56.94 on held-out clean Chanka).

This is not the strongest deployable artifact — it scores chrF++ 43.49 on the 158-row held-out, well below the final champion. It is published so other researchers can train alternative LoRA adapters on the same Chanka-specialized base without having to redo the multi-hour broad+full-FT chain.

Held-out result

Metric (158-row clean Chanka held-out)	This base	Final champion	Δ
chrF++	43.49	56.94	+13.45
BLEU	16.14	30.76	+14.62
token F1	28.94	46.43	+17.49
TER (↓)	82.49	62.21	−20.28

How this base was produced

Stage	Recipe
0	Raw `unsloth/Qwen3.5-4B`
1	Broad Quechua LoRA SFT (~768 steps on ~169k AmericasNLP + SomosNLP `quy-spa` pairs, LoRA r=64/α=128, LR 5e-6)
2	Merge the broad LoRA into the full base
3	Full-parameter FT on the clean Chanka subset (48 steps, LR=2e-6, `paged_adamw_8bit`) — produces this checkpoint

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("Thermostatic/rosettia-chanka-4b-base")
model = AutoModelForCausalLM.from_pretrained(
    "Thermostatic/rosettia-chanka-4b-base",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
# Use the same prompt format as the champion model card.

Or attach the compact-mixed LoRA chain on top with PEFT for the champion behavior (see the champion model card for the full recipe).

Intended use

Starting point for further Chanka SFT / LoRA experiments
Reproducing the Rosettia compact-mixed chain (v11→v12→v13)
Domain-adapting to other Chanka subdomains (judicial training data is the source here)

Limitations

Trained on 1,055 reviewed Chanka pairs from a single judicial manual — out-of-domain coverage is unknown
Chanka variety only (quy_Latn); not appropriate for Cuzco-Collao (quz), Bolivian (quh), or other Quechua varieties without further adaptation
Tokenizer-level: inherits Qwen3.5-4B vocabulary (no Quechua-specific tokens added)

Citation / attribution

Built for #HACKATHONSomosNLP 2026 by the Thermostatic team. Data: Thermostatic/rosettia-chanka-data. The judicial-manual source PDF is in the public domain.

Downloads last month: 9

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for Thermostatic/rosettia-chanka-4b-base

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

unsloth/Qwen3.5-4B

Finetuned

(184)

this model

Thermostatic
/

rosettia-chanka-4b-base

rosettia-chanka-4b-base

Held-out result

How this base was produced

How to use

Intended use

Limitations

Citation / attribution

Model tree for Thermostatic/rosettia-chanka-4b-base

Dataset used to train Thermostatic/rosettia-chanka-4b-base