| | --- |
| | language: vi |
| | tags: |
| | - vietnamese |
| | - causal-lm |
| | - finetuning |
| | - viena |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | license: other |
| | base_model: vietrix/viena-60m-pretrain |
| | --- |
| | |
| | # Viena 60M (SFT) |
| |
|
| | ## Model details |
| | - Developed by: Vietrix |
| | - Model type: decoder-only causal LM (Llama-style) |
| | - Parameters: ~60M |
| | - Layers: 16 |
| | - Hidden size: 512 |
| | - Attention heads: 8 (KV heads: 4) |
| | - Max sequence length: 1024 |
| | - RoPE theta: 10000 |
| | - Normalization/MLP: RMSNorm + SwiGLU |
| | - Precision: BF16 training |
| |
|
| | ## Tokenizer |
| | - SentencePiece BPE |
| | - Target vocab in config: 32k |
| | - Actual vocab in tokenizer.model: 2105 (trained on a small corpus) |
| | - Note: embeddings are sized for 32k; only the first 2105 tokens are used by the tokenizer. |
| |
|
| | ## Training data |
| | - Internal synthetic Vietnamese instruction/chat data. |
| | - Train/val split: 2,000 / 200 JSONL records. |
| | - Format: messages with roles (system/user/assistant/tool). |
| | - PII: best-effort redaction applied during dataset preparation. |
| |
|
| | ## Fine-tuning procedure |
| | - Initialized from: `vietrix/viena-60m-pretrain`. |
| | - Objective: token-level cross-entropy, prompt loss disabled. |
| | - Sequence length: 1024. |
| | - Global batch size: 32 (batch 8 x grad_accum 4). |
| | - Optimizer: AdamW, lr 2e-4, weight decay 0.01, cosine decay with warmup. |
| | - Steps: 1,000. |
| | - Validation every 200 steps (10 batches). |
| | |
| | ## Intended use |
| | - Vietnamese chat/instruction-following use cases. |
| | - Research and prototyping; not a production-grade safety model. |
| | |
| | ## Limitations |
| | - Trained on a small synthetic corpus; may hallucinate or respond incorrectly. |
| | - Not safety-tuned for sensitive domains. |
| | - Tokenizer vocab is small; lexical coverage is limited. |
| | |
| | ## How to use |
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_id = "vietrix/viena-60m" |
| | tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False) |
| | model = AutoModelForCausalLM.from_pretrained(model_id) |
| | ``` |
| | |
| | If `AutoTokenizer` fails, load the SentencePiece model explicitly: |
| | ```python |
| | from transformers import LlamaTokenizer |
| | |
| | tokenizer = LlamaTokenizer.from_pretrained(model_id, use_fast=False) |
| | ``` |
| | |