vietrix
/

viena-60m

Text Generation

text-generation-inference

Model card Files Files and versions

viena-60m / README.md

lehungquangminh's picture

lehungquangminh

Add model card

ac2dce6 verified about 1 month ago

|

history blame contribute delete

2.07 kB

	---
	language: vi
	tags:
	- vietnamese
	- causal-lm
	- finetuning
	- viena
	library_name: transformers
	pipeline_tag: text-generation
	license: other
	base_model: vietrix/viena-60m-pretrain
	---

	# Viena 60M (SFT)

	## Model details
	- Developed by: Vietrix
	- Model type: decoder-only causal LM (Llama-style)
	- Parameters: ~60M
	- Layers: 16
	- Hidden size: 512
	- Attention heads: 8 (KV heads: 4)
	- Max sequence length: 1024
	- RoPE theta: 10000
	- Normalization/MLP: RMSNorm + SwiGLU
	- Precision: BF16 training

	## Tokenizer
	- SentencePiece BPE
	- Target vocab in config: 32k
	- Actual vocab in tokenizer.model: 2105 (trained on a small corpus)
	- Note: embeddings are sized for 32k; only the first 2105 tokens are used by the tokenizer.

	## Training data
	- Internal synthetic Vietnamese instruction/chat data.
	- Train/val split: 2,000 / 200 JSONL records.
	- Format: messages with roles (system/user/assistant/tool).
	- PII: best-effort redaction applied during dataset preparation.

	## Fine-tuning procedure
	- Initialized from: `vietrix/viena-60m-pretrain`.
	- Objective: token-level cross-entropy, prompt loss disabled.
	- Sequence length: 1024.
	- Global batch size: 32 (batch 8 x grad_accum 4).
	- Optimizer: AdamW, lr 2e-4, weight decay 0.01, cosine decay with warmup.
	- Steps: 1,000.
	- Validation every 200 steps (10 batches).

	## Intended use
	- Vietnamese chat/instruction-following use cases.
	- Research and prototyping; not a production-grade safety model.

	## Limitations
	- Trained on a small synthetic corpus; may hallucinate or respond incorrectly.
	- Not safety-tuned for sensitive domains.
	- Tokenizer vocab is small; lexical coverage is limited.

	## How to use
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "vietrix/viena-60m"
	tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	If `AutoTokenizer` fails, load the SentencePiece model explicitly:
	```python
	from transformers import LlamaTokenizer

	tokenizer = LlamaTokenizer.from_pretrained(model_id, use_fast=False)
	```