lehungquangminh commited on
Commit
ac2dce6
·
verified ·
1 Parent(s): 4dc562d

Add model card

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: vi
3
+ tags:
4
+ - vietnamese
5
+ - causal-lm
6
+ - finetuning
7
+ - viena
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
+ license: other
11
+ base_model: vietrix/viena-60m-pretrain
12
+ ---
13
+
14
+ # Viena 60M (SFT)
15
+
16
+ ## Model details
17
+ - Developed by: Vietrix
18
+ - Model type: decoder-only causal LM (Llama-style)
19
+ - Parameters: ~60M
20
+ - Layers: 16
21
+ - Hidden size: 512
22
+ - Attention heads: 8 (KV heads: 4)
23
+ - Max sequence length: 1024
24
+ - RoPE theta: 10000
25
+ - Normalization/MLP: RMSNorm + SwiGLU
26
+ - Precision: BF16 training
27
+
28
+ ## Tokenizer
29
+ - SentencePiece BPE
30
+ - Target vocab in config: 32k
31
+ - Actual vocab in tokenizer.model: 2105 (trained on a small corpus)
32
+ - Note: embeddings are sized for 32k; only the first 2105 tokens are used by the tokenizer.
33
+
34
+ ## Training data
35
+ - Internal synthetic Vietnamese instruction/chat data.
36
+ - Train/val split: 2,000 / 200 JSONL records.
37
+ - Format: messages with roles (system/user/assistant/tool).
38
+ - PII: best-effort redaction applied during dataset preparation.
39
+
40
+ ## Fine-tuning procedure
41
+ - Initialized from: `vietrix/viena-60m-pretrain`.
42
+ - Objective: token-level cross-entropy, prompt loss disabled.
43
+ - Sequence length: 1024.
44
+ - Global batch size: 32 (batch 8 x grad_accum 4).
45
+ - Optimizer: AdamW, lr 2e-4, weight decay 0.01, cosine decay with warmup.
46
+ - Steps: 1,000.
47
+ - Validation every 200 steps (10 batches).
48
+
49
+ ## Intended use
50
+ - Vietnamese chat/instruction-following use cases.
51
+ - Research and prototyping; not a production-grade safety model.
52
+
53
+ ## Limitations
54
+ - Trained on a small synthetic corpus; may hallucinate or respond incorrectly.
55
+ - Not safety-tuned for sensitive domains.
56
+ - Tokenizer vocab is small; lexical coverage is limited.
57
+
58
+ ## How to use
59
+ ```python
60
+ from transformers import AutoModelForCausalLM, AutoTokenizer
61
+
62
+ model_id = "vietrix/viena-60m"
63
+ tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
64
+ model = AutoModelForCausalLM.from_pretrained(model_id)
65
+ ```
66
+
67
+ If `AutoTokenizer` fails, load the SentencePiece model explicitly:
68
+ ```python
69
+ from transformers import LlamaTokenizer
70
+
71
+ tokenizer = LlamaTokenizer.from_pretrained(model_id, use_fast=False)
72
+ ```