lehungquangminh commited on
Commit
8d63353
·
verified ·
1 Parent(s): b2b50b4

Add model card

Browse files
Files changed (1) hide show
  1. README.md +70 -0
README.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: vi
3
+ tags:
4
+ - vietnamese
5
+ - causal-lm
6
+ - pretraining
7
+ - viena
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
+ license: other
11
+ ---
12
+
13
+ # Viena 60M (Pretrain)
14
+
15
+ ## Model details
16
+ - Developed by: Vietrix
17
+ - Model type: decoder-only causal LM (Llama-style)
18
+ - Parameters: ~60M
19
+ - Layers: 16
20
+ - Hidden size: 512
21
+ - Attention heads: 8 (KV heads: 4)
22
+ - Max sequence length: 1024
23
+ - RoPE theta: 10000
24
+ - Normalization/MLP: RMSNorm + SwiGLU
25
+ - Precision: BF16 training
26
+
27
+ ## Tokenizer
28
+ - SentencePiece BPE
29
+ - Target vocab in config: 32k
30
+ - Actual vocab in tokenizer.model: 2105 (trained on a small corpus)
31
+ - Note: embeddings are sized for 32k; only the first 2105 tokens are used by the tokenizer.
32
+
33
+ ## Training data
34
+ - Internal synthetic Vietnamese pretrain corpus.
35
+ - Domains: Vietnam/general, math, code, identity.
36
+ - Raw JSONL entries: ~2.4k; after cleanup/dedupe, HF dataset contains 472 unique texts.
37
+ - PII: best-effort redaction during dataset build.
38
+
39
+ ## Training procedure
40
+ - Objective: next-token prediction with packed sequences.
41
+ - Sequence length: 1024.
42
+ - Global batch size: 64 (batch 16 x grad_accum 4).
43
+ - Optimizer: AdamW, lr 3e-4, weight decay 0.1, cosine decay with 10% warmup.
44
+ - Steps: 2,500 (approx 163.8M tokens processed).
45
+ - Checkpoints saved every 1,250 steps.
46
+
47
+ ## Intended use
48
+ - Base model for continued training or fine-tuning on Vietnamese tasks.
49
+ - Not instruction-tuned; outputs may be unaligned.
50
+
51
+ ## Limitations
52
+ - Trained on a small synthetic corpus; coverage and factuality are limited.
53
+ - Not suitable for safety-critical or high-stakes applications.
54
+ - Tokenizer vocab is much smaller than model vocab; lexical coverage is limited.
55
+
56
+ ## How to use
57
+ ```python
58
+ from transformers import AutoModelForCausalLM, AutoTokenizer
59
+
60
+ model_id = "vietrix/viena-60m-pretrain"
61
+ tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
62
+ model = AutoModelForCausalLM.from_pretrained(model_id)
63
+ ```
64
+
65
+ If `AutoTokenizer` fails, load the SentencePiece model explicitly:
66
+ ```python
67
+ from transformers import LlamaTokenizer
68
+
69
+ tokenizer = LlamaTokenizer.from_pretrained(model_id, use_fast=False)
70
+ ```