vietrix
/

viena-60m-pretrain

Text Generation

text-generation-inference

Model card Files Files and versions

lehungquangminh commited on Jan 18

Commit

8d63353

·

verified ·

1 Parent(s): b2b50b4

Add model card

Files changed (1) hide show

README.md +70 -0

README.md ADDED Viewed

	@@ -0,0 +1,70 @@

+---
+language: vi
+tags:
+- vietnamese
+- causal-lm
+- pretraining
+- viena
+library_name: transformers
+pipeline_tag: text-generation
+license: other
+---
+# Viena 60M (Pretrain)
+## Model details
+- Developed by: Vietrix
+- Model type: decoder-only causal LM (Llama-style)
+- Parameters: ~60M
+- Layers: 16
+- Hidden size: 512
+- Attention heads: 8 (KV heads: 4)
+- Max sequence length: 1024
+- RoPE theta: 10000
+- Normalization/MLP: RMSNorm + SwiGLU
+- Precision: BF16 training
+## Tokenizer
+- SentencePiece BPE
+- Target vocab in config: 32k
+- Actual vocab in tokenizer.model: 2105 (trained on a small corpus)
+- Note: embeddings are sized for 32k; only the first 2105 tokens are used by the tokenizer.
+## Training data
+- Internal synthetic Vietnamese pretrain corpus.
+- Domains: Vietnam/general, math, code, identity.
+- Raw JSONL entries: ~2.4k; after cleanup/dedupe, HF dataset contains 472 unique texts.
+- PII: best-effort redaction during dataset build.
+## Training procedure
+- Objective: next-token prediction with packed sequences.
+- Sequence length: 1024.
+- Global batch size: 64 (batch 16 x grad_accum 4).
+- Optimizer: AdamW, lr 3e-4, weight decay 0.1, cosine decay with 10% warmup.
+- Steps: 2,500 (approx 163.8M tokens processed).
+- Checkpoints saved every 1,250 steps.
+## Intended use
+- Base model for continued training or fine-tuning on Vietnamese tasks.
+- Not instruction-tuned; outputs may be unaligned.
+## Limitations
+- Trained on a small synthetic corpus; coverage and factuality are limited.
+- Not suitable for safety-critical or high-stakes applications.
+- Tokenizer vocab is much smaller than model vocab; lexical coverage is limited.
+## How to use
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "vietrix/viena-60m-pretrain"
+tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
+model = AutoModelForCausalLM.from_pretrained(model_id)
+```
+If `AutoTokenizer` fails, load the SentencePiece model explicitly:
+```python
+from transformers import LlamaTokenizer
+tokenizer = LlamaTokenizer.from_pretrained(model_id, use_fast=False)
+```