vi-en-glm: Bilingual Small Language Model (English/Vietnamese)
This model is a Small Language Model (SLM) based on the GLM (General Language Model) architecture, trained from scratch to support both English and Vietnamese. It was developed as part of a NLP project to explore bilingual text generation and cross-lingual understanding in a resource-constrained environment.
Model Description
- Language(s): Vietnamese, English
- Model Type: Causal Language Model (GLM Architecture)
- Parameters: ~200 Million
- Vocabulary Size: 32,000 (BPE Tokenizer)
Training Details
Training Data
The model was trained on an interleaved dataset consisting of 52,000 English and 52,000 Vietnamese samples from Wikipedia. The datasets were cleaned to remove Wikipedia markdown and filtered for high-quality content longer than 100 characters.
Training Procedure
The training was conducted on 2 Kaggle T4 GPU utilizing fp16 precision and tf32 for optimized performance.
- Epochs: 10
- Batch Size: 64 (Effective Batch Size: 256 via Gradient Accumulation)
- Learning Rate: 5e-4 with Cosine Scheduler
- Optimizer: AdamW
- Masking Strategy: GLM span masking (15% probability)
How to Use
from transformers import GlmForCausalLM, PreTrainedTokenizerFast
model = GlmForCausalLM.from_pretrained("JohnMarble/vi-en-glm")
tokenizer = PreTrainedTokenizerFast.from_pretrained("JohnMarble/vi-en-glm")
prompt = "Việt Nam là một quốc gia"
inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False)
outputs = model.generate(
**inputs,
max_length=50,
do_sample=True,
top_p=0.9,
temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 116
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support