vi-en-glm: Bilingual Small Language Model (English/Vietnamese)

This model is a Small Language Model (SLM) based on the GLM (General Language Model) architecture, trained from scratch to support both English and Vietnamese. It was developed as part of a NLP project to explore bilingual text generation and cross-lingual understanding in a resource-constrained environment.

Model Description

  • Language(s): Vietnamese, English
  • Model Type: Causal Language Model (GLM Architecture)
  • Parameters: ~200 Million
  • Vocabulary Size: 32,000 (BPE Tokenizer)

Training Details

Training Data

The model was trained on an interleaved dataset consisting of 52,000 English and 52,000 Vietnamese samples from Wikipedia. The datasets were cleaned to remove Wikipedia markdown and filtered for high-quality content longer than 100 characters.

Training Procedure

The training was conducted on 2 Kaggle T4 GPU utilizing fp16 precision and tf32 for optimized performance.

  • Epochs: 10
  • Batch Size: 64 (Effective Batch Size: 256 via Gradient Accumulation)
  • Learning Rate: 5e-4 with Cosine Scheduler
  • Optimizer: AdamW
  • Masking Strategy: GLM span masking (15% probability)

How to Use

from transformers import GlmForCausalLM, PreTrainedTokenizerFast

model = GlmForCausalLM.from_pretrained("JohnMarble/vi-en-glm")
tokenizer = PreTrainedTokenizerFast.from_pretrained("JohnMarble/vi-en-glm")

prompt = "Việt Nam là một quốc gia"
inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False)

outputs = model.generate(
**inputs,
max_length=50,
do_sample=True,
top_p=0.9,
temperature=0.7
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
116
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train JohnMarble/vi-en-glm