vi-en-glm: Bilingual Small Language Model (English/Vietnamese)

This model is a Small Language Model (SLM) based on the GLM (General Language Model) architecture, trained from scratch to support both English and Vietnamese. It was developed as part of a NLP project to explore bilingual text generation and cross-lingual understanding in a resource-constrained environment.

Model Description

Language(s): Vietnamese, English
Model Type: Causal Language Model (GLM Architecture)
Parameters: ~200 Million
Vocabulary Size: 32,000 (BPE Tokenizer)

Training Details

Training Data

The model was trained on an interleaved dataset consisting of 52,000 English and 52,000 Vietnamese samples from Wikipedia. The datasets were cleaned to remove Wikipedia markdown and filtered for high-quality content longer than 100 characters.

Training Procedure

The training was conducted on 2 Kaggle T4 GPU utilizing fp16 precision and tf32 for optimized performance.

Epochs: 10
Batch Size: 64 (Effective Batch Size: 256 via Gradient Accumulation)
Learning Rate: 5e-4 with Cosine Scheduler
Optimizer: AdamW
Masking Strategy: GLM span masking (15% probability)

How to Use

from transformers import GlmForCausalLM, PreTrainedTokenizerFast

model = GlmForCausalLM.from_pretrained("JohnMarble/vi-en-glm")
tokenizer = PreTrainedTokenizerFast.from_pretrained("JohnMarble/vi-en-glm")

prompt = "Việt Nam là một quốc gia"
inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False)

outputs = model.generate(
**inputs,
max_length=50,
do_sample=True,
top_p=0.9,
temperature=0.7
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 47

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

JohnMarble
/

vi-en-glm