omarkamali/wikipedia-monthly
Viewer • Updated • 195M • 11k • 67
This model is a Small Language Model (SLM) based on the GLM (General Language Model) architecture, trained from scratch to support both English and Vietnamese. It was developed as part of a NLP project to explore bilingual text generation and cross-lingual understanding in a resource-constrained environment.
The model was trained on an interleaved dataset consisting of 52,000 English and 52,000 Vietnamese samples from Wikipedia. The datasets were cleaned to remove Wikipedia markdown and filtered for high-quality content longer than 100 characters.
The training was conducted on 2 Kaggle T4 GPU utilizing fp16 precision and tf32 for optimized performance.
from transformers import GlmForCausalLM, PreTrainedTokenizerFast
model = GlmForCausalLM.from_pretrained("JohnMarble/vi-en-glm")
tokenizer = PreTrainedTokenizerFast.from_pretrained("JohnMarble/vi-en-glm")
prompt = "Việt Nam là một quốc gia"
inputs = tokenizer(prompt, return_tensors="pt", return_token_type_ids=False)
outputs = model.generate(
**inputs,
max_length=50,
do_sample=True,
top_p=0.9,
temperature=0.7
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))