ViGPT2 AIO

ViGPT2 AIO is a Vietnamese GPT-2 style causal language model pretrained for open-ended text generation and general Vietnamese language modeling.

This model was developed to support teaching and hands-on practice in the AIO course, while also serving as a general Vietnamese pretrained language model for experimentation and downstream adaptation.

Model Details

Model name: ViGPT2 AIO
Architecture: GPT-2
Task: Causal language modeling / text generation
Language: Vietnamese
Library: Transformers
Weights format: Safetensors

Training Data

The model was pretrained on a mixture of Vietnamese news and Vietnamese Wikipedia text.

Data Sources

BKAINewsCorpus from bkai-foundation-models/BKAINewsCorpus
Vietnamese Wikipedia collected through a custom crawling pipeline, then cleaned from raw wikitext into plain text

Data Processing

Before pretraining, the corpora were cleaned and deduplicated.

The tokenizer was trained on the raw corpora:
- bkai_train.parquet
- vi_wiki_articles_clean.parquet
The language model was pretrained on deduplicated versions of these corpora.
In the final training mixture, the Vietnamese Wikipedia corpus was upweighted relative to the news corpus.

Training Mixture

The pretraining mixture used:

bkai_train.parquet with weight 1
vi_wiki_articles_clean.parquet with weight 3

Training Objective

The model was pretrained with a standard causal language modeling objective, where the model learns to predict the next token in a sequence.

Limitations

The model may generate incorrect, nonsensical, or fabricated content.
Outputs can reflect biases or artifacts present in the pretraining data.
The model is not a verified factual source and should not be used without human validation in high-stakes settings.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo_id = "VLAI-AIVN/vigpt2-aio"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(repo_id)
model.config.pad_token_id = tokenizer.pad_token_id

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()

prompt = "Việt Nam là một đất nước"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=True,
        temperature=0.8,
        top_p=0.9,
        repetition_penalty=1.1,
        pad_token_id=tokenizer.pad_token_id,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

F32

VLAI-AIVN
/

vigpt2-aio