language: - vi - en license: apache-2.0 base_model: Qwen/Qwen3-4B-Base tags: - qwen3 - causal-lm - vietnamese - continuous-pretraining - unsloth datasets: - data-std/vi-text-corpus pipeline_tag: text-generation

Qwen3-4B Vietnamese Continued Pre-trained Model

This model is a continued pre-training version of Qwen/Qwen3-4B-Base on Vietnamese text corpus, optimized using Unsloth for efficient training.

Model Details

Model Description

  • Base Model: Qwen/Qwen3-4B-Base
  • Model Type: Causal Language Model (Decoder-only Transformer)
  • Language(s): Vietnamese (primary), English (inherited from base)
  • Training Method: Continued Pre-Training (CPT) with Unsloth optimization
  • Parameters: ~4 Billion
  • Context Length: 4096 tokens
  • License: Apache 2.0

Training Data

The model was trained on:

  • Dataset: data-std/vi-text-corpus
  • Subset: filter-by-ppl-and-length (filtered for quality by perplexity and length)
  • Language: Vietnamese text corpus
  • Processing: Automatic EOS token appending

Training Details

Training Configuration

Parameter Value
Base Model unsloth/Qwen3-4B-Base
Max Sequence Length 4096 tokens
Training Epochs 1
Batch Size (per device) 2
Gradient Accumulation Steps 8
Effective Batch Size 16
Learning Rate 2e-5
Optimizer AdamW (torch)
Weight Decay 0.01
LR Scheduler Cosine
Warmup Steps 10
Warmup Ratio 0.03
Precision BF16 (if supported) / FP16
Seed 3407

Training Framework

  • Framework: Unsloth + Hugging Face Transformers
  • Optimization: Full fine-tuning (all parameters trainable)
  • Checkpointing: Every 100 steps, keeping 1 checkpoint
  • Hardware: CUDA-enabled GPU

Training Methodology

This model uses Continued Pre-Training (CPT) to adapt the Qwen3-4B-Base model to Vietnamese language:

  • Trained on next-token prediction objective
  • Uses DataCollatorForLanguageModeling for causal LM
  • Maintains the original model architecture
  • Enhanced Vietnamese language understanding while preserving multilingual capabilities

Usage

Requirements

pip install transformers torch accelerate

Basic Text Generation

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "data-std/qwen3-4b-wiki-filter-28k"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,  # Use torch.float16 if BF16 not supported
)

# Generate text
prompt = "Việt Nam là một quốc gia"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.1,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Chat/Instruction Format

For instruction-following tasks, you may need additional fine-tuning. Here's a basic template:

def format_instruction(instruction, context=""):
    if context:
        prompt = f"### Instruction:\n{instruction}\n\n### Context:\n{context}\n\n### Response:\n"
    else:
        prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
    return prompt

instruction = "Giải thích về lịch sử Việt Nam"
prompt = format_instruction(instruction)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Using with Unsloth (for further fine-tuning)

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="data-std/qwen3-4b-wiki-filter-28k",
    max_seq_length=4096,
    dtype=None,  # Auto-detect
    load_in_4bit=True,  # Use 4-bit quantization for memory efficiency
)

# Continue training or perform inference

Quantization for Lower Memory Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"
)

model = AutoModelForCausalLM.from_pretrained(
    "data-std/qwen3-4b-wiki-filter-28k",
    quantization_config=quantization_config,
    device_map="auto",
)

Performance

Hardware Requirements

Precision VRAM Required Inference Speed
FP32 ~16 GB Baseline
FP16/BF16 ~8 GB 2x faster
4-bit ~3-4 GB Slightly slower, very memory efficient

Recommended Use Cases

  • ✅ Vietnamese text generation
  • ✅ Vietnamese language understanding
  • ✅ Content creation in Vietnamese
  • ✅ Further fine-tuning for downstream tasks
  • ✅ Research on Vietnamese NLP
  • ⚠️ Instruction-following (may need additional fine-tuning)
  • ⚠️ Multi-turn conversation (may need additional fine-tuning)

Limitations

  • Training Data: The model's knowledge is limited to the Vietnamese corpus used during continued pre-training
  • Not Instruction-Tuned: This is a base model continued pre-trained on Vietnamese text. For instruction-following capabilities, additional supervised fine-tuning (SFT) is recommended
  • Potential Biases: May reflect biases present in the training data
  • Language: While enhanced for Vietnamese, performance may vary across different Vietnamese dialects and domains
  • Generation Quality: May produce repetitive or inconsistent outputs without proper generation parameters

Ethical Considerations

  • This model should not be used for generating harmful, misleading, or discriminatory content
  • Users should verify generated content for factual accuracy
  • The model may generate biased content reflecting biases in training data
  • Not suitable for high-stakes decision-making without human oversight

Acknowledgements

  • Base Model: Qwen Team for Qwen3-4B-Base
  • Training Framework: Unsloth AI for efficient training
  • Dataset: Vietnamese text corpus from data-std/vi-text-corpus
  • Infrastructure: Trained using CUDA-enabled GPUs

Contact

For questions, issues, or collaborations, please open an issue on the model repository or contact the maintainers.

Model Card Authors

Data Standard Team

Model Card Contact

[Your contact information or repository issues page]


License: Apache 2.0

Intended Use: Research and development of Vietnamese NLP applications

Out-of-Scope Use: Generating harmful content, impersonation, high-stakes decisions without human oversight

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support