language: - vi - en license: apache-2.0 base_model: Qwen/Qwen3-4B-Base tags: - qwen3 - causal-lm - vietnamese - continuous-pretraining - unsloth datasets: - data-std/vi-text-corpus pipeline_tag: text-generation
Qwen3-4B Vietnamese Continued Pre-trained Model
This model is a continued pre-training version of Qwen/Qwen3-4B-Base on Vietnamese text corpus, optimized using Unsloth for efficient training.
Model Details
Model Description
- Base Model: Qwen/Qwen3-4B-Base
- Model Type: Causal Language Model (Decoder-only Transformer)
- Language(s): Vietnamese (primary), English (inherited from base)
- Training Method: Continued Pre-Training (CPT) with Unsloth optimization
- Parameters: ~4 Billion
- Context Length: 4096 tokens
- License: Apache 2.0
Training Data
The model was trained on:
- Dataset: data-std/vi-text-corpus
- Subset:
filter-by-ppl-and-length(filtered for quality by perplexity and length) - Language: Vietnamese text corpus
- Processing: Automatic EOS token appending
Training Details
Training Configuration
| Parameter | Value |
|---|---|
| Base Model | unsloth/Qwen3-4B-Base |
| Max Sequence Length | 4096 tokens |
| Training Epochs | 1 |
| Batch Size (per device) | 2 |
| Gradient Accumulation Steps | 8 |
| Effective Batch Size | 16 |
| Learning Rate | 2e-5 |
| Optimizer | AdamW (torch) |
| Weight Decay | 0.01 |
| LR Scheduler | Cosine |
| Warmup Steps | 10 |
| Warmup Ratio | 0.03 |
| Precision | BF16 (if supported) / FP16 |
| Seed | 3407 |
Training Framework
- Framework: Unsloth + Hugging Face Transformers
- Optimization: Full fine-tuning (all parameters trainable)
- Checkpointing: Every 100 steps, keeping 1 checkpoint
- Hardware: CUDA-enabled GPU
Training Methodology
This model uses Continued Pre-Training (CPT) to adapt the Qwen3-4B-Base model to Vietnamese language:
- Trained on next-token prediction objective
- Uses DataCollatorForLanguageModeling for causal LM
- Maintains the original model architecture
- Enhanced Vietnamese language understanding while preserving multilingual capabilities
Usage
Requirements
pip install transformers torch accelerate
Basic Text Generation
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "data-std/qwen3-4b-wiki-filter-28k"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16, # Use torch.float16 if BF16 not supported
)
# Generate text
prompt = "Việt Nam là một quốc gia"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
repetition_penalty=1.1,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Chat/Instruction Format
For instruction-following tasks, you may need additional fine-tuning. Here's a basic template:
def format_instruction(instruction, context=""):
if context:
prompt = f"### Instruction:\n{instruction}\n\n### Context:\n{context}\n\n### Response:\n"
else:
prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
return prompt
instruction = "Giải thích về lịch sử Việt Nam"
prompt = format_instruction(instruction)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Using with Unsloth (for further fine-tuning)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="data-std/qwen3-4b-wiki-filter-28k",
max_seq_length=4096,
dtype=None, # Auto-detect
load_in_4bit=True, # Use 4-bit quantization for memory efficiency
)
# Continue training or perform inference
Quantization for Lower Memory Usage
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"data-std/qwen3-4b-wiki-filter-28k",
quantization_config=quantization_config,
device_map="auto",
)
Performance
Hardware Requirements
| Precision | VRAM Required | Inference Speed |
|---|---|---|
| FP32 | ~16 GB | Baseline |
| FP16/BF16 | ~8 GB | 2x faster |
| 4-bit | ~3-4 GB | Slightly slower, very memory efficient |
Recommended Use Cases
- ✅ Vietnamese text generation
- ✅ Vietnamese language understanding
- ✅ Content creation in Vietnamese
- ✅ Further fine-tuning for downstream tasks
- ✅ Research on Vietnamese NLP
- ⚠️ Instruction-following (may need additional fine-tuning)
- ⚠️ Multi-turn conversation (may need additional fine-tuning)
Limitations
- Training Data: The model's knowledge is limited to the Vietnamese corpus used during continued pre-training
- Not Instruction-Tuned: This is a base model continued pre-trained on Vietnamese text. For instruction-following capabilities, additional supervised fine-tuning (SFT) is recommended
- Potential Biases: May reflect biases present in the training data
- Language: While enhanced for Vietnamese, performance may vary across different Vietnamese dialects and domains
- Generation Quality: May produce repetitive or inconsistent outputs without proper generation parameters
Ethical Considerations
- This model should not be used for generating harmful, misleading, or discriminatory content
- Users should verify generated content for factual accuracy
- The model may generate biased content reflecting biases in training data
- Not suitable for high-stakes decision-making without human oversight
Acknowledgements
- Base Model: Qwen Team for Qwen3-4B-Base
- Training Framework: Unsloth AI for efficient training
- Dataset: Vietnamese text corpus from data-std/vi-text-corpus
- Infrastructure: Trained using CUDA-enabled GPUs
Contact
For questions, issues, or collaborations, please open an issue on the model repository or contact the maintainers.
Model Card Authors
Data Standard Team
Model Card Contact
[Your contact information or repository issues page]
License: Apache 2.0
Intended Use: Research and development of Vietnamese NLP applications
Out-of-Scope Use: Generating harmful content, impersonation, high-stakes decisions without human oversight
- Downloads last month
- -