DuongTrongChi's picture
Update README.md
cc0e909 verified
---
library_name: transformers
tags: []
---
---
language:
- vi
- en
license: apache-2.0
base_model: Qwen/Qwen3-4B-Base
tags:
- qwen3
- causal-lm
- vietnamese
- continuous-pretraining
- unsloth
datasets:
- data-std/vi-text-corpus
pipeline_tag: text-generation
---
# Qwen3-4B Vietnamese Continued Pre-trained Model
This model is a **continued pre-training** version of [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) on Vietnamese text corpus, optimized using [Unsloth](https://github.com/unslothai/unsloth) for efficient training.
## Model Details
### Model Description
- **Base Model:** Qwen/Qwen3-4B-Base
- **Model Type:** Causal Language Model (Decoder-only Transformer)
- **Language(s):** Vietnamese (primary), English (inherited from base)
- **Training Method:** Continued Pre-Training (CPT) with Unsloth optimization
- **Parameters:** ~4 Billion
- **Context Length:** 4096 tokens
- **License:** Apache 2.0
### Training Data
The model was trained on:
- **Dataset:** [data-std/vi-text-corpus](https://huggingface.co/datasets/data-std/vi-text-corpus)
- **Subset:** `filter-by-ppl-and-length` (filtered for quality by perplexity and length)
- **Language:** Vietnamese text corpus
- **Processing:** Automatic EOS token appending
## Training Details
### Training Configuration
| Parameter | Value |
|-----------|-------|
| Base Model | unsloth/Qwen3-4B-Base |
| Max Sequence Length | 4096 tokens |
| Training Epochs | 1 |
| Batch Size (per device) | 2 |
| Gradient Accumulation Steps | 8 |
| Effective Batch Size | 16 |
| Learning Rate | 2e-5 |
| Optimizer | AdamW (torch) |
| Weight Decay | 0.01 |
| LR Scheduler | Cosine |
| Warmup Steps | 10 |
| Warmup Ratio | 0.03 |
| Precision | BF16 (if supported) / FP16 |
| Seed | 3407 |
### Training Framework
- **Framework:** Unsloth + Hugging Face Transformers
- **Optimization:** Full fine-tuning (all parameters trainable)
- **Checkpointing:** Every 100 steps, keeping 1 checkpoint
- **Hardware:** CUDA-enabled GPU
### Training Methodology
This model uses **Continued Pre-Training (CPT)** to adapt the Qwen3-4B-Base model to Vietnamese language:
- Trained on next-token prediction objective
- Uses DataCollatorForLanguageModeling for causal LM
- Maintains the original model architecture
- Enhanced Vietnamese language understanding while preserving multilingual capabilities
## Usage
### Requirements
```bash
pip install transformers torch accelerate
```
### Basic Text Generation
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "data-std/qwen3-4b-wiki-filter-28k"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16, # Use torch.float16 if BF16 not supported
)
# Generate text
prompt = "Việt Nam là một quốc gia"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
repetition_penalty=1.1,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
### Chat/Instruction Format
For instruction-following tasks, you may need additional fine-tuning. Here's a basic template:
```python
def format_instruction(instruction, context=""):
if context:
prompt = f"### Instruction:\n{instruction}\n\n### Context:\n{context}\n\n### Response:\n"
else:
prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
return prompt
instruction = "Giải thích về lịch sử Việt Nam"
prompt = format_instruction(instruction)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Using with Unsloth (for further fine-tuning)
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="data-std/qwen3-4b-wiki-filter-28k",
max_seq_length=4096,
dtype=None, # Auto-detect
load_in_4bit=True, # Use 4-bit quantization for memory efficiency
)
# Continue training or perform inference
```
### Quantization for Lower Memory Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4"
)
model = AutoModelForCausalLM.from_pretrained(
"data-std/qwen3-4b-wiki-filter-28k",
quantization_config=quantization_config,
device_map="auto",
)
```
## Performance
### Hardware Requirements
| Precision | VRAM Required | Inference Speed |
|-----------|---------------|-----------------|
| FP32 | ~16 GB | Baseline |
| FP16/BF16 | ~8 GB | 2x faster |
| 4-bit | ~3-4 GB | Slightly slower, very memory efficient |
### Recommended Use Cases
- ✅ Vietnamese text generation
- ✅ Vietnamese language understanding
- ✅ Content creation in Vietnamese
- ✅ Further fine-tuning for downstream tasks
- ✅ Research on Vietnamese NLP
- ⚠️ Instruction-following (may need additional fine-tuning)
- ⚠️ Multi-turn conversation (may need additional fine-tuning)
## Limitations
- **Training Data:** The model's knowledge is limited to the Vietnamese corpus used during continued pre-training
- **Not Instruction-Tuned:** This is a base model continued pre-trained on Vietnamese text. For instruction-following capabilities, additional supervised fine-tuning (SFT) is recommended
- **Potential Biases:** May reflect biases present in the training data
- **Language:** While enhanced for Vietnamese, performance may vary across different Vietnamese dialects and domains
- **Generation Quality:** May produce repetitive or inconsistent outputs without proper generation parameters
## Ethical Considerations
- This model should not be used for generating harmful, misleading, or discriminatory content
- Users should verify generated content for factual accuracy
- The model may generate biased content reflecting biases in training data
- Not suitable for high-stakes decision-making without human oversight
## Acknowledgements
- **Base Model:** [Qwen Team](https://huggingface.co/Qwen) for Qwen3-4B-Base
- **Training Framework:** [Unsloth AI](https://github.com/unslothai/unsloth) for efficient training
- **Dataset:** Vietnamese text corpus from data-std/vi-text-corpus
- **Infrastructure:** Trained using CUDA-enabled GPUs
## Contact
For questions, issues, or collaborations, please open an issue on the model repository or contact the maintainers.
## Model Card Authors
Data Standard Team
## Model Card Contact
[Your contact information or repository issues page]
---
**License:** Apache 2.0
**Intended Use:** Research and development of Vietnamese NLP applications
**Out-of-Scope Use:** Generating harmful content, impersonation, high-stakes decisions without human oversight