| | --- |
| | library_name: transformers |
| | tags: [] |
| | --- |
| | --- |
| | language: |
| | - vi |
| | - en |
| | license: apache-2.0 |
| | base_model: Qwen/Qwen3-4B-Base |
| | tags: |
| | - qwen3 |
| | - causal-lm |
| | - vietnamese |
| | - continuous-pretraining |
| | - unsloth |
| | datasets: |
| | - data-std/vi-text-corpus |
| | pipeline_tag: text-generation |
| | --- |
| |
|
| | # Qwen3-4B Vietnamese Continued Pre-trained Model |
| |
|
| | This model is a **continued pre-training** version of [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) on Vietnamese text corpus, optimized using [Unsloth](https://github.com/unslothai/unsloth) for efficient training. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | - **Base Model:** Qwen/Qwen3-4B-Base |
| | - **Model Type:** Causal Language Model (Decoder-only Transformer) |
| | - **Language(s):** Vietnamese (primary), English (inherited from base) |
| | - **Training Method:** Continued Pre-Training (CPT) with Unsloth optimization |
| | - **Parameters:** ~4 Billion |
| | - **Context Length:** 4096 tokens |
| | - **License:** Apache 2.0 |
| |
|
| | ### Training Data |
| |
|
| | The model was trained on: |
| | - **Dataset:** [data-std/vi-text-corpus](https://huggingface.co/datasets/data-std/vi-text-corpus) |
| | - **Subset:** `filter-by-ppl-and-length` (filtered for quality by perplexity and length) |
| | - **Language:** Vietnamese text corpus |
| | - **Processing:** Automatic EOS token appending |
| |
|
| | ## Training Details |
| |
|
| | ### Training Configuration |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Base Model | unsloth/Qwen3-4B-Base | |
| | | Max Sequence Length | 4096 tokens | |
| | | Training Epochs | 1 | |
| | | Batch Size (per device) | 2 | |
| | | Gradient Accumulation Steps | 8 | |
| | | Effective Batch Size | 16 | |
| | | Learning Rate | 2e-5 | |
| | | Optimizer | AdamW (torch) | |
| | | Weight Decay | 0.01 | |
| | | LR Scheduler | Cosine | |
| | | Warmup Steps | 10 | |
| | | Warmup Ratio | 0.03 | |
| | | Precision | BF16 (if supported) / FP16 | |
| | | Seed | 3407 | |
| |
|
| | ### Training Framework |
| |
|
| | - **Framework:** Unsloth + Hugging Face Transformers |
| | - **Optimization:** Full fine-tuning (all parameters trainable) |
| | - **Checkpointing:** Every 100 steps, keeping 1 checkpoint |
| | - **Hardware:** CUDA-enabled GPU |
| |
|
| | ### Training Methodology |
| |
|
| | This model uses **Continued Pre-Training (CPT)** to adapt the Qwen3-4B-Base model to Vietnamese language: |
| | - Trained on next-token prediction objective |
| | - Uses DataCollatorForLanguageModeling for causal LM |
| | - Maintains the original model architecture |
| | - Enhanced Vietnamese language understanding while preserving multilingual capabilities |
| |
|
| | ## Usage |
| |
|
| | ### Requirements |
| |
|
| | ```bash |
| | pip install transformers torch accelerate |
| | ``` |
| |
|
| | ### Basic Text Generation |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import torch |
| | |
| | model_name = "data-std/qwen3-4b-wiki-filter-28k" |
| | |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | device_map="auto", |
| | torch_dtype=torch.bfloat16, # Use torch.float16 if BF16 not supported |
| | ) |
| | |
| | # Generate text |
| | prompt = "Việt Nam là một quốc gia" |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=256, |
| | temperature=0.7, |
| | top_p=0.9, |
| | do_sample=True, |
| | repetition_penalty=1.1, |
| | ) |
| | |
| | generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(generated_text) |
| | ``` |
| |
|
| | ### Chat/Instruction Format |
| |
|
| | For instruction-following tasks, you may need additional fine-tuning. Here's a basic template: |
| |
|
| | ```python |
| | def format_instruction(instruction, context=""): |
| | if context: |
| | prompt = f"### Instruction:\n{instruction}\n\n### Context:\n{context}\n\n### Response:\n" |
| | else: |
| | prompt = f"### Instruction:\n{instruction}\n\n### Response:\n" |
| | return prompt |
| | |
| | instruction = "Giải thích về lịch sử Việt Nam" |
| | prompt = format_instruction(instruction) |
| | |
| | inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| | outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ### Using with Unsloth (for further fine-tuning) |
| |
|
| | ```python |
| | from unsloth import FastLanguageModel |
| | |
| | model, tokenizer = FastLanguageModel.from_pretrained( |
| | model_name="data-std/qwen3-4b-wiki-filter-28k", |
| | max_seq_length=4096, |
| | dtype=None, # Auto-detect |
| | load_in_4bit=True, # Use 4-bit quantization for memory efficiency |
| | ) |
| | |
| | # Continue training or perform inference |
| | ``` |
| |
|
| | ### Quantization for Lower Memory Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
| | import torch |
| | |
| | # 4-bit quantization |
| | quantization_config = BitsAndBytesConfig( |
| | load_in_4bit=True, |
| | bnb_4bit_compute_dtype=torch.bfloat16, |
| | bnb_4bit_use_double_quant=True, |
| | bnb_4bit_quant_type="nf4" |
| | ) |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "data-std/qwen3-4b-wiki-filter-28k", |
| | quantization_config=quantization_config, |
| | device_map="auto", |
| | ) |
| | ``` |
| |
|
| | ## Performance |
| |
|
| | ### Hardware Requirements |
| |
|
| | | Precision | VRAM Required | Inference Speed | |
| | |-----------|---------------|-----------------| |
| | | FP32 | ~16 GB | Baseline | |
| | | FP16/BF16 | ~8 GB | 2x faster | |
| | | 4-bit | ~3-4 GB | Slightly slower, very memory efficient | |
| |
|
| | ### Recommended Use Cases |
| |
|
| | - ✅ Vietnamese text generation |
| | - ✅ Vietnamese language understanding |
| | - ✅ Content creation in Vietnamese |
| | - ✅ Further fine-tuning for downstream tasks |
| | - ✅ Research on Vietnamese NLP |
| | - ⚠️ Instruction-following (may need additional fine-tuning) |
| | - ⚠️ Multi-turn conversation (may need additional fine-tuning) |
| |
|
| | ## Limitations |
| |
|
| | - **Training Data:** The model's knowledge is limited to the Vietnamese corpus used during continued pre-training |
| | - **Not Instruction-Tuned:** This is a base model continued pre-trained on Vietnamese text. For instruction-following capabilities, additional supervised fine-tuning (SFT) is recommended |
| | - **Potential Biases:** May reflect biases present in the training data |
| | - **Language:** While enhanced for Vietnamese, performance may vary across different Vietnamese dialects and domains |
| | - **Generation Quality:** May produce repetitive or inconsistent outputs without proper generation parameters |
| |
|
| | ## Ethical Considerations |
| |
|
| | - This model should not be used for generating harmful, misleading, or discriminatory content |
| | - Users should verify generated content for factual accuracy |
| | - The model may generate biased content reflecting biases in training data |
| | - Not suitable for high-stakes decision-making without human oversight |
| |
|
| |
|
| |
|
| | ## Acknowledgements |
| |
|
| | - **Base Model:** [Qwen Team](https://huggingface.co/Qwen) for Qwen3-4B-Base |
| | - **Training Framework:** [Unsloth AI](https://github.com/unslothai/unsloth) for efficient training |
| | - **Dataset:** Vietnamese text corpus from data-std/vi-text-corpus |
| | - **Infrastructure:** Trained using CUDA-enabled GPUs |
| |
|
| | ## Contact |
| |
|
| | For questions, issues, or collaborations, please open an issue on the model repository or contact the maintainers. |
| |
|
| | ## Model Card Authors |
| |
|
| | Data Standard Team |
| |
|
| | ## Model Card Contact |
| |
|
| | [Your contact information or repository issues page] |
| |
|
| | --- |
| |
|
| | **License:** Apache 2.0 |
| |
|
| | **Intended Use:** Research and development of Vietnamese NLP applications |
| |
|
| | **Out-of-Scope Use:** Generating harmful content, impersonation, high-stakes decisions without human oversight |
| |
|
| |
|