--- library_name: transformers tags: [] --- --- language: - vi - en license: apache-2.0 base_model: Qwen/Qwen3-4B-Base tags: - qwen3 - causal-lm - vietnamese - continuous-pretraining - unsloth datasets: - data-std/vi-text-corpus pipeline_tag: text-generation --- # Qwen3-4B Vietnamese Continued Pre-trained Model This model is a **continued pre-training** version of [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) on Vietnamese text corpus, optimized using [Unsloth](https://github.com/unslothai/unsloth) for efficient training. ## Model Details ### Model Description - **Base Model:** Qwen/Qwen3-4B-Base - **Model Type:** Causal Language Model (Decoder-only Transformer) - **Language(s):** Vietnamese (primary), English (inherited from base) - **Training Method:** Continued Pre-Training (CPT) with Unsloth optimization - **Parameters:** ~4 Billion - **Context Length:** 4096 tokens - **License:** Apache 2.0 ### Training Data The model was trained on: - **Dataset:** [data-std/vi-text-corpus](https://huggingface.co/datasets/data-std/vi-text-corpus) - **Subset:** `filter-by-ppl-and-length` (filtered for quality by perplexity and length) - **Language:** Vietnamese text corpus - **Processing:** Automatic EOS token appending ## Training Details ### Training Configuration | Parameter | Value | |-----------|-------| | Base Model | unsloth/Qwen3-4B-Base | | Max Sequence Length | 4096 tokens | | Training Epochs | 1 | | Batch Size (per device) | 2 | | Gradient Accumulation Steps | 8 | | Effective Batch Size | 16 | | Learning Rate | 2e-5 | | Optimizer | AdamW (torch) | | Weight Decay | 0.01 | | LR Scheduler | Cosine | | Warmup Steps | 10 | | Warmup Ratio | 0.03 | | Precision | BF16 (if supported) / FP16 | | Seed | 3407 | ### Training Framework - **Framework:** Unsloth + Hugging Face Transformers - **Optimization:** Full fine-tuning (all parameters trainable) - **Checkpointing:** Every 100 steps, keeping 1 checkpoint - **Hardware:** CUDA-enabled GPU ### Training Methodology This model uses **Continued Pre-Training (CPT)** to adapt the Qwen3-4B-Base model to Vietnamese language: - Trained on next-token prediction objective - Uses DataCollatorForLanguageModeling for causal LM - Maintains the original model architecture - Enhanced Vietnamese language understanding while preserving multilingual capabilities ## Usage ### Requirements ```bash pip install transformers torch accelerate ``` ### Basic Text Generation ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "data-std/qwen3-4b-wiki-filter-28k" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.bfloat16, # Use torch.float16 if BF16 not supported ) # Generate text prompt = "Việt Nam là một quốc gia" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True, repetition_penalty=1.1, ) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ### Chat/Instruction Format For instruction-following tasks, you may need additional fine-tuning. Here's a basic template: ```python def format_instruction(instruction, context=""): if context: prompt = f"### Instruction:\n{instruction}\n\n### Context:\n{context}\n\n### Response:\n" else: prompt = f"### Instruction:\n{instruction}\n\n### Response:\n" return prompt instruction = "Giải thích về lịch sử Việt Nam" prompt = format_instruction(instruction) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Using with Unsloth (for further fine-tuning) ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="data-std/qwen3-4b-wiki-filter-28k", max_seq_length=4096, dtype=None, # Auto-detect load_in_4bit=True, # Use 4-bit quantization for memory efficiency ) # Continue training or perform inference ``` ### Quantization for Lower Memory Usage ```python from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import torch # 4-bit quantization quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4" ) model = AutoModelForCausalLM.from_pretrained( "data-std/qwen3-4b-wiki-filter-28k", quantization_config=quantization_config, device_map="auto", ) ``` ## Performance ### Hardware Requirements | Precision | VRAM Required | Inference Speed | |-----------|---------------|-----------------| | FP32 | ~16 GB | Baseline | | FP16/BF16 | ~8 GB | 2x faster | | 4-bit | ~3-4 GB | Slightly slower, very memory efficient | ### Recommended Use Cases - ✅ Vietnamese text generation - ✅ Vietnamese language understanding - ✅ Content creation in Vietnamese - ✅ Further fine-tuning for downstream tasks - ✅ Research on Vietnamese NLP - ⚠️ Instruction-following (may need additional fine-tuning) - ⚠️ Multi-turn conversation (may need additional fine-tuning) ## Limitations - **Training Data:** The model's knowledge is limited to the Vietnamese corpus used during continued pre-training - **Not Instruction-Tuned:** This is a base model continued pre-trained on Vietnamese text. For instruction-following capabilities, additional supervised fine-tuning (SFT) is recommended - **Potential Biases:** May reflect biases present in the training data - **Language:** While enhanced for Vietnamese, performance may vary across different Vietnamese dialects and domains - **Generation Quality:** May produce repetitive or inconsistent outputs without proper generation parameters ## Ethical Considerations - This model should not be used for generating harmful, misleading, or discriminatory content - Users should verify generated content for factual accuracy - The model may generate biased content reflecting biases in training data - Not suitable for high-stakes decision-making without human oversight ## Acknowledgements - **Base Model:** [Qwen Team](https://huggingface.co/Qwen) for Qwen3-4B-Base - **Training Framework:** [Unsloth AI](https://github.com/unslothai/unsloth) for efficient training - **Dataset:** Vietnamese text corpus from data-std/vi-text-corpus - **Infrastructure:** Trained using CUDA-enabled GPUs ## Contact For questions, issues, or collaborations, please open an issue on the model repository or contact the maintainers. ## Model Card Authors Data Standard Team ## Model Card Contact [Your contact information or repository issues page] --- **License:** Apache 2.0 **Intended Use:** Research and development of Vietnamese NLP applications **Out-of-Scope Use:** Generating harmful content, impersonation, high-stakes decisions without human oversight