Update README.md

cc0e909 verified 5 months ago

7.13 kB

	---
	library_name: transformers
	tags: []
	---
	---
	language:
	- vi
	- en
	license: apache-2.0
	base_model: Qwen/Qwen3-4B-Base
	tags:
	- qwen3
	- causal-lm
	- vietnamese
	- continuous-pretraining
	- unsloth
	datasets:
	- data-std/vi-text-corpus
	pipeline_tag: text-generation
	---

	# Qwen3-4B Vietnamese Continued Pre-trained Model

	This model is a continued pre-training version of [Qwen/Qwen3-4B-Base](https://huggingface.co/Qwen/Qwen3-4B-Base) on Vietnamese text corpus, optimized using [Unsloth](https://github.com/unslothai/unsloth) for efficient training.

	## Model Details

	### Model Description

	- Base Model: Qwen/Qwen3-4B-Base
	- Model Type: Causal Language Model (Decoder-only Transformer)
	- Language(s): Vietnamese (primary), English (inherited from base)
	- Training Method: Continued Pre-Training (CPT) with Unsloth optimization
	- Parameters: ~4 Billion
	- Context Length: 4096 tokens
	- License: Apache 2.0

	### Training Data

	The model was trained on:
	- Dataset: [data-std/vi-text-corpus](https://huggingface.co/datasets/data-std/vi-text-corpus)
	- Subset: `filter-by-ppl-and-length` (filtered for quality by perplexity and length)
	- Language: Vietnamese text corpus
	- Processing: Automatic EOS token appending

	## Training Details

	### Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| unsloth/Qwen3-4B-Base \|
	\| Max Sequence Length \| 4096 tokens \|
	\| Training Epochs \| 1 \|
	\| Batch Size (per device) \| 2 \|
	\| Gradient Accumulation Steps \| 8 \|
	\| Effective Batch Size \| 16 \|
	\| Learning Rate \| 2e-5 \|
	\| Optimizer \| AdamW (torch) \|
	\| Weight Decay \| 0.01 \|
	\| LR Scheduler \| Cosine \|
	\| Warmup Steps \| 10 \|
	\| Warmup Ratio \| 0.03 \|
	\| Precision \| BF16 (if supported) / FP16 \|
	\| Seed \| 3407 \|

	### Training Framework

	- Framework: Unsloth + Hugging Face Transformers
	- Optimization: Full fine-tuning (all parameters trainable)
	- Checkpointing: Every 100 steps, keeping 1 checkpoint
	- Hardware: CUDA-enabled GPU

	### Training Methodology

	This model uses Continued Pre-Training (CPT) to adapt the Qwen3-4B-Base model to Vietnamese language:
	- Trained on next-token prediction objective
	- Uses DataCollatorForLanguageModeling for causal LM
	- Maintains the original model architecture
	- Enhanced Vietnamese language understanding while preserving multilingual capabilities

	## Usage

	### Requirements

	```bash
	pip install transformers torch accelerate
	```

	### Basic Text Generation

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_name = "data-std/qwen3-4b-wiki-filter-28k"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	device_map="auto",
	torch_dtype=torch.bfloat16, # Use torch.float16 if BF16 not supported
	)

	# Generate text
	prompt = "Việt Nam là một quốc gia"
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	temperature=0.7,
	top_p=0.9,
	do_sample=True,
	repetition_penalty=1.1,
	)

	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(generated_text)
	```

	### Chat/Instruction Format

	For instruction-following tasks, you may need additional fine-tuning. Here's a basic template:

	```python
	def format_instruction(instruction, context=""):
	if context:
	prompt = f"### Instruction:\n{instruction}\n\n### Context:\n{context}\n\n### Response:\n"
	else:
	prompt = f"### Instruction:\n{instruction}\n\n### Response:\n"
	return prompt

	instruction = "Giải thích về lịch sử Việt Nam"
	prompt = format_instruction(instruction)

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Using with Unsloth (for further fine-tuning)

	```python
	from unsloth import FastLanguageModel

	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name="data-std/qwen3-4b-wiki-filter-28k",
	max_seq_length=4096,
	dtype=None, # Auto-detect
	load_in_4bit=True, # Use 4-bit quantization for memory efficiency
	)

	# Continue training or perform inference
	```

	### Quantization for Lower Memory Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
	import torch

	# 4-bit quantization
	quantization_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type="nf4"
	)

	model = AutoModelForCausalLM.from_pretrained(
	"data-std/qwen3-4b-wiki-filter-28k",
	quantization_config=quantization_config,
	device_map="auto",
	)
	```

	## Performance

	### Hardware Requirements

	\| Precision \| VRAM Required \| Inference Speed \|
	\|-----------\|---------------\|-----------------\|
	\| FP32 \| ~16 GB \| Baseline \|
	\| FP16/BF16 \| ~8 GB \| 2x faster \|
	\| 4-bit \| ~3-4 GB \| Slightly slower, very memory efficient \|

	### Recommended Use Cases

	- ✅ Vietnamese text generation
	- ✅ Vietnamese language understanding
	- ✅ Content creation in Vietnamese
	- ✅ Further fine-tuning for downstream tasks
	- ✅ Research on Vietnamese NLP
	- ⚠️ Instruction-following (may need additional fine-tuning)
	- ⚠️ Multi-turn conversation (may need additional fine-tuning)

	## Limitations

	- Training Data: The model's knowledge is limited to the Vietnamese corpus used during continued pre-training
	- Not Instruction-Tuned: This is a base model continued pre-trained on Vietnamese text. For instruction-following capabilities, additional supervised fine-tuning (SFT) is recommended
	- Potential Biases: May reflect biases present in the training data
	- Language: While enhanced for Vietnamese, performance may vary across different Vietnamese dialects and domains
	- Generation Quality: May produce repetitive or inconsistent outputs without proper generation parameters

	## Ethical Considerations

	- This model should not be used for generating harmful, misleading, or discriminatory content
	- Users should verify generated content for factual accuracy
	- The model may generate biased content reflecting biases in training data
	- Not suitable for high-stakes decision-making without human oversight



	## Acknowledgements

	- Base Model: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-4B-Base
	- Training Framework: [Unsloth AI](https://github.com/unslothai/unsloth) for efficient training
	- Dataset: Vietnamese text corpus from data-std/vi-text-corpus
	- Infrastructure: Trained using CUDA-enabled GPUs

	## Contact

	For questions, issues, or collaborations, please open an issue on the model repository or contact the maintainers.

	## Model Card Authors

	Data Standard Team

	## Model Card Contact

	[Your contact information or repository issues page]

	---

	License: Apache 2.0

	Intended Use: Research and development of Vietnamese NLP applications

	Out-of-Scope Use: Generating harmful content, impersonation, high-stakes decisions without human oversight