|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: llama-cpp-python |
|
|
tags: |
|
|
- llama |
|
|
- instruction-tuned |
|
|
- thai |
|
|
- gguf |
|
|
- quantized |
|
|
- q8 |
|
|
- rag |
|
|
- chatbot |
|
|
language: |
|
|
- th |
|
|
--- |
|
|
|
|
|
# Llama 3.2 Typhoon2 3B Instruct (GGUF Q8_0) |
|
|
|
|
|
Fine-tuned Thai instruction-following model quantized to GGUF Q8_0 format for efficient inference. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Base Model**: typhoon-ai/llama3.2-typhoon2-3b-instruct |
|
|
- **Format**: GGUF (Q8_0 quantization) |
|
|
- **Parameters**: 3 billion |
|
|
- **Language**: Thai |
|
|
- **Use Case**: Context-aware Q&A, RAG systems, chatbots |
|
|
|
|
|
## Training |
|
|
|
|
|
- **Framework**: Unsloth |
|
|
- **Method**: Supervised Fine-Tuning (SFT) |
|
|
- **Training Data**: Thai instruction-following dataset with negative samples for strictness |
|
|
- **Optimization**: LoRA + 4-bit quantization during training |
|
|
|
|
|
## Inference |
|
|
|
|
|
### Using llama-cpp-python |
|
|
|
|
|
```python |
|
|
from llama_cpp import Llama |
|
|
|
|
|
llm = Llama( |
|
|
model_path="model.gguf", |
|
|
n_ctx=4096, |
|
|
n_gpu_layers=0, |
|
|
) |
|
|
|
|
|
response = llm(prompt, max_tokens=256, temperature=0.0) |
|
|
``` |
|
|
|
|
|
### Docker Deployment (EKS) |
|
|
|
|
|
See deployment guide in the chat-inference Helm chart. |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Quantization**: Q8_0 (8-bit) |
|
|
- **Model Size**: ~3.3 GB |
|
|
- **Inference Speed (CPU)**: ~2-5 tokens/sec (t3.xlarge) |
|
|
- **Recommended CPU**: 2-4 cores, 4-6 GB RAM |
|
|
|
|
|
## License |
|
|
|
|
|
Apache License 2.0 |
|
|
|