--- license: apache-2.0 library_name: llama-cpp-python tags: - llama - instruction-tuned - thai - gguf - quantized - q8 - rag - chatbot language: - th --- # Llama 3.2 Typhoon2 3B Instruct (GGUF Q8_0) Fine-tuned Thai instruction-following model quantized to GGUF Q8_0 format for efficient inference. ## Model Details - **Base Model**: typhoon-ai/llama3.2-typhoon2-3b-instruct - **Format**: GGUF (Q8_0 quantization) - **Parameters**: 3 billion - **Language**: Thai - **Use Case**: Context-aware Q&A, RAG systems, chatbots ## Training - **Framework**: Unsloth - **Method**: Supervised Fine-Tuning (SFT) - **Training Data**: Thai instruction-following dataset with negative samples for strictness - **Optimization**: LoRA + 4-bit quantization during training ## Inference ### Using llama-cpp-python ```python from llama_cpp import Llama llm = Llama( model_path="model.gguf", n_ctx=4096, n_gpu_layers=0, ) response = llm(prompt, max_tokens=256, temperature=0.0) ``` ### Docker Deployment (EKS) See deployment guide in the chat-inference Helm chart. ## Performance - **Quantization**: Q8_0 (8-bit) - **Model Size**: ~3.3 GB - **Inference Speed (CPU)**: ~2-5 tokens/sec (t3.xlarge) - **Recommended CPU**: 2-4 cores, 4-6 GB RAM ## License Apache License 2.0