chronorus
/

chatbot-poc

llama-cpp-python

instruction-tuned

Model card Files Files and versions

chatbot-poc / README.md

chronorus's picture

Add model card

8a124e6 verified 3 days ago

|

history blame contribute delete

1.3 kB

	---
	license: apache-2.0
	library_name: llama-cpp-python
	tags:
	- llama
	- instruction-tuned
	- thai
	- gguf
	- quantized
	- q8
	- rag
	- chatbot
	language:
	- th
	---

	# Llama 3.2 Typhoon2 3B Instruct (GGUF Q8_0)

	Fine-tuned Thai instruction-following model quantized to GGUF Q8_0 format for efficient inference.

	## Model Details

	- Base Model: typhoon-ai/llama3.2-typhoon2-3b-instruct
	- Format: GGUF (Q8_0 quantization)
	- Parameters: 3 billion
	- Language: Thai
	- Use Case: Context-aware Q&A, RAG systems, chatbots

	## Training

	- Framework: Unsloth
	- Method: Supervised Fine-Tuning (SFT)
	- Training Data: Thai instruction-following dataset with negative samples for strictness
	- Optimization: LoRA + 4-bit quantization during training

	## Inference

	### Using llama-cpp-python

	```python
	from llama_cpp import Llama

	llm = Llama(
	model_path="model.gguf",
	n_ctx=4096,
	n_gpu_layers=0,
	)

	response = llm(prompt, max_tokens=256, temperature=0.0)
	```

	### Docker Deployment (EKS)

	See deployment guide in the chat-inference Helm chart.

	## Performance

	- Quantization: Q8_0 (8-bit)
	- Model Size: ~3.3 GB
	- Inference Speed (CPU): ~2-5 tokens/sec (t3.xlarge)
	- Recommended CPU: 2-4 cores, 4-6 GB RAM

	## License

	Apache License 2.0