chronorus commited on
Commit
8a124e6
·
verified ·
1 Parent(s): b5ecc31

Add model card

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: llama-cpp-python
4
+ tags:
5
+ - llama
6
+ - instruction-tuned
7
+ - thai
8
+ - gguf
9
+ - quantized
10
+ - q8
11
+ - rag
12
+ - chatbot
13
+ language:
14
+ - th
15
+ ---
16
+
17
+ # Llama 3.2 Typhoon2 3B Instruct (GGUF Q8_0)
18
+
19
+ Fine-tuned Thai instruction-following model quantized to GGUF Q8_0 format for efficient inference.
20
+
21
+ ## Model Details
22
+
23
+ - **Base Model**: typhoon-ai/llama3.2-typhoon2-3b-instruct
24
+ - **Format**: GGUF (Q8_0 quantization)
25
+ - **Parameters**: 3 billion
26
+ - **Language**: Thai
27
+ - **Use Case**: Context-aware Q&A, RAG systems, chatbots
28
+
29
+ ## Training
30
+
31
+ - **Framework**: Unsloth
32
+ - **Method**: Supervised Fine-Tuning (SFT)
33
+ - **Training Data**: Thai instruction-following dataset with negative samples for strictness
34
+ - **Optimization**: LoRA + 4-bit quantization during training
35
+
36
+ ## Inference
37
+
38
+ ### Using llama-cpp-python
39
+
40
+ ```python
41
+ from llama_cpp import Llama
42
+
43
+ llm = Llama(
44
+ model_path="model.gguf",
45
+ n_ctx=4096,
46
+ n_gpu_layers=0,
47
+ )
48
+
49
+ response = llm(prompt, max_tokens=256, temperature=0.0)
50
+ ```
51
+
52
+ ### Docker Deployment (EKS)
53
+
54
+ See deployment guide in the chat-inference Helm chart.
55
+
56
+ ## Performance
57
+
58
+ - **Quantization**: Q8_0 (8-bit)
59
+ - **Model Size**: ~3.3 GB
60
+ - **Inference Speed (CPU)**: ~2-5 tokens/sec (t3.xlarge)
61
+ - **Recommended CPU**: 2-4 cores, 4-6 GB RAM
62
+
63
+ ## License
64
+
65
+ Apache License 2.0