File size: 14,536 Bytes
23611e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
570280a
23611e1
 
 
 
 
 
 
 
f8a2f75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
570280a
 
 
 
 
 
 
 
 
 
 
 
23611e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
570280a
 
 
 
 
 
 
 
 
 
 
23611e1
 
570280a
23611e1
 
570280a
 
23611e1
 
 
 
570280a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23611e1
 
570280a
23611e1
 
 
 
 
 
570280a
 
23611e1
570280a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23611e1
 
570280a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23611e1
570280a
 
 
 
 
 
 
 
 
23611e1
570280a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23611e1
 
570280a
 
23611e1
 
570280a
 
 
 
23611e1
 
570280a
 
 
 
 
 
 
23611e1
 
 
570280a
 
 
23611e1
570280a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23611e1
 
 
 
 
570280a
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
# Learning Roadmap — NLP to LLM

---

## Phase 2 — ML-Based NLP & Transformers

### Step 1 — Text Preprocessing
**Goal:** Understand how raw text becomes model input.

Topics:
- Tokenization at character / subword level (SentencePiece, BPE)
- What is a vocabulary? What is an `<UNK>` token?
- Padding and truncation (`max_length`)
- Attention mask — why it exists
- The difference between word tokenization (Phase 1 / PyThaiNLP) vs subword tokenization (transformers)

Practice:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
print(tokenizer("วันนี้ทำอะไรบ้าง"))
# See: input_ids, attention_mask
```

---

### Step 2 — What is a Transformer?
**Goal:** Understand the architecture behind BERT / WangchanBERTa.

Topics:
- Encoder vs Decoder (BERT = encoder only)
- Self-attention — how tokens look at each other
- What is a `[CLS]` token? What is `[SEP]`?
- Pre-training vs fine-tuning — why we don't train from scratch
- Why WangchanBERTa for Thai? (pre-trained on Thai corpus)

Resource: "The Illustrated BERT" by Jay Alammar (Google it — best visual explanation)

---

### Step 3 — Named Entity Recognition (NER)
**Goal:** Understand the task your bot needs.

Topics:
- What is NER? (extract "ฮาคุบะ" → LOCATION, "29 พ.ค." → DATE)
- BIO tagging scheme: `B-LOC`, `I-LOC`, `O`
- Token classification head on top of BERT
- How the model outputs one label per token
- `O` = Outside — means "not an entity", not a label type

Example:
```
Input : "จากฮาคุบะไปคามิโคจิ"
Tokens: ["จาก", "ฮาคุบะ", "ไป", "คามิโคจิ"]
Labels: [O,     B-LOC,    O,    B-LOC      ]
```

#### NER vs Text Classification — two different tasks, same base model

Both tasks use WangchanBERTa as the encoder, but attach different heads:

```
WangchanBERTa encoder

        ├──► Token Classification Head  →  NER
        │     (one label per token)         AutoModelForTokenClassification

        └──► Sequence Classification Head  →  Text Classification / Intent
              (one label per [CLS] token)      AutoModelForSequenceClassification
```

| | NER | Text Classification |
|---|---|---|
| Output | 1 label per token | 1 label per sentence |
| Labels | `B-LOC`, `I-DATE`, `O` … | `query_itinerary`, `greeting` … |
| Loss computed on | Every token | Only `[CLS]` token |
| Answers | *What* entities are in the text | *What* the user wants to do |

Fine-tuning for NER and fine-tuning for text classification are **separate training runs** with separate datasets, even though both start from the same checkpoint.

#### Do you need both for this bot?

NER extracts *what* is in the text (date, place). Intent classification determines *what action* to take (query, delete, add).

| Bot complexity | What you need |
|---|---|
| Single-purpose — show activities only (current bot) | NER only — everything is implicitly `query_itinerary` |
| Multi-action — query + add + delete | NER + Intent classification |
| Full assistant — arbitrary tasks | LLM handles both implicitly |

For the current itinerary bot, **NER alone is sufficient**. Intent classification becomes necessary only when the bot supports multiple actions on the same entity.

#### PyThaiNLP NER vs WangchanBERTa NER

PyThaiNLP ships a built-in NER tagger (CRF-based). It is simpler but weaker:

| | PyThaiNLP NER | WangchanBERTa NER |
|---|---|---|
| Model type | CRF — classical ML | RoBERTa + token classification head |
| Tokenizer | newmm (dictionary word-level) | SentencePiece subword |
| Context awareness | None — labels each token independently | Full bidirectional attention |
| Size | ~MB, instant | ~500 MB, needs decent CPU/GPU |
| Use case | Quick prototype | Production accuracy |

The critical difference is **context**. CRF labels tokens one by one using local features. A transformer attends to the entire sentence, so the same surface form (e.g. "มัตสึโมโต") can be correctly labeled LOC or PERSON depending on surrounding words.

**Recommended order:** get PyThaiNLP NER working first to understand BIO output format, then replicate with WangchanBERTa to feel the quality difference.

#### NER real-world applications

NER's real home is **document processing**, not chatbots (LLMs replaced NER in chatbots after 2022):

| Industry | NER use |
|---|---|
| Medical | Extract drug names, dosages, symptoms from clinical notes → structured database |
| Legal | Extract parties, dates, clauses from contracts automatically |
| Finance | Extract company names, amounts, dates from earnings reports |
| HR | Resume parsing — extract skills, companies, job titles |
| Compliance | Flag PII (names, phones, IDs) in documents for redaction before sending to external APIs |

---

### Step 4 — Fine-Tuning WangchanBERTa for NER
**Goal:** Train the model on your task.

Topics:
- What is LST20? (Thai NER dataset — your training data)
- How to load and format a dataset with HuggingFace `datasets`
- `Trainer` API — the standard fine-tuning loop
- Evaluation metrics: precision, recall, F1 (seqeval library)
- Saving and loading a checkpoint

Practice: Follow the HuggingFace NER fine-tuning tutorial, swap the dataset for LST20.

---

### Step 5 — Plug into the Bot
**Goal:** Replace `intent_engine.py` with the trained model.

- Load model with `pipeline("ner", model="your-checkpoint")`
- Extract `origin` and `destination` entities
- Uncomment the Phase 2 block in `webhook.py`

---

## Phase 3 — LLM + Context Injection (Production)

### Application layer vs Deep understanding

| Goal | Approach |
|---|---|
| Build a working chatbot now | Use pre-trained LLM via Ollama API — done |
| Understand how LLM works internally | Study the full pipeline below |
| Build your own LLM from scratch | Follow the deep learning path at the end |

---

### Step 1 — Why LLMs Change Everything
**Goal:** Understand what GPT / Claude actually do differently from BERT.

Topics:
- Decoder-only architecture (GPT, Llama, Typhoon) vs encoder-only (BERT, WangchanBERTa)
- BERT reads whole sentence bidirectionally — LLM reads left to right, generates new tokens
- Pre-training on massive text → emergent instruction following
- Why you don't need labeled data or fine-tuning for most tasks
- Zero-shot vs few-shot prompting

```
BERT (encoder):   reads [full sentence] → outputs labels for existing tokens
LLM  (decoder):   reads [prompt] → predicts next token → appends → repeats
```

---

### Step 2 — How LLM Generates Text (the behind-the-scenes pipeline)

This is what Ollama does invisibly when you call `requests.post(...)`:

```
Your plain text (system prompt + user message)

    ▼  1. Tokenize  (SentencePiece / BPE — same concept as WangchanBERTa)
["▁วัน", "ที่", "▁29", "▁ทำ", "อะไร", "บ้าง"]

    ▼  2. Token IDs  (vocabulary lookup)
[2341, 891, 445, 1203, 567, 892]

    ▼  3. Embedding lookup  (each ID → 768/4096-dim vector)

    ▼  4. Decoder transformer layers  (left-to-right attention, ~32 layers)
    │      each token attends to ALL previous tokens

    ▼  5. Predict next token  (softmax over full vocabulary)
    │      "กิจกรรม" → 42%
    │      "กำหนดการ" → 31%
    │      ... pick highest (or sample)

    ▼  6. Append predicted token → repeat from step 4
    │      until <end> token is predicted

    ▼  7. Detokenize → plain Thai text reply
```

You only see step 1 input and step 7 output. Everything in between is inside Ollama.

**Why the itinerary JSON is text, not vectors:**
- RAG converts documents to vectors for *searching large corpora*
- Your itinerary (~2,000 tokens) fits entirely in the context window
- No search needed — paste everything, LLM reads it all as tokens
- Every user message re-sends the full itinerary (stateless — no memory between calls)

---

### Step 3 — Prompt Engineering
**Goal:** Learn to control LLM behavior through prompts.

Topics:
- System prompt vs user prompt
- Role prompting ("You are a Thai travel assistant...")
- Context injection — paste your JSON into the prompt
- Output formatting (ask for bullets, specific structure)
- Temperature / top-p — controls randomness of next-token sampling

```python
# trip-bot system prompt structure
system = f"""
คุณเป็นผู้ช่วยท่องเที่ยวภาษาไทย
ทริปนี้อยู่ในช่วง 29 พ.ค. – 8 มิ.ย. 2569
ตอบตามข้อมูลนี้เท่านั้น: {itinerary_json}
"""
```

**How LLM handles what Phase 1 needed code for:**

| Phase 1 needed | LLM does automatically |
|---|---|
| Regex for date extraction | Reads "29 พ.ค." in context → understands it |
| Gazetteer for place names | Reads JSON → matches places in context |
| Intent classification | Infers what user wants from phrasing |
| Typo handling | Predicts most likely meaning from context |

---

### Ollama Model Selection

Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

**Install:** download from ollama.com then pull a model:

```bash
ollama pull qwen2.5:3b    # recommended starting point
ollama serve              # starts local server on localhost:11434
```

**Model comparison for this bot (Thai group chat, CPU only):**

| Model | Size | Speed (CPU) | Thai quality | Recommended for |
|---|---|---|---|---|
| `qwen2.5:3b` | 1.9 GB | ~10-15s | Good | **Starting point - best balance** |
| `qwen2.5` | 4.7 GB | ~30-60s | Very good | Better quality, slower |
| `supachai/llama-3-typhoon-v1.5:8b-instruct` | 4.9 GB | ~30-60s | Best (Thai-specific) | Best Thai, needs patience |
| `llama3.2:1b` | 1.3 GB | ~5s | Decent | Fastest, weakest Thai |

**Upgrade path:**
- Start with `qwen2.5:3b` -> test response quality
- If Thai quality not good enough -> upgrade to `qwen2.5` or `typhoon`
- If too slow for group chat -> downgrade to `llama3.2:1b`

**How Ollama works:**
- Downloads model in GGUF format (quantized - 4-bit instead of 16-bit = smaller, faster)
- Runs as background server on `localhost:11434`
- Your Python code sends HTTP requests - Ollama runs the full inference pipeline internally
- You only see plain text in -> plain text out

**Group chat trigger word:**
In group chats, bot responds only when message starts with the trigger word:
```
fujisan วันที่ 29 ทำอะไรบ้าง   <- bot responds
วันที่ 29 ทำอะไรบ้าง           <- bot ignores
```

---

### Ollama Model Selection

Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

**Install:** download from ollama.com → then pull a model:

**Model comparison for this bot (Thai group chat, CPU only):**

| Model | Size | Speed (CPU) | Thai quality | Recommended for |
|---|---|---|---|---|
| \ | 1.9 GB | ~10–15s | Good | **Starting point — best balance** |
| \ | 4.7 GB | ~30–60s | Very good | Better quality, slower |
| \ | 4.9 GB | ~30–60s | Best (Thai-specific) | Best Thai, needs patience |
| \ | 1.3 GB | ~5s | Decent | Fastest, weakest Thai |

**Upgrade path:**
- Start with \ → test response quality
- If Thai quality not good enough → upgrade to \ or - If too slow for group chat → downgrade to 
**How Ollama works:**
- Downloads model in GGUF format (quantized — 4-bit instead of 16-bit = smaller, faster)
- Runs as background server on - Your Python code sends HTTP requests — Ollama runs the full inference pipeline internally
- You only see plain text in → plain text out

**Group chat trigger word:**
In group chats, bot responds only when message starts with the trigger word:
---

### Step 4 — RAG (Retrieval-Augmented Generation)
**Goal:** Understand when and why context injection is not enough.

Topics:
- Token limit problem: when data > context window, you can't paste everything
- Embeddings — convert text chunks to vectors that capture semantic meaning
- Vector similarity search — find chunks most relevant to the query
- Retrieve relevant chunks → inject only those → LLM generates answer
- Tools: `chromadb`, `faiss`, OpenAI/Claude embeddings

**When RAG is needed vs not:**

| Data size | Approach |
|---|---|
| Small JSON / single document (trip-bot now) | Full context injection — no RAG |
| 10+ trips | Still probably fine with full injection |
| 100+ trips + reviews + guides | RAG — mandatory |

---

### Step 5 — If You Want to Build Your Own LLM

The full learning path from understanding to building from scratch:

```
Level 1 — Tokenization              ✓ done (SentencePiece, BPE, input_ids)
Level 2 — Embeddings                ✓ done (token IDs → vectors, WangchanBERTa)
Level 3 — Encoder transformer       ✓ done (BERT, attention, NER, BIO)
Level 4 — Decoder / Generation      → next (next-token prediction, autoregressive)
Level 5 — Pre-training              → how model learns from raw text (loss, backprop)
Level 6 — Build your own small LLM  → implement transformer in PyTorch from scratch
```

**Recommended resources in order:**

| Resource | What you learn |
|---|---|
| 3Blue1Brown — Neural Networks series | Backpropagation visually |
| Andrej Karpathy — makemore (YouTube) | Build bigram → MLP → transformer from scratch |
| Andrej Karpathy — nanoGPT (GitHub) | Minimal GPT in ~300 lines of PyTorch |
| HuggingFace course chapters 1–4 | Pre-training and fine-tuning at scale |
| Paper: "Attention Is All You Need" (2017) | Original transformer architecture |

nanoGPT is the single best resource — it implements exactly the pipeline above
(`tokenize → IDs → transformer layers → predict next token → repeat`) from zero.

---

## Summary

| Phase | Status | Key skill | What you built |
|---|---|---|---|
| 1 | Done | Rule-based NLP, keyword matching | Working trip chatbot |
| 2 | Done (learning) | Transformers, NER, BIO tagging, subword tokenization | Understood ML-based NLP |
| 3 | Done (production) | Prompt engineering, context injection, Typhoon API | Production LLM chatbot via `openai` SDK + `typhoon-v2.5-30b-a3b-instruct` |
| 4 (optional) | Future | Decoder architecture, pre-training, PyTorch | Build your own LLM |