File size: 14,536 Bytes
23611e1 570280a 23611e1 f8a2f75 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a 23611e1 570280a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | # Learning Roadmap — NLP to LLM
---
## Phase 2 — ML-Based NLP & Transformers
### Step 1 — Text Preprocessing
**Goal:** Understand how raw text becomes model input.
Topics:
- Tokenization at character / subword level (SentencePiece, BPE)
- What is a vocabulary? What is an `<UNK>` token?
- Padding and truncation (`max_length`)
- Attention mask — why it exists
- The difference between word tokenization (Phase 1 / PyThaiNLP) vs subword tokenization (transformers)
Practice:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
print(tokenizer("วันนี้ทำอะไรบ้าง"))
# See: input_ids, attention_mask
```
---
### Step 2 — What is a Transformer?
**Goal:** Understand the architecture behind BERT / WangchanBERTa.
Topics:
- Encoder vs Decoder (BERT = encoder only)
- Self-attention — how tokens look at each other
- What is a `[CLS]` token? What is `[SEP]`?
- Pre-training vs fine-tuning — why we don't train from scratch
- Why WangchanBERTa for Thai? (pre-trained on Thai corpus)
Resource: "The Illustrated BERT" by Jay Alammar (Google it — best visual explanation)
---
### Step 3 — Named Entity Recognition (NER)
**Goal:** Understand the task your bot needs.
Topics:
- What is NER? (extract "ฮาคุบะ" → LOCATION, "29 พ.ค." → DATE)
- BIO tagging scheme: `B-LOC`, `I-LOC`, `O`
- Token classification head on top of BERT
- How the model outputs one label per token
- `O` = Outside — means "not an entity", not a label type
Example:
```
Input : "จากฮาคุบะไปคามิโคจิ"
Tokens: ["จาก", "ฮาคุบะ", "ไป", "คามิโคจิ"]
Labels: [O, B-LOC, O, B-LOC ]
```
#### NER vs Text Classification — two different tasks, same base model
Both tasks use WangchanBERTa as the encoder, but attach different heads:
```
WangchanBERTa encoder
│
├──► Token Classification Head → NER
│ (one label per token) AutoModelForTokenClassification
│
└──► Sequence Classification Head → Text Classification / Intent
(one label per [CLS] token) AutoModelForSequenceClassification
```
| | NER | Text Classification |
|---|---|---|
| Output | 1 label per token | 1 label per sentence |
| Labels | `B-LOC`, `I-DATE`, `O` … | `query_itinerary`, `greeting` … |
| Loss computed on | Every token | Only `[CLS]` token |
| Answers | *What* entities are in the text | *What* the user wants to do |
Fine-tuning for NER and fine-tuning for text classification are **separate training runs** with separate datasets, even though both start from the same checkpoint.
#### Do you need both for this bot?
NER extracts *what* is in the text (date, place). Intent classification determines *what action* to take (query, delete, add).
| Bot complexity | What you need |
|---|---|
| Single-purpose — show activities only (current bot) | NER only — everything is implicitly `query_itinerary` |
| Multi-action — query + add + delete | NER + Intent classification |
| Full assistant — arbitrary tasks | LLM handles both implicitly |
For the current itinerary bot, **NER alone is sufficient**. Intent classification becomes necessary only when the bot supports multiple actions on the same entity.
#### PyThaiNLP NER vs WangchanBERTa NER
PyThaiNLP ships a built-in NER tagger (CRF-based). It is simpler but weaker:
| | PyThaiNLP NER | WangchanBERTa NER |
|---|---|---|
| Model type | CRF — classical ML | RoBERTa + token classification head |
| Tokenizer | newmm (dictionary word-level) | SentencePiece subword |
| Context awareness | None — labels each token independently | Full bidirectional attention |
| Size | ~MB, instant | ~500 MB, needs decent CPU/GPU |
| Use case | Quick prototype | Production accuracy |
The critical difference is **context**. CRF labels tokens one by one using local features. A transformer attends to the entire sentence, so the same surface form (e.g. "มัตสึโมโต") can be correctly labeled LOC or PERSON depending on surrounding words.
**Recommended order:** get PyThaiNLP NER working first to understand BIO output format, then replicate with WangchanBERTa to feel the quality difference.
#### NER real-world applications
NER's real home is **document processing**, not chatbots (LLMs replaced NER in chatbots after 2022):
| Industry | NER use |
|---|---|
| Medical | Extract drug names, dosages, symptoms from clinical notes → structured database |
| Legal | Extract parties, dates, clauses from contracts automatically |
| Finance | Extract company names, amounts, dates from earnings reports |
| HR | Resume parsing — extract skills, companies, job titles |
| Compliance | Flag PII (names, phones, IDs) in documents for redaction before sending to external APIs |
---
### Step 4 — Fine-Tuning WangchanBERTa for NER
**Goal:** Train the model on your task.
Topics:
- What is LST20? (Thai NER dataset — your training data)
- How to load and format a dataset with HuggingFace `datasets`
- `Trainer` API — the standard fine-tuning loop
- Evaluation metrics: precision, recall, F1 (seqeval library)
- Saving and loading a checkpoint
Practice: Follow the HuggingFace NER fine-tuning tutorial, swap the dataset for LST20.
---
### Step 5 — Plug into the Bot
**Goal:** Replace `intent_engine.py` with the trained model.
- Load model with `pipeline("ner", model="your-checkpoint")`
- Extract `origin` and `destination` entities
- Uncomment the Phase 2 block in `webhook.py`
---
## Phase 3 — LLM + Context Injection (Production)
### Application layer vs Deep understanding
| Goal | Approach |
|---|---|
| Build a working chatbot now | Use pre-trained LLM via Ollama API — done |
| Understand how LLM works internally | Study the full pipeline below |
| Build your own LLM from scratch | Follow the deep learning path at the end |
---
### Step 1 — Why LLMs Change Everything
**Goal:** Understand what GPT / Claude actually do differently from BERT.
Topics:
- Decoder-only architecture (GPT, Llama, Typhoon) vs encoder-only (BERT, WangchanBERTa)
- BERT reads whole sentence bidirectionally — LLM reads left to right, generates new tokens
- Pre-training on massive text → emergent instruction following
- Why you don't need labeled data or fine-tuning for most tasks
- Zero-shot vs few-shot prompting
```
BERT (encoder): reads [full sentence] → outputs labels for existing tokens
LLM (decoder): reads [prompt] → predicts next token → appends → repeats
```
---
### Step 2 — How LLM Generates Text (the behind-the-scenes pipeline)
This is what Ollama does invisibly when you call `requests.post(...)`:
```
Your plain text (system prompt + user message)
│
▼ 1. Tokenize (SentencePiece / BPE — same concept as WangchanBERTa)
["▁วัน", "ที่", "▁29", "▁ทำ", "อะไร", "บ้าง"]
│
▼ 2. Token IDs (vocabulary lookup)
[2341, 891, 445, 1203, 567, 892]
│
▼ 3. Embedding lookup (each ID → 768/4096-dim vector)
│
▼ 4. Decoder transformer layers (left-to-right attention, ~32 layers)
│ each token attends to ALL previous tokens
│
▼ 5. Predict next token (softmax over full vocabulary)
│ "กิจกรรม" → 42%
│ "กำหนดการ" → 31%
│ ... pick highest (or sample)
│
▼ 6. Append predicted token → repeat from step 4
│ until <end> token is predicted
│
▼ 7. Detokenize → plain Thai text reply
```
You only see step 1 input and step 7 output. Everything in between is inside Ollama.
**Why the itinerary JSON is text, not vectors:**
- RAG converts documents to vectors for *searching large corpora*
- Your itinerary (~2,000 tokens) fits entirely in the context window
- No search needed — paste everything, LLM reads it all as tokens
- Every user message re-sends the full itinerary (stateless — no memory between calls)
---
### Step 3 — Prompt Engineering
**Goal:** Learn to control LLM behavior through prompts.
Topics:
- System prompt vs user prompt
- Role prompting ("You are a Thai travel assistant...")
- Context injection — paste your JSON into the prompt
- Output formatting (ask for bullets, specific structure)
- Temperature / top-p — controls randomness of next-token sampling
```python
# trip-bot system prompt structure
system = f"""
คุณเป็นผู้ช่วยท่องเที่ยวภาษาไทย
ทริปนี้อยู่ในช่วง 29 พ.ค. – 8 มิ.ย. 2569
ตอบตามข้อมูลนี้เท่านั้น: {itinerary_json}
"""
```
**How LLM handles what Phase 1 needed code for:**
| Phase 1 needed | LLM does automatically |
|---|---|
| Regex for date extraction | Reads "29 พ.ค." in context → understands it |
| Gazetteer for place names | Reads JSON → matches places in context |
| Intent classification | Infers what user wants from phrasing |
| Typo handling | Predicts most likely meaning from context |
---
### Ollama Model Selection
Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.
**Install:** download from ollama.com then pull a model:
```bash
ollama pull qwen2.5:3b # recommended starting point
ollama serve # starts local server on localhost:11434
```
**Model comparison for this bot (Thai group chat, CPU only):**
| Model | Size | Speed (CPU) | Thai quality | Recommended for |
|---|---|---|---|---|
| `qwen2.5:3b` | 1.9 GB | ~10-15s | Good | **Starting point - best balance** |
| `qwen2.5` | 4.7 GB | ~30-60s | Very good | Better quality, slower |
| `supachai/llama-3-typhoon-v1.5:8b-instruct` | 4.9 GB | ~30-60s | Best (Thai-specific) | Best Thai, needs patience |
| `llama3.2:1b` | 1.3 GB | ~5s | Decent | Fastest, weakest Thai |
**Upgrade path:**
- Start with `qwen2.5:3b` -> test response quality
- If Thai quality not good enough -> upgrade to `qwen2.5` or `typhoon`
- If too slow for group chat -> downgrade to `llama3.2:1b`
**How Ollama works:**
- Downloads model in GGUF format (quantized - 4-bit instead of 16-bit = smaller, faster)
- Runs as background server on `localhost:11434`
- Your Python code sends HTTP requests - Ollama runs the full inference pipeline internally
- You only see plain text in -> plain text out
**Group chat trigger word:**
In group chats, bot responds only when message starts with the trigger word:
```
fujisan วันที่ 29 ทำอะไรบ้าง <- bot responds
วันที่ 29 ทำอะไรบ้าง <- bot ignores
```
---
### Ollama Model Selection
Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.
**Install:** download from ollama.com → then pull a model:
**Model comparison for this bot (Thai group chat, CPU only):**
| Model | Size | Speed (CPU) | Thai quality | Recommended for |
|---|---|---|---|---|
| \ | 1.9 GB | ~10–15s | Good | **Starting point — best balance** |
| \ | 4.7 GB | ~30–60s | Very good | Better quality, slower |
| \ | 4.9 GB | ~30–60s | Best (Thai-specific) | Best Thai, needs patience |
| \ | 1.3 GB | ~5s | Decent | Fastest, weakest Thai |
**Upgrade path:**
- Start with \ → test response quality
- If Thai quality not good enough → upgrade to \ or - If too slow for group chat → downgrade to
**How Ollama works:**
- Downloads model in GGUF format (quantized — 4-bit instead of 16-bit = smaller, faster)
- Runs as background server on - Your Python code sends HTTP requests — Ollama runs the full inference pipeline internally
- You only see plain text in → plain text out
**Group chat trigger word:**
In group chats, bot responds only when message starts with the trigger word:
---
### Step 4 — RAG (Retrieval-Augmented Generation)
**Goal:** Understand when and why context injection is not enough.
Topics:
- Token limit problem: when data > context window, you can't paste everything
- Embeddings — convert text chunks to vectors that capture semantic meaning
- Vector similarity search — find chunks most relevant to the query
- Retrieve relevant chunks → inject only those → LLM generates answer
- Tools: `chromadb`, `faiss`, OpenAI/Claude embeddings
**When RAG is needed vs not:**
| Data size | Approach |
|---|---|
| Small JSON / single document (trip-bot now) | Full context injection — no RAG |
| 10+ trips | Still probably fine with full injection |
| 100+ trips + reviews + guides | RAG — mandatory |
---
### Step 5 — If You Want to Build Your Own LLM
The full learning path from understanding to building from scratch:
```
Level 1 — Tokenization ✓ done (SentencePiece, BPE, input_ids)
Level 2 — Embeddings ✓ done (token IDs → vectors, WangchanBERTa)
Level 3 — Encoder transformer ✓ done (BERT, attention, NER, BIO)
Level 4 — Decoder / Generation → next (next-token prediction, autoregressive)
Level 5 — Pre-training → how model learns from raw text (loss, backprop)
Level 6 — Build your own small LLM → implement transformer in PyTorch from scratch
```
**Recommended resources in order:**
| Resource | What you learn |
|---|---|
| 3Blue1Brown — Neural Networks series | Backpropagation visually |
| Andrej Karpathy — makemore (YouTube) | Build bigram → MLP → transformer from scratch |
| Andrej Karpathy — nanoGPT (GitHub) | Minimal GPT in ~300 lines of PyTorch |
| HuggingFace course chapters 1–4 | Pre-training and fine-tuning at scale |
| Paper: "Attention Is All You Need" (2017) | Original transformer architecture |
nanoGPT is the single best resource — it implements exactly the pipeline above
(`tokenize → IDs → transformer layers → predict next token → repeat`) from zero.
---
## Summary
| Phase | Status | Key skill | What you built |
|---|---|---|---|
| 1 | Done | Rule-based NLP, keyword matching | Working trip chatbot |
| 2 | Done (learning) | Transformers, NER, BIO tagging, subword tokenization | Understood ML-based NLP |
| 3 | Done (production) | Prompt engineering, context injection, Typhoon API | Production LLM chatbot via `openai` SDK + `typhoon-v2.5-30b-a3b-instruct` |
| 4 (optional) | Future | Decoder architecture, pre-training, PyTorch | Build your own LLM |
|