Spaces:

TaipongK
/

trip-bot

Sleeping

File size: 14,536 Bytes

# Learning Roadmap — NLP to LLM

---

## Phase 2 — ML-Based NLP & Transformers

### Step 1 — Text Preprocessing
**Goal:** Understand how raw text becomes model input.

Topics:
- Tokenization at character / subword level (SentencePiece, BPE)
- What is a vocabulary? What is an `<UNK>` token?
- Padding and truncation (`max_length`)
- Attention mask — why it exists
- The difference between word tokenization (Phase 1 / PyThaiNLP) vs subword tokenization (transformers)

Practice:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
print(tokenizer("วันนี้ทำอะไรบ้าง"))
# See: input_ids, attention_mask
```

---

### Step 2 — What is a Transformer?
**Goal:** Understand the architecture behind BERT / WangchanBERTa.

Topics:
- Encoder vs Decoder (BERT = encoder only)
- Self-attention — how tokens look at each other
- What is a `[CLS]` token? What is `[SEP]`?
- Pre-training vs fine-tuning — why we don't train from scratch
- Why WangchanBERTa for Thai? (pre-trained on Thai corpus)

Resource: "The Illustrated BERT" by Jay Alammar (Google it — best visual explanation)

---

### Step 3 — Named Entity Recognition (NER)
**Goal:** Understand the task your bot needs.

Topics:
- What is NER? (extract "ฮาคุบะ" → LOCATION, "29 พ.ค." → DATE)
- BIO tagging scheme: `B-LOC`, `I-LOC`, `O`
- Token classification head on top of BERT
- How the model outputs one label per token
- `O` = Outside — means "not an entity", not a label type

Example:
```
Input : "จากฮาคุบะไปคามิโคจิ"
Tokens: ["จาก", "ฮาคุบะ", "ไป", "คามิโคจิ"]
Labels: [O,     B-LOC,    O,    B-LOC      ]
```

#### NER vs Text Classification — two different tasks, same base model

Both tasks use WangchanBERTa as the encoder, but attach different heads:

```
WangchanBERTa encoder
        │
        ├──► Token Classification Head  →  NER
        │     (one label per token)         AutoModelForTokenClassification
        │
        └──► Sequence Classification Head  →  Text Classification / Intent
              (one label per [CLS] token)      AutoModelForSequenceClassification
```

| | NER | Text Classification |
|---|---|---|
| Output | 1 label per token | 1 label per sentence |
| Labels | `B-LOC`, `I-DATE`, `O` … | `query_itinerary`, `greeting` … |
| Loss computed on | Every token | Only `[CLS]` token |
| Answers | *What* entities are in the text | *What* the user wants to do |

Fine-tuning for NER and fine-tuning for text classification are **separate training runs** with separate datasets, even though both start from the same checkpoint.

#### Do you need both for this bot?

NER extracts *what* is in the text (date, place). Intent classification determines *what action* to take (query, delete, add).

| Bot complexity | What you need |
|---|---|
| Single-purpose — show activities only (current bot) | NER only — everything is implicitly `query_itinerary` |
| Multi-action — query + add + delete | NER + Intent classification |
| Full assistant — arbitrary tasks | LLM handles both implicitly |

For the current itinerary bot, **NER alone is sufficient**. Intent classification becomes necessary only when the bot supports multiple actions on the same entity.

#### PyThaiNLP NER vs WangchanBERTa NER

PyThaiNLP ships a built-in NER tagger (CRF-based). It is simpler but weaker:

| | PyThaiNLP NER | WangchanBERTa NER |
|---|---|---|
| Model type | CRF — classical ML | RoBERTa + token classification head |
| Tokenizer | newmm (dictionary word-level) | SentencePiece subword |
| Context awareness | None — labels each token independently | Full bidirectional attention |
| Size | ~MB, instant | ~500 MB, needs decent CPU/GPU |
| Use case | Quick prototype | Production accuracy |

The critical difference is **context**. CRF labels tokens one by one using local features. A transformer attends to the entire sentence, so the same surface form (e.g. "มัตสึโมโต") can be correctly labeled LOC or PERSON depending on surrounding words.

**Recommended order:** get PyThaiNLP NER working first to understand BIO output format, then replicate with WangchanBERTa to feel the quality difference.

#### NER real-world applications

NER's real home is **document processing**, not chatbots (LLMs replaced NER in chatbots after 2022):

| Industry | NER use |
|---|---|
| Medical | Extract drug names, dosages, symptoms from clinical notes → structured database |
| Legal | Extract parties, dates, clauses from contracts automatically |
| Finance | Extract company names, amounts, dates from earnings reports |
| HR | Resume parsing — extract skills, companies, job titles |
| Compliance | Flag PII (names, phones, IDs) in documents for redaction before sending to external APIs |

---

### Step 4 — Fine-Tuning WangchanBERTa for NER
**Goal:** Train the model on your task.

Topics:
- What is LST20? (Thai NER dataset — your training data)
- How to load and format a dataset with HuggingFace `datasets`
- `Trainer` API — the standard fine-tuning loop
- Evaluation metrics: precision, recall, F1 (seqeval library)
- Saving and loading a checkpoint

Practice: Follow the HuggingFace NER fine-tuning tutorial, swap the dataset for LST20.

---

### Step 5 — Plug into the Bot
**Goal:** Replace `intent_engine.py` with the trained model.

- Load model with `pipeline("ner", model="your-checkpoint")`
- Extract `origin` and `destination` entities
- Uncomment the Phase 2 block in `webhook.py`

---

## Phase 3 — LLM + Context Injection (Production)

### Application layer vs Deep understanding

| Goal | Approach |
|---|---|
| Build a working chatbot now | Use pre-trained LLM via Ollama API — done |
| Understand how LLM works internally | Study the full pipeline below |
| Build your own LLM from scratch | Follow the deep learning path at the end |

---

### Step 1 — Why LLMs Change Everything
**Goal:** Understand what GPT / Claude actually do differently from BERT.

Topics:
- Decoder-only architecture (GPT, Llama, Typhoon) vs encoder-only (BERT, WangchanBERTa)
- BERT reads whole sentence bidirectionally — LLM reads left to right, generates new tokens
- Pre-training on massive text → emergent instruction following
- Why you don't need labeled data or fine-tuning for most tasks
- Zero-shot vs few-shot prompting

```
BERT (encoder):   reads [full sentence] → outputs labels for existing tokens
LLM  (decoder):   reads [prompt] → predicts next token → appends → repeats
```

---

### Step 2 — How LLM Generates Text (the behind-the-scenes pipeline)

This is what Ollama does invisibly when you call `requests.post(...)`:

```
Your plain text (system prompt + user message)
    │
    ▼  1. Tokenize  (SentencePiece / BPE — same concept as WangchanBERTa)
["▁วัน", "ที่", "▁29", "▁ทำ", "อะไร", "บ้าง"]
    │
    ▼  2. Token IDs  (vocabulary lookup)
[2341, 891, 445, 1203, 567, 892]
    │
    ▼  3. Embedding lookup  (each ID → 768/4096-dim vector)
    │
    ▼  4. Decoder transformer layers  (left-to-right attention, ~32 layers)
    │      each token attends to ALL previous tokens
    │
    ▼  5. Predict next token  (softmax over full vocabulary)
    │      "กิจกรรม" → 42%
    │      "กำหนดการ" → 31%
    │      ... pick highest (or sample)
    │
    ▼  6. Append predicted token → repeat from step 4
    │      until <end> token is predicted
    │
    ▼  7. Detokenize → plain Thai text reply
```

You only see step 1 input and step 7 output. Everything in between is inside Ollama.

**Why the itinerary JSON is text, not vectors:**
- RAG converts documents to vectors for *searching large corpora*
- Your itinerary (~2,000 tokens) fits entirely in the context window
- No search needed — paste everything, LLM reads it all as tokens
- Every user message re-sends the full itinerary (stateless — no memory between calls)

---

### Step 3 — Prompt Engineering
**Goal:** Learn to control LLM behavior through prompts.

Topics:
- System prompt vs user prompt
- Role prompting ("You are a Thai travel assistant...")
- Context injection — paste your JSON into the prompt
- Output formatting (ask for bullets, specific structure)
- Temperature / top-p — controls randomness of next-token sampling

```python
# trip-bot system prompt structure
system = f"""
คุณเป็นผู้ช่วยท่องเที่ยวภาษาไทย
ทริปนี้อยู่ในช่วง 29 พ.ค. – 8 มิ.ย. 2569
ตอบตามข้อมูลนี้เท่านั้น: {itinerary_json}
"""
```

**How LLM handles what Phase 1 needed code for:**

| Phase 1 needed | LLM does automatically |
|---|---|
| Regex for date extraction | Reads "29 พ.ค." in context → understands it |
| Gazetteer for place names | Reads JSON → matches places in context |
| Intent classification | Infers what user wants from phrasing |
| Typo handling | Predicts most likely meaning from context |

---

### Ollama Model Selection

Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

**Install:** download from ollama.com then pull a model:

```bash
ollama pull qwen2.5:3b    # recommended starting point
ollama serve              # starts local server on localhost:11434
```

**Model comparison for this bot (Thai group chat, CPU only):**

| Model | Size | Speed (CPU) | Thai quality | Recommended for |
|---|---|---|---|---|
| `qwen2.5:3b` | 1.9 GB | ~10-15s | Good | **Starting point - best balance** |
| `qwen2.5` | 4.7 GB | ~30-60s | Very good | Better quality, slower |
| `supachai/llama-3-typhoon-v1.5:8b-instruct` | 4.9 GB | ~30-60s | Best (Thai-specific) | Best Thai, needs patience |
| `llama3.2:1b` | 1.3 GB | ~5s | Decent | Fastest, weakest Thai |

**Upgrade path:**
- Start with `qwen2.5:3b` -> test response quality
- If Thai quality not good enough -> upgrade to `qwen2.5` or `typhoon`
- If too slow for group chat -> downgrade to `llama3.2:1b`

**How Ollama works:**
- Downloads model in GGUF format (quantized - 4-bit instead of 16-bit = smaller, faster)
- Runs as background server on `localhost:11434`
- Your Python code sends HTTP requests - Ollama runs the full inference pipeline internally
- You only see plain text in -> plain text out

**Group chat trigger word:**
In group chats, bot responds only when message starts with the trigger word:
```
fujisan วันที่ 29 ทำอะไรบ้าง   <- bot responds
วันที่ 29 ทำอะไรบ้าง           <- bot ignores
```

---

### Ollama Model Selection

Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

**Install:** download from ollama.com → then pull a model:

**Model comparison for this bot (Thai group chat, CPU only):**

| Model | Size | Speed (CPU) | Thai quality | Recommended for |
|---|---|---|---|---|
| \ | 1.9 GB | ~10–15s | Good | **Starting point — best balance** |
| \ | 4.7 GB | ~30–60s | Very good | Better quality, slower |
| \ | 4.9 GB | ~30–60s | Best (Thai-specific) | Best Thai, needs patience |
| \ | 1.3 GB | ~5s | Decent | Fastest, weakest Thai |

**Upgrade path:**
- Start with \ → test response quality
- If Thai quality not good enough → upgrade to \ or - If too slow for group chat → downgrade to 
**How Ollama works:**
- Downloads model in GGUF format (quantized — 4-bit instead of 16-bit = smaller, faster)
- Runs as background server on - Your Python code sends HTTP requests — Ollama runs the full inference pipeline internally
- You only see plain text in → plain text out

**Group chat trigger word:**
In group chats, bot responds only when message starts with the trigger word:
---

### Step 4 — RAG (Retrieval-Augmented Generation)
**Goal:** Understand when and why context injection is not enough.

Topics:
- Token limit problem: when data > context window, you can't paste everything
- Embeddings — convert text chunks to vectors that capture semantic meaning
- Vector similarity search — find chunks most relevant to the query
- Retrieve relevant chunks → inject only those → LLM generates answer
- Tools: `chromadb`, `faiss`, OpenAI/Claude embeddings

**When RAG is needed vs not:**

| Data size | Approach |
|---|---|
| Small JSON / single document (trip-bot now) | Full context injection — no RAG |
| 10+ trips | Still probably fine with full injection |
| 100+ trips + reviews + guides | RAG — mandatory |

---

### Step 5 — If You Want to Build Your Own LLM

The full learning path from understanding to building from scratch:

```
Level 1 — Tokenization              ✓ done (SentencePiece, BPE, input_ids)
Level 2 — Embeddings                ✓ done (token IDs → vectors, WangchanBERTa)
Level 3 — Encoder transformer       ✓ done (BERT, attention, NER, BIO)
Level 4 — Decoder / Generation      → next (next-token prediction, autoregressive)
Level 5 — Pre-training              → how model learns from raw text (loss, backprop)
Level 6 — Build your own small LLM  → implement transformer in PyTorch from scratch
```

**Recommended resources in order:**

| Resource | What you learn |
|---|---|
| 3Blue1Brown — Neural Networks series | Backpropagation visually |
| Andrej Karpathy — makemore (YouTube) | Build bigram → MLP → transformer from scratch |
| Andrej Karpathy — nanoGPT (GitHub) | Minimal GPT in ~300 lines of PyTorch |
| HuggingFace course chapters 1–4 | Pre-training and fine-tuning at scale |
| Paper: "Attention Is All You Need" (2017) | Original transformer architecture |

nanoGPT is the single best resource — it implements exactly the pipeline above
(`tokenize → IDs → transformer layers → predict next token → repeat`) from zero.

---

## Summary

| Phase | Status | Key skill | What you built |
|---|---|---|---|
| 1 | Done | Rule-based NLP, keyword matching | Working trip chatbot |
| 2 | Done (learning) | Transformers, NER, BIO tagging, subword tokenization | Understood ML-based NLP |
| 3 | Done (production) | Prompt engineering, context injection, Typhoon API | Production LLM chatbot via `openai` SDK + `typhoon-v2.5-30b-a3b-instruct` |
| 4 (optional) | Future | Decoder architecture, pre-training, PyTorch | Build your own LLM |