Spaces:

TaipongK
/

trip-bot

Sleeping

App Files Files Community

trip-bot / dev_tools /learning_roadmap.md

Pongsatorn Kanjanasantisak

using LLM typhoon-v2.5-30b-a3b-instruct

570280a 2 months ago

preview code

raw

history blame contribute delete

14.5 kB

	# Learning Roadmap — NLP to LLM

	---

	## Phase 2 — ML-Based NLP & Transformers

	### Step 1 — Text Preprocessing
	Goal: Understand how raw text becomes model input.

	Topics:
	- Tokenization at character / subword level (SentencePiece, BPE)
	- What is a vocabulary? What is an `<UNK>` token?
	- Padding and truncation (`max_length`)
	- Attention mask — why it exists
	- The difference between word tokenization (Phase 1 / PyThaiNLP) vs subword tokenization (transformers)

	Practice:
	```python
	from transformers import AutoTokenizer
	tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
	print(tokenizer("วันนี้ทำอะไรบ้าง"))
	# See: input_ids, attention_mask
	```

	---

	### Step 2 — What is a Transformer?
	Goal: Understand the architecture behind BERT / WangchanBERTa.

	Topics:
	- Encoder vs Decoder (BERT = encoder only)
	- Self-attention — how tokens look at each other
	- What is a `[CLS]` token? What is `[SEP]`?
	- Pre-training vs fine-tuning — why we don't train from scratch
	- Why WangchanBERTa for Thai? (pre-trained on Thai corpus)

	Resource: "The Illustrated BERT" by Jay Alammar (Google it — best visual explanation)

	---

	### Step 3 — Named Entity Recognition (NER)
	Goal: Understand the task your bot needs.

	Topics:
	- What is NER? (extract "ฮาคุบะ" → LOCATION, "29 พ.ค." → DATE)
	- BIO tagging scheme: `B-LOC`, `I-LOC`, `O`
	- Token classification head on top of BERT
	- How the model outputs one label per token
	- `O` = Outside — means "not an entity", not a label type

	Example:
	```
	Input : "จากฮาคุบะไปคามิโคจิ"
	Tokens: ["จาก", "ฮาคุบะ", "ไป", "คามิโคจิ"]
	Labels: [O, B-LOC, O, B-LOC ]
	```

	#### NER vs Text Classification — two different tasks, same base model

	Both tasks use WangchanBERTa as the encoder, but attach different heads:

	```
	WangchanBERTa encoder
	│
	├──► Token Classification Head → NER
	│ (one label per token) AutoModelForTokenClassification
	│
	└──► Sequence Classification Head → Text Classification / Intent
	(one label per [CLS] token) AutoModelForSequenceClassification
	```

	\| \| NER \| Text Classification \|
	\|---\|---\|---\|
	\| Output \| 1 label per token \| 1 label per sentence \|
	\| Labels \| `B-LOC`, `I-DATE`, `O` … \| `query_itinerary`, `greeting` … \|
	\| Loss computed on \| Every token \| Only `[CLS]` token \|
	\| Answers \| What entities are in the text \| What the user wants to do \|

	Fine-tuning for NER and fine-tuning for text classification are separate training runs with separate datasets, even though both start from the same checkpoint.

	#### Do you need both for this bot?

	NER extracts what is in the text (date, place). Intent classification determines what action to take (query, delete, add).

	\| Bot complexity \| What you need \|
	\|---\|---\|
	\| Single-purpose — show activities only (current bot) \| NER only — everything is implicitly `query_itinerary` \|
	\| Multi-action — query + add + delete \| NER + Intent classification \|
	\| Full assistant — arbitrary tasks \| LLM handles both implicitly \|

	For the current itinerary bot, NER alone is sufficient. Intent classification becomes necessary only when the bot supports multiple actions on the same entity.

	#### PyThaiNLP NER vs WangchanBERTa NER

	PyThaiNLP ships a built-in NER tagger (CRF-based). It is simpler but weaker:

	\| \| PyThaiNLP NER \| WangchanBERTa NER \|
	\|---\|---\|---\|
	\| Model type \| CRF — classical ML \| RoBERTa + token classification head \|
	\| Tokenizer \| newmm (dictionary word-level) \| SentencePiece subword \|
	\| Context awareness \| None — labels each token independently \| Full bidirectional attention \|
	\| Size \| ~MB, instant \| ~500 MB, needs decent CPU/GPU \|
	\| Use case \| Quick prototype \| Production accuracy \|

	The critical difference is context. CRF labels tokens one by one using local features. A transformer attends to the entire sentence, so the same surface form (e.g. "มัตสึโมโต") can be correctly labeled LOC or PERSON depending on surrounding words.

	Recommended order: get PyThaiNLP NER working first to understand BIO output format, then replicate with WangchanBERTa to feel the quality difference.

	#### NER real-world applications

	NER's real home is document processing, not chatbots (LLMs replaced NER in chatbots after 2022):

	\| Industry \| NER use \|
	\|---\|---\|
	\| Medical \| Extract drug names, dosages, symptoms from clinical notes → structured database \|
	\| Legal \| Extract parties, dates, clauses from contracts automatically \|
	\| Finance \| Extract company names, amounts, dates from earnings reports \|
	\| HR \| Resume parsing — extract skills, companies, job titles \|
	\| Compliance \| Flag PII (names, phones, IDs) in documents for redaction before sending to external APIs \|

	---

	### Step 4 — Fine-Tuning WangchanBERTa for NER
	Goal: Train the model on your task.

	Topics:
	- What is LST20? (Thai NER dataset — your training data)
	- How to load and format a dataset with HuggingFace `datasets`
	- `Trainer` API — the standard fine-tuning loop
	- Evaluation metrics: precision, recall, F1 (seqeval library)
	- Saving and loading a checkpoint

	Practice: Follow the HuggingFace NER fine-tuning tutorial, swap the dataset for LST20.

	---

	### Step 5 — Plug into the Bot
	Goal: Replace `intent_engine.py` with the trained model.

	- Load model with `pipeline("ner", model="your-checkpoint")`
	- Extract `origin` and `destination` entities
	- Uncomment the Phase 2 block in `webhook.py`

	---

	## Phase 3 — LLM + Context Injection (Production)

	### Application layer vs Deep understanding

	\| Goal \| Approach \|
	\|---\|---\|
	\| Build a working chatbot now \| Use pre-trained LLM via Ollama API — done \|
	\| Understand how LLM works internally \| Study the full pipeline below \|
	\| Build your own LLM from scratch \| Follow the deep learning path at the end \|

	---

	### Step 1 — Why LLMs Change Everything
	Goal: Understand what GPT / Claude actually do differently from BERT.

	Topics:
	- Decoder-only architecture (GPT, Llama, Typhoon) vs encoder-only (BERT, WangchanBERTa)
	- BERT reads whole sentence bidirectionally — LLM reads left to right, generates new tokens
	- Pre-training on massive text → emergent instruction following
	- Why you don't need labeled data or fine-tuning for most tasks
	- Zero-shot vs few-shot prompting

	```
	BERT (encoder): reads [full sentence] → outputs labels for existing tokens
	LLM (decoder): reads [prompt] → predicts next token → appends → repeats
	```

	---

	### Step 2 — How LLM Generates Text (the behind-the-scenes pipeline)

	This is what Ollama does invisibly when you call `requests.post(...)`:

	```
	Your plain text (system prompt + user message)
	│
	▼ 1. Tokenize (SentencePiece / BPE — same concept as WangchanBERTa)
	["▁วัน", "ที่", "▁29", "▁ทำ", "อะไร", "บ้าง"]
	│
	▼ 2. Token IDs (vocabulary lookup)
	[2341, 891, 445, 1203, 567, 892]
	│
	▼ 3. Embedding lookup (each ID → 768/4096-dim vector)
	│
	▼ 4. Decoder transformer layers (left-to-right attention, ~32 layers)
	│ each token attends to ALL previous tokens
	│
	▼ 5. Predict next token (softmax over full vocabulary)
	│ "กิจกรรม" → 42%
	│ "กำหนดการ" → 31%
	│ ... pick highest (or sample)
	│
	▼ 6. Append predicted token → repeat from step 4
	│ until <end> token is predicted
	│
	▼ 7. Detokenize → plain Thai text reply
	```

	You only see step 1 input and step 7 output. Everything in between is inside Ollama.

	Why the itinerary JSON is text, not vectors:
	- RAG converts documents to vectors for searching large corpora
	- Your itinerary (~2,000 tokens) fits entirely in the context window
	- No search needed — paste everything, LLM reads it all as tokens
	- Every user message re-sends the full itinerary (stateless — no memory between calls)

	---

	### Step 3 — Prompt Engineering
	Goal: Learn to control LLM behavior through prompts.

	Topics:
	- System prompt vs user prompt
	- Role prompting ("You are a Thai travel assistant...")
	- Context injection — paste your JSON into the prompt
	- Output formatting (ask for bullets, specific structure)
	- Temperature / top-p — controls randomness of next-token sampling

	```python
	# trip-bot system prompt structure
	system = f"""
	คุณเป็นผู้ช่วยท่องเที่ยวภาษาไทย
	ทริปนี้อยู่ในช่วง 29 พ.ค. – 8 มิ.ย. 2569
	ตอบตามข้อมูลนี้เท่านั้น: {itinerary_json}
	"""
	```

	How LLM handles what Phase 1 needed code for:

	\| Phase 1 needed \| LLM does automatically \|
	\|---\|---\|
	\| Regex for date extraction \| Reads "29 พ.ค." in context → understands it \|
	\| Gazetteer for place names \| Reads JSON → matches places in context \|
	\| Intent classification \| Infers what user wants from phrasing \|
	\| Typo handling \| Predicts most likely meaning from context \|

	---

	### Ollama Model Selection

	Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

	Install: download from ollama.com then pull a model:

	```bash
	ollama pull qwen2.5:3b # recommended starting point
	ollama serve # starts local server on localhost:11434
	```

	Model comparison for this bot (Thai group chat, CPU only):

	\| Model \| Size \| Speed (CPU) \| Thai quality \| Recommended for \|
	\|---\|---\|---\|---\|---\|
	\| `qwen2.5:3b` \| 1.9 GB \| ~10-15s \| Good \| Starting point - best balance \|
	\| `qwen2.5` \| 4.7 GB \| ~30-60s \| Very good \| Better quality, slower \|
	\| `supachai/llama-3-typhoon-v1.5:8b-instruct` \| 4.9 GB \| ~30-60s \| Best (Thai-specific) \| Best Thai, needs patience \|
	\| `llama3.2:1b` \| 1.3 GB \| ~5s \| Decent \| Fastest, weakest Thai \|

	Upgrade path:
	- Start with `qwen2.5:3b` -> test response quality
	- If Thai quality not good enough -> upgrade to `qwen2.5` or `typhoon`
	- If too slow for group chat -> downgrade to `llama3.2:1b`

	How Ollama works:
	- Downloads model in GGUF format (quantized - 4-bit instead of 16-bit = smaller, faster)
	- Runs as background server on `localhost:11434`
	- Your Python code sends HTTP requests - Ollama runs the full inference pipeline internally
	- You only see plain text in -> plain text out

	Group chat trigger word:
	In group chats, bot responds only when message starts with the trigger word:
	```
	fujisan วันที่ 29 ทำอะไรบ้าง <- bot responds
	วันที่ 29 ทำอะไรบ้าง <- bot ignores
	```

	---

	### Ollama Model Selection

	Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

	Install: download from ollama.com → then pull a model:

	Model comparison for this bot (Thai group chat, CPU only):

	\| Model \| Size \| Speed (CPU) \| Thai quality \| Recommended for \|
	\|---\|---\|---\|---\|---\|
	\| \ \| 1.9 GB \| ~10–15s \| Good \| Starting point — best balance \|
	\| \ \| 4.7 GB \| ~30–60s \| Very good \| Better quality, slower \|
	\| \ \| 4.9 GB \| ~30–60s \| Best (Thai-specific) \| Best Thai, needs patience \|
	\| \ \| 1.3 GB \| ~5s \| Decent \| Fastest, weakest Thai \|

	Upgrade path:
	- Start with \ → test response quality
	- If Thai quality not good enough → upgrade to \ or - If too slow for group chat → downgrade to
	How Ollama works:
	- Downloads model in GGUF format (quantized — 4-bit instead of 16-bit = smaller, faster)
	- Runs as background server on - Your Python code sends HTTP requests — Ollama runs the full inference pipeline internally
	- You only see plain text in → plain text out

	Group chat trigger word:
	In group chats, bot responds only when message starts with the trigger word:
	---

	### Step 4 — RAG (Retrieval-Augmented Generation)
	Goal: Understand when and why context injection is not enough.

	Topics:
	- Token limit problem: when data > context window, you can't paste everything
	- Embeddings — convert text chunks to vectors that capture semantic meaning
	- Vector similarity search — find chunks most relevant to the query
	- Retrieve relevant chunks → inject only those → LLM generates answer
	- Tools: `chromadb`, `faiss`, OpenAI/Claude embeddings

	When RAG is needed vs not:

	\| Data size \| Approach \|
	\|---\|---\|
	\| Small JSON / single document (trip-bot now) \| Full context injection — no RAG \|
	\| 10+ trips \| Still probably fine with full injection \|
	\| 100+ trips + reviews + guides \| RAG — mandatory \|

	---

	### Step 5 — If You Want to Build Your Own LLM

	The full learning path from understanding to building from scratch:

	```
	Level 1 — Tokenization ✓ done (SentencePiece, BPE, input_ids)
	Level 2 — Embeddings ✓ done (token IDs → vectors, WangchanBERTa)
	Level 3 — Encoder transformer ✓ done (BERT, attention, NER, BIO)
	Level 4 — Decoder / Generation → next (next-token prediction, autoregressive)
	Level 5 — Pre-training → how model learns from raw text (loss, backprop)
	Level 6 — Build your own small LLM → implement transformer in PyTorch from scratch
	```

	Recommended resources in order:

	\| Resource \| What you learn \|
	\|---\|---\|
	\| 3Blue1Brown — Neural Networks series \| Backpropagation visually \|
	\| Andrej Karpathy — makemore (YouTube) \| Build bigram → MLP → transformer from scratch \|
	\| Andrej Karpathy — nanoGPT (GitHub) \| Minimal GPT in ~300 lines of PyTorch \|
	\| HuggingFace course chapters 1–4 \| Pre-training and fine-tuning at scale \|
	\| Paper: "Attention Is All You Need" (2017) \| Original transformer architecture \|

	nanoGPT is the single best resource — it implements exactly the pipeline above
	(`tokenize → IDs → transformer layers → predict next token → repeat`) from zero.

	---

	## Summary

	\| Phase \| Status \| Key skill \| What you built \|
	\|---\|---\|---\|---\|
	\| 1 \| Done \| Rule-based NLP, keyword matching \| Working trip chatbot \|
	\| 2 \| Done (learning) \| Transformers, NER, BIO tagging, subword tokenization \| Understood ML-based NLP \|
	\| 3 \| Done (production) \| Prompt engineering, context injection, Typhoon API \| Production LLM chatbot via `openai` SDK + `typhoon-v2.5-30b-a3b-instruct` \|
	\| 4 (optional) \| Future \| Decoder architecture, pre-training, PyTorch \| Build your own LLM \|