Learning Roadmap — NLP to LLM
Phase 2 — ML-Based NLP & Transformers
Step 1 — Text Preprocessing
Goal: Understand how raw text becomes model input.
Topics:
- Tokenization at character / subword level (SentencePiece, BPE)
- What is a vocabulary? What is an
<UNK>token? - Padding and truncation (
max_length) - Attention mask — why it exists
- The difference between word tokenization (Phase 1 / PyThaiNLP) vs subword tokenization (transformers)
Practice:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
print(tokenizer("วันนี้ทำอะไรบ้าง"))
# See: input_ids, attention_mask
Step 2 — What is a Transformer?
Goal: Understand the architecture behind BERT / WangchanBERTa.
Topics:
- Encoder vs Decoder (BERT = encoder only)
- Self-attention — how tokens look at each other
- What is a
[CLS]token? What is[SEP]? - Pre-training vs fine-tuning — why we don't train from scratch
- Why WangchanBERTa for Thai? (pre-trained on Thai corpus)
Resource: "The Illustrated BERT" by Jay Alammar (Google it — best visual explanation)
Step 3 — Named Entity Recognition (NER)
Goal: Understand the task your bot needs.
Topics:
- What is NER? (extract "ฮาคุบะ" → LOCATION, "29 พ.ค." → DATE)
- BIO tagging scheme:
B-LOC,I-LOC,O - Token classification head on top of BERT
- How the model outputs one label per token
O= Outside — means "not an entity", not a label type
Example:
Input : "จากฮาคุบะไปคามิโคจิ"
Tokens: ["จาก", "ฮาคุบะ", "ไป", "คามิโคจิ"]
Labels: [O, B-LOC, O, B-LOC ]
NER vs Text Classification — two different tasks, same base model
Both tasks use WangchanBERTa as the encoder, but attach different heads:
WangchanBERTa encoder
│
├──► Token Classification Head → NER
│ (one label per token) AutoModelForTokenClassification
│
└──► Sequence Classification Head → Text Classification / Intent
(one label per [CLS] token) AutoModelForSequenceClassification
| NER | Text Classification | |
|---|---|---|
| Output | 1 label per token | 1 label per sentence |
| Labels | B-LOC, I-DATE, O … |
query_itinerary, greeting … |
| Loss computed on | Every token | Only [CLS] token |
| Answers | What entities are in the text | What the user wants to do |
Fine-tuning for NER and fine-tuning for text classification are separate training runs with separate datasets, even though both start from the same checkpoint.
Do you need both for this bot?
NER extracts what is in the text (date, place). Intent classification determines what action to take (query, delete, add).
| Bot complexity | What you need |
|---|---|
| Single-purpose — show activities only (current bot) | NER only — everything is implicitly query_itinerary |
| Multi-action — query + add + delete | NER + Intent classification |
| Full assistant — arbitrary tasks | LLM handles both implicitly |
For the current itinerary bot, NER alone is sufficient. Intent classification becomes necessary only when the bot supports multiple actions on the same entity.
PyThaiNLP NER vs WangchanBERTa NER
PyThaiNLP ships a built-in NER tagger (CRF-based). It is simpler but weaker:
| PyThaiNLP NER | WangchanBERTa NER | |
|---|---|---|
| Model type | CRF — classical ML | RoBERTa + token classification head |
| Tokenizer | newmm (dictionary word-level) | SentencePiece subword |
| Context awareness | None — labels each token independently | Full bidirectional attention |
| Size | ~MB, instant | ~500 MB, needs decent CPU/GPU |
| Use case | Quick prototype | Production accuracy |
The critical difference is context. CRF labels tokens one by one using local features. A transformer attends to the entire sentence, so the same surface form (e.g. "มัตสึโมโต") can be correctly labeled LOC or PERSON depending on surrounding words.
Recommended order: get PyThaiNLP NER working first to understand BIO output format, then replicate with WangchanBERTa to feel the quality difference.
NER real-world applications
NER's real home is document processing, not chatbots (LLMs replaced NER in chatbots after 2022):
| Industry | NER use |
|---|---|
| Medical | Extract drug names, dosages, symptoms from clinical notes → structured database |
| Legal | Extract parties, dates, clauses from contracts automatically |
| Finance | Extract company names, amounts, dates from earnings reports |
| HR | Resume parsing — extract skills, companies, job titles |
| Compliance | Flag PII (names, phones, IDs) in documents for redaction before sending to external APIs |
Step 4 — Fine-Tuning WangchanBERTa for NER
Goal: Train the model on your task.
Topics:
- What is LST20? (Thai NER dataset — your training data)
- How to load and format a dataset with HuggingFace
datasets TrainerAPI — the standard fine-tuning loop- Evaluation metrics: precision, recall, F1 (seqeval library)
- Saving and loading a checkpoint
Practice: Follow the HuggingFace NER fine-tuning tutorial, swap the dataset for LST20.
Step 5 — Plug into the Bot
Goal: Replace intent_engine.py with the trained model.
- Load model with
pipeline("ner", model="your-checkpoint") - Extract
originanddestinationentities - Uncomment the Phase 2 block in
webhook.py
Phase 3 — LLM + Context Injection (Production)
Application layer vs Deep understanding
| Goal | Approach |
|---|---|
| Build a working chatbot now | Use pre-trained LLM via Ollama API — done |
| Understand how LLM works internally | Study the full pipeline below |
| Build your own LLM from scratch | Follow the deep learning path at the end |
Step 1 — Why LLMs Change Everything
Goal: Understand what GPT / Claude actually do differently from BERT.
Topics:
- Decoder-only architecture (GPT, Llama, Typhoon) vs encoder-only (BERT, WangchanBERTa)
- BERT reads whole sentence bidirectionally — LLM reads left to right, generates new tokens
- Pre-training on massive text → emergent instruction following
- Why you don't need labeled data or fine-tuning for most tasks
- Zero-shot vs few-shot prompting
BERT (encoder): reads [full sentence] → outputs labels for existing tokens
LLM (decoder): reads [prompt] → predicts next token → appends → repeats
Step 2 — How LLM Generates Text (the behind-the-scenes pipeline)
This is what Ollama does invisibly when you call requests.post(...):
Your plain text (system prompt + user message)
│
▼ 1. Tokenize (SentencePiece / BPE — same concept as WangchanBERTa)
["▁วัน", "ที่", "▁29", "▁ทำ", "อะไร", "บ้าง"]
│
▼ 2. Token IDs (vocabulary lookup)
[2341, 891, 445, 1203, 567, 892]
│
▼ 3. Embedding lookup (each ID → 768/4096-dim vector)
│
▼ 4. Decoder transformer layers (left-to-right attention, ~32 layers)
│ each token attends to ALL previous tokens
│
▼ 5. Predict next token (softmax over full vocabulary)
│ "กิจกรรม" → 42%
│ "กำหนดการ" → 31%
│ ... pick highest (or sample)
│
▼ 6. Append predicted token → repeat from step 4
│ until <end> token is predicted
│
▼ 7. Detokenize → plain Thai text reply
You only see step 1 input and step 7 output. Everything in between is inside Ollama.
Why the itinerary JSON is text, not vectors:
- RAG converts documents to vectors for searching large corpora
- Your itinerary (~2,000 tokens) fits entirely in the context window
- No search needed — paste everything, LLM reads it all as tokens
- Every user message re-sends the full itinerary (stateless — no memory between calls)
Step 3 — Prompt Engineering
Goal: Learn to control LLM behavior through prompts.
Topics:
- System prompt vs user prompt
- Role prompting ("You are a Thai travel assistant...")
- Context injection — paste your JSON into the prompt
- Output formatting (ask for bullets, specific structure)
- Temperature / top-p — controls randomness of next-token sampling
# trip-bot system prompt structure
system = f"""
คุณเป็นผู้ช่วยท่องเที่ยวภาษาไทย
ทริปนี้อยู่ในช่วง 29 พ.ค. – 8 มิ.ย. 2569
ตอบตามข้อมูลนี้เท่านั้น: {itinerary_json}
"""
How LLM handles what Phase 1 needed code for:
| Phase 1 needed | LLM does automatically |
|---|---|
| Regex for date extraction | Reads "29 พ.ค." in context → understands it |
| Gazetteer for place names | Reads JSON → matches places in context |
| Intent classification | Infers what user wants from phrasing |
| Typo handling | Predicts most likely meaning from context |
Ollama Model Selection
Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.
Install: download from ollama.com then pull a model:
ollama pull qwen2.5:3b # recommended starting point
ollama serve # starts local server on localhost:11434
Model comparison for this bot (Thai group chat, CPU only):
| Model | Size | Speed (CPU) | Thai quality | Recommended for |
|---|---|---|---|---|
qwen2.5:3b |
1.9 GB | ~10-15s | Good | Starting point - best balance |
qwen2.5 |
4.7 GB | ~30-60s | Very good | Better quality, slower |
supachai/llama-3-typhoon-v1.5:8b-instruct |
4.9 GB | ~30-60s | Best (Thai-specific) | Best Thai, needs patience |
llama3.2:1b |
1.3 GB | ~5s | Decent | Fastest, weakest Thai |
Upgrade path:
- Start with
qwen2.5:3b-> test response quality - If Thai quality not good enough -> upgrade to
qwen2.5ortyphoon - If too slow for group chat -> downgrade to
llama3.2:1b
How Ollama works:
- Downloads model in GGUF format (quantized - 4-bit instead of 16-bit = smaller, faster)
- Runs as background server on
localhost:11434 - Your Python code sends HTTP requests - Ollama runs the full inference pipeline internally
- You only see plain text in -> plain text out
Group chat trigger word: In group chats, bot responds only when message starts with the trigger word:
fujisan วันที่ 29 ทำอะไรบ้าง <- bot responds
วันที่ 29 ทำอะไรบ้าง <- bot ignores
Ollama Model Selection
Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.
Install: download from ollama.com → then pull a model:
Model comparison for this bot (Thai group chat, CPU only):
| Model | Size | Speed (CPU) | Thai quality | Recommended for |
|---|---|---|---|---|
| \ | 1.9 GB | ~10–15s | Good | Starting point — best balance |
| \ | 4.7 GB | ~30–60s | Very good | Better quality, slower |
| \ | 4.9 GB | ~30–60s | Best (Thai-specific) | Best Thai, needs patience |
| \ | 1.3 GB | ~5s | Decent | Fastest, weakest Thai |
Upgrade path:
- Start with \ → test response quality
- If Thai quality not good enough → upgrade to \ or - If too slow for group chat → downgrade to How Ollama works:
- Downloads model in GGUF format (quantized — 4-bit instead of 16-bit = smaller, faster)
- Runs as background server on - Your Python code sends HTTP requests — Ollama runs the full inference pipeline internally
- You only see plain text in → plain text out
Group chat trigger word: In group chats, bot responds only when message starts with the trigger word:
Step 4 — RAG (Retrieval-Augmented Generation)
Goal: Understand when and why context injection is not enough.
Topics:
- Token limit problem: when data > context window, you can't paste everything
- Embeddings — convert text chunks to vectors that capture semantic meaning
- Vector similarity search — find chunks most relevant to the query
- Retrieve relevant chunks → inject only those → LLM generates answer
- Tools:
chromadb,faiss, OpenAI/Claude embeddings
When RAG is needed vs not:
| Data size | Approach |
|---|---|
| Small JSON / single document (trip-bot now) | Full context injection — no RAG |
| 10+ trips | Still probably fine with full injection |
| 100+ trips + reviews + guides | RAG — mandatory |
Step 5 — If You Want to Build Your Own LLM
The full learning path from understanding to building from scratch:
Level 1 — Tokenization ✓ done (SentencePiece, BPE, input_ids)
Level 2 — Embeddings ✓ done (token IDs → vectors, WangchanBERTa)
Level 3 — Encoder transformer ✓ done (BERT, attention, NER, BIO)
Level 4 — Decoder / Generation → next (next-token prediction, autoregressive)
Level 5 — Pre-training → how model learns from raw text (loss, backprop)
Level 6 — Build your own small LLM → implement transformer in PyTorch from scratch
Recommended resources in order:
| Resource | What you learn |
|---|---|
| 3Blue1Brown — Neural Networks series | Backpropagation visually |
| Andrej Karpathy — makemore (YouTube) | Build bigram → MLP → transformer from scratch |
| Andrej Karpathy — nanoGPT (GitHub) | Minimal GPT in ~300 lines of PyTorch |
| HuggingFace course chapters 1–4 | Pre-training and fine-tuning at scale |
| Paper: "Attention Is All You Need" (2017) | Original transformer architecture |
nanoGPT is the single best resource — it implements exactly the pipeline above
(tokenize → IDs → transformer layers → predict next token → repeat) from zero.
Summary
| Phase | Status | Key skill | What you built |
|---|---|---|---|
| 1 | Done | Rule-based NLP, keyword matching | Working trip chatbot |
| 2 | Done (learning) | Transformers, NER, BIO tagging, subword tokenization | Understood ML-based NLP |
| 3 | Done (production) | Prompt engineering, context injection, Typhoon API | Production LLM chatbot via openai SDK + typhoon-v2.5-30b-a3b-instruct |
| 4 (optional) | Future | Decoder architecture, pre-training, PyTorch | Build your own LLM |