trip-bot / dev_tools /learning_roadmap.md
Pongsatorn Kanjanasantisak
using LLM typhoon-v2.5-30b-a3b-instruct
570280a

Learning Roadmap — NLP to LLM


Phase 2 — ML-Based NLP & Transformers

Step 1 — Text Preprocessing

Goal: Understand how raw text becomes model input.

Topics:

  • Tokenization at character / subword level (SentencePiece, BPE)
  • What is a vocabulary? What is an <UNK> token?
  • Padding and truncation (max_length)
  • Attention mask — why it exists
  • The difference between word tokenization (Phase 1 / PyThaiNLP) vs subword tokenization (transformers)

Practice:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
print(tokenizer("วันนี้ทำอะไรบ้าง"))
# See: input_ids, attention_mask

Step 2 — What is a Transformer?

Goal: Understand the architecture behind BERT / WangchanBERTa.

Topics:

  • Encoder vs Decoder (BERT = encoder only)
  • Self-attention — how tokens look at each other
  • What is a [CLS] token? What is [SEP]?
  • Pre-training vs fine-tuning — why we don't train from scratch
  • Why WangchanBERTa for Thai? (pre-trained on Thai corpus)

Resource: "The Illustrated BERT" by Jay Alammar (Google it — best visual explanation)


Step 3 — Named Entity Recognition (NER)

Goal: Understand the task your bot needs.

Topics:

  • What is NER? (extract "ฮาคุบะ" → LOCATION, "29 พ.ค." → DATE)
  • BIO tagging scheme: B-LOC, I-LOC, O
  • Token classification head on top of BERT
  • How the model outputs one label per token
  • O = Outside — means "not an entity", not a label type

Example:

Input : "จากฮาคุบะไปคามิโคจิ"
Tokens: ["จาก", "ฮาคุบะ", "ไป", "คามิโคจิ"]
Labels: [O,     B-LOC,    O,    B-LOC      ]

NER vs Text Classification — two different tasks, same base model

Both tasks use WangchanBERTa as the encoder, but attach different heads:

WangchanBERTa encoder
        │
        ├──► Token Classification Head  →  NER
        │     (one label per token)         AutoModelForTokenClassification
        │
        └──► Sequence Classification Head  →  Text Classification / Intent
              (one label per [CLS] token)      AutoModelForSequenceClassification
NER Text Classification
Output 1 label per token 1 label per sentence
Labels B-LOC, I-DATE, O query_itinerary, greeting
Loss computed on Every token Only [CLS] token
Answers What entities are in the text What the user wants to do

Fine-tuning for NER and fine-tuning for text classification are separate training runs with separate datasets, even though both start from the same checkpoint.

Do you need both for this bot?

NER extracts what is in the text (date, place). Intent classification determines what action to take (query, delete, add).

Bot complexity What you need
Single-purpose — show activities only (current bot) NER only — everything is implicitly query_itinerary
Multi-action — query + add + delete NER + Intent classification
Full assistant — arbitrary tasks LLM handles both implicitly

For the current itinerary bot, NER alone is sufficient. Intent classification becomes necessary only when the bot supports multiple actions on the same entity.

PyThaiNLP NER vs WangchanBERTa NER

PyThaiNLP ships a built-in NER tagger (CRF-based). It is simpler but weaker:

PyThaiNLP NER WangchanBERTa NER
Model type CRF — classical ML RoBERTa + token classification head
Tokenizer newmm (dictionary word-level) SentencePiece subword
Context awareness None — labels each token independently Full bidirectional attention
Size ~MB, instant ~500 MB, needs decent CPU/GPU
Use case Quick prototype Production accuracy

The critical difference is context. CRF labels tokens one by one using local features. A transformer attends to the entire sentence, so the same surface form (e.g. "มัตสึโมโต") can be correctly labeled LOC or PERSON depending on surrounding words.

Recommended order: get PyThaiNLP NER working first to understand BIO output format, then replicate with WangchanBERTa to feel the quality difference.

NER real-world applications

NER's real home is document processing, not chatbots (LLMs replaced NER in chatbots after 2022):

Industry NER use
Medical Extract drug names, dosages, symptoms from clinical notes → structured database
Legal Extract parties, dates, clauses from contracts automatically
Finance Extract company names, amounts, dates from earnings reports
HR Resume parsing — extract skills, companies, job titles
Compliance Flag PII (names, phones, IDs) in documents for redaction before sending to external APIs

Step 4 — Fine-Tuning WangchanBERTa for NER

Goal: Train the model on your task.

Topics:

  • What is LST20? (Thai NER dataset — your training data)
  • How to load and format a dataset with HuggingFace datasets
  • Trainer API — the standard fine-tuning loop
  • Evaluation metrics: precision, recall, F1 (seqeval library)
  • Saving and loading a checkpoint

Practice: Follow the HuggingFace NER fine-tuning tutorial, swap the dataset for LST20.


Step 5 — Plug into the Bot

Goal: Replace intent_engine.py with the trained model.

  • Load model with pipeline("ner", model="your-checkpoint")
  • Extract origin and destination entities
  • Uncomment the Phase 2 block in webhook.py

Phase 3 — LLM + Context Injection (Production)

Application layer vs Deep understanding

Goal Approach
Build a working chatbot now Use pre-trained LLM via Ollama API — done
Understand how LLM works internally Study the full pipeline below
Build your own LLM from scratch Follow the deep learning path at the end

Step 1 — Why LLMs Change Everything

Goal: Understand what GPT / Claude actually do differently from BERT.

Topics:

  • Decoder-only architecture (GPT, Llama, Typhoon) vs encoder-only (BERT, WangchanBERTa)
  • BERT reads whole sentence bidirectionally — LLM reads left to right, generates new tokens
  • Pre-training on massive text → emergent instruction following
  • Why you don't need labeled data or fine-tuning for most tasks
  • Zero-shot vs few-shot prompting
BERT (encoder):   reads [full sentence] → outputs labels for existing tokens
LLM  (decoder):   reads [prompt] → predicts next token → appends → repeats

Step 2 — How LLM Generates Text (the behind-the-scenes pipeline)

This is what Ollama does invisibly when you call requests.post(...):

Your plain text (system prompt + user message)
    │
    ▼  1. Tokenize  (SentencePiece / BPE — same concept as WangchanBERTa)
["▁วัน", "ที่", "▁29", "▁ทำ", "อะไร", "บ้าง"]
    │
    ▼  2. Token IDs  (vocabulary lookup)
[2341, 891, 445, 1203, 567, 892]
    │
    ▼  3. Embedding lookup  (each ID → 768/4096-dim vector)
    │
    ▼  4. Decoder transformer layers  (left-to-right attention, ~32 layers)
    │      each token attends to ALL previous tokens
    │
    ▼  5. Predict next token  (softmax over full vocabulary)
    │      "กิจกรรม" → 42%
    │      "กำหนดการ" → 31%
    │      ... pick highest (or sample)
    │
    ▼  6. Append predicted token → repeat from step 4
    │      until <end> token is predicted
    │
    ▼  7. Detokenize → plain Thai text reply

You only see step 1 input and step 7 output. Everything in between is inside Ollama.

Why the itinerary JSON is text, not vectors:

  • RAG converts documents to vectors for searching large corpora
  • Your itinerary (~2,000 tokens) fits entirely in the context window
  • No search needed — paste everything, LLM reads it all as tokens
  • Every user message re-sends the full itinerary (stateless — no memory between calls)

Step 3 — Prompt Engineering

Goal: Learn to control LLM behavior through prompts.

Topics:

  • System prompt vs user prompt
  • Role prompting ("You are a Thai travel assistant...")
  • Context injection — paste your JSON into the prompt
  • Output formatting (ask for bullets, specific structure)
  • Temperature / top-p — controls randomness of next-token sampling
# trip-bot system prompt structure
system = f"""
คุณเป็นผู้ช่วยท่องเที่ยวภาษาไทย
ทริปนี้อยู่ในช่วง 29 พ.ค. – 8 มิ.ย. 2569
ตอบตามข้อมูลนี้เท่านั้น: {itinerary_json}
"""

How LLM handles what Phase 1 needed code for:

Phase 1 needed LLM does automatically
Regex for date extraction Reads "29 พ.ค." in context → understands it
Gazetteer for place names Reads JSON → matches places in context
Intent classification Infers what user wants from phrasing
Typo handling Predicts most likely meaning from context

Ollama Model Selection

Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

Install: download from ollama.com then pull a model:

ollama pull qwen2.5:3b    # recommended starting point
ollama serve              # starts local server on localhost:11434

Model comparison for this bot (Thai group chat, CPU only):

Model Size Speed (CPU) Thai quality Recommended for
qwen2.5:3b 1.9 GB ~10-15s Good Starting point - best balance
qwen2.5 4.7 GB ~30-60s Very good Better quality, slower
supachai/llama-3-typhoon-v1.5:8b-instruct 4.9 GB ~30-60s Best (Thai-specific) Best Thai, needs patience
llama3.2:1b 1.3 GB ~5s Decent Fastest, weakest Thai

Upgrade path:

  • Start with qwen2.5:3b -> test response quality
  • If Thai quality not good enough -> upgrade to qwen2.5 or typhoon
  • If too slow for group chat -> downgrade to llama3.2:1b

How Ollama works:

  • Downloads model in GGUF format (quantized - 4-bit instead of 16-bit = smaller, faster)
  • Runs as background server on localhost:11434
  • Your Python code sends HTTP requests - Ollama runs the full inference pipeline internally
  • You only see plain text in -> plain text out

Group chat trigger word: In group chats, bot responds only when message starts with the trigger word:

fujisan วันที่ 29 ทำอะไรบ้าง   <- bot responds
วันที่ 29 ทำอะไรบ้าง           <- bot ignores

Ollama Model Selection

Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

Install: download from ollama.com → then pull a model:

Model comparison for this bot (Thai group chat, CPU only):

Model Size Speed (CPU) Thai quality Recommended for
\ 1.9 GB ~10–15s Good Starting point — best balance
\ 4.7 GB ~30–60s Very good Better quality, slower
\ 4.9 GB ~30–60s Best (Thai-specific) Best Thai, needs patience
\ 1.3 GB ~5s Decent Fastest, weakest Thai

Upgrade path:

  • Start with \ → test response quality
  • If Thai quality not good enough → upgrade to \ or - If too slow for group chat → downgrade to How Ollama works:
  • Downloads model in GGUF format (quantized — 4-bit instead of 16-bit = smaller, faster)
  • Runs as background server on - Your Python code sends HTTP requests — Ollama runs the full inference pipeline internally
  • You only see plain text in → plain text out

Group chat trigger word: In group chats, bot responds only when message starts with the trigger word:

Step 4 — RAG (Retrieval-Augmented Generation)

Goal: Understand when and why context injection is not enough.

Topics:

  • Token limit problem: when data > context window, you can't paste everything
  • Embeddings — convert text chunks to vectors that capture semantic meaning
  • Vector similarity search — find chunks most relevant to the query
  • Retrieve relevant chunks → inject only those → LLM generates answer
  • Tools: chromadb, faiss, OpenAI/Claude embeddings

When RAG is needed vs not:

Data size Approach
Small JSON / single document (trip-bot now) Full context injection — no RAG
10+ trips Still probably fine with full injection
100+ trips + reviews + guides RAG — mandatory

Step 5 — If You Want to Build Your Own LLM

The full learning path from understanding to building from scratch:

Level 1 — Tokenization              ✓ done (SentencePiece, BPE, input_ids)
Level 2 — Embeddings                ✓ done (token IDs → vectors, WangchanBERTa)
Level 3 — Encoder transformer       ✓ done (BERT, attention, NER, BIO)
Level 4 — Decoder / Generation      → next (next-token prediction, autoregressive)
Level 5 — Pre-training              → how model learns from raw text (loss, backprop)
Level 6 — Build your own small LLM  → implement transformer in PyTorch from scratch

Recommended resources in order:

Resource What you learn
3Blue1Brown — Neural Networks series Backpropagation visually
Andrej Karpathy — makemore (YouTube) Build bigram → MLP → transformer from scratch
Andrej Karpathy — nanoGPT (GitHub) Minimal GPT in ~300 lines of PyTorch
HuggingFace course chapters 1–4 Pre-training and fine-tuning at scale
Paper: "Attention Is All You Need" (2017) Original transformer architecture

nanoGPT is the single best resource — it implements exactly the pipeline above (tokenize → IDs → transformer layers → predict next token → repeat) from zero.


Summary

Phase Status Key skill What you built
1 Done Rule-based NLP, keyword matching Working trip chatbot
2 Done (learning) Transformers, NER, BIO tagging, subword tokenization Understood ML-based NLP
3 Done (production) Prompt engineering, context injection, Typhoon API Production LLM chatbot via openai SDK + typhoon-v2.5-30b-a3b-instruct
4 (optional) Future Decoder architecture, pre-training, PyTorch Build your own LLM