Spaces:

TaipongK
/

trip-bot

Sleeping

App Files Files Community

trip-bot / dev_tools /learning_roadmap.md

Pongsatorn Kanjanasantisak

using LLM typhoon-v2.5-30b-a3b-instruct

570280a 2 months ago

preview code

raw

history blame contribute delete

14.5 kB

Learning Roadmap — NLP to LLM

Phase 2 — ML-Based NLP & Transformers

Step 1 — Text Preprocessing

Goal: Understand how raw text becomes model input.

Topics:

Tokenization at character / subword level (SentencePiece, BPE)
What is a vocabulary? What is an <UNK> token?
Padding and truncation (max_length)
Attention mask — why it exists
The difference between word tokenization (Phase 1 / PyThaiNLP) vs subword tokenization (transformers)

Practice:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
print(tokenizer("วันนี้ทำอะไรบ้าง"))
# See: input_ids, attention_mask

Step 2 — What is a Transformer?

Goal: Understand the architecture behind BERT / WangchanBERTa.

Topics:

Encoder vs Decoder (BERT = encoder only)
Self-attention — how tokens look at each other
What is a [CLS] token? What is [SEP]?
Pre-training vs fine-tuning — why we don't train from scratch
Why WangchanBERTa for Thai? (pre-trained on Thai corpus)

Resource: "The Illustrated BERT" by Jay Alammar (Google it — best visual explanation)

Step 3 — Named Entity Recognition (NER)

Goal: Understand the task your bot needs.

Topics:

What is NER? (extract "ฮาคุบะ" → LOCATION, "29 พ.ค." → DATE)
BIO tagging scheme: B-LOC, I-LOC, O
Token classification head on top of BERT
How the model outputs one label per token
O = Outside — means "not an entity", not a label type

Example:

Input : "จากฮาคุบะไปคามิโคจิ"
Tokens: ["จาก", "ฮาคุบะ", "ไป", "คามิโคจิ"]
Labels: [O,     B-LOC,    O,    B-LOC      ]

NER vs Text Classification — two different tasks, same base model

Both tasks use WangchanBERTa as the encoder, but attach different heads:

WangchanBERTa encoder
        │
        ├──► Token Classification Head  →  NER
        │     (one label per token)         AutoModelForTokenClassification
        │
        └──► Sequence Classification Head  →  Text Classification / Intent
              (one label per [CLS] token)      AutoModelForSequenceClassification

	NER	Text Classification
Output	1 label per token	1 label per sentence
Labels	`B-LOC`, `I-DATE`, `O` …	`query_itinerary`, `greeting` …
Loss computed on	Every token	Only `[CLS]` token
Answers	What entities are in the text	What the user wants to do

Fine-tuning for NER and fine-tuning for text classification are separate training runs with separate datasets, even though both start from the same checkpoint.

Do you need both for this bot?

NER extracts what is in the text (date, place). Intent classification determines what action to take (query, delete, add).

Bot complexity	What you need
Single-purpose — show activities only (current bot)	NER only — everything is implicitly `query_itinerary`
Multi-action — query + add + delete	NER + Intent classification
Full assistant — arbitrary tasks	LLM handles both implicitly

For the current itinerary bot, NER alone is sufficient. Intent classification becomes necessary only when the bot supports multiple actions on the same entity.

PyThaiNLP NER vs WangchanBERTa NER

PyThaiNLP ships a built-in NER tagger (CRF-based). It is simpler but weaker:

	PyThaiNLP NER	WangchanBERTa NER
Model type	CRF — classical ML	RoBERTa + token classification head
Tokenizer	newmm (dictionary word-level)	SentencePiece subword
Context awareness	None — labels each token independently	Full bidirectional attention
Size	~MB, instant	~500 MB, needs decent CPU/GPU
Use case	Quick prototype	Production accuracy

The critical difference is context. CRF labels tokens one by one using local features. A transformer attends to the entire sentence, so the same surface form (e.g. "มัตสึโมโต") can be correctly labeled LOC or PERSON depending on surrounding words.

Recommended order: get PyThaiNLP NER working first to understand BIO output format, then replicate with WangchanBERTa to feel the quality difference.

NER real-world applications

NER's real home is document processing, not chatbots (LLMs replaced NER in chatbots after 2022):

Industry	NER use
Medical	Extract drug names, dosages, symptoms from clinical notes → structured database
Legal	Extract parties, dates, clauses from contracts automatically
Finance	Extract company names, amounts, dates from earnings reports
HR	Resume parsing — extract skills, companies, job titles
Compliance	Flag PII (names, phones, IDs) in documents for redaction before sending to external APIs

Step 4 — Fine-Tuning WangchanBERTa for NER

Goal: Train the model on your task.

Topics:

What is LST20? (Thai NER dataset — your training data)
How to load and format a dataset with HuggingFace datasets
Trainer API — the standard fine-tuning loop
Evaluation metrics: precision, recall, F1 (seqeval library)
Saving and loading a checkpoint

Practice: Follow the HuggingFace NER fine-tuning tutorial, swap the dataset for LST20.

Step 5 — Plug into the Bot

Goal: Replace intent_engine.py with the trained model.

Load model with pipeline("ner", model="your-checkpoint")
Extract origin and destination entities
Uncomment the Phase 2 block in webhook.py

Phase 3 — LLM + Context Injection (Production)

Application layer vs Deep understanding

Goal	Approach
Build a working chatbot now	Use pre-trained LLM via Ollama API — done
Understand how LLM works internally	Study the full pipeline below
Build your own LLM from scratch	Follow the deep learning path at the end

Step 1 — Why LLMs Change Everything

Goal: Understand what GPT / Claude actually do differently from BERT.

Topics:

Decoder-only architecture (GPT, Llama, Typhoon) vs encoder-only (BERT, WangchanBERTa)
BERT reads whole sentence bidirectionally — LLM reads left to right, generates new tokens
Pre-training on massive text → emergent instruction following
Why you don't need labeled data or fine-tuning for most tasks
Zero-shot vs few-shot prompting

BERT (encoder):   reads [full sentence] → outputs labels for existing tokens
LLM  (decoder):   reads [prompt] → predicts next token → appends → repeats

Step 2 — How LLM Generates Text (the behind-the-scenes pipeline)

This is what Ollama does invisibly when you call requests.post(...):

Your plain text (system prompt + user message)
    │
    ▼  1. Tokenize  (SentencePiece / BPE — same concept as WangchanBERTa)
["▁วัน", "ที่", "▁29", "▁ทำ", "อะไร", "บ้าง"]
    │
    ▼  2. Token IDs  (vocabulary lookup)
[2341, 891, 445, 1203, 567, 892]
    │
    ▼  3. Embedding lookup  (each ID → 768/4096-dim vector)
    │
    ▼  4. Decoder transformer layers  (left-to-right attention, ~32 layers)
    │      each token attends to ALL previous tokens
    │
    ▼  5. Predict next token  (softmax over full vocabulary)
    │      "กิจกรรม" → 42%
    │      "กำหนดการ" → 31%
    │      ... pick highest (or sample)
    │
    ▼  6. Append predicted token → repeat from step 4
    │      until <end> token is predicted
    │
    ▼  7. Detokenize → plain Thai text reply

You only see step 1 input and step 7 output. Everything in between is inside Ollama.

Why the itinerary JSON is text, not vectors:

RAG converts documents to vectors for searching large corpora
Your itinerary (~2,000 tokens) fits entirely in the context window
No search needed — paste everything, LLM reads it all as tokens
Every user message re-sends the full itinerary (stateless — no memory between calls)

Step 3 — Prompt Engineering

Goal: Learn to control LLM behavior through prompts.

Topics:

System prompt vs user prompt
Role prompting ("You are a Thai travel assistant...")
Context injection — paste your JSON into the prompt
Output formatting (ask for bullets, specific structure)
Temperature / top-p — controls randomness of next-token sampling

# trip-bot system prompt structure
system = f"""
คุณเป็นผู้ช่วยท่องเที่ยวภาษาไทย
ทริปนี้อยู่ในช่วง 29 พ.ค. – 8 มิ.ย. 2569
ตอบตามข้อมูลนี้เท่านั้น: {itinerary_json}
"""

How LLM handles what Phase 1 needed code for:

Phase 1 needed	LLM does automatically
Regex for date extraction	Reads "29 พ.ค." in context → understands it
Gazetteer for place names	Reads JSON → matches places in context
Intent classification	Infers what user wants from phrasing
Typo handling	Predicts most likely meaning from context

Ollama Model Selection

Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

Install: download from ollama.com then pull a model:

ollama pull qwen2.5:3b    # recommended starting point
ollama serve              # starts local server on localhost:11434

Model comparison for this bot (Thai group chat, CPU only):

Model	Size	Speed (CPU)	Thai quality	Recommended for
`qwen2.5:3b`	1.9 GB	~10-15s	Good	Starting point - best balance
`qwen2.5`	4.7 GB	~30-60s	Very good	Better quality, slower
`supachai/llama-3-typhoon-v1.5:8b-instruct`	4.9 GB	~30-60s	Best (Thai-specific)	Best Thai, needs patience
`llama3.2:1b`	1.3 GB	~5s	Decent	Fastest, weakest Thai

Upgrade path:

Start with qwen2.5:3b -> test response quality
If Thai quality not good enough -> upgrade to qwen2.5 or typhoon
If too slow for group chat -> downgrade to llama3.2:1b

How Ollama works:

Downloads model in GGUF format (quantized - 4-bit instead of 16-bit = smaller, faster)
Runs as background server on localhost:11434
Your Python code sends HTTP requests - Ollama runs the full inference pipeline internally
You only see plain text in -> plain text out

Group chat trigger word: In group chats, bot responds only when message starts with the trigger word:

fujisan วันที่ 29 ทำอะไรบ้าง   <- bot responds
วันที่ 29 ทำอะไรบ้าง           <- bot ignores

Ollama Model Selection

Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine.

Install: download from ollama.com → then pull a model:

Model comparison for this bot (Thai group chat, CPU only):

Model	Size	Speed (CPU)	Thai quality	Recommended for
\	1.9 GB	~10–15s	Good	Starting point — best balance
\	4.7 GB	~30–60s	Very good	Better quality, slower
\	4.9 GB	~30–60s	Best (Thai-specific)	Best Thai, needs patience
\	1.3 GB	~5s	Decent	Fastest, weakest Thai

Upgrade path:

Start with \ → test response quality
If Thai quality not good enough → upgrade to \ or - If too slow for group chat → downgrade to How Ollama works:
Downloads model in GGUF format (quantized — 4-bit instead of 16-bit = smaller, faster)
Runs as background server on - Your Python code sends HTTP requests — Ollama runs the full inference pipeline internally
You only see plain text in → plain text out

Group chat trigger word: In group chats, bot responds only when message starts with the trigger word:

Step 4 — RAG (Retrieval-Augmented Generation)

Goal: Understand when and why context injection is not enough.

Topics:

Token limit problem: when data > context window, you can't paste everything
Embeddings — convert text chunks to vectors that capture semantic meaning
Vector similarity search — find chunks most relevant to the query
Retrieve relevant chunks → inject only those → LLM generates answer
Tools: chromadb, faiss, OpenAI/Claude embeddings

When RAG is needed vs not:

Data size	Approach
Small JSON / single document (trip-bot now)	Full context injection — no RAG
10+ trips	Still probably fine with full injection
100+ trips + reviews + guides	RAG — mandatory

Step 5 — If You Want to Build Your Own LLM

The full learning path from understanding to building from scratch:

Level 1 — Tokenization              ✓ done (SentencePiece, BPE, input_ids)
Level 2 — Embeddings                ✓ done (token IDs → vectors, WangchanBERTa)
Level 3 — Encoder transformer       ✓ done (BERT, attention, NER, BIO)
Level 4 — Decoder / Generation      → next (next-token prediction, autoregressive)
Level 5 — Pre-training              → how model learns from raw text (loss, backprop)
Level 6 — Build your own small LLM  → implement transformer in PyTorch from scratch

Recommended resources in order:

Resource	What you learn
3Blue1Brown — Neural Networks series	Backpropagation visually
Andrej Karpathy — makemore (YouTube)	Build bigram → MLP → transformer from scratch
Andrej Karpathy — nanoGPT (GitHub)	Minimal GPT in ~300 lines of PyTorch
HuggingFace course chapters 1–4	Pre-training and fine-tuning at scale
Paper: "Attention Is All You Need" (2017)	Original transformer architecture

nanoGPT is the single best resource — it implements exactly the pipeline above (tokenize → IDs → transformer layers → predict next token → repeat) from zero.

Summary

Phase	Status	Key skill	What you built
1	Done	Rule-based NLP, keyword matching	Working trip chatbot
2	Done (learning)	Transformers, NER, BIO tagging, subword tokenization	Understood ML-based NLP
3	Done (production)	Prompt engineering, context injection, Typhoon API	Production LLM chatbot via `openai` SDK + `typhoon-v2.5-30b-a3b-instruct`
4 (optional)	Future	Decoder architecture, pre-training, PyTorch	Build your own LLM