# Learning Roadmap — NLP to LLM --- ## Phase 2 — ML-Based NLP & Transformers ### Step 1 — Text Preprocessing **Goal:** Understand how raw text becomes model input. Topics: - Tokenization at character / subword level (SentencePiece, BPE) - What is a vocabulary? What is an `` token? - Padding and truncation (`max_length`) - Attention mask — why it exists - The difference between word tokenization (Phase 1 / PyThaiNLP) vs subword tokenization (transformers) Practice: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased") print(tokenizer("วันนี้ทำอะไรบ้าง")) # See: input_ids, attention_mask ``` --- ### Step 2 — What is a Transformer? **Goal:** Understand the architecture behind BERT / WangchanBERTa. Topics: - Encoder vs Decoder (BERT = encoder only) - Self-attention — how tokens look at each other - What is a `[CLS]` token? What is `[SEP]`? - Pre-training vs fine-tuning — why we don't train from scratch - Why WangchanBERTa for Thai? (pre-trained on Thai corpus) Resource: "The Illustrated BERT" by Jay Alammar (Google it — best visual explanation) --- ### Step 3 — Named Entity Recognition (NER) **Goal:** Understand the task your bot needs. Topics: - What is NER? (extract "ฮาคุบะ" → LOCATION, "29 พ.ค." → DATE) - BIO tagging scheme: `B-LOC`, `I-LOC`, `O` - Token classification head on top of BERT - How the model outputs one label per token - `O` = Outside — means "not an entity", not a label type Example: ``` Input : "จากฮาคุบะไปคามิโคจิ" Tokens: ["จาก", "ฮาคุบะ", "ไป", "คามิโคจิ"] Labels: [O, B-LOC, O, B-LOC ] ``` #### NER vs Text Classification — two different tasks, same base model Both tasks use WangchanBERTa as the encoder, but attach different heads: ``` WangchanBERTa encoder │ ├──► Token Classification Head → NER │ (one label per token) AutoModelForTokenClassification │ └──► Sequence Classification Head → Text Classification / Intent (one label per [CLS] token) AutoModelForSequenceClassification ``` | | NER | Text Classification | |---|---|---| | Output | 1 label per token | 1 label per sentence | | Labels | `B-LOC`, `I-DATE`, `O` … | `query_itinerary`, `greeting` … | | Loss computed on | Every token | Only `[CLS]` token | | Answers | *What* entities are in the text | *What* the user wants to do | Fine-tuning for NER and fine-tuning for text classification are **separate training runs** with separate datasets, even though both start from the same checkpoint. #### Do you need both for this bot? NER extracts *what* is in the text (date, place). Intent classification determines *what action* to take (query, delete, add). | Bot complexity | What you need | |---|---| | Single-purpose — show activities only (current bot) | NER only — everything is implicitly `query_itinerary` | | Multi-action — query + add + delete | NER + Intent classification | | Full assistant — arbitrary tasks | LLM handles both implicitly | For the current itinerary bot, **NER alone is sufficient**. Intent classification becomes necessary only when the bot supports multiple actions on the same entity. #### PyThaiNLP NER vs WangchanBERTa NER PyThaiNLP ships a built-in NER tagger (CRF-based). It is simpler but weaker: | | PyThaiNLP NER | WangchanBERTa NER | |---|---|---| | Model type | CRF — classical ML | RoBERTa + token classification head | | Tokenizer | newmm (dictionary word-level) | SentencePiece subword | | Context awareness | None — labels each token independently | Full bidirectional attention | | Size | ~MB, instant | ~500 MB, needs decent CPU/GPU | | Use case | Quick prototype | Production accuracy | The critical difference is **context**. CRF labels tokens one by one using local features. A transformer attends to the entire sentence, so the same surface form (e.g. "มัตสึโมโต") can be correctly labeled LOC or PERSON depending on surrounding words. **Recommended order:** get PyThaiNLP NER working first to understand BIO output format, then replicate with WangchanBERTa to feel the quality difference. #### NER real-world applications NER's real home is **document processing**, not chatbots (LLMs replaced NER in chatbots after 2022): | Industry | NER use | |---|---| | Medical | Extract drug names, dosages, symptoms from clinical notes → structured database | | Legal | Extract parties, dates, clauses from contracts automatically | | Finance | Extract company names, amounts, dates from earnings reports | | HR | Resume parsing — extract skills, companies, job titles | | Compliance | Flag PII (names, phones, IDs) in documents for redaction before sending to external APIs | --- ### Step 4 — Fine-Tuning WangchanBERTa for NER **Goal:** Train the model on your task. Topics: - What is LST20? (Thai NER dataset — your training data) - How to load and format a dataset with HuggingFace `datasets` - `Trainer` API — the standard fine-tuning loop - Evaluation metrics: precision, recall, F1 (seqeval library) - Saving and loading a checkpoint Practice: Follow the HuggingFace NER fine-tuning tutorial, swap the dataset for LST20. --- ### Step 5 — Plug into the Bot **Goal:** Replace `intent_engine.py` with the trained model. - Load model with `pipeline("ner", model="your-checkpoint")` - Extract `origin` and `destination` entities - Uncomment the Phase 2 block in `webhook.py` --- ## Phase 3 — LLM + Context Injection (Production) ### Application layer vs Deep understanding | Goal | Approach | |---|---| | Build a working chatbot now | Use pre-trained LLM via Ollama API — done | | Understand how LLM works internally | Study the full pipeline below | | Build your own LLM from scratch | Follow the deep learning path at the end | --- ### Step 1 — Why LLMs Change Everything **Goal:** Understand what GPT / Claude actually do differently from BERT. Topics: - Decoder-only architecture (GPT, Llama, Typhoon) vs encoder-only (BERT, WangchanBERTa) - BERT reads whole sentence bidirectionally — LLM reads left to right, generates new tokens - Pre-training on massive text → emergent instruction following - Why you don't need labeled data or fine-tuning for most tasks - Zero-shot vs few-shot prompting ``` BERT (encoder): reads [full sentence] → outputs labels for existing tokens LLM (decoder): reads [prompt] → predicts next token → appends → repeats ``` --- ### Step 2 — How LLM Generates Text (the behind-the-scenes pipeline) This is what Ollama does invisibly when you call `requests.post(...)`: ``` Your plain text (system prompt + user message) │ ▼ 1. Tokenize (SentencePiece / BPE — same concept as WangchanBERTa) ["▁วัน", "ที่", "▁29", "▁ทำ", "อะไร", "บ้าง"] │ ▼ 2. Token IDs (vocabulary lookup) [2341, 891, 445, 1203, 567, 892] │ ▼ 3. Embedding lookup (each ID → 768/4096-dim vector) │ ▼ 4. Decoder transformer layers (left-to-right attention, ~32 layers) │ each token attends to ALL previous tokens │ ▼ 5. Predict next token (softmax over full vocabulary) │ "กิจกรรม" → 42% │ "กำหนดการ" → 31% │ ... pick highest (or sample) │ ▼ 6. Append predicted token → repeat from step 4 │ until token is predicted │ ▼ 7. Detokenize → plain Thai text reply ``` You only see step 1 input and step 7 output. Everything in between is inside Ollama. **Why the itinerary JSON is text, not vectors:** - RAG converts documents to vectors for *searching large corpora* - Your itinerary (~2,000 tokens) fits entirely in the context window - No search needed — paste everything, LLM reads it all as tokens - Every user message re-sends the full itinerary (stateless — no memory between calls) --- ### Step 3 — Prompt Engineering **Goal:** Learn to control LLM behavior through prompts. Topics: - System prompt vs user prompt - Role prompting ("You are a Thai travel assistant...") - Context injection — paste your JSON into the prompt - Output formatting (ask for bullets, specific structure) - Temperature / top-p — controls randomness of next-token sampling ```python # trip-bot system prompt structure system = f""" คุณเป็นผู้ช่วยท่องเที่ยวภาษาไทย ทริปนี้อยู่ในช่วง 29 พ.ค. – 8 มิ.ย. 2569 ตอบตามข้อมูลนี้เท่านั้น: {itinerary_json} """ ``` **How LLM handles what Phase 1 needed code for:** | Phase 1 needed | LLM does automatically | |---|---| | Regex for date extraction | Reads "29 พ.ค." in context → understands it | | Gazetteer for place names | Reads JSON → matches places in context | | Intent classification | Infers what user wants from phrasing | | Typo handling | Predicts most likely meaning from context | --- ### Ollama Model Selection Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine. **Install:** download from ollama.com then pull a model: ```bash ollama pull qwen2.5:3b # recommended starting point ollama serve # starts local server on localhost:11434 ``` **Model comparison for this bot (Thai group chat, CPU only):** | Model | Size | Speed (CPU) | Thai quality | Recommended for | |---|---|---|---|---| | `qwen2.5:3b` | 1.9 GB | ~10-15s | Good | **Starting point - best balance** | | `qwen2.5` | 4.7 GB | ~30-60s | Very good | Better quality, slower | | `supachai/llama-3-typhoon-v1.5:8b-instruct` | 4.9 GB | ~30-60s | Best (Thai-specific) | Best Thai, needs patience | | `llama3.2:1b` | 1.3 GB | ~5s | Decent | Fastest, weakest Thai | **Upgrade path:** - Start with `qwen2.5:3b` -> test response quality - If Thai quality not good enough -> upgrade to `qwen2.5` or `typhoon` - If too slow for group chat -> downgrade to `llama3.2:1b` **How Ollama works:** - Downloads model in GGUF format (quantized - 4-bit instead of 16-bit = smaller, faster) - Runs as background server on `localhost:11434` - Your Python code sends HTTP requests - Ollama runs the full inference pipeline internally - You only see plain text in -> plain text out **Group chat trigger word:** In group chats, bot responds only when message starts with the trigger word: ``` fujisan วันที่ 29 ทำอะไรบ้าง <- bot responds วันที่ 29 ทำอะไรบ้าง <- bot ignores ``` --- ### Ollama Model Selection Ollama hosts and runs inference locally — no API key, no cost, no data leaving your machine. **Install:** download from ollama.com → then pull a model: **Model comparison for this bot (Thai group chat, CPU only):** | Model | Size | Speed (CPU) | Thai quality | Recommended for | |---|---|---|---|---| | \ | 1.9 GB | ~10–15s | Good | **Starting point — best balance** | | \ | 4.7 GB | ~30–60s | Very good | Better quality, slower | | \ | 4.9 GB | ~30–60s | Best (Thai-specific) | Best Thai, needs patience | | \ | 1.3 GB | ~5s | Decent | Fastest, weakest Thai | **Upgrade path:** - Start with \ → test response quality - If Thai quality not good enough → upgrade to \ or - If too slow for group chat → downgrade to **How Ollama works:** - Downloads model in GGUF format (quantized — 4-bit instead of 16-bit = smaller, faster) - Runs as background server on - Your Python code sends HTTP requests — Ollama runs the full inference pipeline internally - You only see plain text in → plain text out **Group chat trigger word:** In group chats, bot responds only when message starts with the trigger word: --- ### Step 4 — RAG (Retrieval-Augmented Generation) **Goal:** Understand when and why context injection is not enough. Topics: - Token limit problem: when data > context window, you can't paste everything - Embeddings — convert text chunks to vectors that capture semantic meaning - Vector similarity search — find chunks most relevant to the query - Retrieve relevant chunks → inject only those → LLM generates answer - Tools: `chromadb`, `faiss`, OpenAI/Claude embeddings **When RAG is needed vs not:** | Data size | Approach | |---|---| | Small JSON / single document (trip-bot now) | Full context injection — no RAG | | 10+ trips | Still probably fine with full injection | | 100+ trips + reviews + guides | RAG — mandatory | --- ### Step 5 — If You Want to Build Your Own LLM The full learning path from understanding to building from scratch: ``` Level 1 — Tokenization ✓ done (SentencePiece, BPE, input_ids) Level 2 — Embeddings ✓ done (token IDs → vectors, WangchanBERTa) Level 3 — Encoder transformer ✓ done (BERT, attention, NER, BIO) Level 4 — Decoder / Generation → next (next-token prediction, autoregressive) Level 5 — Pre-training → how model learns from raw text (loss, backprop) Level 6 — Build your own small LLM → implement transformer in PyTorch from scratch ``` **Recommended resources in order:** | Resource | What you learn | |---|---| | 3Blue1Brown — Neural Networks series | Backpropagation visually | | Andrej Karpathy — makemore (YouTube) | Build bigram → MLP → transformer from scratch | | Andrej Karpathy — nanoGPT (GitHub) | Minimal GPT in ~300 lines of PyTorch | | HuggingFace course chapters 1–4 | Pre-training and fine-tuning at scale | | Paper: "Attention Is All You Need" (2017) | Original transformer architecture | nanoGPT is the single best resource — it implements exactly the pipeline above (`tokenize → IDs → transformer layers → predict next token → repeat`) from zero. --- ## Summary | Phase | Status | Key skill | What you built | |---|---|---|---| | 1 | Done | Rule-based NLP, keyword matching | Working trip chatbot | | 2 | Done (learning) | Transformers, NER, BIO tagging, subword tokenization | Understood ML-based NLP | | 3 | Done (production) | Prompt engineering, context injection, Typhoon API | Production LLM chatbot via `openai` SDK + `typhoon-v2.5-30b-a3b-instruct` | | 4 (optional) | Future | Decoder architecture, pre-training, PyTorch | Build your own LLM |