| --- |
| title: Thai LLM Token Comparison |
| emoji: 🚀 |
| colorFrom: blue |
| colorTo: yellow |
| sdk: docker |
| app_port: 8501 |
| tags: |
| - streamlit |
| pinned: false |
| short_description: Thai Tokenizer Arena and Benchmark. |
| --- |
| |
| While Tokenizer Visualizers are standard tools in the global AI landscape, |
| there is a significant gap when it comes to the Thai language, especially regarding official and legal contexts. |
| Standard models often fail to capture the nuances of complex Thai bureaucratic phrasing and long compound nouns, |
| leading to 'Token Inflation'—where fragmented tokenization results in 'Hidden Costs' and significant performance loss. |
| This app focuses on comparing how different LLMs 'see' Thai text. Efficient tokenization (lower Token Count) |
| usually leads to lower inference costs and better performance for Thai language tasks. |
|
|
|
|
| # What is this? |
| This is a playground to visualize and compare how various Large Language Models (LLMs) and Embedding models "break down" (tokenize) Thai text. |
| It’s a side-by-side benchmarking tool that lets you see exactly how many tokens a model uses and how it perceives complex Thai sentences. |
|
|
| # Why this? |
| When I work with unstructured data from Thai government documents (OCR text), I run into a massive headache. |
| Legal and official Thai documents use extremely long, formal compound words and complex phrasing. |
|
|
| Most global models are trained primarily on English. |
| When they face a sentence like "ระเบียบคณะกรรมการป้องกันและปราบปราม..." with Thai numerics, they often struggle. |
| They might split a single Thai word into 10+ tiny, meaningless fragments (tokens). |
|
|
| # Why does that matter? |
|
|
| Cost: Most APIs charge you per token. Inefficient tokenization = 3x the price for the same Thai sentence. |
|
|
| Context: Models have a limited memory (context window). If a tokenizer is "wasteful," the model will "forget" the beginning of your document much faster. |
|
|
| Accuracy: If a model doesn't "see" Thai words correctly at the token level, it’s more likely to hallucinate or fail at understanding the nuance of official Thai regulations. |
|
|
| This tool helps you find the "Winner"—the model that understands Thai most natively and efficiently through visualization. |
|
|
|
|
| # How it works |
| Select Models: Choose from top Thai-centric models like Typhoon and SeaLLM, or global giants like Llama-3 and Gemma. |
|
|
| Input Text: Paste your text or snippets into the box. |
|
|
| Analyze Metrics: |
|
|
| - Total Tokens: Lower is better! It means the model has a better "vocabulary" for Thai. |
|
|
| - Visual Spans: Each color represents one token. If you see a lot of single characters highlighted individually, that model is struggling with Thai. |
|
|
|
|
|
|