ll-monkey's picture
Update README.md
55980cb verified
metadata
title: Thai LLM Token Comparison
emoji: 🚀
colorFrom: blue
colorTo: yellow
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: Thai Tokenizer Arena and Benchmark.

While Tokenizer Visualizers are standard tools in the global AI landscape, there is a significant gap when it comes to the Thai language, especially regarding official and legal contexts. Standard models often fail to capture the nuances of complex Thai bureaucratic phrasing and long compound nouns, leading to 'Token Inflation'—where fragmented tokenization results in 'Hidden Costs' and significant performance loss. This app focuses on comparing how different LLMs 'see' Thai text. Efficient tokenization (lower Token Count) usually leads to lower inference costs and better performance for Thai language tasks.

What is this?

This is a playground to visualize and compare how various Large Language Models (LLMs) and Embedding models "break down" (tokenize) Thai text. It’s a side-by-side benchmarking tool that lets you see exactly how many tokens a model uses and how it perceives complex Thai sentences.

Why this?

When I work with unstructured data from Thai government documents (OCR text), I run into a massive headache. Legal and official Thai documents use extremely long, formal compound words and complex phrasing.

Most global models are trained primarily on English. When they face a sentence like "ระเบียบคณะกรรมการป้องกันและปราบปราม..." with Thai numerics, they often struggle. They might split a single Thai word into 10+ tiny, meaningless fragments (tokens).

Why does that matter?

Cost: Most APIs charge you per token. Inefficient tokenization = 3x the price for the same Thai sentence.

Context: Models have a limited memory (context window). If a tokenizer is "wasteful," the model will "forget" the beginning of your document much faster.

Accuracy: If a model doesn't "see" Thai words correctly at the token level, it’s more likely to hallucinate or fail at understanding the nuance of official Thai regulations.

This tool helps you find the "Winner"—the model that understands Thai most natively and efficiently through visualization.

How it works

Select Models: Choose from top Thai-centric models like Typhoon and SeaLLM, or global giants like Llama-3 and Gemma.

Input Text: Paste your text or snippets into the box.

Analyze Metrics:

  • Total Tokens: Lower is better! It means the model has a better "vocabulary" for Thai.

  • Visual Spans: Each color represents one token. If you see a lot of single characters highlighted individually, that model is struggling with Thai.