Spaces:

ll-monkey
/

Thai-LLM-Token-Comparison

Running

App Files Files Community

Thai-LLM-Token-Comparison / README.md

ll-monkey

Update README.md

55980cb verified 27 days ago

preview code

raw

history blame contribute delete

2.71 kB

	---
	title: Thai LLM Token Comparison
	emoji: 🚀
	colorFrom: blue
	colorTo: yellow
	sdk: docker
	app_port: 8501
	tags:
	- streamlit
	pinned: false
	short_description: Thai Tokenizer Arena and Benchmark.
	---

	While Tokenizer Visualizers are standard tools in the global AI landscape,
	there is a significant gap when it comes to the Thai language, especially regarding official and legal contexts.
	Standard models often fail to capture the nuances of complex Thai bureaucratic phrasing and long compound nouns,
	leading to 'Token Inflation'—where fragmented tokenization results in 'Hidden Costs' and significant performance loss.
	This app focuses on comparing how different LLMs 'see' Thai text. Efficient tokenization (lower Token Count)
	usually leads to lower inference costs and better performance for Thai language tasks.


	# What is this?
	This is a playground to visualize and compare how various Large Language Models (LLMs) and Embedding models "break down" (tokenize) Thai text.
	It’s a side-by-side benchmarking tool that lets you see exactly how many tokens a model uses and how it perceives complex Thai sentences.

	# Why this?
	When I work with unstructured data from Thai government documents (OCR text), I run into a massive headache.
	Legal and official Thai documents use extremely long, formal compound words and complex phrasing.

	Most global models are trained primarily on English.
	When they face a sentence like "ระเบียบคณะกรรมการป้องกันและปราบปราม..." with Thai numerics, they often struggle.
	They might split a single Thai word into 10+ tiny, meaningless fragments (tokens).

	# Why does that matter?

	Cost: Most APIs charge you per token. Inefficient tokenization = 3x the price for the same Thai sentence.

	Context: Models have a limited memory (context window). If a tokenizer is "wasteful," the model will "forget" the beginning of your document much faster.

	Accuracy: If a model doesn't "see" Thai words correctly at the token level, it’s more likely to hallucinate or fail at understanding the nuance of official Thai regulations.

	This tool helps you find the "Winner"—the model that understands Thai most natively and efficiently through visualization.


	# How it works
	Select Models: Choose from top Thai-centric models like Typhoon and SeaLLM, or global giants like Llama-3 and Gemma.

	Input Text: Paste your text or snippets into the box.

	Analyze Metrics:

	- Total Tokens: Lower is better! It means the model has a better "vocabulary" for Thai.

	- Visual Spans: Each color represents one token. If you see a lot of single characters highlighted individually, that model is struggling with Thai.