Spaces:

ll-monkey
/

Thai-LLM-Token-Comparison

Sleeping

ll-monkey commited on May 3

Commit

502cb8b

verified ·

1 Parent(s): 437ff7c

Update src/streamlit_app.py

Files changed (1) hide show

src/streamlit_app.py CHANGED Viewed

@@ -56,6 +56,28 @@ st.markdown("""
 Compare how different LLMs 'see' Thai text. Efficient tokenization (lower Token Count)
 usually leads to lower inference costs and better performance for Thai language tasks.
 """)
 # put choice on the sidebar
 with st.sidebar:

 Compare how different LLMs 'see' Thai text. Efficient tokenization (lower Token Count)
 usually leads to lower inference costs and better performance for Thai language tasks.
 """)
+with st.expander("Why Tokenization matters for Thai OCR & Documents?", expanded=False):
+    st.markdown("""
+    ### **The Problem: Unstructured Thai Government Data**
+    When processing **OCR text from official Thai documents**, I face a unique challenge.
+    Thai is a non-segmented language (no spaces between words and with those Thai numerics), and legal/official
+    vocabulary is highly complex.
+    ### **What is this?**
+    This Arena helps visualize which LLM "understands" Thai document structures most
+    efficiently. A "good" tokenizer sees words as meaningful units; a "bad" one breaks
+    them into meaningless characters.
+    ### **Why does this matter?**
+    *   **Cost Efficiency:** Models that use fewer tokens to represent the same text are cheaper to run.
+    *   **Memory (Context):** Efficient tokenization allows you to feed longer documents into a model without hitting memory limits.
+    *   **Accuracy:** Better tokenization leads to fewer hallucinations in RAG (Retrieval-Augmented Generation) systems.
+    ### **How to use**
+    1. Select models from the sidebar.
+    2. Paste your Thai text.
+    3. Look for the most cost-effective model for your data (better segmentation and lower number of tokens).
+    """)
 # put choice on the sidebar
 with st.sidebar: