ll-monkey commited on
Commit
502cb8b
·
verified ·
1 Parent(s): 437ff7c

Update src/streamlit_app.py

Browse files
Files changed (1) hide show
  1. src/streamlit_app.py +22 -0
src/streamlit_app.py CHANGED
@@ -56,6 +56,28 @@ st.markdown("""
56
  Compare how different LLMs 'see' Thai text. Efficient tokenization (lower Token Count)
57
  usually leads to lower inference costs and better performance for Thai language tasks.
58
  """)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
  # put choice on the sidebar
61
  with st.sidebar:
 
56
  Compare how different LLMs 'see' Thai text. Efficient tokenization (lower Token Count)
57
  usually leads to lower inference costs and better performance for Thai language tasks.
58
  """)
59
+ with st.expander("Why Tokenization matters for Thai OCR & Documents?", expanded=False):
60
+ st.markdown("""
61
+ ### **The Problem: Unstructured Thai Government Data**
62
+ When processing **OCR text from official Thai documents**, I face a unique challenge.
63
+ Thai is a non-segmented language (no spaces between words and with those Thai numerics), and legal/official
64
+ vocabulary is highly complex.
65
+
66
+ ### **What is this?**
67
+ This Arena helps visualize which LLM "understands" Thai document structures most
68
+ efficiently. A "good" tokenizer sees words as meaningful units; a "bad" one breaks
69
+ them into meaningless characters.
70
+
71
+ ### **Why does this matter?**
72
+ * **Cost Efficiency:** Models that use fewer tokens to represent the same text are cheaper to run.
73
+ * **Memory (Context):** Efficient tokenization allows you to feed longer documents into a model without hitting memory limits.
74
+ * **Accuracy:** Better tokenization leads to fewer hallucinations in RAG (Retrieval-Augmented Generation) systems.
75
+
76
+ ### **How to use**
77
+ 1. Select models from the sidebar.
78
+ 2. Paste your Thai text.
79
+ 3. Look for the most cost-effective model for your data (better segmentation and lower number of tokens).
80
+ """)
81
 
82
  # put choice on the sidebar
83
  with st.sidebar: