Spaces:
Sleeping
Sleeping
Update src/streamlit_app.py
Browse files- src/streamlit_app.py +22 -0
src/streamlit_app.py
CHANGED
|
@@ -56,6 +56,28 @@ st.markdown("""
|
|
| 56 |
Compare how different LLMs 'see' Thai text. Efficient tokenization (lower Token Count)
|
| 57 |
usually leads to lower inference costs and better performance for Thai language tasks.
|
| 58 |
""")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
# put choice on the sidebar
|
| 61 |
with st.sidebar:
|
|
|
|
| 56 |
Compare how different LLMs 'see' Thai text. Efficient tokenization (lower Token Count)
|
| 57 |
usually leads to lower inference costs and better performance for Thai language tasks.
|
| 58 |
""")
|
| 59 |
+
with st.expander("Why Tokenization matters for Thai OCR & Documents?", expanded=False):
|
| 60 |
+
st.markdown("""
|
| 61 |
+
### **The Problem: Unstructured Thai Government Data**
|
| 62 |
+
When processing **OCR text from official Thai documents**, I face a unique challenge.
|
| 63 |
+
Thai is a non-segmented language (no spaces between words and with those Thai numerics), and legal/official
|
| 64 |
+
vocabulary is highly complex.
|
| 65 |
+
|
| 66 |
+
### **What is this?**
|
| 67 |
+
This Arena helps visualize which LLM "understands" Thai document structures most
|
| 68 |
+
efficiently. A "good" tokenizer sees words as meaningful units; a "bad" one breaks
|
| 69 |
+
them into meaningless characters.
|
| 70 |
+
|
| 71 |
+
### **Why does this matter?**
|
| 72 |
+
* **Cost Efficiency:** Models that use fewer tokens to represent the same text are cheaper to run.
|
| 73 |
+
* **Memory (Context):** Efficient tokenization allows you to feed longer documents into a model without hitting memory limits.
|
| 74 |
+
* **Accuracy:** Better tokenization leads to fewer hallucinations in RAG (Retrieval-Augmented Generation) systems.
|
| 75 |
+
|
| 76 |
+
### **How to use**
|
| 77 |
+
1. Select models from the sidebar.
|
| 78 |
+
2. Paste your Thai text.
|
| 79 |
+
3. Look for the most cost-effective model for your data (better segmentation and lower number of tokens).
|
| 80 |
+
""")
|
| 81 |
|
| 82 |
# put choice on the sidebar
|
| 83 |
with st.sidebar:
|