Sentence Similarity
sentence-transformers
Safetensors
Burmese
bert
feature-extraction
dense
Generated from Trainer
myanmar
burmese
nlp
text-embeddings-inference
Instructions to use DatarrX/myX-Semantic-Light with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use DatarrX/myX-Semantic-Light with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("DatarrX/myX-Semantic-Light") sentences = [ "▁ထို့ကြောင့် ကြော်ငြာ ရှင် သည် နှိပ် လိုက်ပါ ကသာ ပေးချေ လိမ့်မည်။", "▁ကိုယ်ပိုင် စိတ်ကူး ဉာဏ် ဖြင့် ▁တီထွင် ရေးသား နိုင်သည်။", "▁ထိုအရာ အားလုံးက ▁အလွန် စိတ်လေး စရာ၊ ▁ကြောက်စရာကောင်း လှ သည်ဟု ▁ခံစား မိသည်။" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
File size: 6,550 Bytes
c6950ca | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | ---
license: apache-2.0
language:
- my
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- myanmar
- burmese
- nlp
library_name: sentence-transformers
dataset_size: 500000
loss: MSELoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
widget:
- source_sentence: ▁ထို့ကြောင့် ကြော်ငြာ ရှင် သည် နှိပ် လိုက်ပါ ကသာ ပေးချေ လိမ့်မည်။
sentences:
- ▁ကိုယ်ပိုင် စိတ်ကူး ဉာဏ် ဖြင့် ▁တီထွင် ရေးသား နိုင်သည်။
- ▁ထိုအရာ အားလုံးက ▁အလွန် စိတ်လေး စရာ၊ ▁ကြောက်စရာကောင်း လှ သည်ဟု ▁ခံစား မိသည်။
datasets:
- DatarrX/myX-Mega-Corpus
---
# 📝 myX-Semantic-Light: An Efficient Burmese Sentence Embedding Model
## Model Description
**myX-Semantic-Light** is a lightweight sentence-transformer model optimized for the Burmese (Myanmar 🇲🇲) language. It is designed for high-speed inference and low-resource environments while maintaining robust semantic understanding.
This model was trained using **Knowledge Distillation** from a multilingual teacher model. It maps Burmese sentences into a **384-dimensional dense vector space**, making it twice as memory-efficient as the standard 768-dimensional versions.
### Key Applications
* **Real-time Semantic Search:** Ideal for mobile or edge applications requiring fast retrieval.
* **Efficient Clustering:** Grouping large-scale Burmese datasets with reduced memory overhead.
* **Similarity Scoring:** Determining the relationship between short phrases and sentences.
## Development & Distribution
* **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
* **Published by:** [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX)
* **Training Dataset:** [DatarrX/myX-Mega-Corpus](https://huggingface.co/datasets/DatarrX/myX-Mega-Corpus) (500,000 Rows)
* **Tokenization:** Processed using [DatarrX/myX-Tokenizer](https://huggingface.co/DatarrX/myX-Tokenizer).
## Technical Specifications
- **Base Model:** `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
- **Max Sequence Length:** 128 tokens (Optimized for short-to-medium text)
- **Output Dimension:** 384 dimensions
- **Similarity Function:** Cosine Similarity
- **Loss Function:** MSELoss
### Model Architecture
```text
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'BertModel'})
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
)
```
## Usage
### Installation
```bash
pip install -U sentence-transformers
```
### Direct Usage (Inference)
```python
from sentence_transformers import SentenceTransformer, util
# Load the lightweight model
model = SentenceTransformer("DatarrX/myX-Semantic-Light")
sentences = [
"ဝက်ခြံ ပျောက်ကင်းအောင် ဘယ်လိုလုပ်ရမလဲ။",
"မျက်နှာ အသားအရေ ထိန်းသိမ်းနည်းများ",
"နည်းပညာ သတင်းများ ဖတ်ရှုရန်"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
```
## Implementation Guidelines (Thresholds)
Because this model is a lightweight variant trained on a smaller subset (500K rows), its score distribution differs slightly from the 1M SOTA version.
* **Recommended Threshold:** A Cosine Similarity score of **0.40 or higher** is generally sufficient to indicate a semantic relationship.
* **Note:** For tasks requiring higher precision and deeper contextual reasoning, we recommend using the larger [myX-Semantic](https://huggingface.co/DatarrX/myX-Semantic) (1M) version with a threshold of 0.60.
## Training Details
* **Samples:** 500,000 training pairs.
* **Batch Size:** 64
* **Epochs:** 1
* **Optimizer:** AdamW (`adamw_torch_fused`)
* **Training Time:** ~37 minutes on multi-GPU setup.
### Training Logs
| Epoch | Step | Training Loss |
| :--- | :--- | :--- |
| 0.13 | 500 | 0.0035 |
| 0.51 | 2000 | 0.0029 |
| 0.90 | 3500 | 0.0027 |
## Limitations & Bias
* **Encoding:** Optimized for Unicode Burmese. Zawgyi encoding is not supported.
* **Sequence Length:** Performance may degrade for documents longer than 128 tokens due to the sequence length constraint during training.
## License
This model is licensed under the **Apache License 2.0**.
## Citation
```bibtex
@software{khantsintheinn2026myxsemantic_light,
author = {Khant Sint Heinn},
title = {myX-Semantic-Light: An Efficient Burmese Sentence Embedding Model},
year = {2026},
publisher = {DatarrX},
url = {[https://huggingface.co/DatarrX/myX-Semantic-Light}
}
```
## About the Author
**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
**Connect with the Author:**
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis) |