File size: 6,550 Bytes
c6950ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
license: apache-2.0
language:
- my
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- generated_from_trainer
- myanmar
- burmese
- nlp
library_name: sentence-transformers
dataset_size: 500000
loss: MSELoss
base_model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
widget:
- source_sentence: ▁ထို့ကြောင့် ကြော်ငြာ ရှင် သည် နှိပ် လိုက်ပါ ကသာ ပေးချေ လိမ့်မည်။
  sentences:
  - ▁ကိုယ်ပိုင် စိတ်ကူး ဉာဏ် ဖြင့် ▁တီထွင် ရေးသား နိုင်သည်။
  - ▁ထိုအရာ အားလုံးက ▁အလွန် စိတ်လေး စရာ၊ ▁ကြောက်စရာကောင်း လှ သည်ဟု ▁ခံစား မိသည်။
datasets:
- DatarrX/myX-Mega-Corpus
---

# 📝 myX-Semantic-Light: An Efficient Burmese Sentence Embedding Model

## Model Description
**myX-Semantic-Light** is a lightweight sentence-transformer model optimized for the Burmese (Myanmar 🇲🇲) language. It is designed for high-speed inference and low-resource environments while maintaining robust semantic understanding.

This model was trained using **Knowledge Distillation** from a multilingual teacher model. It maps Burmese sentences into a **384-dimensional dense vector space**, making it twice as memory-efficient as the standard 768-dimensional versions.

### Key Applications
*   **Real-time Semantic Search:** Ideal for mobile or edge applications requiring fast retrieval.
*   **Efficient Clustering:** Grouping large-scale Burmese datasets with reduced memory overhead.
*   **Similarity Scoring:** Determining the relationship between short phrases and sentences.

## Development & Distribution
*   **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
*   **Published by:** [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX)
*   **Training Dataset:** [DatarrX/myX-Mega-Corpus](https://huggingface.co/datasets/DatarrX/myX-Mega-Corpus) (500,000 Rows)
*   **Tokenization:** Processed using [DatarrX/myX-Tokenizer](https://huggingface.co/DatarrX/myX-Tokenizer).

## Technical Specifications
- **Base Model:** `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2`
- **Max Sequence Length:** 128 tokens (Optimized for short-to-medium text)
- **Output Dimension:** 384 dimensions
- **Similarity Function:** Cosine Similarity
- **Loss Function:** MSELoss

### Model Architecture
```text
SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_mean_tokens': True})
)
```

## Usage

### Installation
```bash
pip install -U sentence-transformers
```

### Direct Usage (Inference)
```python
from sentence_transformers import SentenceTransformer, util

# Load the lightweight model
model = SentenceTransformer("DatarrX/myX-Semantic-Light")

sentences = [
    "ဝက်ခြံ ပျောက်ကင်းအောင် ဘယ်လိုလုပ်ရမလဲ။",
    "မျက်နှာ အသားအရေ ထိန်းသိမ်းနည်းများ",
    "နည်းပညာ သတင်းများ ဖတ်ရှုရန်"
]

embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
```

## Implementation Guidelines (Thresholds)
Because this model is a lightweight variant trained on a smaller subset (500K rows), its score distribution differs slightly from the 1M SOTA version.

*   **Recommended Threshold:** A Cosine Similarity score of **0.40 or higher** is generally sufficient to indicate a semantic relationship.
*   **Note:** For tasks requiring higher precision and deeper contextual reasoning, we recommend using the larger [myX-Semantic](https://huggingface.co/DatarrX/myX-Semantic) (1M) version with a threshold of 0.60.

## Training Details
*   **Samples:** 500,000 training pairs.
*   **Batch Size:** 64
*   **Epochs:** 1
*   **Optimizer:** AdamW (`adamw_torch_fused`)
*   **Training Time:** ~37 minutes on multi-GPU setup.

### Training Logs
| Epoch | Step | Training Loss |
| :--- | :--- | :--- |
| 0.13 | 500 | 0.0035 |
| 0.51 | 2000 | 0.0029 |
| 0.90 | 3500 | 0.0027 |

## Limitations & Bias
*   **Encoding:** Optimized for Unicode Burmese. Zawgyi encoding is not supported.
*   **Sequence Length:** Performance may degrade for documents longer than 128 tokens due to the sequence length constraint during training.

## License
This model is licensed under the **Apache License 2.0**.

## Citation
```bibtex
@software{khantsintheinn2026myxsemantic_light,
  author = {Khant Sint Heinn},
  title = {myX-Semantic-Light: An Efficient Burmese Sentence Embedding Model},
  year = {2026},
  publisher = {DatarrX},
  url = {[https://huggingface.co/DatarrX/myX-Semantic-Light}
}
```

## About the Author

**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

**Connect with the Author:**  
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)