File size: 6,196 Bytes
eddc9bd 81cf6e7 eddc9bd 81cf6e7 eddc9bd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
language:
- kk
- ru
- en
license: apache-2.0
tags:
- feature-extraction
- sentence-similarity
- multilingual
pipeline_tag: sentence-similarity
base_model: BAAI/bge-m3
model-index:
- name: darmm-embedding-multilingual
results:
- task:
type: retrieval
name: Retrieval
metrics:
- type: recall_at_1
value: 0.9444
- type: recall_at_3
value: 1.0
- type: recall_at_5
value: 1.0
- type: recall_at_10
value: 1.0
---
# Darmm Multilingual Embedding
Multilingual embedding model (Kazakh/Russian/English) fine-tuned from `BAAI/bge-m3` for Darmm FAQ and product content retrieval.
## Usage
### Direct model usage (Hugging Face)
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Darmm/darmm-embedding-multilingual")
sentences = ["Darmm қызметтері қандай?", "What services does Darmm provide?"]
embeddings = model.encode(sentences)
print(embeddings.shape)
```
## Training data (verified)
- Darmm landing, academy, and mentor site text extracted from local sources.
## Training setup
- Base model: `BAAI/bge-m3`.
- Loss: `MultipleNegativesRankingLoss` (default in `scripts/train_embeddings.py`).
- Typical training params in this repo: `epochs=3`, `batch_size=2`, `max_seq_length=128`.
## Evaluation
Evaluation uses paraphrased FAQ questions mapped to the FAQ corpus:
- Corpus: `data/faq_chunks.jsonl` (369 chunks)
- Queries: `data/eval_questions.jsonl` (90 questions)
## Paper & Documentation
<details>
<summary>🇬🇧 English</summary>
# Darmm: Multilingual Embeddings for FAQ Retrieval
## Abstract
We present a multilingual embedding model fine‑tuned for Darmm FAQ and product knowledge retrieval in Kazakh, Russian, and English. The model is based on `BAAI/bge-m3` and trained on Darmm website content and a handcrafted FAQ corpus. We evaluate on paraphrased FAQ questions mapped to the FAQ corpus.
## 1. Dataset
- **Sources**: Darmm landing, academy, and mentor site content (local sources) plus handcrafted FAQ data.
- **FAQ corpus**: 150 topics × 3 languages = 450 Q/A documents.
- **Chunked corpus**: 369 chunks in `data/faq_chunks.jsonl`.
## 2. Training
- **Base model**: `BAAI/bge-m3`
- **Loss**: `MultipleNegativesRankingLoss`
- **Params**: `epochs=3`, `batch_size=2`, `max_seq_length=128`
## 3. Results
Evaluation on `data/eval_questions.jsonl` (90 paraphrased queries) against the FAQ corpus:
- Recall@1 = 0.9444
- Recall@3/5/10 = 1.0
## 4. Limitations
- Performance depends on query style and corpus quality.
- Short UI strings can reduce relevance; prefer richer FAQ or docs.
- Validate with real user questions and a held‑out test set.
</details>
<details>
<summary>🇰🇿 Қазақша</summary>
# Darmm: FAQ іздеуге арналған көптілді эмбеддингтер
## Аңдатпа
Бұл модель Darmm‑ның FAQ және өнім білім базасын қазақ, орыс және ағылшын тілдерінде іздеуге арналған. Негізі `BAAI/bge-m3`, оқыту Darmm сайт контенті мен қолмен жасалған FAQ жиынына жүргізілді. Бағалау парафраз сұрақтар арқылы жасалды.
## 1. Деректер
- **Көздер**: Darmm landing/academy/mentor сайттарының локал контенті және FAQ жиыны.
- **FAQ корпусы**: 150 тақырып × 3 тіл = 450 Q/A құжаты.
- **Чанкталған корпус**: `data/faq_chunks.jsonl` ішінде 369 чанк.
## 2. Оқыту
- **Негізгі модель**: `BAAI/bge-m3`
- **Loss**: `MultipleNegativesRankingLoss`
- **Параметрлер**: `epochs=3`, `batch_size=2`, `max_seq_length=128`
## 3. Нәтижелер
`data/eval_questions.jsonl` (90 парафраз сұрақ) арқылы бағалау:
- Recall@1 = 0.9444
- Recall@3/5/10 = 1.0
## 4. Шектеулер
- Нәтиже сұрақ стилі мен корпус сапасына тәуелді.
- Қысқа UI мәтіндері релевантты төмендетуі мүмкін.
- Нақты пайдаланушы сұрақтарымен міндетті түрде тексеріңіз.
</details>
<details>
<summary>🇷🇺 Русский</summary>
# Darmm: Мультиязычные эмбеддинги для FAQ‑поиска
## Аннотация
Модель предназначена для поиска по FAQ и базе знаний Darmm на казахском, русском и английском. Основана на `BAAI/bge-m3` и дообучена на локальном контенте сайтов Darmm и ручном FAQ‑корпусе. Оценка проводится на перефразированных вопросах.
## 1. Данные
- **Источники**: локальный контент сайтов Darmm и FAQ‑корпус.
- **FAQ корпус**: 150 тем × 3 языка = 450 Q/A документов.
- **Чанкованный корпус**: 369 чанков в `data/faq_chunks.jsonl`.
## 2. Обучение
- **Базовая модель**: `BAAI/bge-m3`
- **Loss**: `MultipleNegativesRankingLoss`
- **Параметры**: `epochs=3`, `batch_size=2`, `max_seq_length=128`
## 3. Результаты
Оценка на `data/eval_questions.jsonl` (90 перефразированных запросов):
- Recall@1 = 0.9444
- Recall@3/5/10 = 1.0
## 4. Ограничения
- Результаты зависят от стиля запросов и качества корпуса.
- Короткие UI‑строки снижают релевантность.
- Проверяйте на реальных пользовательских вопросах.
</details>
## Intended use
- FAQ search and internal knowledge retrieval across kk/ru/en.
- RAG pipelines for Darmm services.
## Limitations
- Results depend on corpus quality and query style.
- Short UI strings reduce relevance; prefer fuller FAQ or documentation.
- For real-world validation, use actual user queries and a held‑out test set.
|