|
|
--- |
|
|
language: |
|
|
- kk |
|
|
- ru |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- feature-extraction |
|
|
- sentence-similarity |
|
|
- multilingual |
|
|
pipeline_tag: sentence-similarity |
|
|
base_model: BAAI/bge-m3 |
|
|
model-index: |
|
|
- name: darmm-embedding-multilingual |
|
|
results: |
|
|
- task: |
|
|
type: retrieval |
|
|
name: Retrieval |
|
|
metrics: |
|
|
- type: recall_at_1 |
|
|
value: 0.9444 |
|
|
- type: recall_at_3 |
|
|
value: 1.0 |
|
|
- type: recall_at_5 |
|
|
value: 1.0 |
|
|
- type: recall_at_10 |
|
|
value: 1.0 |
|
|
--- |
|
|
|
|
|
# Darmm Multilingual Embedding |
|
|
|
|
|
Multilingual embedding model (Kazakh/Russian/English) fine-tuned from `BAAI/bge-m3` for Darmm FAQ and product content retrieval. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Direct model usage (Hugging Face) |
|
|
```python |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
model = SentenceTransformer("Darmm/darmm-embedding-multilingual") |
|
|
sentences = ["Darmm қызметтері қандай?", "What services does Darmm provide?"] |
|
|
embeddings = model.encode(sentences) |
|
|
print(embeddings.shape) |
|
|
``` |
|
|
|
|
|
## Training data (verified) |
|
|
- Darmm landing, academy, and mentor site text extracted from local sources. |
|
|
|
|
|
## Training setup |
|
|
- Base model: `BAAI/bge-m3`. |
|
|
- Loss: `MultipleNegativesRankingLoss` (default in `scripts/train_embeddings.py`). |
|
|
- Typical training params in this repo: `epochs=3`, `batch_size=2`, `max_seq_length=128`. |
|
|
|
|
|
## Evaluation |
|
|
Evaluation uses paraphrased FAQ questions mapped to the FAQ corpus: |
|
|
- Corpus: `data/faq_chunks.jsonl` (369 chunks) |
|
|
- Queries: `data/eval_questions.jsonl` (90 questions) |
|
|
|
|
|
## Paper & Documentation |
|
|
|
|
|
<details> |
|
|
<summary>🇬🇧 English</summary> |
|
|
|
|
|
# Darmm: Multilingual Embeddings for FAQ Retrieval |
|
|
|
|
|
## Abstract |
|
|
We present a multilingual embedding model fine‑tuned for Darmm FAQ and product knowledge retrieval in Kazakh, Russian, and English. The model is based on `BAAI/bge-m3` and trained on Darmm website content and a handcrafted FAQ corpus. We evaluate on paraphrased FAQ questions mapped to the FAQ corpus. |
|
|
|
|
|
## 1. Dataset |
|
|
- **Sources**: Darmm landing, academy, and mentor site content (local sources) plus handcrafted FAQ data. |
|
|
- **FAQ corpus**: 150 topics × 3 languages = 450 Q/A documents. |
|
|
- **Chunked corpus**: 369 chunks in `data/faq_chunks.jsonl`. |
|
|
|
|
|
## 2. Training |
|
|
- **Base model**: `BAAI/bge-m3` |
|
|
- **Loss**: `MultipleNegativesRankingLoss` |
|
|
- **Params**: `epochs=3`, `batch_size=2`, `max_seq_length=128` |
|
|
|
|
|
## 3. Results |
|
|
Evaluation on `data/eval_questions.jsonl` (90 paraphrased queries) against the FAQ corpus: |
|
|
- Recall@1 = 0.9444 |
|
|
- Recall@3/5/10 = 1.0 |
|
|
|
|
|
## 4. Limitations |
|
|
- Performance depends on query style and corpus quality. |
|
|
- Short UI strings can reduce relevance; prefer richer FAQ or docs. |
|
|
- Validate with real user questions and a held‑out test set. |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>🇰🇿 Қазақша</summary> |
|
|
|
|
|
# Darmm: FAQ іздеуге арналған көптілді эмбеддингтер |
|
|
|
|
|
## Аңдатпа |
|
|
Бұл модель Darmm‑ның FAQ және өнім білім базасын қазақ, орыс және ағылшын тілдерінде іздеуге арналған. Негізі `BAAI/bge-m3`, оқыту Darmm сайт контенті мен қолмен жасалған FAQ жиынына жүргізілді. Бағалау парафраз сұрақтар арқылы жасалды. |
|
|
|
|
|
## 1. Деректер |
|
|
- **Көздер**: Darmm landing/academy/mentor сайттарының локал контенті және FAQ жиыны. |
|
|
- **FAQ корпусы**: 150 тақырып × 3 тіл = 450 Q/A құжаты. |
|
|
- **Чанкталған корпус**: `data/faq_chunks.jsonl` ішінде 369 чанк. |
|
|
|
|
|
## 2. Оқыту |
|
|
- **Негізгі модель**: `BAAI/bge-m3` |
|
|
- **Loss**: `MultipleNegativesRankingLoss` |
|
|
- **Параметрлер**: `epochs=3`, `batch_size=2`, `max_seq_length=128` |
|
|
|
|
|
## 3. Нәтижелер |
|
|
`data/eval_questions.jsonl` (90 парафраз сұрақ) арқылы бағалау: |
|
|
- Recall@1 = 0.9444 |
|
|
- Recall@3/5/10 = 1.0 |
|
|
|
|
|
## 4. Шектеулер |
|
|
- Нәтиже сұрақ стилі мен корпус сапасына тәуелді. |
|
|
- Қысқа UI мәтіндері релевантты төмендетуі мүмкін. |
|
|
- Нақты пайдаланушы сұрақтарымен міндетті түрде тексеріңіз. |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>🇷🇺 Русский</summary> |
|
|
|
|
|
# Darmm: Мультиязычные эмбеддинги для FAQ‑поиска |
|
|
|
|
|
## Аннотация |
|
|
Модель предназначена для поиска по FAQ и базе знаний Darmm на казахском, русском и английском. Основана на `BAAI/bge-m3` и дообучена на локальном контенте сайтов Darmm и ручном FAQ‑корпусе. Оценка проводится на перефразированных вопросах. |
|
|
|
|
|
## 1. Данные |
|
|
- **Источники**: локальный контент сайтов Darmm и FAQ‑корпус. |
|
|
- **FAQ корпус**: 150 тем × 3 языка = 450 Q/A документов. |
|
|
- **Чанкованный корпус**: 369 чанков в `data/faq_chunks.jsonl`. |
|
|
|
|
|
## 2. Обучение |
|
|
- **Базовая модель**: `BAAI/bge-m3` |
|
|
- **Loss**: `MultipleNegativesRankingLoss` |
|
|
- **Параметры**: `epochs=3`, `batch_size=2`, `max_seq_length=128` |
|
|
|
|
|
## 3. Результаты |
|
|
Оценка на `data/eval_questions.jsonl` (90 перефразированных запросов): |
|
|
- Recall@1 = 0.9444 |
|
|
- Recall@3/5/10 = 1.0 |
|
|
|
|
|
## 4. Ограничения |
|
|
- Результаты зависят от стиля запросов и качества корпуса. |
|
|
- Короткие UI‑строки снижают релевантность. |
|
|
- Проверяйте на реальных пользовательских вопросах. |
|
|
|
|
|
</details> |
|
|
|
|
|
## Intended use |
|
|
- FAQ search and internal knowledge retrieval across kk/ru/en. |
|
|
- RAG pipelines for Darmm services. |
|
|
|
|
|
## Limitations |
|
|
- Results depend on corpus quality and query style. |
|
|
- Short UI strings reduce relevance; prefer fuller FAQ or documentation. |
|
|
- For real-world validation, use actual user queries and a held‑out test set. |
|
|
|