Update README.md
Browse files
README.md
CHANGED
|
@@ -12,28 +12,6 @@ tags:
|
|
| 12 |
|
| 13 |
Multilingual sentence embedding model based on Gemma 2, designed for semantic similarity and retrieval. It is used in the Mitra alignment stack to embed sentences in languages such as Sanskrit, Tibetan, Pali, Chinese, and English for cross-lingual sentence alignment and similarity search.
|
| 14 |
|
| 15 |
-
## Model Details
|
| 16 |
-
|
| 17 |
-
### Model Description
|
| 18 |
-
|
| 19 |
-
gemma2-mitra-embedding is a **sentence embedding model** that converts text into dense vectors for semantic similarity and retrieval. It follows the FlagLLM-style usage: **asymmetric** encoding with distinct formats for **queries** vs **corpus** passages. The model is built on the Gemma 2 architecture and uses special tokens `<instruct>` and `<query>` for the instruction-following prompt. Embeddings are taken from the last non-padded token hidden state and are L2-normalized. It supports 8-bit quantization (e.g. via BitsAndBytes) for lower memory use.
|
| 20 |
-
|
| 21 |
-
- **Developed by:** [More Information Needed]
|
| 22 |
-
- **Funded by [optional]:** [More Information Needed]
|
| 23 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 24 |
-
- **Model type:** Sentence embedding model (encoder; Gemma 2 backbone)
|
| 25 |
-
- **Language(s) (NLP):** Multilingual — primary use in the repo: Sanskrit (sa), Tibetan (bo), Pali (pa), Chinese (zh), English (en), Hindi (hi). The prompt accepts a language name for the “find similar text in {language}” instruction.
|
| 26 |
-
- **License:** [More Information Needed]
|
| 27 |
-
- **Finetuned from model [optional]:** Gemma 2 (exact base checkpoint TBD)
|
| 28 |
-
|
| 29 |
-
### Model Sources [optional]
|
| 30 |
-
|
| 31 |
-
- **Repository:** [More Information Needed — e.g. Hugging Face model repo or alignment-backend repo]
|
| 32 |
-
- **Paper [optional]:** [More Information Needed]
|
| 33 |
-
- **Demo [optional]:** [More Information Needed]
|
| 34 |
-
|
| 35 |
-
## Uses
|
| 36 |
-
|
| 37 |
### Direct Use
|
| 38 |
|
| 39 |
- **Semantic similarity:** Encode sentences and compare them via cosine similarity (embeddings are L2-normalized).
|
|
@@ -46,18 +24,6 @@ gemma2-mitra-embedding is a **sentence embedding model** that converts text into
|
|
| 46 |
- RAG or search systems that need multilingual, instruction-aware query/corpus embeddings.
|
| 47 |
- Any application that consumes L2-normalized sentence vectors from this model.
|
| 48 |
|
| 49 |
-
### Out-of-Scope Use
|
| 50 |
-
|
| 51 |
-
- **Generation:** The model is used as an encoder (embedding extraction), not for open-ended text generation in normal use.
|
| 52 |
-
- **Classification/QA without adaptation:** Best used for similarity/retrieval; task-specific heads would require additional training or design.
|
| 53 |
-
- **Languages not represented in the prompt language set:** Performance may degrade for languages other than those explicitly supported (e.g. bo, en, zh, pa, sa, hi); “Unknown” is used for other codes.
|
| 54 |
-
|
| 55 |
-
## Bias, Risks, and Limitations
|
| 56 |
-
|
| 57 |
-
- Training data and demographic coverage are unspecified; the model may reflect biases present in the base Gemma 2 and any fine-tuning data.
|
| 58 |
-
- Primary use in this repo is Buddhist/multilingual scholarly text; behavior on other domains (e.g. social media, legal, medical) is not documented.
|
| 59 |
-
- Embedding quality depends on using the correct **query** vs **corpus** template and the appropriate language name in the query prompt.
|
| 60 |
-
|
| 61 |
### Recommendations
|
| 62 |
|
| 63 |
- Use the exact prompt format (see “How to Get Started”) for queries and corpus.
|
|
@@ -76,7 +42,7 @@ The model expects **asymmetric** inputs:
|
|
| 76 |
- **Corpus:**
|
| 77 |
Use the raw sentence (or passage) text **only**, with no `<instruct>` or `<query>` wrapper.
|
| 78 |
|
| 79 |
-
### Example (
|
| 80 |
|
| 81 |
```python
|
| 82 |
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
|
@@ -117,65 +83,6 @@ For **corpus** sentences, pass only the raw text (no `<instruct>`/`<query>`), th
|
|
| 117 |
|
| 118 |
Alternatively, use **FlagEmbedding**’s `FlagLLMModel` with this model path for `encode_queries` and `encode_corpus` (see [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)).
|
| 119 |
|
| 120 |
-
## Training Details
|
| 121 |
-
|
| 122 |
-
### Training Data
|
| 123 |
-
|
| 124 |
-
[More Information Needed]
|
| 125 |
-
|
| 126 |
-
### Training Procedure
|
| 127 |
-
|
| 128 |
-
#### Preprocessing [optional]
|
| 129 |
-
|
| 130 |
-
[More Information Needed]
|
| 131 |
-
|
| 132 |
-
#### Training Hyperparameters
|
| 133 |
-
|
| 134 |
-
- **Training regime:** [More Information Needed] (e.g. fp16/bf16 mixed precision, 8-bit inference as in repo)
|
| 135 |
-
|
| 136 |
-
#### Speeds, Sizes, Times [optional]
|
| 137 |
-
|
| 138 |
-
[More Information Needed]
|
| 139 |
-
|
| 140 |
-
## Evaluation
|
| 141 |
-
|
| 142 |
-
### Testing Data, Factors & Metrics
|
| 143 |
-
|
| 144 |
-
#### Testing Data
|
| 145 |
-
|
| 146 |
-
[More Information Needed]
|
| 147 |
-
|
| 148 |
-
#### Factors
|
| 149 |
-
|
| 150 |
-
[More Information Needed]
|
| 151 |
-
|
| 152 |
-
#### Metrics
|
| 153 |
-
|
| 154 |
-
[More Information Needed]
|
| 155 |
-
|
| 156 |
-
### Results
|
| 157 |
-
|
| 158 |
-
[More Information Needed]
|
| 159 |
-
|
| 160 |
-
#### Summary
|
| 161 |
-
|
| 162 |
-
[More Information Needed]
|
| 163 |
-
|
| 164 |
-
## Model Examination [optional]
|
| 165 |
-
|
| 166 |
-
[More Information Needed]
|
| 167 |
-
|
| 168 |
-
## Environmental Impact
|
| 169 |
-
|
| 170 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 171 |
-
|
| 172 |
-
- **Hardware Type:** [More Information Needed]
|
| 173 |
-
- **Hours used:** [More Information Needed]
|
| 174 |
-
- **Cloud Provider:** [More Information Needed]
|
| 175 |
-
- **Compute Region:** [More Information Needed]
|
| 176 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 177 |
-
|
| 178 |
-
## Technical Specifications [optional]
|
| 179 |
|
| 180 |
### Model Architecture and Objective
|
| 181 |
|
|
@@ -183,42 +90,3 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
| 183 |
- **Config (from repo):** `hidden_size=3584`, `num_hidden_layers=42`, `num_attention_heads=16`, `num_key_value_heads=8`, `intermediate_size=14336`, `head_dim=256`, `max_position_embeddings=8192`, `sliding_window=4096`, `vocab_size=256002` (includes special tokens `<instruct>`, `<query>`).
|
| 184 |
- **Special tokens:** `<instruct>`, `<query>` (see `special_tokens_map.json` / `added_tokens.json` in the model dir).
|
| 185 |
- **Objective:** Dense retrieval / semantic similarity (asymmetric query/corpus encoding).
|
| 186 |
-
|
| 187 |
-
### Compute Infrastructure
|
| 188 |
-
|
| 189 |
-
#### Hardware
|
| 190 |
-
|
| 191 |
-
- Typically run on GPU (e.g. CUDA) with optional 8-bit quantization to reduce VRAM.
|
| 192 |
-
|
| 193 |
-
#### Software
|
| 194 |
-
|
| 195 |
-
- `transformers`, `torch`, `FlagEmbedding` (optional), `bitsandbytes` for 8-bit; see `requirements.txt` in the repo.
|
| 196 |
-
|
| 197 |
-
## Citation [optional]
|
| 198 |
-
|
| 199 |
-
**BibTeX:**
|
| 200 |
-
|
| 201 |
-
[More Information Needed]
|
| 202 |
-
|
| 203 |
-
**APA:**
|
| 204 |
-
|
| 205 |
-
[More Information Needed]
|
| 206 |
-
|
| 207 |
-
## Glossary [optional]
|
| 208 |
-
|
| 209 |
-
- **Query vs corpus:** In asymmetric retrieval, queries use an instruction plus the query text; corpus items are encoded as plain text.
|
| 210 |
-
- **Last-token pooling:** The embedding is the hidden state at the last non-padded token position.
|
| 211 |
-
- **L2 normalization:** Embeddings are normalized so that cosine similarity equals dot product.
|
| 212 |
-
|
| 213 |
-
## More Information [optional]
|
| 214 |
-
|
| 215 |
-
- This model is used as the default embedder (`gemma2mitra`) in the alignment-backend API (`/embed-sentences/` with `model=gemma2mitra`).
|
| 216 |
-
- Cache keys for embeddings include `model_type`, `text`, `language`, and `is_query` (see `embedding_cache.py`).
|
| 217 |
-
|
| 218 |
-
## Model Card Authors [optional]
|
| 219 |
-
|
| 220 |
-
[More Information Needed]
|
| 221 |
-
|
| 222 |
-
## Model Card Contact
|
| 223 |
-
|
| 224 |
-
[More Information Needed]
|
|
|
|
| 12 |
|
| 13 |
Multilingual sentence embedding model based on Gemma 2, designed for semantic similarity and retrieval. It is used in the Mitra alignment stack to embed sentences in languages such as Sanskrit, Tibetan, Pali, Chinese, and English for cross-lingual sentence alignment and similarity search.
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
### Direct Use
|
| 16 |
|
| 17 |
- **Semantic similarity:** Encode sentences and compare them via cosine similarity (embeddings are L2-normalized).
|
|
|
|
| 24 |
- RAG or search systems that need multilingual, instruction-aware query/corpus embeddings.
|
| 25 |
- Any application that consumes L2-normalized sentence vectors from this model.
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
### Recommendations
|
| 28 |
|
| 29 |
- Use the exact prompt format (see “How to Get Started”) for queries and corpus.
|
|
|
|
| 42 |
- **Corpus:**
|
| 43 |
Use the raw sentence (or passage) text **only**, with no `<instruct>` or `<query>` wrapper.
|
| 44 |
|
| 45 |
+
### Example (with 8-bit and Hugging Face Transformers)
|
| 46 |
|
| 47 |
```python
|
| 48 |
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
|
|
|
| 83 |
|
| 84 |
Alternatively, use **FlagEmbedding**’s `FlagLLMModel` with this model path for `encode_queries` and `encode_corpus` (see [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)).
|
| 85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
### Model Architecture and Objective
|
| 88 |
|
|
|
|
| 90 |
- **Config (from repo):** `hidden_size=3584`, `num_hidden_layers=42`, `num_attention_heads=16`, `num_key_value_heads=8`, `intermediate_size=14336`, `head_dim=256`, `max_position_embeddings=8192`, `sliding_window=4096`, `vocab_size=256002` (includes special tokens `<instruct>`, `<query>`).
|
| 91 |
- **Special tokens:** `<instruct>`, `<query>` (see `special_tokens_map.json` / `added_tokens.json` in the model dir).
|
| 92 |
- **Objective:** Dense retrieval / semantic similarity (asymmetric query/corpus encoding).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|