snehrdich commited on
Commit
b887229
·
verified ·
1 Parent(s): f111ec4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -133
README.md CHANGED
@@ -12,28 +12,6 @@ tags:
12
 
13
  Multilingual sentence embedding model based on Gemma 2, designed for semantic similarity and retrieval. It is used in the Mitra alignment stack to embed sentences in languages such as Sanskrit, Tibetan, Pali, Chinese, and English for cross-lingual sentence alignment and similarity search.
14
 
15
- ## Model Details
16
-
17
- ### Model Description
18
-
19
- gemma2-mitra-embedding is a **sentence embedding model** that converts text into dense vectors for semantic similarity and retrieval. It follows the FlagLLM-style usage: **asymmetric** encoding with distinct formats for **queries** vs **corpus** passages. The model is built on the Gemma 2 architecture and uses special tokens `<instruct>` and `<query>` for the instruction-following prompt. Embeddings are taken from the last non-padded token hidden state and are L2-normalized. It supports 8-bit quantization (e.g. via BitsAndBytes) for lower memory use.
20
-
21
- - **Developed by:** [More Information Needed]
22
- - **Funded by [optional]:** [More Information Needed]
23
- - **Shared by [optional]:** [More Information Needed]
24
- - **Model type:** Sentence embedding model (encoder; Gemma 2 backbone)
25
- - **Language(s) (NLP):** Multilingual — primary use in the repo: Sanskrit (sa), Tibetan (bo), Pali (pa), Chinese (zh), English (en), Hindi (hi). The prompt accepts a language name for the “find similar text in {language}” instruction.
26
- - **License:** [More Information Needed]
27
- - **Finetuned from model [optional]:** Gemma 2 (exact base checkpoint TBD)
28
-
29
- ### Model Sources [optional]
30
-
31
- - **Repository:** [More Information Needed — e.g. Hugging Face model repo or alignment-backend repo]
32
- - **Paper [optional]:** [More Information Needed]
33
- - **Demo [optional]:** [More Information Needed]
34
-
35
- ## Uses
36
-
37
  ### Direct Use
38
 
39
  - **Semantic similarity:** Encode sentences and compare them via cosine similarity (embeddings are L2-normalized).
@@ -46,18 +24,6 @@ gemma2-mitra-embedding is a **sentence embedding model** that converts text into
46
  - RAG or search systems that need multilingual, instruction-aware query/corpus embeddings.
47
  - Any application that consumes L2-normalized sentence vectors from this model.
48
 
49
- ### Out-of-Scope Use
50
-
51
- - **Generation:** The model is used as an encoder (embedding extraction), not for open-ended text generation in normal use.
52
- - **Classification/QA without adaptation:** Best used for similarity/retrieval; task-specific heads would require additional training or design.
53
- - **Languages not represented in the prompt language set:** Performance may degrade for languages other than those explicitly supported (e.g. bo, en, zh, pa, sa, hi); “Unknown” is used for other codes.
54
-
55
- ## Bias, Risks, and Limitations
56
-
57
- - Training data and demographic coverage are unspecified; the model may reflect biases present in the base Gemma 2 and any fine-tuning data.
58
- - Primary use in this repo is Buddhist/multilingual scholarly text; behavior on other domains (e.g. social media, legal, medical) is not documented.
59
- - Embedding quality depends on using the correct **query** vs **corpus** template and the appropriate language name in the query prompt.
60
-
61
  ### Recommendations
62
 
63
  - Use the exact prompt format (see “How to Get Started”) for queries and corpus.
@@ -76,7 +42,7 @@ The model expects **asymmetric** inputs:
76
  - **Corpus:**
77
  Use the raw sentence (or passage) text **only**, with no `<instruct>` or `<query>` wrapper.
78
 
79
- ### Example (from this repo, with 8-bit and Hugging Face Transformers)
80
 
81
  ```python
82
  from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
@@ -117,65 +83,6 @@ For **corpus** sentences, pass only the raw text (no `<instruct>`/`<query>`), th
117
 
118
  Alternatively, use **FlagEmbedding**’s `FlagLLMModel` with this model path for `encode_queries` and `encode_corpus` (see [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)).
119
 
120
- ## Training Details
121
-
122
- ### Training Data
123
-
124
- [More Information Needed]
125
-
126
- ### Training Procedure
127
-
128
- #### Preprocessing [optional]
129
-
130
- [More Information Needed]
131
-
132
- #### Training Hyperparameters
133
-
134
- - **Training regime:** [More Information Needed] (e.g. fp16/bf16 mixed precision, 8-bit inference as in repo)
135
-
136
- #### Speeds, Sizes, Times [optional]
137
-
138
- [More Information Needed]
139
-
140
- ## Evaluation
141
-
142
- ### Testing Data, Factors & Metrics
143
-
144
- #### Testing Data
145
-
146
- [More Information Needed]
147
-
148
- #### Factors
149
-
150
- [More Information Needed]
151
-
152
- #### Metrics
153
-
154
- [More Information Needed]
155
-
156
- ### Results
157
-
158
- [More Information Needed]
159
-
160
- #### Summary
161
-
162
- [More Information Needed]
163
-
164
- ## Model Examination [optional]
165
-
166
- [More Information Needed]
167
-
168
- ## Environmental Impact
169
-
170
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
171
-
172
- - **Hardware Type:** [More Information Needed]
173
- - **Hours used:** [More Information Needed]
174
- - **Cloud Provider:** [More Information Needed]
175
- - **Compute Region:** [More Information Needed]
176
- - **Carbon Emitted:** [More Information Needed]
177
-
178
- ## Technical Specifications [optional]
179
 
180
  ### Model Architecture and Objective
181
 
@@ -183,42 +90,3 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
183
  - **Config (from repo):** `hidden_size=3584`, `num_hidden_layers=42`, `num_attention_heads=16`, `num_key_value_heads=8`, `intermediate_size=14336`, `head_dim=256`, `max_position_embeddings=8192`, `sliding_window=4096`, `vocab_size=256002` (includes special tokens `<instruct>`, `<query>`).
184
  - **Special tokens:** `<instruct>`, `<query>` (see `special_tokens_map.json` / `added_tokens.json` in the model dir).
185
  - **Objective:** Dense retrieval / semantic similarity (asymmetric query/corpus encoding).
186
-
187
- ### Compute Infrastructure
188
-
189
- #### Hardware
190
-
191
- - Typically run on GPU (e.g. CUDA) with optional 8-bit quantization to reduce VRAM.
192
-
193
- #### Software
194
-
195
- - `transformers`, `torch`, `FlagEmbedding` (optional), `bitsandbytes` for 8-bit; see `requirements.txt` in the repo.
196
-
197
- ## Citation [optional]
198
-
199
- **BibTeX:**
200
-
201
- [More Information Needed]
202
-
203
- **APA:**
204
-
205
- [More Information Needed]
206
-
207
- ## Glossary [optional]
208
-
209
- - **Query vs corpus:** In asymmetric retrieval, queries use an instruction plus the query text; corpus items are encoded as plain text.
210
- - **Last-token pooling:** The embedding is the hidden state at the last non-padded token position.
211
- - **L2 normalization:** Embeddings are normalized so that cosine similarity equals dot product.
212
-
213
- ## More Information [optional]
214
-
215
- - This model is used as the default embedder (`gemma2mitra`) in the alignment-backend API (`/embed-sentences/` with `model=gemma2mitra`).
216
- - Cache keys for embeddings include `model_type`, `text`, `language`, and `is_query` (see `embedding_cache.py`).
217
-
218
- ## Model Card Authors [optional]
219
-
220
- [More Information Needed]
221
-
222
- ## Model Card Contact
223
-
224
- [More Information Needed]
 
12
 
13
  Multilingual sentence embedding model based on Gemma 2, designed for semantic similarity and retrieval. It is used in the Mitra alignment stack to embed sentences in languages such as Sanskrit, Tibetan, Pali, Chinese, and English for cross-lingual sentence alignment and similarity search.
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ### Direct Use
16
 
17
  - **Semantic similarity:** Encode sentences and compare them via cosine similarity (embeddings are L2-normalized).
 
24
  - RAG or search systems that need multilingual, instruction-aware query/corpus embeddings.
25
  - Any application that consumes L2-normalized sentence vectors from this model.
26
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ### Recommendations
28
 
29
  - Use the exact prompt format (see “How to Get Started”) for queries and corpus.
 
42
  - **Corpus:**
43
  Use the raw sentence (or passage) text **only**, with no `<instruct>` or `<query>` wrapper.
44
 
45
+ ### Example (with 8-bit and Hugging Face Transformers)
46
 
47
  ```python
48
  from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
 
83
 
84
  Alternatively, use **FlagEmbedding**’s `FlagLLMModel` with this model path for `encode_queries` and `encode_corpus` (see [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)).
85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
  ### Model Architecture and Objective
88
 
 
90
  - **Config (from repo):** `hidden_size=3584`, `num_hidden_layers=42`, `num_attention_heads=16`, `num_key_value_heads=8`, `intermediate_size=14336`, `head_dim=256`, `max_position_embeddings=8192`, `sliding_window=4096`, `vocab_size=256002` (includes special tokens `<instruct>`, `<query>`).
91
  - **Special tokens:** `<instruct>`, `<query>` (see `special_tokens_map.json` / `added_tokens.json` in the model dir).
92
  - **Objective:** Dense retrieval / semantic similarity (asymmetric query/corpus encoding).