| --- |
| library_name: transformers |
| language: |
| - en |
| - hy |
| base_model: |
| - intfloat/multilingual-e5-base |
| tags: |
| - sentence-transformers |
| --- |
| |
| <div style="background-color: rgba(119, 0, 204, 0.25); border: 2px solid #7700cc; border-radius: 8px; padding: 16px; margin: 16px 0; color: #ffffff;"> |
| <strong>🚀 New Version Available!</strong><br><br> |
| A newer and significantly improved version of this model has been released! Check out <a href="https://huggingface.co/Metric-AI/armenian-text-embeddings-2-base"><strong>ATE-2</strong></a> for much better performance. It is a drop-in replacement for this model.<br> |
| A larger version, <a href="https://huggingface.co/Metric-AI/armenian-text-embeddings-2-large"><strong>ATE-2-large</strong></a>, is also available. |
| </div> |
|
|
| # Armenian-Text-Embeddings-1 |
|
|
| ## Model Details |
| - **Model Name**: Armenian-Text-Embeddings-1 |
| - **Model Type**: Text Embeddings for Armenian Language |
| - **Base Model**: intfloat/multilingual-e5-base |
| - **Version**: 1.0.0 |
| - **License**: Apache 2.0 |
| - **Last Updated**: November 2024 |
| - **Model Architecture**: Transformer-based embeddings model |
| - **Input**: Armenian text |
| - **Output**: Dense vector embeddings |
|
|
| ## Quick Start |
| ```python |
| import torch.nn.functional as F |
| |
| from torch import Tensor |
| from transformers import AutoTokenizer, AutoModel |
| |
| tokenizer = AutoTokenizer.from_pretrained('Metric-AI/armenian-text-embeddings-1') |
| model = AutoModel.from_pretrained('Metric-AI/armenian-text-embeddings-1') |
| |
| |
| def average_pool(last_hidden_states: Tensor, |
| attention_mask: Tensor) -> Tensor: |
| last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0) |
| return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None] |
| |
| |
| # Each input text should start with "query: " or "passage: ", even for non-English texts. |
| # For tasks other than retrieval, you can simply use the "query: " prefix. |
| input_texts = [ |
| 'query: Ինչպե՞ս պատրաստել տոլմա', # How to make tolma |
| 'query: Քանի՞ գրամ սպիտակուց է հարկավոր օրական', # How many grams of protein needed daily |
| |
| """passage: Տոլմայի բաղադրատոմս՝ |
| Բաղադրիչներ՝ |
| - 500գ աղացած միս |
| - 1 բաժակ բրինձ |
| - Խաղողի տերևներ |
| - 2 գլուխ սոխ |
| - Համեմունքներ՝ աղ, սև պղպեղ, քարի |
| |
| Պատրաստման եղանակը՝ |
| 1. Միսը խառնել բրնձի, մանր կտրատած սոխի և համեմունքների հետ |
| 2. Խաղողի տերևները լվանալ և թողնել տաք ջրի մեջ 10 րոպե |
| 3. Լցոնել տերևները և դասավորել կաթսայի մեջ |
| 4. Եփել դանդաղ կրակի վրա 45-60 րոպե""", # Detailed tolma recipe |
| |
| """passage: Սպիտակուցի օրական չափաբաժինը կախված է մարդու քաշից, սեռից և ֆիզիկական ակտիվությունից: |
| Միջին հաշվով, կանանց համար խորհուրդ է տրվում 46-50 գրամ սպիտակուց օրական: |
| Մարզիկների համար այս թիվը կարող է հասնել մինչև 1.6-2 գրամ մարմնի քաշի յուրաքանչյուր կիլոգրամի համար: |
| Հղիների համար պահանջվում է լրացուցիչ 25 գրամ սպիտակուց: |
| |
| Սպիտակուցի հարուստ աղբյուրներ են՝ |
| - Հավի միս (31գ/100գ) |
| - Ձու (13գ/100գ) |
| - Ոսպ (25գ/100գ) |
| - Մածուն (3.5գ/100գ)"""] # Detailed protein intake advice |
| |
| # Tokenize the input texts |
| batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt') |
| outputs = model(**batch_dict) |
| embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask']) |
| |
| # normalize embeddings |
| embeddings = F.normalize(embeddings, p=2, dim=1) |
| scores = (embeddings[:2] @ embeddings[2:].T) * 100 |
| print(scores.tolist()) |
| |
| # [[83.96063232421875, 30.283924102783203], [32.504661560058594, 82.4246826171875]] |
| ``` |
|
|
| ## Support for Sentence Transformers |
|
|
| Below is an example for usage with sentence_transformers. |
| ```python |
| from sentence_transformers import SentenceTransformer |
| model = SentenceTransformer('Metric-AI/armenian-text-embeddings-1') |
|
|
| embeddings = model.encode(input_texts, normalize_embeddings=True) |
| ``` |
| |
| |
| ## Intended Use |
| ### Primary Intended Uses |
| - Retrieval-augmented generation (RAG) |
| - Semantic search in Armenian |
| - Document similarity computation |
| - Cross-lingual text understanding |
| - Text classification tasks |
| - Information retrieval |
| |
| ## Training Data |
| ### Dataset Details |
| - **Source**: Reddit dataset with English-Armenian translations |
| - **Size**: 1.08M pairs of rows |
| - **Content Type**: Title and body text pairs |
| - **Token Statistics**: |
| - Training Set: |
| - Translated Title Tokens: 23,921,393 |
| - Translated Body Tokens: 194,200,654 |
| - Test Set: |
| - Translated Title Tokens: 242,443 |
| - Translated Body Tokens: 1,946,164 |
| - **Split Ratio**: 99% train, 1% test |
| |
| ## Training Procedure |
| ### Training Details |
| - **Weight Averaging**: |
| - Base model (multilingual-e5-base): 0.6 weight |
| - Fine-tuned model: 0.4 weight |
| - **Training Duration**: 2 days |
| - **Hardware**: 4 x NVIDIA A100 40GB GPUs |
| - **Training Parameters**: |
| - Epochs: 5 |
| - Batch Size: 256 per GPU, (256*4 in total) |
| - Learning Rate: 5e-5 |
| - Weight Decay: 0.01 |
| - Warmup Steps: 1000 |
| - Maximum Sequence Length: 128 tokens |
| - FP16 Training: Enabled |
| - Gradient Clipping: 1.0 |
| |
| ### Optimization Configuration |
| - **Framework**: DeepSpeed Stage 2 |
| - **Optimizer**: AdamW with auto weight decay |
| - **Mixed Precision**: FP16 with dynamic loss scaling |
| - **ZeRO Optimization**: Stage 2 with: |
| - Allgather partitions |
| - Overlap communications |
| - Contiguous gradients |
| - **Additional Features**: |
| - Gradient checkpointing |
| - Tensor parallelism (size: 2) |
| |
| ## Performance and Limitations |
| ### Capabilities |
| - Effective for semantic similarity tasks in Armenian |
| - Suitable for document classification and clustering |
| |
| ### Limitations |
| - Performance may vary on domain-specific terminology |
| - May not capture Armenian-specific cultural contexts effectively |
| - Limited by the quality of training data translations |
| |
| ### Known Biases |
| - May exhibit biases present in Reddit content |
| |
| ## Environmental Impact |
| - **Training Hardware**: 4 x NVIDIA A100 40GB |
| - **Training Duration**: 48 hours |
| - **Estimated Energy Consumption**: 384 kWh (estimated based on A100 power consumption) |
| |
| ## Ethical Considerations |
| - **Data Privacy**: Training data from public Reddit content |
| - **Potential Misuse**: Could be misused for content manipulation or spam |
| - **Bias**: May perpetuate social biases present in Reddit content |
| - **Recommendations**: |
| - Monitor system outputs for harmful content |
| - Implement content filtering for production use |
| - Regular bias assessment recommended |
| |
| ## Technical Specifications |
| - **Model Size**: ~278M parameters (based on e5-base) |
| - **Embedding Dimension**: 384 |
| - **Max Sequence Length**: 128 tokens |
| - **Framework Compatibility**: |
| - PyTorch |
| - Hugging Face Transformers |
| - DeepSpeed |
| |
| ## Citation |
| ```bibtex |
| @misc{armenian-text-embeddings-1, |
| author = {Spartak Bughdaryan, Zaruhi Navasardyan, Bagrat Minasyan, Hrant Davtyan}, |
| title = {Armenian-Text-Embeddings-1: Enhanced Armenian Language Embeddings}, |
| year = {2024}, |
| howpublished = {\url{https://metric.am/blog/announcing-armenian-text-embeddings/}} |
| } |
| ``` |
| |
| ## Additional Information |
| ### Base Model References |
| - multilingual-e5-base: [https://huggingface.co/intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) |
| |
| ### Acknowledgments |
| - intfloat for the original multilingual-e5-base model |
| - Reddit community for the source content |
| - DeepSpeed team for optimization toolkit |
| |
| ## Version History |
| - 2.0 (March 2026): **[ATE-2](https://huggingface.co/Metric-AI/armenian-text-embeddings-2-base)** — New open-source version with significantly improved performance. Acts as a drop-in replacement for v1 — no code changes required. Also available in a larger variant: [ATE-2-large](https://huggingface.co/Metric-AI/armenian-text-embeddings-2-large). |
| - 1.0.0 (November 2024): Initial release |