Sentence Similarity
sentence-transformers
Safetensors
Italian
qwen3
information-retrieval
semantic-search
text-embeddings-inference
Instructions to use DeepMount00/Ita-Search with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use DeepMount00/Ita-Search with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("DeepMount00/Ita-Search") sentences = [ "Descrivi dettagliatamente il processo chimico e fisico che avviene durante la preparazione di un impasto per crostata", "## La Magia Chimica e Fisica nell'Impasto della Crostata: Un Viaggio Dagli Ingredienti Secchi al Trionfo del Forno\n\nLa preparazione di una crostata, apparentemente un gesto semplice e familiare, cela in realtà un affascinante balletto di reazioni chimiche e trasformazioni fisiche...", "## L'Arte Effimera: Creare un Dolce Paesaggio Invernale\n\nImmergiamoci nel cuore pulsante della pasticceria festiva, dove l'arte culinaria si fonde con la creatività artistica...", "Le piattaforme di comunicazione digitale, con la loro ubiquità crescente, si configurano come un'arma a doppio taglio nel panorama sociale contemporaneo..." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Upload README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- sentence-transformers
|
| 4 |
+
- sentence-similarity
|
| 5 |
+
- information-retrieval
|
| 6 |
+
- semantic-search
|
| 7 |
+
widget:
|
| 8 |
+
- source_sentence: "Descrivi dettagliatamente il processo chimico e fisico che avviene durante la preparazione di un impasto per crostata"
|
| 9 |
+
sentences:
|
| 10 |
+
- "## La Magia Chimica e Fisica nell'Impasto della Crostata: Un Viaggio Dagli Ingredienti Secchi al Trionfo del Forno\n\nLa preparazione di una crostata, apparentemente un gesto semplice e familiare, cela in realtà un affascinante balletto di reazioni chimiche e trasformazioni fisiche..."
|
| 11 |
+
- "## L'Arte Effimera: Creare un Dolce Paesaggio Invernale\n\nImmergiamoci nel cuore pulsante della pasticceria festiva, dove l'arte culinaria si fonde con la creatività artistica..."
|
| 12 |
+
- "Le piattaforme di comunicazione digitale, con la loro ubiquità crescente, si configurano come un'arma a doppio taglio nel panorama sociale contemporaneo..."
|
| 13 |
+
pipeline_tag: sentence-similarity
|
| 14 |
+
library_name: sentence-transformers
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# Fine-tuned Qwen3-Embedding for Italian-English Cross-Lingual Semantic Retrieval
|
| 18 |
+
|
| 19 |
+
This model is a specialized fine-tuned version of [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B) optimized for cross-lingual semantic retrieval tasks, with particular emphasis on Italian query understanding and multilingual document ranking.
|
| 20 |
+
|
| 21 |
+
## Model Description
|
| 22 |
+
|
| 23 |
+
- **Model Type**: Dense embedding model for semantic retrieval
|
| 24 |
+
- **Base Model**: [Qwen/Qwen3-Embedding-0.6B](https://huggingface.co/Qwen/Qwen3-Embedding-0.6B)
|
| 25 |
+
- **Output Dimensionality**: 1,024-dimensional dense vectors
|
| 26 |
+
- **Maximum Sequence Length**: 32,768 tokens
|
| 27 |
+
- **Primary Languages**: Italian, English
|
| 28 |
+
- **Similarity Function**: Cosine similarity
|
| 29 |
+
|
| 30 |
+
## Capabilities
|
| 31 |
+
|
| 32 |
+
### Cross-Lingual Retrieval
|
| 33 |
+
The model demonstrates strong performance in matching Italian queries to English documents and vice versa, particularly effective in technical and academic domains.
|
| 34 |
+
|
| 35 |
+
### Domain Coverage
|
| 36 |
+
Trained on diverse knowledge domains including:
|
| 37 |
+
- **Medical & Health Sciences**: Diagnostic imaging, clinical procedures, medical terminology
|
| 38 |
+
- **STEM Fields**: Physics, computer science, geology, engineering
|
| 39 |
+
- **Professional Domains**: Finance, law, agriculture, software development
|
| 40 |
+
- **Educational Content**: Historical studies, culinary arts, general knowledge
|
| 41 |
+
|
| 42 |
+
### Query Understanding
|
| 43 |
+
Enhanced comprehension of:
|
| 44 |
+
- Conversational and informal query patterns
|
| 45 |
+
- Technical terminology across domains
|
| 46 |
+
- Cross-lingual semantic concepts
|
| 47 |
+
- Complex multi-faceted questions
|
| 48 |
+
|
| 49 |
+
## Training Data
|
| 50 |
+
|
| 51 |
+
The model was fine-tuned on a curated corpus of Italian-English cross-lingual data, featuring high-quality triplets designed to capture semantic nuances across multiple domains. The dataset emphasizes:
|
| 52 |
+
|
| 53 |
+
- **Hard negative mining**: Strategic inclusion of semantically related but incorrect documents
|
| 54 |
+
- **Cross-lingual alignment**: Balanced representation of Italian-English language pairs
|
| 55 |
+
- **Domain diversity**: Comprehensive coverage of academic, professional, and conversational contexts
|
| 56 |
+
- **Quality curation**: Manual review and automated filtering for coherence and relevance
|
| 57 |
+
|
| 58 |
+
## Usage
|
| 59 |
+
|
| 60 |
+
### Basic Retrieval
|
| 61 |
+
```python
|
| 62 |
+
from sentence_transformers import SentenceTransformer
|
| 63 |
+
|
| 64 |
+
model = SentenceTransformer("your-model-name")
|
| 65 |
+
|
| 66 |
+
# Cross-lingual query-document matching
|
| 67 |
+
query = "Come si distingue una faglia trascorrente da una normale?"
|
| 68 |
+
documents = [
|
| 69 |
+
"Strike-slip faults are characterized by horizontal movement...",
|
| 70 |
+
"Normal faults occur due to extensional stress...",
|
| 71 |
+
"Investment portfolio management strategies..."
|
| 72 |
+
]
|
| 73 |
+
|
| 74 |
+
query_embedding = model.encode(query, prompt="Represent this search query for finding relevant passages: ")
|
| 75 |
+
doc_embeddings = model.encode(documents, prompt="Represent this passage for retrieval: ")
|
| 76 |
+
similarities = model.similarity(query_embedding, doc_embeddings)
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### Prompt Templates
|
| 80 |
+
The model is optimized for specific prompt templates:
|
| 81 |
+
- **Queries**: `"Represent this search query for finding relevant passages: "`
|
| 82 |
+
- **Documents**: `"Represent this passage for retrieval: "`
|
| 83 |
+
|
| 84 |
+
## Applications
|
| 85 |
+
|
| 86 |
+
- **Cross-lingual information retrieval systems**
|
| 87 |
+
- **Academic and technical document search**
|
| 88 |
+
- **Multilingual question-answering platforms**
|
| 89 |
+
- **Educational content recommendation**
|
| 90 |
+
- **Professional knowledge base systems**
|
| 91 |
+
|
| 92 |
+
## Limitations
|
| 93 |
+
|
| 94 |
+
- **Language coverage**: Primarily optimized for Italian-English pairs
|
| 95 |
+
- **Domain specificity**: Performance may vary on highly specialized domains not represented in training
|
| 96 |
+
- **Cultural context**: Reflects primarily Western/European knowledge perspectives
|
| 97 |
+
- **Computational requirements**: Dense representations require significant storage for large-scale deployment
|
| 98 |
+
|
| 99 |
+
## Model Architecture
|
| 100 |
+
|
| 101 |
+
```
|
| 102 |
+
SentenceTransformer(
|
| 103 |
+
(0): Transformer({'max_seq_length': 32768, 'architecture': 'Qwen3Model'})
|
| 104 |
+
(1): Pooling({'pooling_mode_lasttoken': True, 'include_prompt': True})
|
| 105 |
+
(2): Normalize()
|
| 106 |
+
)
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
## Citation
|
| 110 |
+
|
| 111 |
+
```bibtex
|
| 112 |
+
@misc{qwen3-italian-retrieval-2024,
|
| 113 |
+
title={Fine-tuned Qwen3-Embedding for Italian-English Cross-Lingual Semantic Retrieval},
|
| 114 |
+
year={2024},
|
| 115 |
+
howpublished={\\url{https://huggingface.co/your-model-name}}
|
| 116 |
+
}
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
## Acknowledgments
|
| 120 |
+
|
| 121 |
+
This work builds upon the Qwen3-Embedding architecture and advances in contrastive learning for dense retrieval. We acknowledge the contributions of the Qwen team and the sentence-transformers community.
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
**License**: Inherits licensing terms from the base Qwen/Qwen3-Embedding-0.6B model.
|