azizdh00 commited on
Commit
6423c9d
·
verified ·
1 Parent(s): 2a316f2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -100
README.md CHANGED
@@ -1,100 +0,0 @@
1
- ---
2
- library_name: sentence-transformers
3
- pipeline_tag: sentence-similarity
4
- tags:
5
- - sentence-transformers
6
- - feature-extraction
7
- - sentence-similarity
8
- - transformers
9
- - rag
10
- - document-embedding
11
- base_model: sentence-transformers/all-mpnet-base-v2
12
- license: apache-2.0
13
- ---
14
-
15
- # Document Encoder for RAG - MPNet Base V2
16
-
17
- This is a **sentence-transformers** model based on **sentence-transformers/all-mpnet-base-v2**. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for tasks like clustering or semantic search.
18
-
19
- ## Model Details
20
-
21
- - **Base Model**: sentence-transformers/all-mpnet-base-v2
22
- - **Embedding Dimension**: 768
23
- - **Max Sequence Length**: 384 tokens
24
- - **Use Case**: Document encoding for RAG (Retrieval-Augmented Generation) systems
25
-
26
- ## Usage (Sentence-Transformers)
27
-
28
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
29
-
30
- ```bash
31
- pip install -U sentence-transformers
32
- ```
33
-
34
- Then you can use the model like this:
35
-
36
- ```python
37
- from sentence_transformers import SentenceTransformer
38
-
39
- # Load model
40
- model = SentenceTransformer('azizdh00/MNLP_M2_document_encoder')
41
-
42
- # Encode documents
43
- documents = [
44
- "This is a sample document about artificial intelligence.",
45
- "Machine learning is a subset of AI that uses algorithms.",
46
- "Natural language processing enables computers to understand text."
47
- ]
48
-
49
- embeddings = model.encode(documents)
50
- print(f"Embeddings shape: {embeddings.shape}")
51
- ```
52
-
53
- ## Usage (HuggingFace Transformers)
54
-
55
- You can also use the model without sentence-transformers:
56
-
57
- ```python
58
- from transformers import AutoTokenizer, AutoModel
59
- import torch
60
-
61
- # Load model and tokenizer
62
- tokenizer = AutoTokenizer.from_pretrained('azizdh00/MNLP_M2_document_encoder')
63
- model = AutoModel.from_pretrained('azizdh00/MNLP_M2_document_encoder')
64
-
65
- # Tokenize and encode
66
- def encode_text(texts):
67
- encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
68
- with torch.no_grad():
69
- outputs = model(**encoded)
70
- # Mean pooling
71
- embeddings = outputs.last_hidden_state.mean(dim=1)
72
- return embeddings
73
-
74
- # Example usage
75
- texts = ["Sample document text"]
76
- embeddings = encode_text(texts)
77
- ```
78
-
79
- ## Training Data
80
-
81
- This model was originally trained on a large dataset of sentence pairs for semantic similarity tasks.
82
-
83
- ## Performance
84
-
85
- The model achieves strong performance on:
86
- - Semantic similarity tasks
87
- - Document retrieval
88
- - Clustering tasks
89
- - Information retrieval benchmarks
90
-
91
- ## Technical Details
92
-
93
- - **Model Type**: Sentence Transformer (MPNet)
94
- - **Training Procedure**: Pre-trained on sentence similarity tasks
95
- - **Intended Uses**: Semantic search, clustering, similarity measurement
96
- - **Languages**: Primarily English
97
-
98
- ## License
99
-
100
- This model is released under the Apache 2.0 License.