Nanny7 commited on
Commit
9f840c2
·
1 Parent(s): cd44c92

Update README.md with comprehensive modelcard

Browse files
Files changed (1) hide show
  1. README.md +1 -43
README.md CHANGED
@@ -12,59 +12,42 @@ datasets:
12
  pipeline_tag: sentence-similarity
13
  library_name: transformers
14
  ---
15
-
16
  # Hindi Sentence Embeddings Model
17
-
18
  This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences.
19
-
20
  ## Features
21
-
22
  - Specialized for Hindi language text
23
  - Advanced transformer architecture with optimized attention mechanism
24
  - Multiple pooling strategies for enhanced semantic representations
25
  - Creates normalized vector representations for semantic similarity
26
  - Supports semantic search and text similarity applications
27
-
28
  ## Usage
29
-
30
  ### Installation
31
-
32
  ```bash
33
  pip install torch sentencepiece scikit-learn matplotlib
34
  git lfs install
35
- git clone https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model
36
  cd hindi-embedding-foundational-model
37
  ```
38
-
39
  ### Enhanced RAG System
40
-
41
  This model now includes an enhanced RAG (Retrieval Augmented Generation) system that integrates Unsloth's optimized Llama-3.2-1B-Instruct model for question answering on top of Hindi document retrieval.
42
-
43
  #### Setup and Installation
44
-
45
  1. Install additional dependencies:
46
  ```bash
47
  pip install unsloth transformers bitsandbytes accelerate langchain langchain-community faiss-cpu
48
  ```
49
-
50
  2. Index your documents:
51
  ```bash
52
  python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --data_dir ./data --output_dir ./output --index
53
  ```
54
-
55
  3. Run in QA mode with LLM:
56
  ```bash
57
  python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --output_dir ./output --interactive --qa
58
  ```
59
-
60
  ### Basic Embedding Usage
61
-
62
  ```python
63
  from hindi_embeddings import HindiEmbedder
64
-
65
  # Initialize the embedder
66
  model = HindiEmbedder("path/to/hindi-embedding-foundational-model")
67
-
68
  # Encode sentences to embeddings
69
  sentences = [
70
  "मुझे हिंदी भाषा बहुत पसंद है।",
@@ -72,11 +55,9 @@ sentences = [
72
  ]
73
  embeddings = model.encode(sentences)
74
  print(f"Embedding shape: {embeddings.shape}")
75
-
76
  # Compute similarity between sentences
77
  similarity = model.compute_similarity(sentences[0], sentences[1])
78
  print(f"Similarity: {similarity:.4f}")
79
-
80
  # Perform semantic search
81
  query = "भारत की राजधानी"
82
  documents = [
@@ -88,7 +69,6 @@ results = model.search(query, documents)
88
  for i, result in enumerate(results):
89
  print(f"{i+1}. Score: {result['score']:.4f}")
90
  print(f" Document: {result['document']}")
91
-
92
  # Visualize embeddings
93
  example_sentences = [
94
  "मुझे हिंदी में पढ़ना बहुत पसंद है।",
@@ -97,16 +77,12 @@ example_sentences = [
97
  ]
98
  model.visualize_embeddings(example_sentences)
99
  ```
100
-
101
  ## Model Details
102
-
103
  This model uses an advanced transformer-based architecture with the following enhancements:
104
-
105
  - Pre-layer normalization for stable training
106
  - Specialized attention mechanism with relative positional encoding
107
  - Multiple pooling strategies (weighted, mean, attention-based)
108
  - L2-normalized vectors for cosine similarity
109
-
110
  Technical specifications:
111
  - Embedding dimension: 768
112
  - Hidden dimension: 768
@@ -114,9 +90,7 @@ Technical specifications:
114
  - Attention heads: 12
115
  - Vocabulary size: 50,000
116
  - Context length: 128 tokens
117
-
118
  ## Applications
119
-
120
  - Semantic search and information retrieval
121
  - Text clustering and categorization
122
  - Recommendation systems
@@ -124,21 +98,5 @@ Technical specifications:
124
  - Document similarity comparison
125
  - Content-based filtering
126
  - RAG systems for Hindi language content
127
-
128
  ## License
129
-
130
  This model is released under the MIT License.
131
-
132
- ## Citation
133
-
134
- If you use this model in your research or application, please cite us:
135
-
136
- ```
137
- @misc{DeepMostInnovations2025hindi,
138
- author = {DeepMost Innovations},
139
- title = {Hindi Sentence Embeddings Model},
140
- year = {2025},
141
- publisher = {Hugging Face},
142
- howpublished = {\url{https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model}}
143
- }
144
- ```
 
12
  pipeline_tag: sentence-similarity
13
  library_name: transformers
14
  ---
 
15
  # Hindi Sentence Embeddings Model
 
16
  This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences.
 
17
  ## Features
 
18
  - Specialized for Hindi language text
19
  - Advanced transformer architecture with optimized attention mechanism
20
  - Multiple pooling strategies for enhanced semantic representations
21
  - Creates normalized vector representations for semantic similarity
22
  - Supports semantic search and text similarity applications
 
23
  ## Usage
 
24
  ### Installation
 
25
  ```bash
26
  pip install torch sentencepiece scikit-learn matplotlib
27
  git lfs install
28
+ git clone https://huggingface.co/convaiinnovations/hindi-embedding-foundational-model
29
  cd hindi-embedding-foundational-model
30
  ```
 
31
  ### Enhanced RAG System
 
32
  This model now includes an enhanced RAG (Retrieval Augmented Generation) system that integrates Unsloth's optimized Llama-3.2-1B-Instruct model for question answering on top of Hindi document retrieval.
 
33
  #### Setup and Installation
 
34
  1. Install additional dependencies:
35
  ```bash
36
  pip install unsloth transformers bitsandbytes accelerate langchain langchain-community faiss-cpu
37
  ```
 
38
  2. Index your documents:
39
  ```bash
40
  python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --data_dir ./data --output_dir ./output --index
41
  ```
 
42
  3. Run in QA mode with LLM:
43
  ```bash
44
  python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --output_dir ./output --interactive --qa
45
  ```
 
46
  ### Basic Embedding Usage
 
47
  ```python
48
  from hindi_embeddings import HindiEmbedder
 
49
  # Initialize the embedder
50
  model = HindiEmbedder("path/to/hindi-embedding-foundational-model")
 
51
  # Encode sentences to embeddings
52
  sentences = [
53
  "मुझे हिंदी भाषा बहुत पसंद है।",
 
55
  ]
56
  embeddings = model.encode(sentences)
57
  print(f"Embedding shape: {embeddings.shape}")
 
58
  # Compute similarity between sentences
59
  similarity = model.compute_similarity(sentences[0], sentences[1])
60
  print(f"Similarity: {similarity:.4f}")
 
61
  # Perform semantic search
62
  query = "भारत की राजधानी"
63
  documents = [
 
69
  for i, result in enumerate(results):
70
  print(f"{i+1}. Score: {result['score']:.4f}")
71
  print(f" Document: {result['document']}")
 
72
  # Visualize embeddings
73
  example_sentences = [
74
  "मुझे हिंदी में पढ़ना बहुत पसंद है।",
 
77
  ]
78
  model.visualize_embeddings(example_sentences)
79
  ```
 
80
  ## Model Details
 
81
  This model uses an advanced transformer-based architecture with the following enhancements:
 
82
  - Pre-layer normalization for stable training
83
  - Specialized attention mechanism with relative positional encoding
84
  - Multiple pooling strategies (weighted, mean, attention-based)
85
  - L2-normalized vectors for cosine similarity
 
86
  Technical specifications:
87
  - Embedding dimension: 768
88
  - Hidden dimension: 768
 
90
  - Attention heads: 12
91
  - Vocabulary size: 50,000
92
  - Context length: 128 tokens
 
93
  ## Applications
 
94
  - Semantic search and information retrieval
95
  - Text clustering and categorization
96
  - Recommendation systems
 
98
  - Document similarity comparison
99
  - Content-based filtering
100
  - RAG systems for Hindi language content
 
101
  ## License
 
102
  This model is released under the MIT License.