DeepMostInnovations commited on
Commit
c398175
·
verified ·
1 Parent(s): 19da03f

Add README with usage documentation

Browse files
Files changed (1) hide show
  1. README.md +122 -0
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: hi
3
+ license: mit
4
+ tags:
5
+ - hindi
6
+ - embeddings
7
+ - sentence-embeddings
8
+ - semantic-search
9
+ - text-similarity
10
+ datasets:
11
+ - custom
12
+ pipeline_tag: sentence-similarity
13
+ library_name: transformers
14
+ ---
15
+
16
+ # Hindi Sentence Embeddings Model
17
+
18
+ This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences.
19
+
20
+ ## Features
21
+
22
+ - Specialized for Hindi language text
23
+ - Advanced transformer architecture with optimized attention mechanism
24
+ - Multiple pooling strategies for enhanced semantic representations
25
+ - Creates normalized vector representations for semantic similarity
26
+ - Supports semantic search and text similarity applications
27
+
28
+ ## Usage
29
+
30
+ ### Installation
31
+
32
+ ```bash
33
+ pip install torch sentencepiece scikit-learn matplotlib
34
+ git lfs install
35
+ git clone https://huggingface.co/convaiinnovations/hindi-embeddings-model
36
+ cd hindi-embeddings-model
37
+ ```
38
+
39
+ ### Quick Start
40
+
41
+ ```python
42
+ from hindi_embeddings import HindiEmbedder
43
+
44
+ # Initialize the embedder
45
+ model = HindiEmbedder("path/to/hindi-embeddings-model")
46
+
47
+ # Encode sentences to embeddings
48
+ sentences = [
49
+ "मुझे हिंदी भाषा बहुत पसंद है।",
50
+ "मैं हिंदी भाषा सीख रहा हूँ।"
51
+ ]
52
+ embeddings = model.encode(sentences)
53
+ print(f"Embedding shape: {embeddings.shape}")
54
+
55
+ # Compute similarity between sentences
56
+ similarity = model.compute_similarity(sentences[0], sentences[1])
57
+ print(f"Similarity: {similarity:.4f}")
58
+
59
+ # Perform semantic search
60
+ query = "भारत की राजधानी"
61
+ documents = [
62
+ "दिल्ली भारत की राजधानी है।",
63
+ "मुंबई भारत का सबसे बड़ा शहर है।",
64
+ "हिमालय पर्वत भारत के उत्तर में स्थित है।"
65
+ ]
66
+ results = model.search(query, documents)
67
+ for i, result in enumerate(results):
68
+ print(f"{i+1}. Score: {result['score']:.4f}")
69
+ print(f" Document: {result['document']}")
70
+
71
+ # Visualize embeddings
72
+ example_sentences = [
73
+ "मुझे हिंदी में पढ़ना बहुत पसंद है।",
74
+ "आज मौसम बहुत अच्छा है।",
75
+ "भारत एक विशाल देश है।"
76
+ ]
77
+ model.visualize_embeddings(example_sentences)
78
+ ```
79
+
80
+ ## Model Details
81
+
82
+ This model uses an advanced transformer-based architecture with the following enhancements:
83
+
84
+ - Pre-layer normalization for stable training
85
+ - Specialized attention mechanism with relative positional encoding
86
+ - Multiple pooling strategies (weighted, mean, attention-based)
87
+ - L2-normalized vectors for cosine similarity
88
+
89
+ Technical specifications:
90
+ - Embedding dimension: 768
91
+ - Hidden dimension: 768
92
+ - Layers: 12
93
+ - Attention heads: 12
94
+ - Vocabulary size: 50,000
95
+ - Context length: 128 tokens
96
+
97
+ ## Applications
98
+
99
+ - Semantic search and information retrieval
100
+ - Text clustering and categorization
101
+ - Recommendation systems
102
+ - Question answering
103
+ - Document similarity comparison
104
+ - Content-based filtering
105
+
106
+ ## License
107
+
108
+ This model is released under the MIT License.
109
+
110
+ ## Citation
111
+
112
+ If you use this model in your research or application, please cite us:
113
+
114
+ ```
115
+ @misc{convaiinnovations2025hindi,
116
+ author = {ConvAI Innovations},
117
+ title = {Hindi Sentence Embeddings Model},
118
+ year = {2025},
119
+ publisher = {Hugging Face},
120
+ howpublished = {\url{https://huggingface.co/convaiinnovations/hindi-embeddings-model}}
121
+ }
122
+ ```