| | --- |
| | language: hi |
| | license: mit |
| | tags: |
| | - hindi |
| | - embeddings |
| | - sentence-embeddings |
| | - semantic-search |
| | - text-similarity |
| | datasets: |
| | - custom |
| | pipeline_tag: sentence-similarity |
| | library_name: transformers |
| | --- |
| | |
| | # Hindi Sentence Embeddings Model |
| |
|
| | This is a custom state-of-the-art sentence embedding model trained specifically for Hindi text. It leverages an advanced transformer architecture with specialized pooling strategies to create high-quality semantic representations of Hindi sentences. |
| |
|
| | ## Features |
| |
|
| | - Specialized for Hindi language text |
| | - Advanced transformer architecture with optimized attention mechanism |
| | - Multiple pooling strategies for enhanced semantic representations |
| | - Creates normalized vector representations for semantic similarity |
| | - Supports semantic search and text similarity applications |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install torch sentencepiece scikit-learn matplotlib |
| | git lfs install |
| | git clone https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model-10B |
| | cd hindi-embedding-foundational-model-10B |
| | ``` |
| |
|
| | ### Enhanced RAG System |
| |
|
| | This model now includes an enhanced RAG (Retrieval Augmented Generation) system that integrates Unsloth's optimized Llama-3.2-1B-Instruct model for question answering on top of Hindi document retrieval. |
| |
|
| | #### Setup and Installation |
| |
|
| | 1. Install additional dependencies: |
| | ```bash |
| | pip install unsloth transformers bitsandbytes accelerate langchain langchain-community faiss-cpu |
| | ``` |
| |
|
| | 2. Index your documents: |
| | ```bash |
| | python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --data_dir ./data --output_dir ./output --index |
| | ``` |
| |
|
| | 3. Run in QA mode with LLM: |
| | ```bash |
| | python hindi-rag-system.py --model_dir /path/to/your/model --tokenizer_dir /path/to/tokenizer --output_dir ./output --interactive --qa |
| | ``` |
| |
|
| | ### Basic Embedding Usage |
| |
|
| | ```python |
| | from hindi_embeddings import HindiEmbedder |
| | |
| | # Initialize the embedder |
| | model = HindiEmbedder("path/to/hindi-embedding-foundational-model-10B") |
| | |
| | # Encode sentences to embeddings |
| | sentences = [ |
| | "मुझे हिंदी भाषा बहुत पसंद है।", |
| | "मैं हिंदी भाषा सीख रहा हूँ।" |
| | ] |
| | embeddings = model.encode(sentences) |
| | print(f"Embedding shape: {embeddings.shape}") |
| | |
| | # Compute similarity between sentences |
| | similarity = model.compute_similarity(sentences[0], sentences[1]) |
| | print(f"Similarity: {similarity:.4f}") |
| | |
| | # Perform semantic search |
| | query = "भारत की राजधानी" |
| | documents = [ |
| | "दिल्ली भारत की राजधानी है।", |
| | "मुंबई भारत का सबसे बड़ा शहर है।", |
| | "हिमालय पर्वत भारत के उत्तर में स्थित है।" |
| | ] |
| | results = model.search(query, documents) |
| | for i, result in enumerate(results): |
| | print(f"{i+1}. Score: {result['score']:.4f}") |
| | print(f" Document: {result['document']}") |
| | |
| | # Visualize embeddings |
| | example_sentences = [ |
| | "मुझे हिंदी में पढ़ना बहुत पसंद है।", |
| | "आज मौसम बहुत अच्छा है।", |
| | "भारत एक विशाल देश है।" |
| | ] |
| | model.visualize_embeddings(example_sentences) |
| | ``` |
| |
|
| | ## Model Details |
| |
|
| | This model uses an advanced transformer-based architecture with the following enhancements: |
| |
|
| | - Pre-layer normalization for stable training |
| | - Specialized attention mechanism with relative positional encoding |
| | - Multiple pooling strategies (weighted, mean, attention-based) |
| | - L2-normalized vectors for cosine similarity |
| |
|
| | Technical specifications: |
| | - Embedding dimension: 768 |
| | - Hidden dimension: 768 |
| | - Layers: 12 |
| | - Attention heads: 12 |
| | - Vocabulary size: 50,000 |
| | - Context length: 128 tokens |
| |
|
| | ## Applications |
| |
|
| | - Semantic search and information retrieval |
| | - Text clustering and categorization |
| | - Recommendation systems |
| | - Question answering |
| | - Document similarity comparison |
| | - Content-based filtering |
| | - RAG systems for Hindi language content |
| |
|
| | ## License |
| |
|
| | This model is released under the MIT License. |
| |
|
| | ## Citation |
| |
|
| | If you use this model in your research or application, please cite us: |
| |
|
| | ``` |
| | @misc{DeepMostInnovations2025hindi, |
| | author = {DeepMost Innovations}, |
| | title = {Hindi Sentence Embeddings Model}, |
| | year = {2025}, |
| | publisher = {Hugging Face}, |
| | howpublished = {\url{https://huggingface.co/DeepMostInnovations/hindi-embedding-foundational-model-10B}} |
| | } |
| | ``` |
| |
|