YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Vedika - Sanskrit NLP Toolkit
Vedika is a comprehensive toolkit for Sanskrit text processing, offering deep learning-based tools for sandhi splitting and joining, text normalization, sentence splitting, syllabification, and tokenization.
Features
- Sandhi Processing
- Split compound Sanskrit words using attention-based neural networks
- Join Sanskrit words with proper sandhi rules
- Support for beam search to get multiple suggestions
- Text Processing
- Syllabification
- Tokenization
- Sentence splitting
- Text normalization
Installation
# Install from PyPI (Soon Coming)
pip install vedika
# Install from source
git clone https://github.com/tanuj437/vedika.git
cd vedika
pip install -e .
Requirements
- Python >= 3.8
- PyTorch >= 1.9.0
- NumPy >= 1.19.0
- Pandas >= 1.3.0
- tqdm >= 4.62.0
- regex >= 2021.8.3
Quick Start
Sandhi Splitting
from vedika import SanskritSplit
# Initialize splitter
splitter = SanskritSplit()
# Split a single word
result = splitter.split("रामायणम्")
print(result['split']) # Output: राम + अयन + अम्
# Batch processing
words = ["रामायणम्", "गीतागोविन्दम्"]
results = splitter.split_batch(words)
for result in results:
print(f"{result['input']} → {result['split']}")
Sandhi Joining
from vedika import SandhiJoiner
# Initialize joiner
joiner = SandhiJoiner()
# Join split words
result = joiner.join("राम+अस्ति")
print(result) # Output: रामास्ति
# Batch processing
texts = ["राम+अस्ति", "गच्छ+अमि"]
results = joiner.join_batch(texts)
print(results) # ['रामास्ति', 'गच्छामि']
Advanced Usage
Beam Search for Multiple Suggestions
# Get multiple suggestions with beam search
result = splitter.split("रामायणम्", beam_size=3)
print(f"Best split: {result['split']}")
print(f"Confidence: {result['confidence']}")
print("Alternatives:")
for alt in result['alternatives']:
print(f"- {alt['split']} (confidence: {alt['confidence']})")
Model Information
# Get model details
info = splitter.get_model_info()
print(f"Vocabulary size: {info['vocabulary_size']}")
print(f"Device: {info['device']}")
print(f"Configuration: {info['model_config']}")
Project Structure
vedika/
├── __init__.py
├── normalizer.py
├── sandhi_join.py
├── sandhi_split.py
├── sentence_splitter.py
├── syllabification.py
├── tokenizer.py
└── data/
├── cleaned_metres.json
├── sandhi_joiner.pth
└── sandhi_split.pth
Model Architecture
The sandhi processing models use:
- Bidirectional LSTM encoder
- GRU decoder with attention
- Multi-head attention mechanism
- Character-level processing
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Authors
- Tanuj Saxena
- Soumya Sharma
Citation
If you use Vedika in your research, please cite:
@software{vedika2025,
title={Vedika: A Sanskrit Text Processing Toolkit},
author={Saxena, Tanuj and Sharma, Soumya},
year={2025},
url={https://github.com/tanuj437/vedika}
}
Contact
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support